【论文阅读】On clustering using random walks
《On clustering using random walks》阅读笔记
1. 问题建模
1.1 问题描述
let G(V,E,ω)G(V,E,\\omega)G(V,E,ω) be a weighted graph, VVV is the set of nodes, EEE is the edge between nodes in VVV, ω\\omegaω is the function ω:E→Rn\\omega:E \\to \\mathbb{R}^nω:E→Rn, that measures the simularity between pairs of items(a higher value means more similar).
pij=ω(i,j)dip_{ij} = \\frac{\\omega(i,j)}{d_i}pij=diω(i,j)
di=∑k=1nω(i,k)d_i = \\sum_{k=1}^n\\omega(i,k)di=k=1∑nω(i,k)
MG∈Rn×nM^G \\in \\mathbb{R}^{n \\times n}MG∈Rn×n is the associated transition matrix,
MijG={pij⟨i,j⟩∈E0otherwiseM^G_{ij} = \\begin{cases} p_{ij} & \\langle i,j \\rangle \\in E \\\\ 0 & \\textrm{otherwise} \\end{cases}MijG={pij0⟨i,j⟩∈Eotherwise
Question
:
这里的内容比较坑,我在论文中一直找不到关于Pvisitk(i)P^{k}_{\\textrm{visit}}(i)Pvisitk(i)是怎么计算的,在这里卡了好久好久。
在原文中的描述是这样的:
Now, denote by Pvisitk(i)∈RnP^k_{visit}(i) \\in \\mathbb{R}^nPvisitk(i)∈Rn the vector whose j-th component is the probability that a random walk originating at i will visit node j in its k-th step. Thus,Pvisitk(i)P^k_{visit}(i)Pvisitk(i) is the i-th row in the matrix (MG)k(M^G)^k(MG)k, the k’th power of MGM^GMG.
现在我们知道MGM^GMG是怎样计算的,但是(MG)k(M^G)^k(MG)k呢,在原文中的描述是’'the k’th power of MGM^GMG", 我理解的应该是原有矩阵MGM^GMG的k次方(矩阵的乘法)。
Pvisitk(i)P^k_{visit}(i)Pvisitk(i) is the i-th row in the matrix (MG)k(M^G)^k(MG)k,
Pvisitk(i)=(MG)ikP^k_{visit}(i) = (M^G)^k_iPvisitk(i)=(MG)ik
(MG)k={Pvisitk(1)T,Pvisitk(2)T,…,Pvisitk(n)T}(M^G)^k=\\{P^k_{visit}(1)^{\\mathbf{T}}, P^k_{visit}(2)^{\\mathbf{T}}, \\dots, P^k_{visit}(n)^{\\mathbf{T}}\\}(MG)k={Pvisitk(1)T,Pvisitk(2)T,…,Pvisitk(n)T}
Notice: 其实到这里,和马尔可夫聚类算法(MCL)是一样的。MCL是不断迭代,知道矩阵不再改变,这里作者考虑到计算复杂,采用前k次计算结果的和来作为替代。
We now offer two methods for performing the edge separation, both based on deterministic analysis of random walks.
边缘分离,锐化
NS: Separation by neighborhood similarity.
CE: Separation by circular escape.
the weighted neighborhood : 加权领域
bipartite subgraph
Pvisit≤k(v)=∑i=1kPvisiti(v)P^{\\leq k}_{\\textrm{visit}}(v) = \\sum_{i=1}^kP^{i}_{\\textrm{visit}}(v) Pvisit≤k(v)=i=1∑kPvisiti(v)
2. NS: Separation by neighborhood similarity.
Now, in order to estimate the closeness of the two node vvv and uuu , we fix some small k(eg. k = 3) and compare Pvisit≤k(v)P^{\\leq k}_{\\textrm{visit}}(v)Pvisit≤k(v) and Pvisit≤k(u)P^{\\leq k}_{\\textrm{visit}}(u)Pvisit≤k(u). The smaller the difference, the greater the intimacy between uuu and vvv.
NS(G)=dfnGs(V,E,ωs)NS(G) \\xlongequal{dfn} G_s(V, E, \\omega_s)NS(G)dfnGs(V,E,ωs),
where ∀⟨v,u⟩∈E,ωs(u,v)=simk(Pvisit≤k(v),Pvisit≤k(u))\\forall \\langle v, u \\rangle \\in E, \\omega_s(u, v) = sim^k(P^{\\leq k}_{visit}(v),P^{\\leq k}_{visit}(u))∀⟨v,u⟩∈E,ωs(u,v)=simk(Pvisit≤k(v),Pvisit≤k(u))
simk(x,y)sim^k(x,y)simk(x,y) is some similarity measure of the vectors x\\mathrm{x}x and y\\mathrm{y}y, whose value increases as x\\mathrm{x}x and y\\mathrm{y}y are more similar.
simk(x,y)sim^k(x,y)simk(x,y) the suitable choose:
fk(x,y)=dfnexp(2k−∥x−y∥L1)−1(1)f^k(x,y) \\xlongequal{dfn} \\exp(2k − \\|x − y\\|_{L_1}) − 1 \\tag{1}fk(x,y)dfnexp(2k−∥x−y∥L1)−1(1)
∥x−y∥L1=∑i=1n∣xi−yi∣\\|x − y\\|_{L_1} = \\sum_{i=1}^n|x_i-y_i|∥x−y∥L1=i=1∑n∣xi−yi∣
another choose is:
cos(x,y)=(x,y)(x,x).(y,y)(2)\\cos(x,y)= \\frac{(x,y)}{\\sqrt{(x,x)}.\\sqrt{(y,y)}} \\tag{2}cos(x,y)=(x,x).(y,y)(x,y)(2)
where (·,·) denotes inner-product.(内积)
3.2 CE: Separation by circular escape.
3.3 代码实现
import numpy as npdef markovCluster(adjacencyMat, dimension, numIter, power=2, inflation=2):columnSum = np.sum(adjacencyMat, axis=0)probabilityMat = adjacencyMat / columnSum# Expand by taking the e^th power of the matrix.def _expand(probabilityMat, power):expandMat = probabilityMatfor i in range(power - 1):expandMat = np.dot(expandMat, probabilityMat)return expandMatexpandMat = _expand(probabilityMat, power)# Inflate by taking inflation of the resulting# matrix with parameter inflation.def _inflate(expandMat, inflation):powerMat = expandMatfor i in range(inflation - 1):powerMat = powerMat * expandMatinflateColumnSum = np.sum(powerMat, axis=0)inflateMat = powerMat / inflateColumnSumreturn inflateMatinflateMat = _inflate(expandMat, inflation)for i in range(numIter):expand = _expand(inflateMat, power)inflateMat = _inflate(expand, inflation)print(inflateMat)print(np.zeros((7, 7)) != inflateMat)if __name__ == "__main__":dimension = 4numIter = 10adjacencyMat = np.array([[1, 1, 1, 1],[1, 1, 0, 1],[1, 0, 1, 0],[1, 1, 0, 1]])# adjacencyMat = np.array([[1, 1, 1, 1, 0, 0, 0],# [1, 1, 1, 1, 1, 0, 0],# [1, 1, 1, 1, 0, 0, 0],# [1, 1, 1, 1, 0, 0, 0],# [0, 1, 0, 0, 1, 1, 1],# [0, 0, 0, 0, 1, 1, 1],# [0, 0, 0, 0, 1, 1, 1],# ])markovCluster(adjacencyMat, dimension, numIter)
[[1.00000000e+000 1.00000000e+000 1.00000000e+000 1.00000000e+000][5.23869755e-218 5.23869755e-218 5.23869755e-218 5.23869755e-218][0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000][5.23869755e-218 5.23869755e-218 5.23869755e-218 5.23869755e-218]]
[[ True True True True][ True True True True][False False False False][ True True True True]]
可以从中得到聚类效果{{1,2,4},{3}}\\{\\{1,2,4\\},\\{3\\}\\}{{1,2,4},{3}}
谱聚类
MCL
MCL GitHub