> 文章列表 > 【论文阅读】On clustering using random walks

【论文阅读】On clustering using random walks

【论文阅读】On clustering using random walks

《On clustering using random walks》阅读笔记

1. 问题建模

1.1 问题描述

let G(V,E,ω)G(V,E,\\omega)G(V,E,ω) be a weighted graph, VVV is the set of nodes, EEE is the edge between nodes in VVV, ω\\omegaω is the function ω:E→Rn\\omega:E \\to \\mathbb{R}^nωERn, that measures the simularity between pairs of items(a higher value means more similar).

pij=ω(i,j)dip_{ij} = \\frac{\\omega(i,j)}{d_i}pij=diω(i,j)
di=∑k=1nω(i,k)d_i = \\sum_{k=1}^n\\omega(i,k)di=k=1nω(i,k)

MG∈Rn×nM^G \\in \\mathbb{R}^{n \\times n}MGRn×n is the associated transition matrix,
MijG={pij⟨i,j⟩∈E0otherwiseM^G_{ij} = \\begin{cases} p_{ij} & \\langle i,j \\rangle \\in E \\\\ 0 & \\textrm{otherwise} \\end{cases}MijG={pij0i,jEotherwise

Question:

  1. ω\\omegaω表示节点之间的相似性,实际上我们只有无向图,表示节点之间是否有连接,怎么通过已有的信息构建ω\\omegaω
    answer: 这里的相似度可以认为是节点之间边的权值,所以MijGM^G_{ij}MijG可以认为是认为是以邻接矩阵操作后的数据。

这里的内容比较坑,我在论文中一直找不到关于Pvisitk(i)P^{k}_{\\textrm{visit}}(i)Pvisitk(i)是怎么计算的,在这里卡了好久好久。

在原文中的描述是这样的:

Now, denote by Pvisitk(i)∈RnP^k_{visit}(i) \\in \\mathbb{R}^nPvisitk(i)Rn the vector whose j-th component is the probability that a random walk originating at i will visit node j in its k-th step. Thus,Pvisitk(i)P^k_{visit}(i)Pvisitk(i) is the i-th row in the matrix (MG)k(M^G)^k(MG)k, the k’th power of MGM^GMG.

现在我们知道MGM^GMG是怎样计算的,但是(MG)k(M^G)^k(MG)k呢,在原文中的描述是’'the k’th power of MGM^GMG", 我理解的应该是原有矩阵MGM^GMG的k次方(矩阵的乘法)。

Pvisitk(i)P^k_{visit}(i)Pvisitk(i) is the i-th row in the matrix (MG)k(M^G)^k(MG)k,

Pvisitk(i)=(MG)ikP^k_{visit}(i) = (M^G)^k_iPvisitk(i)=(MG)ik
(MG)k={Pvisitk(1)T,Pvisitk(2)T,…,Pvisitk(n)T}(M^G)^k=\\{P^k_{visit}(1)^{\\mathbf{T}}, P^k_{visit}(2)^{\\mathbf{T}}, \\dots, P^k_{visit}(n)^{\\mathbf{T}}\\}(MG)k={Pvisitk(1)T,Pvisitk(2)T,,Pvisitk(n)T}

Notice: 其实到这里,和马尔可夫聚类算法(MCL)是一样的。MCL是不断迭代,知道矩阵不再改变,这里作者考虑到计算复杂,采用前k次计算结果的和来作为替代。

We now offer two methods for performing the edge separation, both based on deterministic analysis of random walks.

边缘分离,锐化

NS: Separation by neighborhood similarity.

CE: Separation by circular escape.

the weighted neighborhood : 加权领域
bipartite subgraph

Pvisit≤k(v)=∑i=1kPvisiti(v)P^{\\leq k}_{\\textrm{visit}}(v) = \\sum_{i=1}^kP^{i}_{\\textrm{visit}}(v) Pvisitk(v)=i=1kPvisiti(v)

2. NS: Separation by neighborhood similarity.

Now, in order to estimate the closeness of the two node vvv and uuu , we fix some small k(eg. k = 3) and compare Pvisit≤k(v)P^{\\leq k}_{\\textrm{visit}}(v)Pvisitk(v) and Pvisit≤k(u)P^{\\leq k}_{\\textrm{visit}}(u)Pvisitk(u). The smaller the difference, the greater the intimacy between uuu and vvv.

NS(G)=dfnGs(V,E,ωs)NS(G) \\xlongequal{dfn} G_s(V, E, \\omega_s)NS(G)dfnGs(V,E,ωs),
where ∀⟨v,u⟩∈E,ωs(u,v)=simk(Pvisit≤k(v),Pvisit≤k(u))\\forall \\langle v, u \\rangle \\in E, \\omega_s(u, v) = sim^k(P^{\\leq k}_{visit}(v),P^{\\leq k}_{visit}(u))v,uE,ωs(u,v)=simk(Pvisitk(v),Pvisitk(u))

simk(x,y)sim^k(x,y)simk(x,y) is some similarity measure of the vectors x\\mathrm{x}x and y\\mathrm{y}y, whose value increases as x\\mathrm{x}x and y\\mathrm{y}y are more similar.

simk(x,y)sim^k(x,y)simk(x,y) the suitable choose:
fk(x,y)=dfnexp⁡(2k−∥x−y∥L1)−1(1)f^k(x,y) \\xlongequal{dfn} \\exp(2k − \\|x − y\\|_{L_1}) − 1 \\tag{1}fk(x,y)dfnexp(2kxyL1)1(1)
∥x−y∥L1=∑i=1n∣xi−yi∣\\|x − y\\|_{L_1} = \\sum_{i=1}^n|x_i-y_i|xyL1=i=1nxiyi

another choose is:
cos⁡(x,y)=(x,y)(x,x).(y,y)(2)\\cos(x,y)= \\frac{(x,y)}{\\sqrt{(x,x)}.\\sqrt{(y,y)}} \\tag{2}cos(x,y)=(x,x).(y,y)(x,y)(2)
where (·,·) denotes inner-product.(内积)

3.2 CE: Separation by circular escape.

3.3 代码实现

无向带权图

import numpy as npdef markovCluster(adjacencyMat, dimension, numIter, power=2, inflation=2):columnSum = np.sum(adjacencyMat, axis=0)probabilityMat = adjacencyMat / columnSum# Expand by taking the e^th power of the matrix.def _expand(probabilityMat, power):expandMat = probabilityMatfor i in range(power - 1):expandMat = np.dot(expandMat, probabilityMat)return expandMatexpandMat = _expand(probabilityMat, power)# Inflate by taking inflation of the resulting# matrix with parameter inflation.def _inflate(expandMat, inflation):powerMat = expandMatfor i in range(inflation - 1):powerMat = powerMat * expandMatinflateColumnSum = np.sum(powerMat, axis=0)inflateMat = powerMat / inflateColumnSumreturn inflateMatinflateMat = _inflate(expandMat, inflation)for i in range(numIter):expand = _expand(inflateMat, power)inflateMat = _inflate(expand, inflation)print(inflateMat)print(np.zeros((7, 7)) != inflateMat)if __name__ == "__main__":dimension = 4numIter = 10adjacencyMat = np.array([[1, 1, 1, 1],[1, 1, 0, 1],[1, 0, 1, 0],[1, 1, 0, 1]])# adjacencyMat = np.array([[1, 1, 1, 1, 0, 0, 0],#                          [1, 1, 1, 1, 1, 0, 0],#                          [1, 1, 1, 1, 0, 0, 0],#                          [1, 1, 1, 1, 0, 0, 0],#                          [0, 1, 0, 0, 1, 1, 1],#                          [0, 0, 0, 0, 1, 1, 1],#                          [0, 0, 0, 0, 1, 1, 1],#                          ])markovCluster(adjacencyMat, dimension, numIter)
[[1.00000000e+000 1.00000000e+000 1.00000000e+000 1.00000000e+000][5.23869755e-218 5.23869755e-218 5.23869755e-218 5.23869755e-218][0.00000000e+000 0.00000000e+000 0.00000000e+000 0.00000000e+000][5.23869755e-218 5.23869755e-218 5.23869755e-218 5.23869755e-218]]
[[ True  True  True  True][ True  True  True  True][False False False False][ True  True  True  True]]

可以从中得到聚类效果{{1,2,4},{3}}\\{\\{1,2,4\\},\\{3\\}\\}{{124}{3}}

谱聚类
MCL
MCL GitHub