Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a graph given pairwise similarties as edge properties, described in
Lin
and Cohen, Power Iteration Clustering. It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via
power
iteration and uses it to cluster vertices. MLlib includes an implementation of PIC using GraphX as its backend. It takes an
RDD
of
(srcId,
dstId, similarity)
tuples and outputs a model with the clustering assignments. The similarities must be nonnegative. PIC assumes that the similarity measure is symmetric. A pair
(srcId,
dstId)
regardless of the ordering should appear at most once in the input data. If a pair is missing from input, their similarity is treated as zero. MLlib’s PIC implementation takes the following (hyper-)parameters:
k
:
number of clustersmaxIterations
:
maximum number of power iterationsinitializationMode
:
initialization model. This can be either “random”, which is the default, to use a random vector as vertex properties, or “degree” to use normalized sum similarities.