'PageRank implementation for Research

After reading theory of PageRank algorithm from this site I would like to play with it. I am trying to implement this in Java. I mean I would like to play with PageRank in detail (like giving different weights and so on). For this I need to build hyperlink matrix. If I have 1 million nodes then my hyperlink matrix will be 1 million x 1 million size, which causes this exception:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at WebGraph.main(WebGraph.java:6)

How can I implement PageRank in Java, is there any way of storing hyperlink matrix?



Solution 1:[1]

Python networkx module has a nice implementation of pagerank. It uses scipy/numpy for the matrix implementation. The below two questions on stackoverflow should be enough to get you started.

Solution 2:[2]

A few suggestions:

  • Use python, not Java: python is an excellent prototyping language, and has available sparse matrices (in scipy) as well as many other goodies. As others have noted, it also has a pagerank implementation.

  • Store your data not all in memory: any type of lightweight database would be fine, for instance sqlite, hibernate, ...

  • Work on tiles of the data: if there is a big matrix NxN, break it up into small tiles MxM where M is a fraction of N, that fit in memory. Combined with sparse matrices this allows you to work with really big N (hundreds of millions to billions, depending on how sparse the data is).

Solution 3:[3]

As Dan W suggested, try to increase the heap size. If you run your Java application from the command line, just add the switch -Xmx with the desired heap size. Let's assume you compiled your Java code into a runnable JAR file called pagerank.jar, and you want to set your heap size to 512 MB, you would issue the following command:

java -jar -Xmx512m pagerank.jar

EDIT: But that only works if you don't have that many "pages" ... A 1 Million x 1 Million array is too big to fit into your RAM (1 trillion times * 64 bit double value = 7.27595761 terabytes). You should change your algorithm to load chunks of data from the disk, manipulate it, and store it back to disk.

You could use a graph database like Neo4j for that purpose.

Solution 4:[4]

You don't have to store the whole 1000000x1000000 matrix, because most matrix entries will be zero. Instead, you can (for example) store a list of nonzero entries for each row, and write your matrix functions to use it directly, without expanding it into a full matrix.

This kind of compressed representation is called a sparse matrix format, and most matrix libraries have an option to build and work with sparse matrices.

One disadvantage with sparse matrices is that multiplying two of them will result in a matrix which is much less sparse. However, the PageRank algorithm is designed so that you don't need to do that: the hyperlink matrix is constant, and only the score vector is updated.

Solution 5:[5]

PageRank is performed by Google using the 'Pregel' BSP (really just keywords) framework.

I remembered Apache Giraph (another Pregel), which includes a version of PageRank in its benchmark package.

Here's a video about Giraph: it's an introduction, and it specifically talks about handling PageRank.

If that doesn't work:

In Java there is an implementation of Pregel called GoldenOrb.

Pseudo code for the PageRank algorithm is here (on a different implementation of Pregel).

You'll have to read around BSP, and PageRank to handle the size of data you have.

Solution 6:[6]

Because the matrix is sparse you can implement dimensionality reduction like svd,pca,mds or Lsi that includes svd. There is a library to implement this kind of processes which is called Jama. You can find it here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Community
Solution 2 Alex I
Solution 3
Solution 4 comingstorm
Solution 5
Solution 6 Lefteris Bab