Saturday, February 9, 2013

PageRank explained

Today, I thought I'd post about something near and dear to my heart: math.  When I was a senior at BYU (insert obligatory boos) studying numerical analysis, for one of my classes I wrote a paper about the PageRank algorithm.  Seeing as this is a web analytics class and that a big part of web analytics these days is search engine optimization, I thought I’d revisit the topic.  This time, though, I’ll do it in a way that is a lot simpler, involves less mathematical proofs, and I hope is less boring.
For those of you that don’t know what PageRank is, before Google adopted the “whoever gives the most money to Google wins” algorithm, Larry Page and Sergey Brin from Stanford developed a way to in essence let the internet itself determine the relative importance of the pages that it contains.  In the algorithm each member (page) of groups of hyperlinked documents (aka the internet) is assigned a weight based on the number of hyperlinks to it from other pages.  So, a page with a lot of links to it has a higher rank than a page with only a few links to it.

How it works

Suppose we have an internet with 4 pages: A, B, C, and D with links to each other as illustrated below.  In this case, A has one link to it from D; B has a link to it from A, C, and D; C has one link to it from D; and D has 2 links to it, one from A and one from B.  Every time a page links to another page, it transfers a portion of its “rank” to the page that it links to.  So, D has a link to A, B, and C, so it transfers a third of its rank to A, a third to B, and a third to C.

So in our example, the ranks of each page are represented by the following equations:

Or, using matrix notation, it is the solution to the system of equations below. 

Those of you who are mathematicians will notice that the PageRank values are an eigenvector of the matrix of link weights.  In our case, we want the one where the sum of all the ranks is 1.  So, for our model A=0.13, B=0.33, C=0.13, and D=0.4.
So, what exactly does this rank mean?  One interpretation is that that it is the probability that after following links for a long time you’ll end up on that particular page (If you try this on the real internet, you’ll likely either end up looking at Wikipedia or porn).
This is a simplified version of PageRank.  The actual algorithm is a bit different to take into account that not all pages have outbound links, people don’t just follow links all day when they surf the internet, etc., but this is essentially how it works.

What it means for your site

Knowing all this, what does this mean for your site?  Well, the first and most obvious thing is that the more links to your page, the better.  You might be thinking “great, I can just go out and plaster links to my site all over message boards, blogs, Facebook, Twitter, etc. to increase its PageRank” or “OK, I can just pay people to put links to my site on theirs.”  Unfortunately, most message boards, blogs, etc. use the "nofollow" tag which tells Google not to include these links in PageRank calculation to prevent this kind of spamming.  Also, Google has specifically cautioned against selling links to increase PageRank.  If they catch people doing this, their links are excluded from calculations [1].  For this reason, Google has advised using the "nofollow" tag on sponsored links.

Also, take note of where links to your page are coming from.  Remember that when a page links to yours, it transfers a portion of its PageRank to it.  A link from a big, important site is worth a lot more than a bunch of links from small, obscure sites.