Today, I thought I'd post about something near and dear to my heart: math. When I was a senior at BYU (insert obligatory boos) studying
numerical analysis, for one of my classes I wrote a paper about the PageRank
algorithm. Seeing as this is a web
analytics class and that a big part of web analytics these days is search
engine optimization, I thought I’d revisit the topic. This time, though, I’ll do it in a way that
is a lot simpler, involves less mathematical proofs, and I hope is less boring.
For those of you that don’t know what PageRank is, before
Google adopted the “whoever gives the most money to Google wins” algorithm,
Larry Page and Sergey Brin from Stanford developed a way to in essence let the
internet itself determine the relative importance of the pages that it
contains. In the algorithm each member
(page) of groups of hyperlinked documents (aka the internet) is assigned a
weight based on the number of hyperlinks to it from other pages. So, a page with a lot of links to it has a
higher rank than a page with only a few links to it.
How it works
Suppose we have an internet with
4 pages: A, B, C, and D with links to each other as illustrated below. In this case, A has one link to it from D; B
has a link to it from A, C, and D; C has one link to it from D; and D has 2
links to it, one from A and one from B.
Every time a page links to another page, it transfers a portion of its
“rank” to the page that it links to. So,
D has a link to A, B, and C, so it transfers a third of its rank to A, a third
to B, and a third to C.
So in our example, the ranks of each page are represented by the following equations:
Or, using matrix notation, it is the
solution to the system of equations below.
Those of you who are mathematicians will
notice that the PageRank values are an eigenvector of the matrix of link
weights. In our case, we want the one where the sum of all the ranks is 1. So, for our model A=0.13, B=0.33, C=0.13, and
D=0.4.
So, what exactly does this rank
mean? One interpretation is that that it
is the probability that after following links for a long time you’ll end up on
that particular page (If you try this on the real internet, you’ll likely
either end up looking at Wikipedia or porn).
This is a simplified version of
PageRank. The actual algorithm is a bit
different to take into account that not all pages have outbound links, people
don’t just follow links all day when they surf the internet, etc., but this is
essentially how it works.
What it means for your site
Knowing all this, what does this mean for
your site? Well, the first and most
obvious thing is that the more links to your page, the better. You might be thinking “great, I can just go
out and plaster links to my site all over message boards, blogs, Facebook,
Twitter, etc. to increase its PageRank” or “OK, I can just pay people to put
links to my site on theirs.”
Unfortunately, most message boards, blogs, etc. use the "nofollow" tag
which tells Google not to include these links in PageRank calculation to
prevent this kind of spamming. Also,
Google has specifically cautioned against selling links to increase PageRank. If they catch people doing this, their links
are excluded from calculations [1]. For
this reason, Google has advised using the "nofollow" tag on sponsored links.
Also, take note of where links to your
page are coming from. Remember that when
a page links to yours, it transfers a portion of its PageRank to it. A link from a big, important site is worth a
lot more than a bunch of links from small, obscure sites.
1. http://en.wikipedia.org/wiki/Pagerank
I never knew how many moving pieces there were to page ranking. Thanks for the details!
ReplyDeleteWow. This was slightly over my head, but also very helpful. I will definitely be using some of these tips in and the idea of the algorithm you posted in future optimization of my website.
ReplyDeleteWow very interesting. Good insight to keep in mind
ReplyDelete