22 October 2008

Google PageRank: The Mathematics Behind It All

PageRank is some order for pages that rates them based on how important they are. Since "importance" is such a subjective term, how can you actually rank pages? Search engines have to solve this problem so they know what to display first. Luckily, I learned something interesting in my probability course about Google PageRank.

Let's start off with an amazing(ly vague) definition that tells us very little and a whole lot at the same time:
"A page has high rank if the sum of the ranks of its backlinks is high."

Wow. What a statement. Let's break it down.

What Is A Backlink?
It's when someone else links to your website. So it doesn't matter if you have a million links to other websites, it only matters if those million wesbites have links to you. Sounds simple enough.

So what if you have a thousands of dummy pages that all link to your page? Well, it would seem that your page gets a high rank then. However, how does Google prevent this? The pages that link to your page need to be important too. So if some big website links to you, you're somewhat important too.

Probability and Stationary Equations (for Math People)
Anyways, it turns out Google PageRank uses this nifty idea in probability involving stationary equations of Markov chains (anyone scared of math, close your eyes now and scroll to the end):






Putting it together, we get that the PageRank of page i is the sum of all the PageRanks of page j times the probability of it going to page i.

This is the exact formula for stationary equations in Markov chains. So now, if you have a bunch of dummy pages that link to your page, you're still not important because the probability that anyone goes to those dummy pages if very low (hence all of those PageRanks stay low). Now let's try to screw this thing up too.

What if you have some person that decides not link to anyone at all! Or maybe just links to itself. So if even one page has a link TO this page, then as time goes by, everyone will end up at this page and get "stuck" (we don't use the back button since we only follow links). This is called a deadend or spidertrap. In the stationary equations, what happens is that this page gets ALL the importance.

Taxation
One solution to this: taxation. Basically, "tax" everyone's page and take that percent and redistribute links randomly. So suppose there's a 30% tax. Then the probabilities of going through a link is only 70% of what it originally was. Then we take that 30%, and add a random link to some random page. So now, with probability 70%, the "spidertrap" goes back to itself. The other 30% goes to some other page(s).

So you might be thinking how do you calculate the probability of going from one webpage to another. Easy way: you don't! Just assume that you have equal probability. I guess there are better ways that might get only a slightly better answer, but to save time and calculation power, each link has an equal probability of being followed.

For the Non-Math People
So those that skipped over the previous section. Here's a summary: There's an interesting topic in probability called stationary equations that helps you find what "fraction" of the time you are on a webpage. Essentially, if you have a lot of high-ranked websites backlinking to you, many visitors are very likely spending lots of time on your webpage.

What if you try to hog all that importance? What if you don't put links out to other people? In theory (without the browser back button), visitors that come to your page will get "stuck" there. This is called a deadend or a spidertrap.

Taxes on Webpages (For Non-Math People)
Taxation solves this by solving the calculations in stationary equations with random links pointing from every page to every other page. In other words, it creates smaller and less important links that aren't really there when doing the calculations. So when you get stuck in some page, you can follow the non-existent link to something else. It's called a tax because you can think of it as though you're very unlikely to follow that non-existent link and the webpage loses some of it's PageRank (based on how big the tax is).


You can kind of think of it as money. Everyone gets money, everyone gives money. If someone only got money and never spent anything, you can imagine that over a long period of time, he would have most of the money in the world. The taxes basically take that money and give it back to everyone. (Think communism when you see 100% tax... everyone is equal).

Current Model
Google is constantly changing their PageRank algorithm but this is essentially the basis for their model. Understanding Google's PageRank can really help if you have a new website and you want become important really fast. Or if you're interested in working for Google (or planning on buying out Google), doesn't hurt to know this either.

1 comment:

Andy said...

Hmmmm, your point of view is interesting. Google Page Rank for most people is so complicated but your understanding of it, as well as your sharing information makes me impressed. Keep up a good work.