• 沒有找到結果。

Definition of PageRank

在文檔中 Jeffrey D. Ullman (頁 159-163)

PageRank is a function that assigns a real number to each page in the Web (or at least to that portion of the Web that has been crawled and its links discovered). The intent is that the higher the PageRank of a page, the more

“important” it is. There is not one fixed algorithm for assignment of PageRank, and in fact variations on the basic idea can alter the relative PageRank of any two pages. We begin by defining the basic, idealized PageRank, and follow it

by modifications that are necessary for dealing with some real-world problems concerning the structure of the Web.

Think of the Web as a directed graph, where pages are the nodes, and there is an arc from page p1to page p2if there are one or more links from p1 to p2. Figure 5.1 is an example of a tiny version of the Web, where there are only four pages. Page A has links to each of the other three pages; page B has links to A and D only; page C has a link only to A, and page D has links to B and C only.

B A

C D

Figure 5.1: A hypothetical example of the Web

Suppose a random surfer starts at page A in Fig. 5.1. There are links to B, C, and D, so this surfer will next be at each of those pages with probability 1/3, and has zero probability of being at A. A random surfer at B has, at the next step, probability 1/2 of being at A, 1/2 of being at D, and 0 of being at B or C.

In general, we can define the transition matrix of the Web to describe what happens to random surfers after one step. This matrix M has n rows and columns, if there are n pages. The element mijin row i and column j has value 1/k if page j has k arcs out, and one of them is to page i. Otherwise, mij = 0.

Example 5.1 : The transition matrix for the Web of Fig. 5.1 is

M =

0 1/2 1 0

1/3 0 0 1/2

1/3 0 0 1/2

1/3 1/2 0 0

In this matrix, the order of the pages is the natural one, A, B, C, and D. Thus, the first column expresses the fact, already discussed, that a surfer at A has a 1/3 probability of next being at each of the other pages. The second column expresses the fact that a surfer at B has a 1/2 probability of being next at A and the same of being at D. The third column says a surfer at C is certain to be at A next. The last column says a surfer at D has a 1/2 probability of being next at B and the same at C. 2

5.1. PAGERANK 149 The probability distribution for the location of a random surfer can be described by a column vector whose jth component is the probability that the surfer is at page j. This probability is the (idealized) PageRank function.

Suppose we start a random surfer at any of the n pages of the Web with equal probability. Then the initial vector v0will have 1/n for each component.

If M is the transition matrix of the Web, then after one step, the distribution of the surfer will be M v0, after two steps it will be M (M v0) = M2v0, and so on. In general, multiplying the initial vector v0 by M a total of i times will give us the distribution of the surfer after i steps.

To see why multiplying a distribution vector v by M gives the distribution x = M v at the next step, we reason as follows. The probability xi that a random surfer will be at node i at the next step, isP

jmijvj. Here, mij is the probability that a surfer at node j will move to node i at the next step (often 0 because there is no link from j to i), and vj is the probability that the surfer was at node j at the previous step.

This sort of behavior is an example of the ancient theory of Markov processes.

It is known that the distribution of the surfer approaches a limiting distribution vthat satisfies v = M v, provided two conditions are met:

1. The graph is strongly connected; that is, it is possible to get from any node to any other node.

2. There are no dead ends: nodes that have no arcs out.

Note that Fig. 5.1 satisfies both these conditions.

The limit is reached when multiplying the distribution by M another time does not change the distribution. In other terms, the limiting v is an eigenvec-tor of M (an eigenveceigenvec-tor of a matrix M is a veceigenvec-tor v that satisfies v = λM v for some constant eigenvalue λ). In fact, because M is stochastic, meaning that its columns each add up to 1, v is the principal eigenvector (its associated eigen-value is the largest of all eigeneigen-values). Note also that, because M is stochastic, the eigenvalue associated with the principal eigenvector is 1.

The principal eigenvector of M tells us where the surfer is most likely to be after a long time. Recall that the intuition behind PageRank is that the more likely a surfer is to be at a page, the more important the page is. We can compute the principal eigenvector of M by starting with the initial vector v0 and multiplying by M some number of times, until the vector we get shows little change at each round. In practice, for the Web itself, 50–75 iterations are sufficient to converge to within the error limits of double-precision arithmetic.

Example 5.2 : Suppose we apply the process described above to the matrix M from Example 5.1. Since there are four nodes, the initial vector v0has four components, each 1/4. The sequence of approximations to the limit that we

Solving Linear Equations

If you look at the 4-node “Web” of Example 5.2, you might think that the way to solve the equation v = M v is by Gaussian elimination. Indeed, in that example, we argued what the limit would be essentially by doing so. However, in realistic examples, where there are tens or hundreds of billions of nodes, Gaussian elimination is not feasible. The reason is that Gaussian elimination takes time that is cubic in the number of equations.

Thus, the only way to solve equations on this scale is to iterate as we have suggested. Even that iteration is quadratic at each round, but we can speed it up by taking advantage of the fact that the matrix M is very sparse; there are on average about ten links per page, i.e., ten nonzero entries per column.

Moreover, there is another difference between PageRank calculation and solving linear equations. The equation v = M v has an infinite number of solutions, since we can take any solution v, multiply its components by any fixed constant c, and get another solution to the same equation. When we include the constraint that the sum of the components is 1, as we have done, then we get a unique solution.

get by multiplying at each step by M is:

Notice that in this example, the probabilities for B, C, and D remain the same. It is easy to see that B and C must always have the same values at any iteration, because their rows in M are identical. To show that their values are also the same as the value for D, an inductive proof works, and we leave it as an exercise. Given that the last three values of the limiting vector must be the same, it is easy to discover the limit of the above sequence. The first row of M tells us that the probability of A must be 3/2 the other probabilities, so the limit has the probability of A equal to 3/9, or 1/3, while the probability for the other three nodes is 2/9.

This difference in probability is not great. But in the real Web, with billions of nodes of greatly varying importance, the true probability of being at a node like www.amazon.com is orders of magnitude greater than the probability of typical nodes. 2

5.1. PAGERANK 151

在文檔中 Jeffrey D. Ullman (頁 159-163)