S2ProT: rank allocation by superpositioned propagation of topic relevance
The Authors
Ola Ågren, Department of Computing Science, Umeå University, Umeå, Sweden
Acknowledgements
The authors thank the referees for their valuable comments that have led to numerous improvements on the structure and content of the paper. They also thank all the participants of the assessment.
Abstract
Purpose – The purpose of this paper is to assign topic-specific ratings to web pages.
Design/methodology/approach – The paper uses power iteration to assign topic-specific rating values (called relevance) to web pages, creating a ranking or partial order among these pages for each topic. This approach depends on a set of pages that are initially assumed to be relevant for a specific topic; the spatial link structure of the web pages; and a net-specific decay factor designated ξ.
Findings – The paper finds that this approach exhibits desirable properties such as fast convergence, stability and yields relevant answer sets. The first property will be shown using theoretical proofs, while the others are evaluated through stability experiments and assessments of real world data in comparison with already established algorithms.
Research limitations/implications – In the assessment, all pages that a web spider was able to find in the Nordic countries were used. It is also important to note that entities that use domains outside the Nordic countries (e.g..com or.org) are not present in the paper's datasets even though they reside logically within one or more of the Nordic countries. This is quite a large dataset, but still small in comparison with the entire worldwide web. Moreover, the execution speed of some of the algorithms unfortunately prohibited the use of a large test dataset in the stability tests.
Practical implications – It is not only possible, but also reasonable, to perform ranking of web pages without using Markov chain approaches. This means that the work of generating answer sets for complex questions could (at least in theory) be divided into smaller parts that are later summed up to give the final answer.
Originality/value – This paper contributes to the research on internet search engines.
Article Type:
Research paper
Keyword(s):
Worldwide web; Information retrieval; Spatial data structures; Search engines.
Journal:
International Journal of Web Information Systems
Volume:
4
Number:
4
Year:
2008
pp:
416-440
Copyright ©
Emerald Group Publishing Limited
ISSN:
1744-0084
1 Introduction
Web search is normally performed using search engines that process text queries, where each query is built up by a combination of search terms (called topics in this paper). Since the web has grown to such an enormous size, we can expect the number of pages matching each text query to be huge as well. This means that there should be some way of deciding the order (or ranking) of the resulting pages as a support for the user. While many different methods of ranking pages on the internet exist, some have been more prolific than others. Probably the most important method relies on the fact that people tend to put in links to pages that they find interesting on roughly the same topics as their own pages, e.g. pages on table tennis tend to have links to other pages and sites on table tennis (Davison, 2000). These links can be used in different ways to calculate relevance values. We see two major approaches for computing relevance values based on the structure of links between pages; topic independent and topic specific.
Topic-independent approaches generate one set of rankings, and then use a subset of this to answer each query. This means that answering a question about pages that contain a specific search term boils down to looking up the ranking for the pages that contain that term. The challenge is to find a ranking order that says something for each and every topic, since the ranking algorithm does not care whether a search term is part of the page or not. The most prominent example of this approach is PageRank (Page et al., 1998).
Topic-specific approaches instead generate one (or more) set of rankings for each topic. The set of rankings is tailor-made to this specific topic by using either weighting of certain pages or by starting with smaller sets of pages around the pages containing the search term. While this can often leads to smaller solutions sets, the challenge is still to create sets of rankings for a sufficiently large number of topics because the computational cost of generating each topic-specific ranking is usually quite large. Typical examples of this approach are Topic-sensitive PageRank (Haveliwala, 2002) and HITS (Kleinberg, 1999).
We propose a new approach for generating topic-specific rankings. The main idea of our approach is to start with an initial set Θ of pages that are assumed to be relevant for the topic. For instance, Θ may be the set of all pages containing a specific keyword. Each page in Θ is initially assigned a certain relevance value (1 in our tests), whereas all other pages are assigned the relevance value 0. The relevance values are then propagated and decreased in a controlled fashion over the network of links at hand (be it a single site, the entire internet, or anything in between). This gives a relevance value for each page, which can be used to generate a ranking for these pages.
We have implemented a number of different algorithms based on our approach. Two of these will be described in detail and another in a cursory manner in this paper.
One thing that is important for all ranking algorithms is that they should preferably give roughly the same answer even if small perturbations such as missed links occur. The less susceptible to perturbations a given algorithm is, the more stable it is said to be. Our results show that our algorithms are very stable, yielding good results even when operating on small, site-specific data sets. We will show the stability using statistics from actual data. The more advanced version of the algorithm is also very fast and scalable as we will show.
Disadvantages of our approach are that it requires an initial set Θ of pages, and depends on a parameter that we denote by ξ. The former can be found by using, e.g. web directories such as Yahoo or by checking the content of a web page word by word (which is the method we used in our tests). The parameter ξ is an algorithm-dependent decay factor; our basic algorithm requires ξ to be a close approximation of the dominant eigenvalue of the underlying network while the advanced version works better and faster when larger values of ξ are chosen, i.e. four or five times larger than the dominant eigenvalue.
1.1 Layout of this paper
Section 2 contains background material and definitions that are essential for the technical details of (but not central to) this paper. Section 3 describes related work. The section following it, Section 4, describes our algorithms in detail. The behaviour patterns of each algorithm are discussed in Section 5. Section 6 contains empirical results from running the algorithms, as well as comparisons of these results with PageRank and Topic-sensitive PageRank. The last section, Section 7, contains a discussion and some concluding remarks.
2 Preliminaries
This section sets up notations and terminology that are required for the paper, even though it is not central to the work described herein.
2.1 Webs as graphs
Throughout this paper, we will identify a web with a graph (as far as its link structure is concerned). For this, let V be the set of web pages and E be the set of hyperlinks (and thus directed edges) between them. The pair (V,E) denotes the unweighted
The adjacency matrix of this graph is obtained as usual, based on an arbitrary but fixed ordering of the set V.
2.2 Nomenclature
We use the following naming conventions in order to minimise misunderstandings: Equation 1 Let A be a square matrix. The transpose of A is denoted A′. Its dominant eigenvalue is denoted by λ 1(A) and the second largest eigenvalue by λ 2(A), etc. The corresponding eigenvectors are denoted by π 1(A),π 2(A), etc. The spectrum of A is the set Λ(A)={λ 1(A), λ 2(A),…}.
Definition 2.1 (rating)
A rating function is a total functionρ:V→[0,1]. For each individual page i∈V, the value ρ(i) is called the rating of i.
As output, all algorithms in this paper yield rating functions.
Definition 2.2 (ranking)
A ranking is a partial order of pages according to a rating function, where pages with rating values larger than a certain cut-off value ɛ are ordered in decreasing order with respect to their rating.
Note that a ranking does not need to contain all elements in V, as only pages with large enough rating are included. We have used ɛ=10−6 when nothing else is said in this paper.
Definition 2.3 (ranking order)
A ranking order assigns to each page i the index σ i of that page within a ranking, where σ i =1 for the highest ranked page. If the range is restricted, then the ranking order is adapted accordingly, e.g. removing page b from {σ a =3,σ b =2,σ c =1} yields {σ a =2,σ c =1}.
Ties because of equal ratings have been handled by using the fixed ordering among the elements of the set V (Section 2.1).
2.3 Metrics
We use four different measurements when comparing the algorithms:
Definition 2.4 (n-value)
Given n=|Θ| pages that are initially assumed to be relevant for a topic, the n-value is the percentage of these appearing in the n top elements in the output ranking.
An n-value corresponds to both a recall and a precision value, and is usually called R-Prec when using exact rather than assumed number of relevant pages (Buckley and Voorhees, 2000).
Definition 2.5 (total value)
The total value is the percentage of the pages in the given input set that received a value larger than the cut-off ɛ after running an algorithm.
Definition 2.6 (Spearman footrule distance, SFD)
SFD (Spearman, 1904, 1906; Dwork et al., 2001) shows how different rankings are. This value is between 0 (identical ranking orders) and 1 (inverted or heavily permuted ranking orders) with 0.5 meaning totally random orders with no correlation. It is defined as follows: Given two ranking orders σ and τ with s elements in common: Equation 2
Definition 2.7 (order percentage)
Given two rankings, the order percentage is the probability that two pages that are consecutive in the first ranking will have the same relative order in the other, considering only pages that are common to both rankings.
2.4 Small test dataset
The first database used in our experiments consists of a subset of the pages available at: http://cs.umu.se/, the web site of the Department of Computing Science, Umeå University. This dataset was collected in January 2003.
The database contains 7,312 pages, of which 2,728 are HTML pages with outgoing links. A total of 4,486 of the pages contain text. There are a total of 22,970 hyperlinks, yielding an average of approximately 8.42 outgoing hyperlinks per HTML page and just over 3.14 incoming links per page. The corresponding unweighted adjacency matrix has a dominant eigenvalue of λ 1≈25.813242.
A total of 57,823 distinct stemmed words are present in the entire set of pages. These words are present in on average 9.87 pages, for a total of 570,870 page occurrences. Among these words, 33,854 appear in the same set of pages as another word. Thus, only 23,969 unique rankings per algorithm are required for the entire dataset.
2.5 Large test dataset
The dataset used in the assessment consists of all web pages found within the Nordic countries, i.e. Denmark, Finland, Iceland, Norway, and Sweden, by using a web spider. It was collected in January and February 2007. This dataset was chosen since these countries have had access to the internet for a long time, they contain a mix of old and new, academic and commercial, web servers, and provide fast access from our location.
This database contains 3,087,531 web addresses, of which 478,985 contain hyperlinks that point to another server. It contains 37,245,054 hyperlinks, of which 3,889,216 are non-local. The corresponding unweighted non-local adjacency matrix has a dominant eigenvalue of λ 1=49.135476.
All in all, 8,054,200 stemmed words of at least four characters appear 221,259,520 times in 727,757 of the pages, leading to an average of just over 304 unique words per page in that set.
2.6 Performance details
All running times reported throughout this paper have been measured while running the algorithms on a 1.7 GHz Pentium M laptop with 1.5 GB RAM. While this is a quite modest machine, given the newest machines available in the market (especially stationary machines with 64 bit processors), it will still give a rough estimate of the time required to run the algorithms.
3 Related works
Web link mining is the discovery of useful knowledge from the structure of hyperlinks. This link structure has been exploited in several techniques to say something about the importance of different web pages, such as in HITS (Kleinberg, 1999), Clever (Chakrabarti et al., 1998), PageRank and Google (Brin and Page, 1998), cGraph (Kubica et al., 2003), the Intelligent Surfer (Richardson and Domingos, 2001), as well as Rafiei and Mendelzon (2000) and Nie et al. (2006). The strength of using links in this way is that neighbouring web pages (when using hyperlinks to define distance) can be used to either deduce or corroborate information about a web page.
Almost all of these techniques use a (modified) connection matrix corresponding to the graph that they are working on. Each algorithm yields one or more eigenvector(s) corresponding to the eigenvalue(s) of the connection matrix. Such eigenvectors can be seen as a tuple of values that define a rating for the corresponding pages.
The algorithms, as well as their underlying methodologies differ considerably between various approaches within this domain. Most of the link mining systems available today are based on either PageRank or HITS. Both of these families of algorithms use a connection matrix, but in quite different ways.
The PageRank algorithms first performs normalisation on the matrix in order to get a zero-sum propagation of data when multiplying with a random vector whose values form a Markov chain (Brin and Page, 1998). This matrix has to be updated further, since cross-referrals would accumulate more and more of the value in each iteration. The way this is handled in PageRank is to simulate a jump to a random page with a certain probability, also known as the damping factor. This also fulfils the requirement on a Markov chain, that each page must have both incoming and outgoing links. The final answer appears when the changes between two iterations of the algorithm have become small enough, and corresponds intuitively to the probability that a random user would look at the corresponding web page. This is called the PageRank. We will also use a version of PageRank whose random jumps only lead to a starting set of pages, called Topic-sensitive PageRank (Haveliwala, 2002). Using 1−μ as the damping factor, δ(i) is the out-degree of node and n is the number of pages, we can define PageRank as: Equation 3 The HITS algorithms, on the other hand, yield not one value, but two; Authority – indicating how interesting the information on the page relative to the given query is, and Hub – indicating how good the pointers from the page are (Kleinberg, 1999).These numbers are obtained by multiplying the connection matrix with the Hub values – yielding the new set of Authority values – followed by a multiplication of the transpose of the connection matrix with the Authority values – resulting in a new set of Hub values – and finally a normalisation of both Authority and Hub values so that they remain within a reasonable interval. We will not compare our algorithms with the HITS algorithms in this paper, since our algorithms are more similar to the PageRank family.
For more information on ranking systems see Langville and Meyer (2005).
4 Propagation of topic relevance
The basic idea of the algorithms proposed in this paper is that each page is assigned a relevance value for each topic. This relevance value propagates along hyperlinks, while decreasing for each link travelled. This decrease is controlled using a parameter ξ, called the decay factor. Therefore, the method is called propagation of topic relevance (ProT). The method is quite similar to spreading activation (Pirolli et al., 1996; Crestani, 1997; Crestani and Lee, 1999; Aswath et al., 2005), but uses multiplicative rather than additive activation in each iteration.
To implement this idea, we use an iterative approach similar to the algorithms described in Section 3. Starting with an initial assignment of relevance values, iterated updates and normalisations are made until a fixed point is reached (or, more precisely, until the changes are smaller than a certain threshold that we call cut-off). Let Π j k be the relevance value for page i∈V in iteration k≥0. The initial values depend on the set of pages initially assumed to be on-topic, i.e. the set Θ: Equation 4 The ProT algorithm is then given by normalisation of: Equation 5 This is the same thing as using the wide-spread power method (Golub and Van Loan, 1996) on a matrix A^, consisting of a standard adjacency matrix A divided by ξ with the addition that diagonal elements corresponding to pages in the set Θ are set to 1 (and using the values in the diagonal as the starting vector). Each iteration computes the next answer vector Π 1, Π 2, etc. The final result is given after a suitably large number of iterations (i.e., lim k→∞ Π k =π 1(A^), but the algorithm usually converges quite quickly as long as normalisation is done after each iteration (see Section 4.1 and the more advanced version of the algorithm given in Section 4.4).
Notation 4.1
Let D Θ denote the diagonal matrix where the only non-zero positions are the diagonal elements corresponding to a member in Θ; these elements are set to 1.
Observation 4.2 (eigenvalues of a diagonal matrix). Given that |Θ|=k, then: Equation 6
4.1 Resulting values
Using equation (3) without normalisation leads to values Π i k much larger than 1, as soon as:
- there exists a path from one on-topic page to another, and/or; and
- the sum of predecessor values for a page i is larger than ξ.
Even though we are interested in relative (rather than absolute) values, we need to keep these numbers in check, since they will accumulate over each iteration performed. These values can otherwise become arbitrarily large until they no longer can be represented in floating point format (see also HITS (Kleinberg, 1999; Chakrabarti et al., 1998)). The easiest way to handle this is by using normalisation after applying equation (3) once to all pages (or one iteration of the power method, as the case might be). Normalisation will, moreover, make it easier to test if changes compared with previous results are below the cut-off.
This leads to a rather interesting question, namely which normalisation to use. Since we are using relevance values and we want to consider at least one page as relevant, it seems natural to assign full relevance (i.e., 100 percent) to the most relevant pages. This indicates that a normalisation using ‖ · ‖∞ = max( · ) should be used, leading to Algorithm 1:
Algorithm 1 (ProT)
Parameters: A,ξ,D Θ (as defined previously)
A^=A/ξ+D Θ
∀i∈V:Π i 0=A^ (i,i)
for k=1,2, …
Π k =A^Π k−1
Π k =Π k /‖Π k ‖∞
if ‖Π k −Π k−1‖<ɛ, stop
end for
This algorithm will converge, as long as the two largest eigenvalues of A^ are not equal (Golub and Van Loan, 1996).
4.2 Expected results
The intuitive view is that the results given by the algorithm should correspond to two authoritative sources: Θ and the links of the web. If an appropriate set Θ was chosen, all pages that belong to Θ should appear fairly early in the resulting rankings. Moreover, pages that are pointed to from many of the pages in Θ should be given a high relevance value as well. If ξ is increased, this should intuitively strengthen the relative effect of “+D Θ”.
There is always the trade-off between these mutually conflicting expectations. As will be seen below, we can tune the tendencies towards the different expectations by setting the decay factor ξ. The larger ξ, the closer the results go towards pages in Θ, and vice versa.
4.3 The size of ξ
There is a rather complex relationship between the parameter ξ and the behaviour of ProT, as we will show. In general:
- If ξ is too small (e.g., zero) then the primary eigenvector of the original matrix is given, regardless of Θ. This often leads to the situation where none of the pages in Θ are given as a result for a search term.
- Increasing ξ strengthens certain eigenvectors of A^. While this is a desired effect, it may lead to very slow convergence if ξ gets too large.
Let us explain this in more detail. While the eigenvalues of a matrix are given by the matrix, their behaviours are complex when the matrix is changed. Increasing multiple values in the diagonal might result in an increase of multiple eigenvalues, thereby inhibiting the convergence of ProT (Conjecture A2).
Values added to diagonal positions in the adjacency matrix that do not belong to the eigenvectors of the non-zero eigenvalues results in the creation of an eigenvector with this element as its only member. The corresponding eigenvalue is equal to the value added in the diagonal, see Theorem A1.
All other diagonal additions result in the increase of at least one eigenvalue, and possibly shifting (and resizing) of others so that all eigenvalues will continue to be linearly independent of each other.
4.3.1 Using too small ξ
Let B be the sum of D Θ multiplied by ξ and the adjacency matrix A: Equation 7 B is after normalisation identical in all respects (except the absolute size of the eigenvalues) to the matrix A^ used elsewhere in this paper, except at ξ=0 or ξ=∞.
It is obvious that λ 1(B)=λ 1(A) when ξ=0. What happens when ξ is increased is more complex. It depends on both the layout of the web and the number as well as the position of elements in Θ.
However, if one or more of the pages in Θ coincide with the non-zero elements of the dominant eigenvector π 1(B), enlarging ξ will strengthen the corresponding eigenvalue (while shifting the eigenvectors slightly toward these pages). Unless another eigenvector gets an even stronger boost from multiple pages in Θ, this will continue to be the strongest eigenvector.
On the other hand, if some other eigenvector is strengthened to the point where its eigenvalue is equal to the (possibly increased) dominant eigenvalue of A, they will compete for position as the strongest eigenvalue. We call the ξ where this happen the peak value, since it corresponds to a distinct local maximum in the number of iterations required to reach a stable value (Figure 3(b) in Section 5.1 for typical examples from the test database). The exact value required to reach the peak value is somewhere in (0,λ 1(A)], depending on A and Θ. We are currently not able to predict exactly where the peak values are.
A typical behaviour of ProT when given too small values for ξ can be seen in Figure 1. When ξ is smaller than the peak value, the dominant eigenvector of the connection matrix is given. The strengthened eigenvalue competes with the dominant eigenvalue of the original connection matrix, and then takes over (15.2≤ξ≤19.1). Multiple eigenvalues are increased at the same time, resulting in slower convergence, as can be seen in right half of Figure 1 when ξ is increased above 19.1.
4.3.2 Using too large ξ
The situation is quite different when ξ is large and |Θ|>1. Multiple eigenvectors will be affected by the values in the diagonal, indicating that it will be very hard to find the correct eigenvectors using the power method. This leads us to Observation 4.3.
Observation 4.3 (Too large ξ). Using a very large ξ in ProT will prohibit convergence of the power method.
Let the matrix A^ ξ,Θ be the sum of the adjacency matrix A divided by ξ and D Θ. This gives, per definition, that: Equation 8 This means that the |Θ| largest eigenvalues in the spectrum of A^ ξ,Θ are equal to 1. The convergence rate of the power method is linear to |λ 2/λ 1| (Golub and Van Loan, 1996) and here we have |λ 2(A^ ∞,Θ)/λ 1(A^ ∞,Θ)|=1 leading to no convergence at all.
One observation that can be made here is that this is not true when |Θ|=1, since there is only one non-zero eigenvalue when ξ→∞. This leads us to the version of our algorithm as described in Section 4.4.
4.3.3 Selection of ξ
A moderate solution is to use ξ=⌊λ 1(A)+1⌋. This is large enough to be over the peak value, since the peak value appears in (0,λ 1(A)]. This value is still not so high that slow convergence is a problem for smaller web sites. Examples of this can be seen in Figure 1.
4.4 Superpositioned singleton propagation of topic relevance (S2ProT)
A further development of the ProT algorithm using the same general idea but a slightly different approach is the S2ProT algorithm. Instead of trying to generate the entire eigenvector at once, it creates one vector for each page in Θ and then performs additive superpositioning of these followed by normalisation, thus resulting in the vector that yields the returned rating. The rationale for this is that even though many different calculations need to be performed, this is offset by much faster convergence for each subproblem and reuse of vectors whenever a page is on-topic for more than one topic.
The reason for the fast propagation is that each such vector calculation can be viewed as a propagation with decreasing strength, i.e. a topological ordering with minor changes because of back links.
Definition 4.4 (singleton matrix)
Let S (i) be the V×V matrix with a single non-zero value equal to 1 in the diagonal. This corresponds to a self-reference of the page i∈V. Such a matrix is called a singleton matrix and satisfies, Λ(S (i))={1, 0,…,0}:
Algorithm 2 (S2ProT)
Parameters: A,ξ,D Θ (as defined previously)
∀i∈Θ:rating i = ProT(A,ξ,S (i))
rating=∑ i∈Θ rating i
return rating/‖rating‖∞
Note that when computing ratings for sets Θ1,…,Θ k , we only need to compute rating i once, for each i∈Y 1≤j≤k Θ j . In practise, this will be very useful as it allows one to consider a large collection of topics. Moreover, S2ProT has excellent convergence rate when ξ>λ 1, proportional to ξ/λ 1, and will finish in log(ɛ)/(log(λ 1(A))−log(ξ)) iterations or less (Theorems A3 and A4, respectively).
This means that each individual page rating rating
i
can be calculated in just a few iterations. PageRank (using μ=0.85
Almost as important is the fact that we do not need to concern ourselves with pages further away from a starting node than the maximum number of iterations required. This means that we can speed up S2ProT very effectively by increasing ξ, leading to calculations on a small subset of the original web with only a few iterations required. The downside of using large decay factors is that fewer pages not in Θ will be included in the answer, and they will be given later in the ranking order.
5 Comparison of algorithm behaviours
In this section, we describe the general behaviour of our algorithms when applied to our small test database, using the well-known Topic-sensitive PageRank as a basis of comparison. We look at the number of non-zero elements in the rating functions, scalability, execution times, and similarities between our algorithms and Topic-sensitive PageRank.
Our algorithms have very different behaviours regarding how they are affected by the input data size (handled below) and their different decay factors (Figure 2). The number of non-zero elements in the rating functions is important when assessing relevance; the precision (equation (7) in Section 6) goes down drastically when too many pages are given in the results.
ProT. The basic ProT algorithm yields reasonable results, but has a number of disadvantages when it comes to efficiency. For a large web, the number of iterations required to reach a stable state is typically in the same range as for Topic-sensitive PageRank. Thus, ProT does not scale well. Choosing an appropriate decay factor is imperative; too small and the result is misleading at best, and too large and no answer will be given in due time (if ever). The number of non-zero elements in the resulting vectors depends on the decay factor, as can be seen in Figure 2(a). The rather complex behaviour of the basic algorithm with respect to the decay factor can be seen in Figure 3(b).
S2ProT. This version scales well, and yields reasonable results as long as the decay factor is high enough, i.e. larger than the dominant eigenvalue of the adjacency matrix. The larger the decay factor, the smaller the resulting set and the faster the convergence, as can be seen in Figures 2(a) and 3(c).
Topic-sensitive PageRank. We implemented Topic-sensitive PageRank (Haveliwala, 2002) and tested it on the same data as ProT and S2ProT. It turns out that this algorithm has roughly the same scalability as ProT. The number of non-zero elements in the resulting vector is very large when using the recommended values of μ (0.7-0.85). Moreover, it requires a lot of iterations to reach a stable state, as can be seen in Figure 3(a).
Applying the approach of Jeh and Widom (2002, 2003) would require fewer iterations, but still much more than for S2ProT. The main problem with their approach is that the choice of hub nodes to use in the calculation is critical.
5.1 Execution time
Using the small test database, we found the execution time of the various algorithms (Table I and Figure 3). Each individual word available within any of these pages constitutes its own query, e.g. the word “ladok” exists on 23 pages and these 23 pages belong to the Θ of the query for “ladok”. The cut-off for termination was set to ɛ=10−6 for all algorithms. Table I shows the total number of iterations and the total CPU time required to compute all queries.
5.1.1 Topic-sensitive PageRank
A total of 115 M iterations were required for an average of just under 4,700 iterations per search term using μ=0.85. The maximum required number of iterations for a single query is just over 4 M. Calculating the answer set for all queries amounts to approximately 7.5 d on the test computer with only minimal optimisation done on the code. A typical behaviour of this algorithm with respect to the damping factor can be seen in Figure 3(a).
5.1.2 ProT
A total of 557 M iterations were required for an average of approximately 23,200 iterations per search term with ξ=26. The maximum required number of iterations for a single query is just over 7 M. Calculating the answer set for all queries amounts to 10 d on the test computer with only minimal optimisation done on the code. The behaviour of the basic algorithm depends heavily on the decay factor (including the unstable peaks at 15-20), as can be seen in Figure 3(b).
5.1.3 S2ProT
A total of almost 88 k iterations were used to create the complete set of individual page ratings for the 4,486 pages with text when the decay factor was set to 26. This amounts to an average of 19.6 iterations per page, with a maximum of 1,586 iterations for one page. This corresponds to a total of 1.52 iterations/search term, since already computed page ratings can be reused for other queries, as discussed earlier.
Calculating the answer vectors for all questions takes a total of 158 s of user time on the test computer, including superpositioning and all overheads and with no optimisation made.
Figure 3(c) shows the typical behaviour pattern for this algorithm with respect to the decay factor. Please note the peak values when the decay factor is between 10 and 20, i.e. just under λ 1.
Figure 4 shows that the number of iterations is well below the upper bound as given in Section 4.4, specifically Theorem A4.
5.2 Summary
In this section, we compared the execution behaviour of ProT and S2ProT to that of Topic-sensitive PageRank. Specifically:
- In the beginning of Section 5, we showed that the number of non-zero elements in the resulting vector depends on μ (for Topic-sensitive PageRank) or ξ (for ProT/S2ProT). We also discussed the scalability of each algorithm in terms of CPU usage, with S2ProT being the most scalable of the three.
- In Section 5.1 we showed that the ProT algorithm is roughly comparable to Topic-sensitive PageRank when it comes to execution speed, while S2ProT is much faster. The practical rate of convergence for S2ProT was also shown to be well below the theoretical upper bound for our test database.
This means that S2ProT will yield results in a timely fashion, making it suitable even for very large web systems. This is especially true for query systems with many search terms, since page ratings can be reused for all terms for which this page occurs as a starting page, i.e. is in Θ.
6 Empirical results
In the previous section, we evaluated the general behaviour of our algorithms compared to Topic-sensitive PageRank for a small (but real) example database. In this section, we evaluate the quality of the resulting ratings and the stability of our algorithms compared to PageRank and Topic-sensitive PageRank, based on perceived relevance and stability.
This is the second study of the ProT algorithm. The first study was done on a single web server and was reported in a resent paper (Ågren, 2006), while the one reported here used all web pages our web spider was able to find in the Nordic countries (Sections 2.4 and 2.5, respectively).
6.1 Assessment
The relevance assessment was done on the large dataset (Section 2.5). In particular, the web considered in this assessment did not consist of pages mainly taken from the academic environment as in Ågren (2006), but covered commercial, private, and public service sites as well. The three algorithms that were considered in this assessment were PageRank, Topic-sensitive PageRank and S2ProT.
6.1.1 Relevance
One of the most important factors of a search engine is how relevant the resulting sets of pages are, especially the pages given early. The two measurements typically used to judge web search and Information Retrieval systems are called precision and recall, as defined in equations (7) and (8). These cannot always be computed, since the set of relevant items is not always given in such a system. High precision indicates that most of the items retrieved are relevant, while high recall indicates that most of the available relevant records in the database have been retrieved: Equation 9 Equation 10 In the assessment, we will instead use perceived relevance, i.e. how relevant and appropriate each page was for the given search term, as judged by a group of people making individual assessments.
6.1.2 Choice of search terms
The search terms were chosen according to the following selection criteria:
- at least 20 pages had to contain the search term;
- the tenth result given by each algorithm had to be above the base level ((1−μ)/n for PageRank, etc.);
- at least three different rating values had to exist in each top ten list;
- the search terms should be spread over a wide range of topics; and
- there should be no risk that the search terms could be considered offensive by the participants.
The reason behind the first criterion was to get enough variability. The next two criteria ensured that the rankings from the algorithms would determine the result, rather than the alphabetical order. The last two criteria ensured that the assessment would be as fair and free of bias as possible.
There was a very large set of possible search terms in this experiment, so the final choice came down to whether a search term was well defined and had a clear cut-off around the 10th or 11th element in the top lists of Topic-sensitive PageRank and S2ProT. The latter was used to assure that the alphabetical order that we used as arbitration between pages with equal ranking would not affect the result one way or the other. This resulted in a list of 85 search terms. The complete list is given in Appendix 2.
6.1.3 Setup of the assessment
The basic setup of the assessment was that participants assessed the subjective relevance of each given page with respect to each search term. These pages were given in alphabetical order without any marking as to what algorithm or which algorithms yielded the page in the top list. There were five different grades among which the test subjects were asked to choose the most appropriate one for each page:
- The first grade indicated that the reviewer was unable to say anything about the relevance of the page regarding the current keyword. These assessments were ignored in all calculations.
- The second grade corresponded to a complete lack of relevance.
- The third grade was indicative of some relevance.
- The fourth grade indicated a moderate amount of relevance.
- The fifth and final grade indicated that the page was very relevant, corresponding to 100 percent relevance.
The second, third, fourth, and fifth grades were assigned the numerical values 0, 0.5, 0.8, and 1, respectively. This scaling was used to avoid an underestimation of the true relevance of the pages, since we were using terms for the grades that intuitively should have had a higher relevance, e.g. “relevant” for the fourth grade (was 2/3 in Ågren (2006) compared to 0.8 in this paper). The relevance numbers were then disseminated back to each algorithm by averaging the grades that were given by the graders. This leads to a system where the subjective perceived relevance values can be computed for the results from each algorithm.
6.1.4 Results
A total of 587 valid assessments (i.e., with at least one grading being from grades two to four) resulted in between 577 and 581 answer sets per algorithm or approximately 7.6 answer sets per search term
The resulting values were then compared using ANOVA tests. The results showed that he S2ProT algorithm gives better relevance results than Topic-sensitive PageRank with statistical significance (p<0.05), that in turn had higher relevance than PageRank with very high statistical significance (p<0.001). The average values for the entire test as well as the 95 percent confidence intervals can be seen in Figure 6, and the ANOVA answers are given in Table II.
6.2 Stability
Another really important aspect of a web search engine is that the results should not change too much even if small changes are applied to the underlying web. The users are expecting valid results even if some information is not available to the search engine when generating answer vectors. Since all of the algorithms that we talk about in this paper work on graphs and most of them use a set Θ of starting pages, we can find two different ways in which we can disturb the system in order to check its stability:
- removal of pages from Θ (thus affecting the starting sets per topic); and
- removal of links (affecting the graph). We will show empirical results from both types of perturbations in this section.
6.2.1 Missing pages in Θ
Using our small test data set, it was possible to remove a number of pages and still get a valid number of remaining pages (between 20 and 30 percent of the original Θ removed) for 16,681 of the 57,823 words.
By running our algorithms as well as Topic-sensitive PageRank on the diminished dataset, we found some interesting measurements when we compared the results of the different algorithms. In each case, we compared the results from using the diminished dataset with the original input dataset, given in n-value and total value (Section 2.3).
The results from applying each algorithm to the diminished data sets can be seen in Table III and (for S2ProT) Figure 7.
Both the n-value and the total value are quite high for Topic-sensitive PageRank. The total value in particular indicates that the results are relevant even when some pages have not yet been indexed.
ProT does not have as good an n-value as Topic-sensitive PageRank. It does, however, have a better total value even though it on average returns only 594.5 pages per search term while Topic-sensitive PageRank returns on average 2,024.53 pages (a factor of over 3.4 times as many pages for Topic-sensitive PageRank).
The precision (and recall) over the first n values of S2ProT are much higher than for the other algorithms. The total-value recall is only slightly lower than Topic-sensitive PageRank and ProT. It does, however, depend heavily on the chosen decay factor, as is clearly visible in Figure 7. Using a larger ξ means faster results but fewer pages in the resulting sets.
6.2.2 Link removal
One of the major concerns for some of the other approaches to searching has been their partial lack of stability if links are missing (Ng et al., 2001a, 2001b).
Our approach is very stable: the resulting ranking order is almost identical even when 10 percent of the links have been dropped at random. In test databases with between 5,000 and 15,000 web pages, we have found an average SFD of less than 0.1 when 10 percent of the links were dropped. That means that our algorithms are in the same order of stability as Topic-sensitive PageRank as far as missing links are concerned. Table IV shows the results from applying each algorithm to the small test database, and the stability is shown when just over one in ten (10.2 percent) of the links have been removed randomly.
It is unfortunately not that easy to compare the results from these algorithms, since their stability varies considerably depending on how it is measured. The SFD indicates that ProT is the most stable algorithm, followed by Topic-sensitive PageRank, and, finally, S2ProT. Order percentage on the other hand gives a different picture, according to which S2ProT is the most stable algorithm, followed by ProT, and, finally, Topic-sensitive PageRank.
We have tested the link removal stability of each algorithm for ten of our queries in the two test databases. The total scatter plots as well as least squares best fit correlations for 320 different link removals are shown in Figures 8 and 9, respectively. Even though all of the algorithms are somewhat sensitive to link removals, both Topic-sensitive PageRank and S2ProT give average ranking orders, in the form of Spearman Footrule Distance, lower than the removal rate for these datasets. S2ProT increases in stability as the number of pages in Θ are increased.
6.3 Summary
In this section, we compared the perceived relevance and stability using actual queries on actual datasets. Our results show that our algorithms compare well with algorithms in use today. Specifically:
- In Section 6.1, we showed that the results given by S2ProT are good on a typical larger dataset, by comparing it to PageRank and Topic-sensitive PageRank.
- The stability of our algorithms was studied in Section 6.2, specifically:
- In Section 6.2.1, we showed that our algorithms are stable when Θ is diminished. This indicates that our algorithms can be used even if some pages have not yet been indexed or have been incorrectly indexed.
- In Section 6.2.2, we showed that our algorithms give more or less the same result even if some of the links are removed from the datasets. This indicates that our algorithms are useful even if some of the links are missing from the database or some of the pages have not yet been traversed.
This means that our algorithms will yield good results even when applied to datasets containing small errors and omissions. As can be expected, the more errors in the input, the less good the results will be.
7 Discussion
7.1 Efficient implementation
Using a value of one in the starting vector at the positions corresponding to the elements of Θ ensures that the resulting vector will always be non-zero, as this value is always non-zero in the dominant eigenvector. Moreover, it is for S2ProT usually very close to the direction of the dominant eigenvector, thus leading to even faster convergence.
Another way of speeding up the calculation of ProT or S2ProT is to keep a list of currently reached pages R⊂V, i.e. i∈R in iteration k indicates that Π i k >0. This means that only the important pages are used in the calculations, reducing the number of elements to calculate to a bare minimum. The downside of doing this is that a stride of larger than 1 will be used when going through the database, thus increasing the likelihood of cache misses and possibly even page faults for large databases. This can be diminished by using a storage order that as much as possible is derived from the link structure. This is, however, not trivial to implement since most webs have a lot of circular structures and back links.
Using S2ProT on a parallel computer or in a distributed environment is straight forward. The individual page ratings can be calculated at the same time, since they do not depend on each other. The precalculated topic vectors can also be stored efficiently, as was shown in Richardson and Domingos (2001).
7.2 Hybrid S2ProT
Another version that we have tried is what was called the hybrid superpositioned singleton propagation of topic relevance (HyS2ProT) in Ågren (2006). It uses singletons as in S2ProT (Section 4.4), but each outgoing value is further decreased with the number of outgoing links (i.e., the out-degree of a page) in the same manner as in PageRank (hence the name). It turns out to be slightly more stable than S2ProT when missing links are concerned and works with a matrix having a fixed dominant eigenvalue of less than or equal to 1, but is otherwise inferior to S2ProT.
7.3 Future work
We are currently developing a complete search system based on our algorithms. The system is supposed to handle everything from the initial retrieval of the data to the actual searching, using a user-friendly web interface.
A number of open questions remain:
- The exact decay value required to overcome the dominant eigenvector of the original matrix when using ProT depends on both the underlying matrix and the set Θ used (how many pages as well as their position in the dataset). Further study regarding the most suitable choice of ξ would be interesting.
- The results and stability of ProT are promising, but using power iteration is often too CPU intensive on even moderately large webs to be practically usable. Better methods could be explored, such as Arnoldi iteration (Scott, 1995). This problem does not exist when using S2ProT, as can be seen in Section 4.4.
- Our algorithms should be compared to some new development using spatial link structures, such as Chakrabarti (2007).
- It is possible to have the same relevance value for many pages in the resulting set when running these algorithms, thus resulting only in a partial order among the pages. Some possible ways to discriminate among these pages could be to use weighing according to the term frequency (Salton and Yang, 1973) or the corresponding eigenvalue when carrying out the superpositioning. This was not a major issue in our datasets, so this has not been pursued further at this time.
7.4 Conclusions
We have shown that topic-specific answer sets can be generated very quickly using S2ProT (Section 4.4 and Section 5) and that the results are both relevant and stable (Section 6). This indicates that our algorithms are very useful for generating relevance values and rankings for web pages in a topic-sensitive manner.
Equation 1
Equation 2
Equation 3
Equation 4
Equation 5
Equation 6
Equation 7
Equation 8
Equation 9
Equation 10
Equation 11
Equation 12
Equation 13
Equation 14
Equation 15
Equation 16
Figure 1The relationship between decay factor and minimum number of iterations required to reach a stable result for ProT. The vertical line corresponds to the dominant eigenvector of the adjacency matrix
Figure 2Typical relationships between input parameters (ξ and μ, respectively) and number of non-zero elements in the resulting vectors using actual data. Topic-sensitive PageRank added for comparison
Figure 3Typical relationships between input parameters (u and ξ, respectively) and total number of iterations required to reach a stable state for each algorithm
Figure 4The relationship between ξ /λ
1(A) and the number of iterations required to reach a stable state, looking at both practical values as well as upper bound (as given in Theorem A4) using a cut-off of 10−6
Figure 5Preference of relevance per search term using pair-wise algorithm comparisons
Figure 6The average 95 percent confidence intervals for each algorithm for the assessment
Figure 7Recall values for the top n results and in all values given from the S2ProT algorithm using the diminished sets with regard to the original Θ sets. S2ProT values are not valid unless the decay factor is larger than the dominant eigenvalue of the matrix, in this case ≈25.813242
Figure 8Stability of some of the mentioned algorithms on the smaller test database when removing a certain amount of links, given as average Spearman Footrule Distance from ten search terms with various removals as well as least-squares best fit of a×x
b
for each algorithm
Figure 9Stability of some of the mentioned algorithms on the larger test database when removing a certain amount of links, given as average Spearman Footrule Distance from ten search terms with various removals as well as least-squares best fit of a×x
b
for each algorithm
Table IActual number of iterations and CPU time required for our small database per algorithm
Table IIANOVA test results from the assessment
Table IIIRecall values for various algorithms using a diminished data set as basis
Table IV
References
Ågren, O. (2006), "Assessment of WWW-based ranking systems for smaller web sites", INFOCOMP Journal of Computer Science, Vol. 5 No.2, pp.45-55.
Aswath, D., Ahmed, S.T., D'cunha, J., Davulcu, H. (2005), "Boosting item keyword search with spreading activation", in Skowron, A., Agrawal, R., Luck, M., Yamaguchi, T., Morizet-Mahoudeaux, P., Liu, J., Zhong, N. (Eds),Web Intelligence, IEEE Computer Society, Washington, DC, pp.704-7.
Berman, A., Plemmons, R.J. (1994), Nonnegative Matrices in the Mathematical Sciences, SIAM, Philadelphia, PA, .
Brin, S., Page, L. (1998), "The anatomy of a large-scale hypertextual web search engine", Computer Networks and ISDN Systems, Vol. 30 No.1/7, pp.107-17.
Buckley, C., Voorhees, E.M. (2000), "Evaluating evaluation measure stability", SIGIR '00: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, NY, pp.33-40.
Chakrabarti, S. (2007), "Dynamic personalized pagerank in entity-relation graphs", WWW '07: Proceedings of the 16th International Conference on World Wide Web, ACM Press, New York, NY, pp.571-80.
Chakrabarti, S., Dom, B.E., Indyk, P. (1998), "Enhanced hypertext categorization using hyperlinks", in Haas, L.M., Tiwary, A. (Eds),Proceedings of SIGMOD-98, ACM International Conference on Management of Data, ACM Press, Seattle, WA, pp.307-18.
Crestani, F. (1997), "Application of spreading activation techniques in information retrieval", Artificial Intelligence Review, Vol. 11 No.6, pp.453-82.
Crestani, F., Lee, P.L. (1999), "WebSCSA: web search by constrained spreading activation", IEEE Forum on Research and Technology Advances in Digital Libraries (ADL '99), Vol. 99 pp.163-70.
Cvetković, D., Rowlinson, P., Simić, S. (1997), Eigenspaces of Graphs, Cambridge University Press, Cambridge, .
Davison, B.D. (2000), "Topical locality in the web", Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, Athens, pp.272-9.
Dwork, C., Kumar, R., Naor, M., Sivakumar, D. (2001), "Rank aggregation methods for the web", WWW '01: Proceedings of the 10th International Conference on World Wide Web, ACM Press, New York, NY, pp.613-22.
Golub, G.H., Van Loan, C.F. (1996), Matrix Computations, 3rd ed., Johns Hopkins University Press, Baltimore, MD, .
Haveliwala, T.H. (2002), "Topic-sensitive PageRank", Proceedings of the 11th International Conference on World Wide Web, ACM Press, New York, NY, pp.517-26.
Jeh, G., Widom, J. (2002), Stanford University Database Group, Stanford, CA, Scaling Personalized Web Search, Technical Report 2002-12, .
Jeh, G., Widom, J. (2003), "Scaling personalized web search", WWW '03: Proceedings of the 12th International Conference on World Wide Web, ACM Press, New York, NY, pp.271-9.
Kleinberg, J.M. (1999), "Authoritative sources in a hyperlinked environment", Journal of the Association for Computing Machinery, Vol. 46 No.5, pp.604-32.
Kubica, J., Moore, A., Cohn, D., Schneider, J. (2003), "cGraph: a fast graph-based method for link analysis and queries", Proceedings of the 2003 IJCAI Text-Mining & Link-Analysis Workshop, Acapulco, Mexico, .
Langville, A.N., Meyer, C.D. (2005), "A survey of eigenvector methods of web information retrieval", The SIAM Review, Vol. 47 No.1, .
Minc, H. (1988), Nonnegative Matrices, John Wiley and Sons, New York, NY, .
Ng, A.Y., Zheng, A.X., Jordan, M. (2001a), "Link analysis, eigenvectors, and stability", Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI-01), .
Ng, A.Y., Zheng, A.X., Jordan, M. (2001b), "Stable algorithms for link analysis", Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, New York, NY, .
Nie, L., Davison, B.D., Qi, X. (2006), "Topical link analysis for web search", in Efthimiadis, E.N., Dumais, S.T., Hawking, D., Järvelin, K. (Eds),SIGIR, ACM Press, New York, NY, pp.91-8.
Page, L., Brin, S., Motwani, R., Winograd, T. (1998), "The PageRank citation ranking: bringing order to the web", technical report, Stanford Digital Library Technologies Project, .
Pirolli, P., Pitkow, J.E., Rao, R. (1996), "Silk from a sow's ear: extracting usable structures from the web", CHI, pp.118-25.
Rafiei, D., Mendelzon, A.O. (2000), "What is this page known for? Computing web page reputations", Computer Networks, Vol. 33 No.1/6, pp.823-35.
Richardson, M., Domingos, P. (2001), "The intelligent surfer: probabilistic combination of link and content information in PageRank", in Dietterich, T.G., Becker, S., Ghahramani, Z. (Eds),Neural Information Processing Systems, MIT Press, Cambridge, MA, pp.1441-8.
Salton, G., Yang, C.S. (1973), "On the specification of term values in automatic indexing", Journal of Documentation, Vol. 29 No.4, pp.351-72.
Scott, J.A. (1995), "An Arnoldi code for computing selected eigenvalues of sparse, real, unsymmetric matrices", ACM Transactions on Mathematical Software, Vol. 21 No.4, pp.432-75.
Spearman, C. (1904), "The proof and measurement of association between two things", American Journal of Psychology, Vol. 15 pp.72-101.
Spearman, C. (1906), "‘Footrule’ for measuring correlations", British Journal of Psychology, Vol. 2 pp.89-108.
Appendix 1. Theorems and proofs
An eigenvector corresponding to a non-zero eigenvalue is called a proper eigenvector.
Adding a value to a diagonal element that does not belong to a proper eigenvector of an adjacency matrix with zero diagonal results in the creation of a new eigenvector that contains at least that element. The corresponding eigenvalue will be as large as the value added to the diagonal.
Proof
That an element a belongs to a cycle within the matrix (i.e., there is a nontrivial path from a back to a) or on a direct path from such a cycle implies that a belongs to a proper eigenvector, and vice versa (Berman and Plemmons, 1994, Chapter 2, Theorem 3.20).
An element a that does not belong to a proper eigenvector has zero value for all non-zero eigenvectors of the matrix. Adding the diagonal element creates a new local cycle that contains at least a.
None of the other eigenvalues are affected and the sum of the eigenvalues is equal to the trace of the matrix, indicating that the eigenvalue corresponding to the newly added eigenvector is equal to the value added in the diagonal.
Adding a value to a diagonal element that belongs to a proper eigenvector of an adjacency matrix with zero diagonal results in the increase of the nonnegative eigenvalue of that eigenvector of at least the added value.
Proof
There is no connection between components of a graph in the corresponding matrix and updates of the diagonal of the matrix do not change this, so we can consider only this component without loss of generality.
The result follows from the fact that only one eigenvalue can be increased by the addition and that ∑ i∈n λ i (A)=tr(A). The corresponding eigenvector is nonnegative (Minc, 1988, Chapter 1, Theorem 4.2) and this is the eigenvector/eigenvalue pair found when using the power method, if the starting vector is nonnegative and non-zero.
Adding multiple values in the diagonal of a nonnegative matrix A increases the eigenvalue of the corresponding component (or nonnegative eigenvector of that principal submatrix (Cvetković et al., 1997, Theorem 2.1.5)) that each element belongs to.
Worst case scenario is that multiple eigenvalues are increased to the same size. This yields a spectrum with dominating real eigenvalues of multiplicity higher than 1, with corresponding nonnegative eigenvectors. The result of this is that the power method will not converge at all. This situation is fortunately not that common, but should not be ignored when using ProT.
The rate of convergence of S2ProT depends on ξ/λ 1 (as long as ξ>λ 1). The smaller this number the faster the convergence.
Proof. Let A¯ be the adjacency matrix A (with zeroes in the diagonal) divided by ξ>λ 1. This matrix has a spectrum of: Equation 11 Let S (i) be a singleton matrix with a spectrum of Λ(S (i))={1,0,…,0}.
The composition of these two matrices results in the matrix A^ (i)=A¯+S (i), which is formed by ProT for each call from the definition of S2ProT. There are two distinct possibilities for this matrix:
- If the page i is not found in any of the proper eigenvectors of A (and thus A¯), then Λ(A^ (i))={1,λ 1(A)/ξ,…} as per Theorem A1.
- In all other cases, λ 1(A^ (i))>1 and λ 2(A^ (i))≤λ 1(A)/ξ, since the addition of S (i) will increase the strength of one of the eigenvalues of A that it coincides with (Theorem A2). All other eigenvalues of this component might be adjusted (and possibly diminished) to still be linearly independent of this new dominant eigenvector.
Recall that the rate of convergence of the power method depends on the relative size of the two largest eigenvalues, i.e. the smaller |λ 2/λ 1|, the faster the convergence (Golub and Van Loan, 1996). For A^ (i) this corresponds to at most (λ 1(A)/ξ)/1=λ 1(A)/ξ. This factor corresponds to the relative diminishing effect of applying each power iteration on all eigenvalues except λ 1(A^ (i)). Thus, the larger ξ is, the faster the convergence.
Given a page i∈V, a starting vector that is not orthogonal to the dominant eigenvector of the matrix π 1(A^) (formed by the adjacency matrix A using A^=A/ξ+S (i)), and a maximum cut-off of ɛ, ProT yields a stable result within I max =log(ɛ)/(log(λ 1(A))−log(ξ)) iterations.
Proof
The relative strength of an eigenvector π i (A^) (where i>1) will be decreased with a factor of λ i (A^)/λ 1(A^) relative to the strength of π 1(A^) in each iteration of the power method. After k iterations this means that the relative strength of π i (A^) is: Equation 12 We are interested in the number of iterations required to let the relative strength of π 2(A^) drops below the cut-off (ɛ), and this happens when: Equation 13 We have already established that (λ 2(A^)/λ 1(A^))≤(λ 1(A)/ξ), and this together with equation (A2) yields: Equation 14 Equation 15 Equation 16
Q.E.D.
Appendix 2. Search terms for the assessment
Search terms that received no assessments are given in italic.
Danish: Boghandler, musikvidenskab, rigsarkivet, uddannelse, and undervisningsministeriet.
English: Helicopter, infection, jubo
[4]
, railway and station, swegrid, table and tennis, and trout and fish.
Finnish: Aineisto, elämä, ihminen, jakelu, korkeakoulu, osaaminen, tietoa, tieto-kone, toimitusjohtaja, and vastuu.
Icelandic: Áhersla, bókasafn, bókmenntir, greinar, háskóla, heilbrigði, kennara, landafraeði, pappír, smiði, stærðfraeði, and tölva.
Nordish: Billedkunst, Kaspersen, kulturstudier, science and fiction, Sturlasson, and universitet.
Norwegian: Akershus, datamaskin, fartsprøve, forbrukerombudet, karrieresenter, kringkastingssjef, and nasjonalbiblioteket.
Swedish: Ågren, ansökningshandlingar, arkitekturmuseet, arkivcentrum, centrumbildningar, civilingenjör, datatermgruppen, datavetenskap, doktorandhandbok, energimyndigheten, fakultetsnämnden, forskningsdatabas, forskningsområden, genusforskning, geovetenskap, grundskolor, hemvärnet, humlab, idéskolor, idrottshögskolan, informationsteknik, källkritik, kammarkollegiet, lammstek, styrmekanism, kommunikationsteknik, länsbiblioteket, lärarutbildning, marklära, matematikdidaktik, miljöanalys, minoritetsspråk, naturvetenskaplig, överprövning, punktskriftsböcker, rymdfysik, sökmotoroptimering, and språkverkstaden.
Corresponding author
Ola Ågren can be contacted at: ola.agren@cs.umu.se