Current location - Education and Training Encyclopedia - Graduation thesis - What is the most effective algorithm for ranking web pages in search engines?
What is the most effective algorithm for ranking web pages in search engines?
2. 1 search engine based on word frequency statistics-word position weighting

Sorting by using the frequency and position of keywords in documents is the earliest main idea of search engines, and its technical development is also the most mature. It is the main ranking technology in the first stage of search engines and is widely used, and it is still the core ranking technology of many search engines. The basic principle is that the higher the frequency of keywords appearing in documents, the more important their positions and the better their relevance to search words.

1) word frequency statistics

The word frequency of a document refers to the frequency with which query keywords appear in the document. The higher the frequency of query keywords in documents, the greater their relevance. However, when keywords are commonly used words, it is of little significance to judge the relevance. TF/IDF solves this problem well. TF/IDF algorithm is considered as the most important invention in the field of information retrieval. Tf ($ TERM frequency): the frequency of single-text words, which is called "keyword frequency" by dividing the number of keywords by the total number of words on the web page. IDF (Inverse Document Frequency): the inverse text frequency index, whose principle is that a keyword has already appeared in n web pages, so the greater the n, the smaller the weight of this keyword, and vice versa. When keywords are commonly used words, the weight is very small, which solves the defect of word frequency statistics.

2) word position weighting

In search engines, word weighting is mainly used for web pages. Therefore, the analysis of page layout information is very important. By giving different weights to the different positions and layouts of search keywords in the web page, we can determine the degree of correlation between search results and search keywords according to the weights. The layout information that can be considered includes: whether it is a title, whether it is a keyword, whether it is text, font size, whether it is bold and so on. At the same time, the information of anchor text is also very important, which can generally accurately describe the content of the page pointed to.

2.2 Second generation search engine based on link analysis and ranking

The idea of link analysis and ranking comes from the citation index mechanism, that is, the more times or authority a paper is cited, the more valuable its paper is. Link analysis and ranking are similar. The more times a webpage is cited by other webpages or more authoritative webpages, the greater its value. The more times it is quoted by other web pages, the more popular it is, the more authoritative it is and the higher its quality is. Link analysis ranking algorithms can be roughly divided into the following categories: based on random roaming models, such as PageRank and Repution algorithms; Based on probabilistic models, such as SALSA and PHITS;; Mutual reinforcement model based on hub and authority, such as HITS and its variants; Based on Bayesian model, such as Bayesian algorithm and its simplified version. In practical application, the algorithm is optimized by combining the traditional content analysis technology. This paper mainly introduces the following classical sorting algorithms:

1)PageRank algorithm

PageRank algorithm was proposed by doctoral students Sergey Brin and Lwraence Page of Stanford University. PageRank algorithm is the core ranking algorithm of Google search engine, which is one of the important factors that make Google the most successful search engine in the world, and also opens the upsurge of link analysis research.

The basic idea of PageRank algorithm is to use PageRank value to measure the importance of a page, which is mainly reflected in two aspects: the number of pages referencing the page and the importance of pages referencing the page. One Page P(A) is cited by another page P(B), which can be regarded as that P(B) recommends P(A), and P(B) distributes its importance (pageRank value) to all pages cited by P(B) equally, so the more pages cited by P(A), the more PageRank values assigned to P(A). In addition, the more important P(B) is, the more PageRank values can be assigned to the pages it refers to, and the higher PageRank value of P(A) is, the more important it is.

Its calculation formula is:

PR(A): PageRank value of page A;

D: Damping coefficient, because some pages are not linked in or out, the PageRank value cannot be calculated, which is put forward to avoid this problem (that is, the LinkSink problem). The damping coefficient is usually specified as 0.85.

R(Pi): PageRank value of page Pi;

C(Pi): the number of links outside the page;

The initial calculation value of PageRank is the same. In order not to ignore the important factor that pages linked to important pages are also important, iterative operation is needed. According to the calculation results written by Zhang Yinghai, after more than 10 iterations, the link evaluation value tends to be stable, so the PR value of the system converges after many iterations.

PageRank is a static algorithm, which has nothing to do with the query, so the PageRank values of all web pages can be calculated offline. This reduces the sorting time required by users in retrieval, and the query response time is also greatly reduced. However, PageRank has two defects: First, PageRank algorithm seriously discriminates against new pages, because there are usually few outgoing and incoming links in new pages, and PageRank value is very low. In addition, PageRank algorithm only depends on the number and importance of external links, but ignores the topic relevance of pages, which makes some pages with unrelated topics (such as advertising pages) get larger PageRank values and affects the accuracy of search results. For this reason, various topic-related algorithms have emerged, among which the following algorithms are the most typical.

2) PageRank algorithm with topic sensitivity.

Because the original PageRank algorithm didn't consider the topic-related factors, Tahir Harvey-Valla of the Department of Computer Science of Stanford University proposed a topic-sensitive PageRank algorithm to solve the problem of "topic drift". This algorithm takes into account that some pages are considered important in some fields, but it does not mean that they are also important in other fields.

The link between Web page A and Web page B can be regarded as the score of Web page A to Web page B. If Web page A and Web page B belong to the same theme, it can be considered that the score of A to B is more reliable. Because A and B can be regarded as peers visually, peers often know their peers better than non-peers, so the scores of peers are often more reliable than those of non-peers. Unfortunately, TSPR does not use topic relevance to improve the accuracy of link scores.

3) Peak algorithm

HillTop is a patent filed by Google engineer Bharat on 200 1. HillTop is a query correlation link analysis algorithm, which overcomes the query independence of PageRank. HillTop algorithm thinks that links to related documents on the same topic will be more valuable to searchers. Only those expert pages (export sources) used to guide people to browse resources are considered in Hilltop. When Hilltop receives a query request, it first calculates a list of expert pages with the strongest relevance according to the query topic, and then sorts the target pages according to the number and relevance of independent expert pages pointing to the target pages.

HillTop algorithm determines the basic ranking process of matching degree between web pages and search keywords, instead of relying too much on PageRank value to find those authoritative pages, and avoids many cheating methods that want to improve PageRank value of web pages by adding many invalid links. HillTop algorithm ensures the relevance of evaluation results and keywords through different grades, ensures the relevance of topics (industries) through different positions, and prevents keyword accumulation by distinguishing the number of phrases.

The search and determination of expert pages play a key role in the algorithm, and the quality of expert pages plays a decisive role in the accuracy of the algorithm, which ignores the influence of most non-expert pages. The proportion of expert pages in the Internet is very low (1.79%), which can't represent all the pages of the Internet, so HillTop has certain limitations. At the same time, unlike PageRank algorithm, the operation of HillTop algorithm runs online, which puts great pressure on the response time of the system.

4) Click volume

Hits (Hyperlink Induced Topic Search) algorithm was proposed by Kleinberg in 1998, which is another most famous ranking algorithm based on hyperlink analysis. The algorithm divides web pages into two categories according to the direction of hyperlinks: authoritative pages and pivot pages. Authoritative page, also called authoritative page, refers to the page closest to a query keyword and its combination, and Hub page is also called directory page. The content of this page is mainly a lot of links to authoritative pages, and its main function is to unite these authoritative pages. For the authoritative page P, the more Hub pages pointing to P, the higher the quality and the greater the authoritative value of P; For the Hub page H, the more authoritative pages H points to, the higher the quality of authoritative pages, and the greater the Hub value of H. For the entire network collection, authority and hub are interdependent, mutually reinforcing and mutually reinforcing. The optimal relationship between authority and hub is the basis of HITS algorithm.

The basic idea of HITS is that the algorithm measures the importance of a webpage according to its in-degree (hyperlink pointing to the webpage) and out-degree (pointing to other webpages from the webpage). After defining the scope, the matrix is established according to the out-bound and in-bound of the web page. Through the iterative operation of the matrix and the definition of the convergence threshold, the values of the Authority and Hub vectors are continuously updated until they converge.

The experimental data show that the ranking accuracy of HITS is higher than PageRank, and the design of HITS algorithm conforms to the general standard for network users to evaluate the quality of network resources, which can bring convenience for users to better use network information retrieval tools to access Internet resources.

However, it has the following defects: firstly, HITS algorithm only calculates the main feature vector, which can not handle topic drift well; Secondly, the problem of topic generalization may occur when searching narrow topics; Thirdly, the HITS algorithm can be said to be an experimental attempt. After the content-oriented retrieval operation in the network information retrieval system, it must be calculated according to the link relationship between the result page of content retrieval and its directly connected pages. Although some people have tried to improve the algorithm and set up a connectivity server to achieve a certain degree of online real-time calculation, the calculation cost is still unacceptable.

2.3 Third generation search engine based on intelligent sorting

Ranking algorithm plays a particularly important role in search engines. At present, many search engines are further studying new ranking methods to improve users' satisfaction. But at present, the second generation search engine has two shortcomings. In this context, the third generation search engine based on intelligent sorting came into being.

1) related issues

Relevance refers to the degree of relevance between search terms and pages. Because of the complexity of language, it is one-sided to judge the relevance between search words and pages only by link analysis and the surface characteristics of web pages. For example, when searching for "rice blast", there is a web page that introduces rice diseases and insect pests, but there is no word "rice blast" in the text, so the search engine can't retrieve it at all. It is for the above reasons that a large number of search engine cheating phenomena cannot be solved. The solution to relevance should be to increase semantic understanding and analyze the degree of relevance between search keywords and web pages. The more accurate the correlation analysis, the better the user's search effect will be. At the same time, it can eliminate web pages with low relevance and effectively prevent search engines from cheating. The association between search keywords and web pages running on the Internet will cause great pressure on the system. Distributed architecture can improve the scale and performance of the system.

2) Simplification of search results

On the search engine, anyone who searches for the same word will get the same result. This does not meet the needs of users. Different users have different requirements for retrieval results. For example, ordinary farmers search for "rice blast" only to get information about rice blast and its prevention and control methods, but agricultural experts or scientific and technological workers may want to get papers related to rice blast.

The way to solve the single search result is to provide personalized service and realize intelligent search. Through Web data mining, user models (such as user background, interests, behaviors and styles) are established to provide personalized services.