Current location - Education and Training Encyclopedia - Graduation thesis - Which friend knows how to use java to realize the technology of web crawler and search engine, and talk about the principle. It is best to attach the code. Thank you very much. You can get extra point
Which friend knows how to use java to realize the technology of web crawler and search engine, and talk about the principle. It is best to attach the code. Thank you very much. You can get extra point
Which friend knows how to use java to realize the technology of web crawler and search engine, and talk about the principle. It is best to attach the code. Thank you very much. You can get extra points if you are good. Heritrix crawls the web page

I won't talk about a lot of web page analysis, so I'd better write it myself.

Lucene index

First of all, the crawler needs a processor chain, and the crawling of web pages can not be realized by dozens of lines of code, because there are many problems.

Now

1. Get the webpage: judge the webpage coding, calculate the webpage text position, get the URL in the webpage (filtering, caching and storing the URL also need to optimize the thread pool), allocate the URL, and start the thread pool.

2. Persistence of web pages. Web page parsing, downloading style sheets and pictures in web pages, saving web pages (xml and html) and generating web page snapshots.

3. Web page de-duplication and de-noising: To remove useless web pages, if it is a vertical search engine, it needs more judgment, which can be realized by using content templates and space vectors.

4. The establishment and optimization of the index is mainly to restore the inverted index.

Your classification can basically be achieved by content template and space vector calculation.

There are many other things that I can't elaborate on for the time being. How far do you want to go? (For example: the algorithm of space vector and the reference value of the result, the establishment of web page content template. )