I won't talk about a lot of web page analysis, so I'd better write it myself.
Lucene index
First of all, the crawler needs a processor chain, and the crawling of web pages can not be realized by dozens of lines of code, because there are many problems.
Now
1. Get the webpage: judge the webpage coding, calculate the webpage text position, get the URL in the webpage (filtering, caching and storing the URL also need to optimize the thread pool), allocate the URL, and start the thread pool.
2. Persistence of web pages. Web page parsing, downloading style sheets and pictures in web pages, saving web pages (xml and html) and generating web page snapshots.
3. Web page de-duplication and de-noising: To remove useless web pages, if it is a vertical search engine, it needs more judgment, which can be realized by using content templates and space vectors.
4. The establishment and optimization of the index is mainly to restore the inverted index.
Your classification can basically be achieved by content template and space vector calculation.
There are many other things that I can't elaborate on for the time being. How far do you want to go? (For example: the algorithm of space vector and the reference value of the result, the establishment of web page content template. )