Current location - Education and Training Encyclopedia - Graduation thesis - Web Crawler Papers Based on c# ~ ~ Kneel down!
Web Crawler Papers Based on c# ~ ~ Kneel down!
After introducing the technical background of web crawler system, the next step is to complete the overall design of this web crawler system according to the results of demand analysis. Firstly, this chapter gives the requirements analysis of distributed crawler system, including the target web page to be crawled by crawler, the requirements and the requirements of the unit itself. Then, based on the technology of Hadoop distributed system architecture, we have a general understanding of the crawler system and make clear the important role of the crawler system in cross-language search. In addition, the overall structure and functional modules of the system are designed and the flow chart is given. At the end of this chapter, the specific structure of each functional module is introduced in detail, and the implementation method is pointed out.

3. 1 demand analysis

This system is a sub-project of distributed cross-language search project. This distributed cross-language search project mainly includes two parts: one is data acquisition; The second part is information search. The article is mainly responsible for the acquisition of data information. Before explaining the contents of this chapter in detail, please introduce the background of the project. Simply put, this project is to enter a keyword in a certain language and then find out the information related to this keyword in many languages. At present, the progress of this project is that information in 27 languages can be searched. These include mainstream languages such as China, Japan, Britain, Germany, France and Russia, as well as small-scale languages such as Mongolian, Vietnamese and Hindi. At the same time, the distributed cross-language search project mainly searches for news information related to these 27 languages. Finally, this distributed cross-language search project clearly stipulates that both web crawler system and information search system must use distributed structure.

3. 1. 1 functional requirements analysis

Because this system is a sub-Xiao Mu of the distributed cross-language search project, before introducing it, we should have a general understanding of the overall layout of the cross-language search project. Through the explanation of this summary, we can understand the distributed web crawler system as a whole, understand the overall module design of the system, and understand the importance of the system in the whole project, so as to better analyze the requirements. At the same time, we can also understand the purpose and work of the crawler system, and lay the foundation for the subsequent indexing work.

The framework of distributed cross-language search project is Hadoop distributed system framework which is widely used now. According to the introduction in the previous chapter, we know that Hadoop is actually a framework based on cloud computing, which is mainly composed of HDFS and Map/Reduce model. Users don't need to know the underlying implementation process when using this framework, so it is more convenient to develop programs. This distributed cross-language search project has about five functional modules, and each functional module has its corresponding Map/Reduce computing model. This module includes five modules: crawler system, analysis, index, search and query. In particular, these five modules must adopt distributed technology. This paper is to discuss how to use distributed technology to realize a web crawler system. Figure 3- 1 shows the functional module division diagram of the project.

More specifically, you can trust me in private. ...