3. 1 demand analysis
This system is a sub-project of distributed cross-language search project. This distributed cross-language search project mainly includes two parts: one is data acquisition; The second part is information search. The article is mainly responsible for the acquisition of data information. Before explaining the contents of this chapter in detail, please introduce the background of the project. Simply put, this project is to enter a keyword in a certain language and then find out the information related to this keyword in many languages. At present, the progress of this project is that information in 27 languages can be searched. These include mainstream languages such as China, Japan, Britain, Germany, France and Russia, as well as small-scale languages such as Mongolian, Vietnamese and Hindi. At the same time, the distributed cross-language search project mainly searches for news information related to these 27 languages. Finally, this distributed cross-language search project clearly stipulates that both web crawler system and information search system must use distributed structure.
3. 1. 1 functional requirements analysis
Because this system is a sub-Xiao Mu of the distributed cross-language search project, before introducing it, we should have a general understanding of the overall layout of the cross-language search project. Through the explanation of this summary, we can understand the distributed web crawler system as a whole, understand the overall module design of the system, and understand the importance of the system in the whole project, so as to better analyze the requirements. At the same time, we can also understand the purpose and work of the crawler system, and lay the foundation for the subsequent indexing work.
The framework of distributed cross-language search project is Hadoop distributed system framework which is widely used now. According to the introduction in the previous chapter, we know that Hadoop is actually a framework based on cloud computing, which is mainly composed of HDFS and Map/Reduce model. Users don't need to know the underlying implementation process when using this framework, so it is more convenient to develop programs. This distributed cross-language search project has about five functional modules, and each functional module has its corresponding Map/Reduce computing model. This module includes five modules: crawler system, analysis, index, search and query. In particular, these five modules must adopt distributed technology. This paper is to discuss how to use distributed technology to realize a web crawler system. Figure 3- 1 shows the functional module division diagram of the project.
More specifically, you can trust me in private. ...