Current location - Education and Training Encyclopedia - Graduation thesis - Information retrieval paper
Information retrieval paper
Information retrieval technology papers-

Web information retrieval of professional content based on grid

1 Introduction

In recent years, with the rapid development of the Internet, the information resources on the Internet have become more and more huge, and the information has the characteristics of large quantity, dispersion and heterogeneity. Therefore, the traditional Web information retrieval tools have begun to show its low performance, which is manifested in the fact that the existing information retrieval tools often require users to find thousands or even tens of thousands of records, and they can't find them carefully, or the content they find and the content they want to find are not in the same professional field, resulting in invalid information. However, with the enhancement of people's information consciousness, the demand for information content and information service is constantly evolving and developing, which puts forward new requirements for the specialization and effectiveness of obtaining information. How to provide professional and tailor-made information services for specific users in the professional field, so that users can effectively find the most needed information content in the shortest time, is a common concern. In this paper, a grid-based Web information retrieval system for professional content is designed by using grid computing, cluster system, XML and other technologies. The system can logically organize and manage geographically dispersed and heterogeneous information according to professional content, and provide users with a fast and effective way to obtain the information they need.

Design of Web Information Retrieval Architecture for Professional Content Based on Grid

Grid computing is an important information technology emerging in the world in recent years. Its purpose is to organize all kinds of online resources under a unified framework, provide a user-friendly virtual platform to solve large-scale complex computing, data services and various network information services, and realize the comprehensive connection of all resources on the Internet and the comprehensive enjoyment of information resources.

In order to solve the problems of complex scientific computing and massive information services in different fields, people have built different grids based on network interconnection, which are different in architecture and types of problems to be solved, but grid computing needs at least three basic functions: resource management, task management and task scheduling. The information retrieval architecture designed in this paper is mainly composed of the following three levels around the basic functions of grid computing and the characteristics of information retrieval: as shown in figure 1.

(1) Grid node: A node is a provider of grid computing resources. The system is mainly composed of a series of cluster systems, which are geographically distributed to form a distributed retrieval group as the basic platform for information sharing. Cluster system is responsible for information management, maintenance and query in the whole cluster.

(2) Grid computing middleware: Middleware is a tool for information resource management, user task scheduling and task management. It is the core part of the whole grid information resource management. It requests tasks according to users' information, is responsible for the matching and positioning of information resources in the whole grid, and realizes the mapping of user tasks to cluster system.

(3) Grid user layer: it mainly provides interfaces for user applications and supports users to describe, create and submit required information resources.

Figure 1

The main idea of the system is to logically divide geographically dispersed and heterogeneous information into multiple cluster systems, which manage the resources and scheduling tasks in the cluster, and then manage each cluster system by grid middleware, thus forming the management of the whole grid resources and uniformly managing and scheduling the information needs of users. This management model can not only respect the local information management strategy of each cluster system, but also manage grid information resources in a global sense by using middleware.

2. 1 Design of Cluster System

Due to the mass of Web information resources, users are faced with the problem of massive data query when using existing search engines to retrieve information, which often leads to the problem of inaccurate and incomplete resource search after consuming huge communication resources. At present, Web server cluster system based on single system mapping can connect multiple servers into a whole through LAN, making it look like a server serving from a client, which makes it possible to logically merge and organize geographically distributed information resources. Therefore, this paper first considers the distributed cooperation strategy, and divides the Web information resources into regions and professional contents. On the one hand, the number of information resources is relatively reduced, which is convenient for data organization, management and maintenance; On the other hand, it is convenient to formulate general XML specifications according to professional content and to describe various information resources in the cluster, thus establishing an information integration system oriented to professional content based on XML. The specific structure of the cluster system is shown in Figure 2.

Cluster server is mainly composed of interface agent, XML information integration system based on professional content, resource service agent and resource publishing agent. Among them, the interface agent registers, receives and manages various information resource request tasks according to the interface parameters provided by the tasks, and provides security authentication and authorization. According to the information resource request task, the resource service agent uses the data provided by the XML information integration system to provide the actual resource retrieval operation for the user and sends the retrieval result information to the user. Resource publishing agent is used to provide grid middleware with logical data and interface parameters of local information resources.

The following mainly explains the construction method of XML information integration system based on professional content:

XML (Extensible Markup Language) was announced by W3C in 1998 as a new standard for data representation and data exchange on the Internet. It is a language that can describe information by itself. It allows developers to describe their own data by creating custom tags defined by document types. DTD specification is a standard to define the syntax, syntax and data structure of XML files. XML uses plain text, so it has the advantage of cross-platform. The advantages of XML are (1) simplicity and standardization: XML documents are based on text labels, with a rigorous and concise grammatical structure, which is easy for computers and users to understand; (2) Extensibility: users can customize labels with specific meanings, and the customized labels can be shared among any organization, customers and applications; (3) Self-description: Self-description makes it very suitable for data exchange between different applications, and this exchange is not based on the premise of defining a set of data structures in advance, so it has strong openness; (4) Interoperability: XML can store all information in documents for transmission, and remote applications can extract the required information from them. XML data is an application independent of a specific platform, so it provides an excellent means for the expression based on specific professional content and can be used as a language to express professional content.

At present, the basic methods of developing Web information integration system can be divided into two categories: warehouse method and virtual method. These two methods can make use of the advantages of XML in data organization and exchange, express the integration mode based on professional content and the mapping between integration mode and resources by using format file DTD and XML document, and establish an XML-based Web information integration system. See reference [2] for its structure and information acquisition process.

Figure 2

2.2 Design of Grid Middleware

The main function of the grid middleware shown in Figure 3 is (1) to eliminate the data expression differences between different users and cluster systems, and make the information resource data transparent to users; (2) Manage and maintain the cluster system distributed on the Web. Grid middleware records the logical information and professional contents of all cluster systems in the form of relational database. The operation of relational database can maintain the distributed logic of cluster system and make this structure flexible, changeable and extensible. (3) By accepting the user's information request task, we can quickly locate the cluster system that meets the requirements, and realize the corresponding relationship between the user's information request task and the cluster system by querying the relational database.

The main internal function modules are described as follows:

(1) Receiving agent module: it is mainly used to register, receive and manage various information resource request tasks, and provide security authentication and authorization.

(2) Relational database and data service agent: Relational database records the logical information and professional contents of all cluster systems. Data service agents provide access to relational databases and operations such as adding, deleting, retrieving and modifying data records for cluster systems.

(3) Format conversion agent module: provides format conversion function between user information resource request documents and documents in each cluster system. Because XML is user-defined, users have different representations of the same data (the descriptions of information resources are different). Because this format difference in XML documents is reflected in related DTD/Schema, the format of information resources can be transparent to users after format conversion.

(4) XML document analysis agent module: extract each tag in the XML document after format conversion, and realize the corresponding relationship between the user information request task and the cluster system by querying the relational database in the grid middleware, and obtain the relevant information of the cluster system that meets the conditions and the interface parameters of each cluster system.

(5) Sending agent module: sending the converted information resource request XML document to the corresponding cluster system.

Among them, Agent technology is the key technology to solve the problem of distributed intelligent application. Agent refers to an entity that can change independently, run in other systems and interact with the environment constantly. The introduction of Agent into the system can make the system humanized, complete users' tasks on behalf of users, dynamically adapt to changes in the environment, better meet users' needs and improve the ability of information retrieval. Secretary hodgepodge network