With the rapid development of the Internet around the world, the contradiction between the huge digital information on the Internet and people's access to information has become increasingly prominent. Therefore, it is an urgent and realistic subject to discuss and study the network information retrieval technology and its development trend. This paper analyzes and studies the basic principles, techniques and tools of network information retrieval, and the present situation of network information retrieval, and predicts the development trend of network information retrieval, aiming at finding effective ways to improve the means and methods of network information retrieval, and finally improve the retrieval effect of network information, so that network information resources can be fully and effectively utilized.
The full text mainly includes six parts,
The first part is a summary of network information retrieval, which mainly expounds the related concepts involved in network information retrieval, such as information retrieval technology, characteristics of network information retrieval and evaluation of network information retrieval effect.
The second part focuses on the basic technology of network information retrieval. Such as information push-pull technology, data mining technology, information filtering technology, natural language processing technology and so on. It aims to clarify the technical support of network information retrieval and pave the way for predicting the development trend of network information retrieval.
The third part expounds the important tool of network information retrieval & search engine, and mainly analyzes the retrieval characteristics and functions of different types of search engines from its retrieval mechanism. Its uniqueness lies in comprehensively summarizing the basic functions of search engines and scientifically classifying the popular search engines at present. ...
The fourth part analyzes and discusses another branch of retrieval technology-content-based retrieval technology.
The fifth part analyzes the limitations of network information retrieval tools, mainly from text information retrieval and multimedia information retrieval.
Finally, I converted it into. Txt text and post it below:
1. 1 network information resources
Network information resources refer to "all kinds of information resources available through the Internet".
With the rapid development of the Internet, online information resources have also increased exponentially, and online information resources have become
As a new information resource, it plays an increasingly important role, and its content is almost all-encompassing.
And political, economic, cultural, scientific, entertainment and other aspects; Its media forms are various, including words.
This, graphics, images, sounds, videos, etc. ; Its scope covers social science, natural science and humanities.
And engineering technology.
1.2 information retrieval technology
Information retrieval technology is one of the key technologies in modern information society. Information retrieval refers to sending letters.
Information is organized and stored in a certain way, and the required information is searched according to the information needs of information users.
Process and technology, so the full name of information retrieval is also called "information storage and retrieval". Information retrieval in a narrow sense
It only refers to the process of finding the required information from the information set, that is, using the information system retrieval tools to find the place.
The process of requesting information. The main ways for people to obtain information sources are as follows: ① Traditional retrieval methods are widely used.
In the library materials of Yanhai, the corresponding document index number is found by manual index search, and then the original document is obtained.
Text; ② Online information retrieval. There is also a development process, from retrieval results to providing
The electronic version of the full text can be obtained directly by searching the secondary information related to the catalogue and abstract. Through the retrieval method
From the point of view, from the conventional retrieval entry, specific keywords or auxiliary information such as authors and institutions are used.
Full-text search for any word in the original document, and so on. Among them, full-text retrieval contains information
In recent years, the originality and thoroughness of information retrieval and the naturalness of retrieval language have been developed.
More quickly, it has become a very effective information retrieval technology, which has attracted people's attention. It is based on large-capacity documents.
L3], which is the most effective way to accurately locate the required information in the archive.
3.2 Network Information Retrieval
Its retrieval methods are: browser mode and search engine mode.
(l) Browser mode (Br, singsystelns). As long as you can get into hitemct, you can get through browsing
Browser, using the WV NINEONE service provided by HTTP protocol, browse the B page and extract it through the B page.
Access the database by retrieval method.
(2) search engines. Search engines are provided by the Internet.
W 7 B, a service-seeking website, uses some technologies and strategies to collect and find online letters on the Internet.
Information, and understand, extract and process the network information, establish a database, and at the same time in the form of Ni B.
Provide a retrieval interface for users, instead of inputting keywords, phrases or phrases and other retrieval items.
Find out the records that match the question in the database, return the results at the same time, and output them in the order of relevance, from
So as to quickly find information. The information resources handled by search engines mainly include World Wide Web services.
In addition to email and newsgroup information, information on the server. The purpose of search engine service is to enrich.
It meets the information needs of users, so it is user-oriented and interactive.
Network information retrieval tools use active submission or automatic search to search for data.
1.4 evaluation of network information retrieval effect
At present, the accepted evaluation criteria of retrieval effect are: recall rate and retrieval.
Accuracy, coverage and output format, among which recall and accuracy are the most important.
The development of modern information science and technology provides people with various ways of information acquisition and transmission.
And technology, from the relationship between "source" and "user", can be divided into two modes: "information push"
Information push mode, that is, the "source" actively pushes information to the "user", such as radio broadcasting;
"information pull" mode, that is, "user" actively pulls information from "source",
Such as querying a database.
2.2. 1 information push technology
"Push" mode network information service is a new service form based on network environment, that is, letter.
Information service providers use "push" technology to provide information services for specific users on the Internet. push technology
It has become a new technology on the Internet because it has created a network information service tool.
Having the initiative can not only push the information that users are interested in directly to users, but also make effective use of it.
Network resources, improve network throughput; In addition, push technology also allows users to communicate with servers that provide information.
Transparent communication between users greatly facilitates users.
The so-called Push technology, also known as "push" technology and Webeasting technology, is essentially
Internet is a kind of software, which can automatically collect the most likely situations of users according to user-defined standards.
Interest information, and then at an appropriate time, deliver it to the "location" specified by the user. Therefore, technically,
Technically, the "push" mode of network information service is intelligent and can provide information automatically.
A set of computer software services can not only understand and discover users' interests (which may be concerned)
Information on certain topics), but also take the initiative to search information from the Internet, and after screening, sorting,
Then according to the specific needs of each user, actively push it to the user 14 1.
(l) information push mode. There are two ways to push information, webcasting and intelligence.
Webcast modes include: channel push. Channel webcasting technology is a widely used mode at present.
It defines some pages as channels in the browser, and users can accept interest just like choosing TV channels.
Network broadcast information; E-mail push, which actively releases pushed information to users through e-mail.
Such as international conference notice, product advertisement, etc. : Web-based push. Will be pushed in a specific webpage.
Information is released to users, such as the web pages of enterprises, institutions or someone; Dedicated push. Adopt specialization
Door information transceiver software, the source pushes information to special users, such as secure peer-to-peer communication.
Intelligent push methods include: operation push (customer push), in which information is started by customer data operation.
Push. When the customer manipulates the data, it will start after storing the modified new data in the database.
Information push process: push new data to other customers; Trigger push (server push), by
Ll Master's Degree Thesis
Main diagram, 5 composite diagram (8)
Triggers in the database start the information push process and push new data to other customers. When data appear,
Changes, such as adding (inserting), deleting (deleting), modifying (updating), and triggering.
Start the information push process.
(2) The characteristics of information push. The characteristics of information push are initiative, pertinence, intelligence and efficiency.
Sex, flexibility and comprehensiveness I5].
Initiative. The core of push technology is that service providers take the initiative to.
Data is transmitted to the client. Therefore, initiative is one of the most basic characteristics of "push" mode network information service.
This is also in sharp contrast to the passive service based on browser "pull" mode.
Targeted (personalized). Targeting means that push technology can target the specific information needs of users.
Search, process and push, and provide customized search for users according to their specific information needs.
Interface.
Intelligence. The push server can automatically collect information that users are interested in according to their needs.
Push to the user. Even the "client agent" in push technology can be downloaded from.
Search for scheduled sites, collect updated information and send it back to users. At the same time, agents and owners of personal information services
The subject search agent can also control the depth of search and filter out unnecessary ones to improve the accuracy of "push"
Information, which will identify the resource list of site B and its update status together with the customer agent. Therefore, the network
The "push" information service under the network environment is highly intelligent. This is also a traditional topic setting service.
(SDI) is incomparable.
Efficiency. Efficiency is another important feature of "push" information service under the network environment. Push
The application of technology can be started when the network is idle, which effectively utilizes the network bandwidth and is more suitable for transmitting big data.
Multimedia information quantity.
Flexibility. Flexibility means that users can flexibly set up connections according to their own convenience and needs.
Access specific information resources on the Internet through e-mail, dialog boxes, audio and video.
Comprehensive. The realization of "push" mode network information service requires not only information technology equipment, but also
It also depends on the synthesis of search software, classification and indexing software and other technologies [6].
However, at the current stage of information technology development, there are still great defects in "push" technology, such as: no
Can guarantee information transmission, no state tracking, lack of group management function and so on. Therefore, the research at home and abroad
The researchers also put forward the theory of pushover technology. The so-called super-push technology is to retain and continue.
Inherit and improve the advantages of push (active delivery and personalized customization), and abandon one of the many disadvantages of push.
! Master Thesis
Main duct, No.5 artificial blood vessel
A new push technology developed later. Its biggest feature is to ensure transmission. done
All information is sent to a specific information user at a specific time, while maintaining continuous user information.
Material, you can always know who received the information, whether the information is customized for users, and whether the user environment is suitable.
Wait [knife].
2.2.2 Information retrieval technology
The commonly used and typical information retrieval technology, such as database query, is that users actively query the database,
Extract the required information from the database. Its main advantages are: good pertinence and users can meet their own needs.
Inquire and search the required information purposefully.
Information retrieval technology on the Internet can be said to be the extension and extension of database query technology. Zaiwang
On the Internet, users are faced with not only a database, but an Internet environment with massive information.
As a result, search engine, an auxiliary tool to pull (query) various network information, came into being. Information push and information pull have their own characteristics, and they are often combined in practice.
Up, commonly used combination way is:
(1) "Push first and then pull" type. Push the latest information in time (update dynamic information), and then have a needle.
Information needed for sexual pull. In this way, it is convenient for users to pay attention to the new situation and new trend of information change, thus
Dynamically select information that needs to be deeply understood.
(2) "Pull first and then push" type. The user first pulls the required information, and then has a needle according to the user's interest.
Push other related information sexually.
(3) "push-pull" type. In the process of information push, users are allowed to interrupt and freeze at any time.
Interested in web pages, further search, take the initiative to pull more information.
(4) "push and pull" type. In the process of searching the information pulled by the user, according to the key input by the user
Words, information sources actively push relevant information and the latest information. This can not only be used in a timely and targeted manner.
Customer service can reduce the network burden and expand the scope of users [8].
Therefore, the combination of information push and information pull is the present Internet, database system and other letters.
Information system is a development direction to provide users with active information services.
2.3Web mining technology
With the development of the Internet, it has become a public information source of human society. In hitemet
It brings unprecedented information opportunities to human beings, but it also makes human information environment more complicated.
Main diagram, 5 composite diagram (8)
The problem of how to use information has not been satisfactorily solved through the development of information technology as expected.
On the contrary, with the development of information technology, the surge of information has caused the amount of information that individuals actually need.
And explore the contradiction between the massive information on B, which also makes it difficult for individuals to use information. exist
In this case, although there are special retrieval tools in the B environment, because the search engine is
Developed from the traditional search technology, under the current situation of increasing user demand, the traditional search technology
Technology has failed to meet people's needs. In order to make more effective use of network information resources, W 7 B mining
As a new knowledge mining method, it provides a new solution for the utilization of Web information.
2.3, 1 eb excavation content of the mountain
Data mining is to extract data from a large number of incomplete, noisy, fuzzy and random data.
The process of extracting potentially useful information and knowledge that people don't know beforehand.
Web mining is to extract useful patterns and hidden information from WWW and its related resources and behaviors. that
WWW and its related resources refer to Web documents existing on WWW and log documents on Web servers.
And user data, from the concept of Web mining, we should see that Web mining is essentially a kind of knowledge.
The means of discovery, it mainly from the following three aspects of benevolence.
(1)Web content mining. W 7 B content mining is to extract knowledge from w7b data to realize Web.
The automatic retrieval of resources improves the utilization efficiency of web data. With the further development of Internet.
More and more huge data, more and more kinds, data in the form of both text data information, as well as images,
Audio, video and other multimedia data information, both from the database structured data, but also useful HTML.
Tag semi-structured data and unstructured free text data information. So the content of w has B.
Information mining is mainly carried out from the following two angles ["].
First of all, from the perspective of information retrieval, this paper mainly studies how to deal with text format and hyperlink documents.
Some data is unstructured or semi-structured. When dealing with unstructured data, word set method is generally used.
Unstructured text is represented by a set of words, and the text is preprocessed by information evaluation technology.
Then the corresponding model is used to represent it. In addition, you can also use the maximum Chinese character sequence length, segmentation,
Use concept classification, machine learning and natural language statistics to represent text. When dealing with semi-structured data,
Some related algorithms can be used to classify hyperlinks, seek to identify the relationship between seven B pages and extract rules. the same
Compared with unstructured data, semi-structured data adds HTML tag information and Web text.
The hyperlink structure inside the file enriches the methods of representing semi-structured data.
Second, from the database point of view, we mainly deal with the structured W Bi B database, that is, hyperlinks.
14 8 Rui Yu Rui
Most documents and data are represented by weighted graph or object embedding model (OME) or relational database.
By applying a certain algorithm, we can find out the internal relationship between web pages, and its main purpose is to infer web pages.
The website structure may turn W Bi B into a database to better manage and query information. count
Database management is generally divided into three aspects: first, modeling, studying and understanding the advanced query language on B, so as to
Not limited to keyword query; The second is the integration and extraction of information, putting each W 7 B site and its packaging.
The program is regarded as a recognized B data source, and the integration of multiple data sources is realized through W 7 B data warehouse or virtual W 7 B database. Thirdly, establish and rebuild PageB website through research.
Study the online query language and realize the establishment and maintenance of the website.
(excavation of Zab structure. W Bi B structure mining, mainly refers to the analysis of seven w b documents, from
Organizational structure between documents to obtain useful patterns. Content mining studies relationships in documents,
W 7 b structure mining focuses on the relationship between hyperlink structures in websites and finds hidden structures.
After a page link structure model, you can use this model to reclassify W seven B pages, or you can
Used to find similar websites.
The data type of W 7 B structure mining is W 7 B structured data. Structured data is a description
The data organized by web page content and the structure in the page can be expressed as tree knots in hypertext markup language.
In addition, the structure between pages can also be represented by hyperlink structure connecting different web pages. Link reflection between documents
This paper discusses some relations between literature information, such as the parallel relationship of membership, the relationship between citation and cited. Yes, W seven B.
By classifying hyperlinks on web pages, we can judge and identify the attribute relationship between web page information. Because of the network
There are more or less structural information in the page, which can be found by studying the internal structure of page W dead B.
Other page information patterns related to the page set information selected by the user to detect the development of site W and site B.
The completeness of information.
③ Network behavior mining. The so-called W-B user behavior mining is mainly through the identification of the B server that day.
Record file and user information analysis, so as to obtain useful patterns about users. W 7 b behavior mining
Data information mainly refers to the user behavior patterns contained in network logs, including retrieval time, retrieval words,
Search paths, search results, and which search results were browsed. Due to the heterogeneity of W 7 B itself,
The characteristics of distribution, dynamics and no unified structure make it difficult to mine content on the Internet.
It needs a breakthrough in artificial intelligence and natural language understanding. Fortunately, it is based on W 7 B service.
The 109 log structure of the server is complete. When an information user visits a website, it is related to the visit.
Information such as page, time, user ro is recorded in the log, so information is provided.
L5 master's degree thesis
Main, 5 synthesis ⑥
Mining is feasible and meaningful. In the process of technical practice, the data in the log is generally reflected first.
Shoot all kinds of relationship information and preprocess it, including removing information unrelated to mining. for
In order to improve the performance, the methods currently used for 109 log data information mining include path analysis, association rules,
Pattern discovery, cluster analysis, etc. In order to improve the accuracy, behavior mining is also applied to the website structure information.
Page content information, etc.
2.3.2web Application of Web Mining Technology in Network Information Retrieval
Application of Web content mining in retrieval. W-Bi-B content mining refers to document content and its description.
In the process of acquiring knowledge, because the traditional information retrieval technology does not deal with W-B documents deeply enough,
Therefore, we can use B content mining technology to process W-B documents in network information retrieval.
Further improvement is embodied in the following aspects.
① Text summarization technology. Text summarization technology refers to extracting key information from documents and then simplifying it.
A clean form of summary or representation of information in W Bi B files. So that users can browse these key letters.
Interest, you can have a general understanding of the information on the W 7 B web page, decide its relevance and choose it.
② Text classification technology. Text classification in content mining refers to classification according to predefined topics.
Category, using a computer to automatically classify each document in the document collection. Network information classification
The value of retrieval lies in that it can narrow the retrieval range and greatly improve the accuracy. At present, there has been a very
Multi-text classification technology, such as TFIFF algorithm. Because text mining and search engines handle a small amount of text.
Almost exactly the same, so text classification technology can be directly applied to the automatic classification of search engines.
By automatically, quickly and effectively classifying a large number of pages, the accuracy of document retrieval can be improved.
③ Text clustering technology. Text clustering is the opposite of the process J of text classification. Text clustering refers to
Dividing documents in a document collection into smaller clusters requires that documents in the same cluster be as similar as possible.
Large, and the smaller the relationship between clusters, the better. These clusters are equivalent to the categories in the classification table. Text clustering
Technology does not need to define topic categories in advance, so the categories of search engines can be compared with the collected ones.
Information adaptation. Compared with manual classification, text clustering technology is faster and more objective. At the same time,
Text clustering can be combined with text classification technology to make information processing more convenient. You can evaluate the search results.
Classify and group similar results.
(2) The application of 2)Web structure mining in network information retrieval. W Bi B adopts an information organization method.
This kind of non-planar structure, generally speaking, the information organization mode of W and B is organized by content. but
Because these structural information of W Bi B is difficult to process, search engines generally don't process these letters.
16 master's degree thesis
Master, Zhu Zheng 5 15⑧
Interest, but will beep touch page b as the text of the plane mechanism for processing. However, after mining from the touch B structure,
By mining the organizational structure of B documents, search engines can further expand the search engine's
Retrieval ability, improve the retrieval effect ['3].
(3) The application of brittle B behavior mining in network information retrieval. B-touch behavior mining is a kind of mining.
Summarize the patterns of users' retrieval behavior. User retrieval behavior has always been an important research content of information retrieval.
Content, by exploring B behavior mining, we can not only find the potential behavior patterns of most users, but also
And you can also find the personalized behavior of individual users. Studying these patterns can make a better search.
Feedback the retrieval effect of the search engine, so as to further improve the search strategy and improve the retrieval effect.
2.3.3web limitations and the development direction of web mining technology
(1)b hole content mining. Whether the data on w 7 b is expressed in HTML or XML markup language,
It can't completely solve the unstructured problem of w 7 b data, especially the various Chinese sentence formats, function words,
There is no absolute boundary between content words, so it is difficult to segment words and cannot automatically label data.
Therefore, it is necessary to combine information technology such as data warehouse with seven B content mining technology.
Store row information, and finally realize intelligent and automatic data representation and index for retrieval.
Usually, the representation of data and the utilization form of data are interrelated, so the design of corresponding data has high query performance.
The mining algorithm of total rate and accuracy is also one of the future directions, just like data representation. In addition, the multimedia number
According to how to identify, classify and index, this is also the difficulty and direction of B content mining research in the future.
(2) Mei B structure data mining. With the rapid development of the Internet, the content of websites is getting richer and richer.
Rich and complex structure, using directed graph to represent the link structure of giant websites will not meet the data processing.
In order to meet the demand, we need to design a new data structure to represent the website structure.
Because the user usage information used for comparative analysis to find out the problem is only log traffic, then, for
How to identify each link relationship in the log stream, what structure is used to represent it, and how to extract useful information.
Mode, etc. , which is not only an important research content of behavior mining of flight identification B, but also an important part of website structure mining.
One of the research directions.
(3) Mining eb user behavior. Due to the stateless nature of the Internet transport protocol HTTP,
The existence of proxy server-side cache makes user access logs exist in servers, proxy servers and.
Client, therefore, the biggest difficulty in learning user access rules from W 7 B user access logs is, for example.
How to preprocess the access logs distributed in different locations to form a one-time access cycle for each user?
Between. Generally speaking, for static W 7 B websites, the logs on the server side are easier to obtain, and the logs on the client side and the generation side are easier to obtain.
L7 server user access log is not easy to obtain; Secondly, because a complete W Bi B is composed of one after another.
Pictures and frame pages, and users' access to the server is also concurrent, when determining that users access content.
, you must select the page and the main content of the page actually requested by the user from the server log.
In addition, the existing data mining algorithms are mainly developed on the basis of a large number of transaction data.
Yes, it is also necessary to redesign the algorithm structure when dealing with massive Web user access logs ['4 1].
2.4 Information filtering technology
Hitemet's open environment provides great convenience for people to retrieve and use information, but it is related to
At the same time, the network environment also brings trouble for people to retrieve the needed information in time and accurately. This is because, first,
First, the sources of information under the network environment are complex and diverse, and they are arbitrary, and no one or any unit cares.
Information can be published online, regardless of the scene and motivation, the generation and dissemination of information have not been screened and reviewed.
Therefore, the reliability, quality and value of information have become the main concerns of users; Second, the purpose
The search scope of most data search tools is comprehensive, and their robots try to put all kinds of
Grab the web page, simply process it and store it in the database for inspection; Third, search engines directly provide
Most retrieval approaches for users are Boolean logic matching based on keywords, and everything is returned to users.
Include keywords, so the number of retrieval results far exceeds the energy absorbed and used by users.
Force makes people feel helpless. This is what people often call "information overload" and "information overload".
Elephant. It is in this context that information filtering technology began to attract people's attention. Its purpose is to make
Search engines have more "intelligence", which makes the participation of search engines more in-depth and detailed.
In the whole retrieval process of users, from the selection of keywords, the determination of retrieval scope to the refinement of retrieval results,
Help users find information that is truly relevant to their own needs in the vast amount of information.
2.4. 1 information filtering model
The essence of information filtering is still an information retrieval technology, so it still depends on a certain information retrieval model.
Different retrieval models have different filtering methods. 5 1。
(1) Filter with Boolean logic model. Boolean model is a simple retrieval model. search
In Chinese, it is based on whether the document contains keywords, so there is no need to analyze the data of the web page.
Carry out deep processing. The simplest keyword table can be designed with only three fields: keywords, including.
The contribution number of the keyword and the number of times the keyword appears in the corresponding literature. When searching, users submit keywords.
……………………………………
It's too long to send. I hope it works for you, but it really doesn't. Contact me (leave me a message) and I will send it to your email.