For example, according to the system structure and implementation technology, crawlers can be divided into general web crawlers (crawling all the contents of the network, regardless of priority), focused web crawlers (crawling only pages related to preset themes), incremental web crawlers (crawling only new pages or changed pages) and deep web crawlers (visiting deep pages). The reptiles we usually see are also used to grab data. This reptile actually did two things:
1, get the source code of the webpage;
2. Parse and extract the required data from the web page source code. Many anti-crawler technologies are aimed at the first task, which prevents you from getting the source code through the crawler. As long as you get the source code, there are many ways to parse and extract data. It can be said that when you get the source code, most of the crawler's work is completed.
How to improve the efficiency of web crawler
1. can improve the crawling frequency of reptiles and crack the verification information of some websites. The verification adopted by the website is generally the verification code or the user needs to log in.
2. Let the crawler use multithreading, and the computer should have enough memory. You should also use the proxy IP, and the proxy IP should be stable online. This method is a good choice to improve efficiency.
Legal basis:
People's Republic of China (PRC) Civil Code
Article 110
Natural persons enjoy the right to life, body, health, name, portrait, reputation, honor, privacy and marital autonomy. Legal persons and unincorporated organizations enjoy the right of name, reputation and honor.