Current location - Education and Training Encyclopedia - Graduation thesis - How to use python or R to capture the hidden source code of a web page?
How to use python or R to capture the hidden source code of a web page?
Hide source code? I want to know what you mean. I have two understandings. First, instead of what is shown in the previous paragraph, you should check the source code from time to time. Second, the content loaded asynchronously is invisible in the front end and source code. The first one is easy to solve. You must mean the second one. There are three solutions:

Simulated browser, dynamic access, can use the killer selenium tool.

Using this method, you can grab anything you can see, such as mouse sliding, asynchronous loading and so on. Because it can behave exactly like a browser, but this method is the least efficient and is generally not recommended unless it is absolutely necessary.

Execute js code

Execute asynchronous loading js code in python, and get some things such as mouse sliding and drop-down loading. However, there are a lot of js codes in the current website, so it is very difficult and time-consuming to find the target js code to be executed. In addition, python and js are not very compatible and are not recommended.

Finding json files loaded asynchronously is the most common, convenient and best method. This is the most commonly used method when I usually grab dynamic asynchronous loading websites, which can solve 99% of my problems. The specific use method is to open the developer tool of the browser, enter the network option, and then reload the webpage to find the json file that needs to be loaded dynamically and asynchronously in the list in the network. Take JD.COM as an example, as shown in the figure, the first json file of inventory information loaded asynchronously and the second json file of comment information loaded asynchronously were found:

Specific and more detailed methods can be Google or Baidu.