Current location - Education and Training Encyclopedia - Graduation thesis - Ten principles to realize the replication of data science research results
Ten principles to realize the replication of data science research results
Ten principles to realize the replication of data science research results

In a paper, a group of researchers described ten rules of reproducibility computing research. If you follow these rules, you should produce more repeatable results.

All data science is research. Just because the research results have not been published in academic papers cannot change the fact that we are trying to gain insights from the vast amount of data. Therefore, for any data scientist engaged in internal analysis, the ten rules in that paper should be taken seriously. Rule 1: For each achievement, it is important to record its production process and understand the production process of research achievements. Knowing how to draw this conclusion from the original data allows you to: defend the results, correct the results when errors are found, reproduce the results when the data is updated, and submit the results for review. If you are using a programming language (R, Python, Julia, F#, etc. To write your analysis script, the process should be clear, provided that any manual steps are avoided. If you use a "mouse click" tool (such as Excel), it will make it more difficult for you to record the steps, because you have to describe a series of manual operations, which are difficult to record and reproduce. Rule 2: Avoid manual data manipulation steps. You may want to open the data file in an editor, manually correct formatting errors or delete outliers. In addition, the modern operating system makes it easy for you to cut and paste applications. However, you should resist the temptation to take shortcuts. Manual data operation is an operation that cannot display traces. Rule 3: Archive exact versions of all external programs you use. Ideally, you should create a virtual machine that contains all the scripts that run the software. This allows you to generate snapshots of the analysis ecosystem and easily achieve repeatability of the results. However, this is not always feasible. For example, if you use cloud services, or the data set you analyze is very large, it is difficult to define the entire environment for archiving. In addition, the use of business tools may make it difficult for you to share such an environment with others. At the very least, you must record the version of all the software you use, including the version of the operating system. Any slight change in the software may affect the result. Rule 4: Recording the version of all customized scripts should use a version control system (such as Git) to record the version of scripts. You should mark (snapshot) multiple scripts and indexes to mark any results you generate. In this way, if you decide to modify the script later (which you will certainly do), you can find the exact script that produces specific results in time. Rule 5: Try to record all intermediate results in a standard format. If you follow the rule 1, you should be able to reproduce any results from the original data. However, although this is possible in theory, there are various limitations in practice. Problems may include: lack of resources to use commercial tools to run results from scratch (for example, using a large number of cluster computing resources), but without the authorization of some tools, the technical ability to use some tools is insufficient. In these cases, it is wise to start with a derived dataset of the original data. These intermediate data sets (such as data in CSV format) provide more analysis options, and it is easier to identify problematic results when mistakes are made, so it is not necessary to start all over again. Rule 6: For analysis with randomness, to record potential random seed data, scientists often don't set seed values for their analysis, so they can't accurately reproduce machine learning research. Many machine learning algorithms contain random components. Although strong results may be statistically repeatable, nothing can match the accurate data generated by others. If you use script and source control, you can set seed values in the script. Rule 7: Always save the original data. If a scripting/programming language is used, charts are usually generated automatically. However, if you use a tool such as Excel to draw a chart, please make sure to save the original data. This allows you to copy the chart and examine the data behind the chart more carefully. Rule 8: Generate the output of the analytic hierarchy process so that more and more levels of detail can be checked. The job of a data scientist is to summarize data in some form and gain insight from the data. However, summary can easily lead to data misuse, so interested parties should be allowed to decompose summary into data points. For each summary result, it should be related to the data used to calculate the summary. Rule 9: Connect the text statement with the potential result. In the final analysis, the results of data analysis are presented in the form of words, but the words are inaccurate. Sometimes, the connection between conclusions and analysis is difficult to determine. Because papers are often the most influential part in scientific research, it is very important to link papers with achievements, and because of the rule of 1, it is also very important to link papers with original data. This can be achieved by adding footnotes to the text. The file or URL quoted in the footnote should contain the specific data observed in the paper. If you can't make this connection, you may not have fully recorded all the steps. Rule 10: Exposing scripts, processes and results may not be suitable for exposing all data in the business environment. However, it is ok to disclose data to others in the organization. Cloud-based source code control systems, such as Bitbucket and GitHub, allow the creation of private storage, which can be accessed by any authorized colleague. Public supervision can improve the quality of analysis, so the more you share, the higher your analysis quality may be.