Time 20 19-02-0 1? 3: 30 p.m.
Lecturer Liu
Venue: Audio-visual classroom on the fourth floor
Data cleaning is the key link of data governance, which refers to the process of reviewing, verifying and processing the obtained original data (also known as "dirty data"), with the purpose of deleting duplicate information, correcting erroneous information and maintaining data consistency.
Generally speaking, data cleaning is mainly to debug, empty and rework data.
For the data table containing name, ID number and license plate number, the error correction rules are established as follows:
1. The license plate number contains neither gan nor rao.
2. The year of ID number is neither equal to 19 nor equal to 20, the month of ID number is greater than 12, and the date of ID number is greater than 3 1.
3. The digits of the ID number are not equal to 18.
4. Name length is less than or equal to 1.
Second, go empty.
For key data, it is not allowed to be empty. For this kind of data, it is necessary to query whether there is a null value.
Third, go to weight.
In a table, some data columns are allowed to repeat, while others are not. For example, in an owner information form, the name and ID number can be repeated, because there are cases where one person registers multiple cars, and this repetition cannot be considered as an error. However, the license plate number cannot be duplicated, otherwise there will be business logic errors. So for the license plate number data column, you need to copy it.
Duplicate data can be listed by the following SQL statement:
To sum up, data cleaning should not only understand the technology, but also understand the business, otherwise the cleaning rules cannot be formulated correctly, which leads to the data cleaning becoming a mere formality and the cleaning effect cannot be achieved.