Current location - Education and Training Encyclopedia - Graduation thesis - What does big data preprocessing include?
What does big data preprocessing include?
First, data cleaning.

Not all data are useful, some data are not the content of everyone's concern, and some are even completely incorrect influence items. Therefore, it is necessary to filter and denoise the data to obtain reasonable data.

The key points of data cleaning include ignoring value solutions (lack of interested attributes), noise data solutions (data with incorrect or deviated expectations) and inconsistent data solutions.

You can ignore data by defining global variables, averaging attributes, padding values, or ignoring data immediately. Noise data can be removed by box (sorting the initial data and then smoothing each group of data), clustering algorithm, periodic inspection and return by computer manual service, etc.

Second, data integration and transformation.

Data integration refers to integrating data from multiple data sources and storing them in a consistent database file. In this whole process, we mainly have to deal with three difficult problems: pattern matching, data redundancy, data value conflict detection and resolution.

Because the data names of several data combinations are different, equal physical lines often have different names. The last key problem of data integration is the contradiction of data values, which is embodied in the fact that different unified entity lines have different data values.

Third, the data protocol.

The key of data specification includes: data aggregation, dimension specification, data reduction, scale specification and definition hierarchy.

If the data needed for analysis is obtained from the database room according to the requirements of business process, this data set will be very huge, and the cost of data analysis and data mining for a large number of data is very high. The application of data specification technology can complete the specification of data set, which shows that data set still tends to maintain the consistency of original data. The data set after the protocol is being mined, and almost the same analysis results as the original data set can still be obtained.

About what big data preprocessing includes, Qingteng Bian Xiao is here to share with you. If you are interested in big data engineering, I hope this article can help you. If you want to know more about the skills and information of data analysts and big data engineers, you can click on other articles on this site to learn.