Current location - Education and Training Encyclopedia - Graduation thesis - Predict the future development trend of big data from three directions.
Predict the future development trend of big data from three directions.
Predict the future development trend of big data from three directions.

With the development of technology, the world is constantly generating data every day. With the introduction of the concept of big data, this technology has gradually developed into an industry and is constantly being optimistic. So what is the future development of the big data industry? Three directions to predict the future development trend of big data technology:

(a) Social networking and Internet of Things technologies have expanded the technical channels for data collection.

After the construction of industry informatization, a lot of internal data has been accumulated in the fields of medical care, transportation and finance, which constitutes the "stock" of big data resources; The development of mobile Internet and Internet of Things has greatly enriched the collection channels of big data. Data from external social networks, wearable devices, Internet of Vehicles, Internet of Things and government open information platforms will become the main body of incremental data resources of big data. At present, the deep popularity of mobile Internet provides rich data sources for big data applications.

In addition, the rapidly developing Internet of Things will become an increasingly important provider of big data resources. Compared with the chaotic data and low value density of the existing Internet, the data resources collected directionally through various data collection terminals such as wearable and car networking are more valuable. For example, after several years of development, smart wearable devices such as smart bracelets, wristbands and watches have become increasingly mature, and smart keychains, bicycles and chopsticks have emerged in an endless stream. There are Intel, Google and Facebook abroad, and Baidu, JD.COM and Xiaomi in China.

Enterprise internal data is still the main source of big data, but the demand for external data is increasingly strong. At present, 32% of enterprises purchase data from outside; Only 18% of enterprises use government open data. How to promote the construction of big data resources, improve data quality and promote cross-border integration and circulation is one of the key issues to promote the further development of big data applications.

Generally speaking, all industries are committed to actively expanding the technical channels of emerging data acquisition and developing incremental resources on the basis of making good use of existing resources. Social media and the Internet of Things have greatly enriched the potential channels of data collection. Theoretically, data collection will become easier and easier.

(2) Distributed storage and computing technology has laid a solid technical foundation for big data processing.

Big data storage and computing technology is the foundation of the whole big data system.

In terms of storage, file system (GFS) proposed by Google and others around 2000 and Hadoop distributed file system (HDFS) later laid the foundation of big data storage technology.

Compared with the traditional system, GFS/HDFS physically combines computing and storage nodes, thus avoiding the I/O throughput limitation that is easy to form in data-intensive computing. At the same time, the file system of this distributed storage system also adopts distributed architecture, which can realize high concurrent access ability.

In computing, the MapReduce distributed parallel computing technology released by Google in 2004 is the representative of the new distributed computing technology. MapReduce system consists of cheap general-purpose servers. By adding server nodes, the total processing capacity of the system can be linearly expanded, which has great advantages in cost and scalability.

(3) Emerging technologies such as deep neural networks have opened a new era of big data analysis technology.

Big data analysis technology is generally divided into two categories: on-line analytical processing (OLAP) and data mining.

OLAP technology, generally based on a series of assumptions of users, performs interactive data set query and association operations on multidimensional data sets (generally using SQL statements) to verify these assumptions, which represents the thinking method of deductive reasoning.

Generally, data mining technology actively searches for models in massive data, and automatically develops patterns hidden in the data, which represents inductive thinking method.

Traditional data mining algorithms mainly include:

(1) clustering, also known as grouping analysis, is a statistical analysis method to study the classification of (samples or indicators), which divides a group of data into several categories according to their similarities and differences. The similarity between data belonging to the same category is very large, while the similarity between data of different categories is very small, and the correlation between data of different categories is very low. Using clustering analysis algorithm, enterprises can group customers, group customer data from different dimensions without knowing the behavior characteristics of customer groups, and then extract and analyze the characteristics of the grouped customers, so as to grasp the characteristics of customers and recommend corresponding products and services.

(2) Classification, similar to clustering, but with different purposes. Classification can use the pre-generated model of clustering, or find out the similarity of a group of data objects through empirical data and divide the data into different classes. Its purpose is to map data items into a given category through a classification model, and the representative algorithm is CART (Classification Regression Tree). Enterprises can classify business data such as users, products and services, establish a classification model, and then forecast and analyze the new data to make it belong to the existing categories. The classification algorithm is mature and the classification accuracy is high. It has a very good ability to predict the precise positioning, marketing and service of customers and help enterprises make decisions.

(3) Regression, which reflects the characteristics of data attribute values, expresses the relationship of data mapping through functions, and finds the list relationship between attribute values. It can be applied to the prediction of data series and the study of correlation. Enterprises can use the regression model to analyze and predict the market sales situation and make corresponding strategic adjustments in time. In terms of risk prevention and anti-fraud, regression model can also be used for early warning.

Traditional data methods, whether traditional OLAP technology or data mining technology, are difficult to meet the challenge of big data. The first is the low execution efficiency. Traditional data mining technologies are all based on centralized underlying software architecture, which is difficult to parallelize, so the efficiency of processing data above TB level is low. Secondly, it is difficult to improve the accuracy of data analysis, especially when dealing with unstructured data.

Only a small part of all human digital data (about 1% of the total data) has been deeply analyzed and mined (such as regression, classification and clustering). Large Internet companies conduct shallow analysis (such as sorting) on semi-structured data such as web indexes and social data, and it is difficult to effectively analyze unstructured data such as voice, pictures and videos, which account for nearly 60% of the total.

Therefore, the development of big data analysis technology needs to make breakthroughs in two aspects. One is to analyze massive structured and semi-structured data efficiently and deeply, and mine tacit knowledge, such as understanding and identifying semantics, emotions and intentions. Text web pages composed of natural languages; The second is to analyze unstructured data, transform massive and complex multi-source voice, image and video data into information that can be recognized by machines and has clear semantics, and then extract useful knowledge from it.

At present, big data analysis technology represented by emerging technologies such as deep neural network has been developed to a certain extent.

Neural network is an advanced artificial intelligence technology with the characteristics of self-processing, distributed storage and high fault tolerance. Very suitable for dealing with nonlinear and fuzzy, incomplete and inaccurate knowledge or data, very suitable for solving the problem of big data mining.

Typical neural network models are mainly divided into three categories: the first category is feedforward neural network model for classification prediction and pattern recognition, mainly represented by functional network and perceptron; The second type is the feedback neural network model of associative memory and optimization algorithm, represented by Hopfield's discrete model and continuous model. The third category is the self-organizing mapping method for clustering, represented by ART model. However, although there are many models and algorithms of neural network, there is no uniform rule about which model and algorithm to use in data mining in specific fields, so it is difficult for people to understand the learning and decision-making process of network.

With the continuous integration of Internet and traditional industries, the mining and analysis of web data has become an important part of demand analysis and market forecast. Web data mining is a comprehensive technology, which can find hidden mapping process from document structure and usage collection.

At present, PageRank algorithm has been widely studied and applied. PageRank is an important part of Google algorithm. It was patented in the United States in September 2006 and named after Larry Page, one of the founders of Google. PageRank measures the value of a website according to the quantity and quality of its external links and internal links. The inspiration of this concept comes from a phenomenon in academic research, that is, the higher the frequency of a paper being cited, the higher the authority and quality of the paper will generally be judged.

It should be pointed out that data mining and analysis have strong characteristics in industries and enterprises. In addition to some basic data analysis tools, there is a lack of targeted and general modeling and analysis tools. Various industries and enterprises need to establish specific data models according to their own businesses. The ability to build data analysis models has become the key for different enterprises to win in the big data competition.