Current location - Education and Training Encyclopedia - Graduation thesis - Big data details daquan
Big data details daquan
Big Data, or huge data, refers to information involving such a huge amount of data that it can't be captured, managed, processed and arranged by mainstream software tools in a reasonable time to help enterprises make more active business decisions.

Chinese name: big data mbth: Big Data or: huge data includes: basic information of data volume, timeliness and diversity, technical overview, cache, distributed database, distributed system, data mart, NoSQL, abstract, basic information Big data, which is not only about data volume, but also includes data volume, timeliness, diversity and doubt: quantity: the generation, processing and preservation of a large number of data. But I think it is more appropriate to use IBM's explanation, that is, the timeliness of processing. As one of the purposes of big data mentioned above is to make market prediction, the timeliness of processing is too long to lose the significance of prediction, so the timeliness of processing is also very critical for big data. The in-depth analysis of 5 million data may only take 5 minutes. Diversity: Variability refers to the form of data. Structured and unstructured data, including text, video, web pages, streams, etc. Accuracy: Suspicion refers to whether the reliability and quality of the data itself are sufficient when the data sources become more diversified. If there is something wrong with the data itself, the analysis result will be incorrect. Technical Overview Big data is a recent technical hotspot, but it is not a new word from the name. After all, big is a relative concept. Historically, database, data warehouse, data mart and other technologies in the field of information management are also largely aimed at solving the problem of large-scale data. As early as 1990s, bill inmon, known as the father of data warehouse, often talked about big data. However, as a proper term, big data has become a hot spot, mainly due to the rapid development of Internet, cloud computing, mobile and Internet of Things in recent years. Ubiquitous mobile devices, RFID, wireless sensors are generating data every minute, and internet services of hundreds of millions of users are generating a lot of interactions all the time ... The amount of data to be processed is too large and growing too fast, and business needs and competitive pressures put forward higher requirements for the real-time and effectiveness of data processing, which traditional conventional technical means can't cope with at all. In this case, technicians have developed and adopted many new technologies, mainly including distributed cache, distributed database based on MPP, distributed file system and various NoSQL distributed storage schemes. 10 years ago, Eric Brewer put forward the famous CAP theorem, pointing out that a distributed system can not meet the three requirements of consistency, availability and partition tolerance, but can only meet two at the same time. The emphasis of the system is different, and the strategies adopted are different. Only by truly understanding the requirements of the system can we make good use of CAP theorem. There are generally two directions for architects to use CAP theory. Key-value storage, such as Amazon Dynamo, can flexibly select database products with different tendencies according to CAP theory. Domain model+distributed cache+storage, flexible distributed scheme can be customized according to CAP theory, but it is difficult. For large websites, usability and partition tolerance take precedence over data consistency. Generally, they will try their best to design in the direction of A and P, and then ensure the consistency of business requirements by other means. Architects should not waste energy on how to design a perfect distributed system that can satisfy the three, but should know how to choose. Different data have different requirements for consistency. SNS website can tolerate a relatively long period of inconsistency without affecting the transaction and user experience; However, the transaction and accounting data like Alipay are very sensitive, and usually can't tolerate inconsistencies above the second level. Cache is widely used in Web development. Mem-cached is a distributed memory object caching system developed by danga (technical team running LiveJournal), which is used to reduce database load and improve performance in dynamic systems. Memcached composed of graph 1 memcached has the following characteristics: simple protocol; Event handling based on libevent; Built-in memory storage mode; Memcached is decentralized and has no communication with each other. Memcached processes each (Key, Value) pair (hereinafter referred to as KV pair), and the Key will be converted into a hash-Key through a hash algorithm, which is as convenient as possible to find, compare and hash. At the same time, memcached uses a secondary hash, which is maintained by a large hash table. Memcached consists of two core components: server (ms) and client (mc). In memcached query, mc first determines the ms position of the KV pair by calculating the hash value of the key. When the ms is confirmed, mc will send a query request to the corresponding ms to find the exact data. Because there is no interaction and multicast protocol, memcached interaction has little impact on the network. MemcacheDB is a distributed persistent storage system for key values. It is not a caching component, but a reliable and fast persistent storage engine based on object access. The protocol is consistent with memcached (incomplete), so many memcached clients can connect with it. MemcacheDB uses Berkeley DB as a persistent storage component, so it supports many features of Berkeley DB. There are many such products. For example, Taobao Tair is a key-value structure storage, which has been widely used in Taobao. Later, Tair also made a persistent version, and the idea was basically the same as Sina Memcached B. Alipay, a distributed database company, took the lead in using Greenplum database to migrate the data warehouse from the original Oracle RAC platform to Greenplum cluster in China. Greenplum's powerful computing power is used to support Alipay's growing business needs. Greenplum data engine software is specially designed for large-scale data and complex query functions required by the new generation data warehouse. It is based on MPP (massive parallel processing) and Shared-Nothing (no * * *) architecture, and is designed based on open source software and x86 commercial hardware (with higher cost performance). Figure 2 Greenplum data engine software distributed system When it comes to distributed file system, we have to mention Google's GFS. Based on a cluster system composed of a large number of ordinary PCs with Linux operating system installed, the whole cluster system consists of a host (usually with several backups) and several TrunkServer. Files in GFS are backed up to fixed-size Trunks, which are stored on different Trunks servers, and each Trunk has multiple copies (usually three copies) and is also stored on different Trunks servers. Master is responsible for maintaining the metadata in GFS, that is, the file name and its backbone information. The client first obtains the metadata of the file from the main server, and communicates with the corresponding TrunkServer according to the position of the data to be read in the file to obtain the file data. Hadoop was born after Google's paper was published. Today, Hadoop is sought after by many of China's largest Internet companies. Baidu's search log analysis, data warehouses of Tencent, Taobao and Alipay can all see Hadoop. Schematic diagram of Hive and Hadoop quoted from Facebook engineers Hadoop has the characteristics of low hardware cost, open source software system, strong flexibility, allowing users to modify their own codes, and supporting massive data storage and computing tasks. Hive is a data warehouse platform based on Hadoop, which will be transformed into corresponding MapReduce programs to be executed based on Hadoop. Through Hive, developers can easily develop ETL. As shown in the figure, a picture of Hive and Hadoop made by a Facebook engineer is quoted. The distributed file storage system (DFS) of Yonghong data mart data grid is an improvement and extension based on Hadoop HDFS, which manages and stores files stored on all nodes in the server cluster in a unified way. These nodes include a unique NamingNode, which provides metadata services in DFS. Many MapNode provide storage blocks. Files stored in DFS are divided into blocks, and then these blocks are copied to multiple computers (Map nodes). This is very different from the traditional RAID architecture. The block size and the number of copied blocks are determined by the client when creating the archive. Named nodes monitor file operations on all nodes in the server cluster, such as file creation, deletion, movement, renaming, etc. Data martData marts, also known as data markets, are warehouses that collect data from operational data and other data sources and serve a special professional group. In terms of scope, data is extracted from enterprise-wide databases, data warehouses or more specialized data warehouses. The key point of the data center is that it caters to the special needs of professional users in terms of analysis, content, performance and ease of use. Users in data centers want to represent data in terms they are familiar with. In the well-known Garnter report on data mart products abroad, agile business intelligence products in the first quadrant include QlikView, Tableau and SpotView, all of which are data mart products with full memory computing, posing challenges to traditional business intelligence product giants in terms of big data. Domestic BI products started late, and the well-known agile business intelligence products include PowerBI, Z-Suite of Yonghong Technology, SmartBI, etc. Among them, Z-Data Mart of Yonghong Technology is a data mart product for hot memory computing. De 'ang Information is also a system integrator of data mart products in China. Yonghong Data Mart is a data storage and data processing software developed by Yonghong Technology based on its own technology. The underlying technology of Yonghong data mart: 1. Decentralized computing II. Decentralized communication 3. Memory calculation. Column stores 5. With the increase of data volume, more and more people pay attention to NoSQL, especially in the second half of 20 10. Facebook chose HBase as the real-time message storage system to replace the original Cassandra system. This has attracted a lot of people's attention to HBase. Facebook's choice of HBase is based on two requirements: short-term small batch of temporary data and long-term growth of rarely accessed data. HBase is a high reliability, high performance, column-oriented and extensible distributed storage system. Using HBase technology, a large-scale structured storage cluster can be built on a cheap PC server. HBase is an open source implementation of BigTable, which uses HDFS as its file storage system. Google runs MapReduce to process the massive data in BigTable, and HBase also uses MapReduce to process the massive data in HBase; BigTable uses Chubby as the collaboration service, while HBase uses Zookeeper as the corresponding service. Summary Recently, the use of NoSQL database has become more and more popular, and almost all large Internet companies are practicing and exploring this aspect. In addition to enjoying the inherent scalability, fault tolerance and high read and write throughput of this kind of database (although the mainstream NoSQL is still improving), more and more practical needs have brought people to other fields where NoSQL is not good at, such as search, quasi-real-time statistical analysis, simple transactions and so on. In practice, some other technologies will be combined around NoSQL to form an overall solution. Figure 4 Seamless integration of online application system and data platform