Current location - Education and Training Encyclopedia - Graduation thesis - This paper introduces the processing method of massive data.
This paper introduces the processing method of massive data.
This paper introduces the processing method of massive data.

Scope of application: It can be used to realize data dictionary, judge data duplication or set intersection.

Basic principles and points:

The principle is simple, bit array +k independent hash functions. If the bit array of the value corresponding to the hash function is set to 1, and all the bits corresponding to the hash function are found to be 1, obviously this process cannot guarantee that the search result is 100% correct. At the same time, deleting the inserted keyword is not supported, because the corresponding bit of this keyword will affect other keywords. Therefore, a simple improvement is the counting Bloom filter, which can support deletion by replacing the bit array with the counter array.

There is another important problem, such as how to determine the size of bit array M and the number of hash functions according to the number n of input elements. When the number of hash functions k=(ln2)*(m/n), the error rate is the minimum. If the error rate is not greater than e, m must be at least equal to n*lg( 1/E) to represent any group of n elements. But m is larger, because it is necessary to ensure that at least half of the bit array is 0, so m should be >; =nlg( 1/E)*lge is about 44 times that of nlg( 1/E) 1.44 times (lg stands for logarithm with base 2).

For example, if we assume that the error rate is 0.0 1, then m should be about 13 times of n, so k is about 8.

Note here that the units of m and n are different, m is the unit of bits, and n is the unit of the number of elements (precisely, the number of different elements). Usually, the length of a single element has many bits. So it is usually economical to use bloom filter in memory.

Extension:

Bloom filter maps the elements in the set to a bit array, and whether all the mapped bits are 1 with k(k is the number of hash functions) indicates whether the elements are in the set. Counting Bloom Filter (CBF) expands each bit in the bit array into a counter, thus supporting the deletion of elements. Spectral Bloom Filter (SBF) correlates it with the number of occurrences of collected elements. SBF uses the minimum value in the counter to estimate the frequency of element occurrence.

Example: I'll give you two files, A and B. Each file has 5 billion URLs, each URL takes up 64 bytes, and the memory limit is 4G, so you can find the same URL as the A and B files. What if it's third gear or even n gear?

According to this question, let's calculate the memory occupation. 4g = 2 32 is about 4 billion *8 is about 34 billion, and n = 50 billion. If the bit error rate is 0.0 1, about 65 billion bits are needed. What is available now is 34 billion, which is not much difference, which may increase the error rate. In addition, if these urlip correspond to ip one by one, they can be converted into IP, which is greatly simplified.

2. Hash method

Scope of application: a basic data structure that can be quickly searched and deleted, usually requiring the total amount of data that can be put into memory.

Basic principles and points:

Hash function selection, for string, integer, arrangement, specific corresponding hash method.

Collision handling, one is open hash method, also called zipper method; The other is closed hash, also known as open addressing.

Extension:

D in the left hash represents multiple. Let's simplify this problem first and look at the 2-left hash. 2 Left hashing refers to dividing a hash table into two halves with equal length, which are called T 1 and T2 respectively, and providing a hash function h 1 and h2 for T 1 and T2 respectively. When storing the new key, two hash functions are used to calculate at the same time, and two addresses h 1[key] and h2[key] are obtained. At this time, it is necessary to check the position of h 1[key] in T 1 and the position of h2[key] in T2, which position stores more (colliding) keys, and then store new keys in the position with less load. If there are as many sides as there are, for example, both positions are empty or a key is stored, and the new key is stored in the T 1 sub-table on the left, from which 2-left comes. When looking up a key, you must hash it twice and look up two locations at the same time.

Example: 1). Massive log data, extract the IP that visited Baidu the most on a certain day.

The number of IPs is still limited, up to 2 32. We can consider using hash to store the IP directly in memory and then make statistics.

Step 3 Bitmap

Scope of application: You can quickly find, copy and delete data. Generally speaking, the data range is less than 10 times of int.

Basic principle and key points: Use a bit array to indicate whether some elements exist, such as an 8-digit telephone number.

Extension: bloom filter can be regarded as an extension of bitmap.

Example of problem:

1) It is known that a file contains some telephone numbers, each number is 8 digits, and the number of different numbers is counted.

8 bits is 99 999 999 at most, which requires about 99m bits, or about 10 MB of memory.

2) Find out the number of non-repetitive integers among 250 million integers, and the memory space is not enough to accommodate these 250 million integers.

Expand the 2 bitmap, and then use 2 bits to represent a number, 0 means it doesn't appear, 1 means it appears once, and 2 means it appears more than twice. Or we can use two bitmaps to simulate and realize this 2-bit bitmap instead of 2-bit.

accumulation

Scope of application: the first n of massive data are large, and the n are relatively small, which can be put into the memory heap.

Basic principles and points: the maximum heap is less than the first n, and the minimum heap is greater than the first n. Methods, such as finding the first n is small, we compare the current element with the largest element in the maximum heap, and if it is less than the largest element, we must replace the largest element. In this way, the last n elements are the smallest n elements. It is suitable for a large amount of data, the first n elements are small, and the size of n is relatively small, so that all the first n elements can be obtained in one scan with high efficiency.

Extension: Double heap, a maximum heap combined with a minimum heap, can be used to maintain intermediate values.

Example of problem:

1) Find the largest number before 100 among 100 w numbers.

Use a heap of at least 100 elements.

5. Double-tube segmentation

Scope of application: K largest, middle, non-repetitive or repetitive numbers.

Basic principle and key points: Because the range of elements is too large to use the direct addressing table, the range is determined step by step through multiple divisions, and then it is finally within the acceptable range. It can be reduced many times, and the double layer is just an example.

Extension:

Example of problem:

1) .250 million integers find the number of non-repeating integers, and the memory space is not enough to accommodate these 250 million integers.

A bit like the pigeon's nest principle, the integer is 2 32, that is, we can divide the number of 2 32 into 2 8 regions (for example, a single file represents a region), then divide the data into different regions, and then the different regions can be directly solved by bitmap. In other words, as long as there is enough disk space, it can be easily solved.

2) .500 million ints find their median.

This example is more obvious than the above one. First, we divide the int into two regions 16, and then read the data to count the number of digits falling into each region. Then we can judge which area the median falls in according to the statistical results, and know that the largest number in this area is just the median. Then, in the second scan, we only need to calculate those numbers that fall in this area.

In fact, if it is not int but int64, after three such divisions, we can reduce it to an acceptable level. That is, int64 can be divided into 2 24 regions, and then the maximum number of regions can be determined, and then the region can be divided into 2 20 sub-regions, and then the maximum number of sub-regions can be determined, and then the number of sub-regions is only 2 20, so that the direct addr table can be directly used for statistics.

6. Database index

Scope of application: adding, deleting, modifying and querying big data.

Basic principle and key points: Using data design and implementation methods to deal with the addition, deletion and query of massive data.

Extension:

Example of problem:

7. Inverted index

Scope of application: search engine, keyword query

Basic principles and key points: Why is it called inverted index? Under full-text search, the index method is used to store the mapping of the storage locations of words in a document or a group of documents.

Taking English as an example, here is the text to be indexed:

T0 = "That's the way it is"

T 1 = "What's this?"

T2 = "This is a banana"

We can get the following reverse file index:

" a": {2}

"Banana": {2}

"Yes": {0, 1, 2}

"It": {0, 1, 2}

"What": {0, 1}

The search conditions "What", "Yes" and "It" will correspond to the intersection of sets.

A list of words developed into an index to store each document. The query of forward index often encounters the orderly and frequent full-text query of each document and the verification of every word in the verification document. In the forward index, documents occupy the central position, and each document points to a series of index items it contains. That is to say, the document points to the words it contains, and the inverted index points to the words of the document containing it, so it is easy to see this inverted relationship.

Extension:

Examples of problems: document retrieval system, which queries which documents contain a certain word, such as keyword search of common academic papers.

8. External sorting

Scope of application: sorting and deduplication of big data.

Basic principles and key points: the merging method of external sorting, the principle of replacing and selecting loser tree, and the optimal merging tree.

Extension:

Example of problem:

1). There is a file with the size of 1G, in which each line is a word, the word length does not exceed 16 bytes, and the memory limit size is1m. Returns the 100 words with the highest frequency.

This data has obvious characteristics, the word length is 16 bytes, but the memory is only 1m, which is not enough for hashing and can be used for sorting. Memory can be used as an input buffer.

9. Christmas tree

Scope of application: large amount of data and many repetitions, but small types of data can be put into memory.

Basic principles and key points: implementation mode, representation mode of node child nodes.

Expansion: compression implementation.

Example of problem:

1). There are 10 files, each file is 1G, each line of each file stores the user's query, and the query of each file may be repeated. I want you to sort by query frequency.

2 2). 10/00000 strings, some of which are the same (repeated), so it is necessary to remove all the repeated strings and keep the non-repeated strings. How to design and implement?

3) Search for popular queries: the repetition rate of query strings is relatively high. Although the total number is100000, if the duplication is removed, it will not exceed 3 million and each query string will not exceed 255 bytes.

10. Distributed processing mapreduce

Scope of application: large amount of data, but small types of data can be put into memory.

Basic principle and key points: hand over data to different machines for processing, divide data and reduce results.

Extension:

Example of problem:

1). The typical example application 1).MapReduce is a calculation.

Each different word in a set of documents:

Invalid mapping (string name, string document):

//Name: document name

//Document: Document content

For each word w in the document:

EmitIntermediate(w, 1);

Void reduce (string word, iterator partial count):

//key: a word

//Value: list of aggregate partial counts.

int result = 0;

For each v in the partial count:

result+= parse int(v);

Issue (results);

Here, each document is divided into words, and each word is initially counted with the value of "1".

Map function, using words as result keys. This framework puts all pairs together.

The same key, and feed them to the same call to reduce, so this function only needs

Sum all the input values to find the total number of occurrences of the word.

2). Massive data are distributed on 100 computers. Try to count the TOP 10 of this batch of data efficiently.

3). A * * * has n machines, and each machine has n numbers. Each machine can store up to O(N) numbers and operate them. How to find the median of n 2 numbers?

Classical problem analysis

Tens of millions or billions of data (with duplication), the top n data with the highest frequency are counted, which are divided into two situations: one can be read into memory, and the other can't.

Available ideas: trie tree+heap, database index, partition subset statistics, hash, distributed computing, approximate statistics, external sorting.

Whether it can be read into memory at one time should actually refer to the amount of data after duplication is removed. If the deduplicated data can be put into memory, we can build a dictionary for the data, such as map, hashmap and trie, and then make statistics directly. Of course, when updating the number of occurrences of each piece of data, we can use the heap to maintain the top n pieces of data with the most occurrences. However, this will lead to an increase in maintenance times, which is not as efficient as finding the first n data after complete statistics.

If the data doesn't fit into memory. On the one hand, we can consider whether we can improve the above dictionary method to adapt to this situation. The change we can make is to store the dictionary on the hard disk instead of the memory, and we can refer to the storage mode of the database.

Of course, there is a better way, that is, distributed computing can be adopted, which is basically a map-reduce process. First, the data can be divided into different computers according to the data value or the value after hash(md5). It is best to read the data into memory at one time after the data is divided, and let different computers handle various numerical ranges, which is actually a mapping. After getting the results, each computer only needs to take out the top n data with the highest frequency, and then summarize and select the top n data with the highest frequency among all the data, which is actually the reduce process.

In fact, you may want to distribute the data directly to different computers for processing, so you can't get the correct solution. Because one data may be evenly distributed to different computers, and another data may be completely aggregated to one computer, there may still be the same amount of data. For example, if we want to find the top 100 with the most times, we will distribute the data of 100,000 to10 machines and find the top 100 with the most times. There is no guarantee that the real 100 will be found after the merger, because for example, 100 may be found the most frequently. But it is divided into 10 machines, so there are only 1000 machines on each machine. Assuming that the machines before 1000 are all distributed on a single machine, such as100/machine, then this machine with 1000 will be eliminated. Even so, we can't randomly divide the data into different computers, but map it to different computers for processing according to the hash value.

External sorting will consume a lot of IO, and the efficiency will not be very high. The above distributed method can also be used for stand-alone version, that is, the total data is divided into several different sub-files according to the range of values, and then processed one by one. After processing, merge words and their frequency of occurrence. In fact, you can use the external sort merge process.

In addition, we can also consider approximate calculation, that is, combining with natural language attributes, only those words that appear most frequently in real practice are used as dictionaries, so that this scale can be memorized.