Current location - Education and Training Encyclopedia - Graduation thesis - Paper reading: an improved data flow summary: counting-minimum sketch and its application.
Paper reading: an improved data flow summary: counting-minimum sketch and its application.
This paper introduces a new sublinear spatial data structure-counting minimum sketch to summarize data flow.

CM sketch allows quick response to basic queries (such as point, range and inner product queries) in data stream summarization, and can also be used to solve several important problems in data stream, such as finding quantiles and common items.

Using CM sketches to solve these problems shows that the time and space range significantly improves the previously known problems, usually from to.

We consider a vector, which is presented in an implicit and incremental way. The dimension of this vector is n, and its current state at time t is:

Initially, a is a zero vector, and all I's.

The update of a single entry of a vector is presented in the form of a stream, and t update is, which means:

Generally, it means that only one dimension in a vector is modified, and the other dimensions are not changed.

In the recent data flow scheme, the function calculation algorithm in the context of data flow needs to meet the following requirements:

In recent years, several different sketches have been proposed in the context of data flow, which allow approximation of many simple set functions.

The sketches designed so far are usually linear functions of their inputs, which can be expressed as projections of basic vectors, and some randomly selected projection matrices are used to represent data.

This means that by converting these functions into calculations on sketches, some functions can be easily calculated by some functions distributed on the data on the site. Therefore, they are also suitable for distributed applications.

Although sketches have proved to be powerful, they have the following disadvantages:

In view of the fact that the data flow field is driven by high-performance monitoring applications, such as the response time requirements of data flow algorithms used to monitor IP packet flows, these shortcomings ultimately limit the use of many known data flow algorithms in appropriate applications.

We will solve all these problems by proposing a new sketch structure, which we call Count-Min or CM sketch.

This sketch has the following advantages:

At any time t, the query needs to calculate some specific functions on a(t):

These queries are necessary for many applications in data stream algorithms and have been widely studied.

Count-Min or CM sketch is named after two basic operations used to answer point queries. Count first, then calculate the minimum value. We use e to represent the bottom of the natural logarithmic function ln.

A Count-Min(CM) sketch with parameters (ε, δ) is represented by a two-dimensional array Count with width w and depth d: count [1, 1] ... count [d, w].

Then we set the parameters, as well as w and d.

Every item in the array starts with zero. In addition, the D-hash function H 1 ... HD: {1...n} → {1...w} is randomly and evenly selected from pairs of independent families.

When the update arrives, it means that the item is updated by numbers and then added to the count of each line, and the counter is determined by.

Formal settings:

The space used in the Count-Min sketch is an array in wd order, which requires wd words and D hash functions. When the paired functions described in the reference are used, each hash function can be stored in 2 words.