Five data processing architectures

Big data is a general term for non-traditional strategies and technologies needed to collect, organize and process large data sets and gain insight from them. Although the computing power or storage capacity required to process data has already exceeded the upper limit of a computer, the universality, scale and value of this type of computing have only experienced a large-scale expansion in recent years.

This paper will introduce a basic component of big data system: processing framework. The processing framework is responsible for computing data in the system, such as processing data read from non-volatile storage or processing data just ingested into the system. The calculation of data refers to the process of extracting information and opinions from a large number of single data points.

These frameworks are described as follows:

Batch frame only:

Apache Hadoop

Streaming only processing framework:

Apache storm

Apache Samza

:: Mixed framework:

Apache spark

Apache flink

What is a big data processing framework?

The processing framework and processing engine are responsible for computing the data in the data system. Although there is no authoritative definition of the difference between "engine" and "framework", in many cases, the former can be defined as a component that actually handles data operations, while the latter can be defined as a series of components that undertake similar functions.

For example, Apache Hadoop can be regarded as a processing framework with MapReduce as the default processing engine. Engine and framework can usually be used interchangeably or at the same time. For example, another framework, Apache Spark, can integrate Hadoop and replace MapReduce. Interoperability among components is one of the reasons why big data systems are so flexible.

Although the systems responsible for processing data at this stage of the life cycle are usually very complex, in a broad sense, their goals are very consistent: to improve the understanding ability by performing operations on data, to reveal the patterns contained in data, and to gain insight into complex interactions.

In order to simplify the discussion of these components, we will classify them according to the state of data processed by different processing frameworks. Some systems can process data in batch mode, while others can process data continuously flowing into the system in stream mode. In addition, some systems can handle two types of data at the same time.

Before introducing the indicators and conclusions of different implementations, it is necessary to briefly introduce the concepts of different processing types.

batch processing system

Batch processing has a long history in the field of big data. Batch processing mainly operates on large static data sets and returns the results after the calculation process is completed.

Data sets used in batch mode usually meet the following characteristics …

Bounded: A batch data set represents a limited data set.

Persistence: Data is usually always stored in some persistent storage location.

Massive: Batch operations are usually the only way to deal with very large data sets.

Batch processing is ideal for computing jobs that require access to a complete recordset. For example, when calculating totals and averages, the data set must be regarded as a whole, not a collection of multiple records. These operations require the data to maintain its own state during the calculation process.

Tasks that need to handle large amounts of data are usually best handled through batch operations. Whether the data set is processed directly from the persistent storage device or loaded into the memory first, the batch processing system fully considers the amount of data in the design process and can provide sufficient processing resources. Batch processing is often used to analyze historical data because of its excellent performance in dealing with a large number of persistent data.

It takes a lot of time to process a large amount of data, so batch processing is not suitable for occasions with high processing time requirements.

Apache Hadoop

Apache Hadoop is a processing framework dedicated to batch processing. Hadoop is the first big data framework that has received great attention in the open source community. Based on a large number of papers and experiences on massive data processing published by Google, Hadoop has re-implemented related algorithms and component stacks, making large-scale batch processing technology easier to use.

The new version of Hadoop contains multiple components, that is, multiple layers, which can be used together to process batch data:

HDFS: HDFS is a distributed file system layer, which can coordinate storage and replication between cluster nodes. HDFS ensures that data can still be used after the inevitable node failure, and can be used as a data source to store the processing results of intermediate states and the final results of calculation.

Yarn: Yarn is the abbreviation of another resource negotiator, which can act as a cluster coordination component of Hadoop stack. This component is responsible for coordinating and managing the operation of the underlying resources and scheduling jobs. By acting as an interface for cluster resources, YARN enables users to run more types of workloads in Hadoop clusters in an iterative manner than in the past.

MapReduce: MapReduce is Hadoop's native batch engine.

Batch processing mode

The processing function of Hadoop comes from MapReduce engine. The processing technology of mapreduce meets the requirements of Map, shuffle and reduce algorithms using key-value pairs. The basic treatment process includes:

Reading data sets from HDFS file system

Divide the data set into small pieces and distribute them to all available nodes.

Calculate the subset of data on each node (the intermediate result of calculation will be rewritten in HDFS).

Redistribution of intermediate results and grouping by keywords

Reduce the value of each key by summarizing and merging the calculation results of each node.

Rewrite the final result of calculation to HDFS.

Advantages and limitations

Because this method relies heavily on persistent storage, each task needs to be read and written many times, so the speed is slow. On the other hand, because disk space is usually the most abundant resource on the server, it means that MapReduce can handle very massive data sets. This also means that compared with other similar technologies, Hadoop's MapReduce can usually run on cheap hardware, because it doesn't need to store everything in memory. MapReduce has a very high expansion potential, and it has been used by tens of thousands of nodes in the production environment.

The learning curve of MapReduce is steep. Although other peripheral technologies of Hadoop ecosystem can greatly reduce the impact of this problem, we still need to pay attention to this problem when implementing some applications quickly through Hadoop cluster.

A huge ecosystem has been formed around Hadoop, and Hadoop cluster itself is often used as a component of other software. By integrating with Hadoop, many other processing frameworks and engines can also use HDFS and YARN Explorer.

abstract

Apache Hadoop and its MapReduce processing engine provide a time-tested batch processing model, which is most suitable for processing very large data sets with low time requirements. A fully functional Hadoop cluster can be built with very low-cost components, which makes this cheap and efficient processing technology flexibly applicable to many situations. Compatibility and integration with other frameworks and engines make Hadoop the underlying foundation of various workload processing platforms using different technologies.

Stream processing system

The stream processing system will calculate the data entering the system at any time. Compared with the batch mode, this is a completely different way of processing. The stream processing method does not need to perform operations on the whole data set, but on each data item transmitted through the system.

The data set in stream processing is "borderless", which has several important effects:

A complete data set can only represent the total amount of data that has entered the system so far.

A working data set may be more relevant and can only represent a single data item at a particular time.

Processing is event-based and there is no "end" unless it is explicitly stopped. Processing results are immediately available and will be updated as new data comes.

The stream processing system can handle almost unlimited data, but at the same time it can only handle one piece (real stream processing) or a small amount (micro-batch processing) of data, and only a minimum state is maintained between different records. Although most systems provide methods to maintain certain states, stream processing is mainly optimized for functional processing with fewer side effects.

Functional operations mainly focus on discrete steps with limited state or side effects. Performing the same operation on the same data will produce the same result or some other factors, which is very suitable for stream processing, because the status of different items is usually a combination of some difficulties, limitations and unnecessary results in some cases. Therefore, although some types of state management are usually feasible, these frameworks are usually simpler and more effective without state management mechanism.

This kind of processing is very suitable for some types of workloads. Tasks with near real-time processing requirements are very suitable for using streaming processing mode. Analysis, server or application error logs, and other time-based metrics are the most appropriate types, because responding to data changes in these areas is extremely important for business functions. Stream processing is suitable for processing data that must respond to changes or peaks and pay attention to the changing trend over a period of time.

Apache storm

Apache Storm is a streaming processing framework that focuses on very low latency, and may be the best choice for workloads that need near real-time processing. This technology can handle a very large amount of data and provide lower latency results than other solutions.

Stream processing mode

Storm's stream processing can arrange DAG (directed acyclic graph) naming topology in the framework. These topologies describe the different transformations or steps that need to be performed on each incoming data segment when it enters the system.

Topologies include:

Stream: Ordinary data stream, which is borderless data that will continue to reach the system.

Spout: The source of data stream at the edge of topology, such as API or query, from which data to be processed can be generated.

Bolt: Bolt represents the processing steps that need to consume stream data, apply operations to it and output the results in the form of a stream. Bolt needs to establish a connection with each nozzle and then connect with each other to form all necessary treatments. At the end of the topology, the final Bolt output can be used as the input of other interconnected systems.

The idea behind Storm is to use the above components to define a large number of small discrete operations, and then combine multiple components into the required topology. By default, Storm provides a "at least once" processing guarantee, which means that each message can be processed at least once, but in some cases, it may be processed multiple times if it fails. Storm cannot ensure that messages can be processed in a specific order.

In order to achieve strict one-time processing, that is, stateful processing, an abstraction called Trident can be used. Strictly speaking, a storm without trident can usually be called a core storm. Trident will greatly affect the processing capacity of Storm, increase the delay, provide the processing status, and use micro-batch mode instead of item-by-item pure flow processing mode.

In order to avoid these problems, it is generally recommended that Storm users use Core Storm as much as possible. However, it should also be noted that Trident's strict one-time handling guarantee for content is also useful in some cases, such as when the system cannot intelligently handle duplicate messages. If you need to maintain the status between projects, such as counting how many users clicked on a link in an hour, Trident will be your only choice. Although it can't give full play to the inherent advantages of the framework, the trident improves the flexibility of the storm.

Trident topology includes:

Stream batch: this refers to the micro-batch of stream data, and batch semantics can be provided by blocking.

Operation: refers to the batch process that can be performed on data.

Advantages and limitations

At present, Storm may be the best solution in the field of near real-time processing. This technology can process data with very low latency and can be used for workloads that want the lowest latency. If the processing speed directly affects the user experience, such as the need to provide the processing results directly to the website pages opened by visitors, then Storm will be a good choice.

The cooperation between Storm and Trident allows users to replace pure flow treatment with micro-batch. Although users can gain more flexibility to create tools that meet their needs, at the same time, this practice will weaken the biggest advantage of this technology over other solutions. Having said that, it is always good to have one more stream processing method.

The core storm cannot guarantee the order of message processing. Core Storm provides a "at least once" processing guarantee for messages, which means that every message can be processed, but it may also be copied. Trident provides a strict one-time treatment guarantee, which can provide sequential treatment between different batches, but it cannot realize sequential treatment within a batch.

In terms of interoperability, Storm can be integrated with Hadoop's YARN Explorer, so it can be easily integrated into existing Hadoop deployments. In addition to supporting most processing frameworks, Storm can also support multiple languages, providing users with more choices of topology definition.

abstract

Storm may be the most suitable technology for pure streaming workloads with high latency requirements. This technology can ensure that every message is processed and can be used in many programming languages. Because Storm can't batch process, if you need these capabilities, you may need to use other software. If there is a high demand for strict one-time treatment guarantee, trident can be considered at this time. However, in this case, other stream processing frameworks may be more suitable.

Apache Samza

Apache Samza is a stream processing framework, which is closely integrated with Apache Kafka messaging system. Although Kafka can be used in many stream processing systems, according to the design, Samza can give full play to Kafka's unique architectural advantages and guarantees. This technology can provide fault tolerance, buffering and state storage through Kafka.

Samza can use YARN as a resource manager. This means that Hadoop clusters (at least HDFS and YARN) are needed by default, but it also means that Samza can directly use YARN's rich built-in functions.

Stream processing mode

Samza relies on Kafka's semantics to define how to handle streams. Kafka involves the following concepts when processing data:

Theme: Every data stream entering Kafka system can be called a theme. A topic is basically a data stream composed of relevant information that consumers can subscribe to.

Partition: In order to spread a topic to multiple nodes, Kafka will divide incoming messages into multiple partitions. Partitioning will be based on the key, which ensures that every message containing the same key can be partitioned into the same partition. The order of partitions can be guaranteed.

Proxy: Each node that constitutes a Kafka cluster is also called a proxy.

Producer: Any component that writes data to Kafka topics can be called a producer. Producers can provide keys needed to divide topics into partitions.

Consumer: Any component that reads a theme from Kafka can be called a consumer. Consumers need to be responsible for maintaining information about their branches so that they can know which records have been processed after failure.

Since Kafka is equivalent to an eternal log, Samza also needs to deal with an eternal data stream. This means that the new data stream created by any transformation can be used by other components without affecting the original data stream.

Advantages and limitations

At first glance, Samza's dependence on Kafka query system seems to be a limitation, but it can also provide some unique guarantees and functions for the system, which other stream processing systems do not have.

For example, Kafka has provided copies of data stores that can be accessed in a low-latency manner, and it can also provide a very easy-to-use and low-cost multi-subscriber model for each data partition. All outputs, including the results of intermediate states, can be written into Kafka and can be used independently by downstream steps.

This close dependence on Kafka is similar to MapReduce engine's dependence on HDFS in many ways. Although the dependence on HDFS between each calculation in batch processing leads to some serious performance problems, it also avoids many other problems encountered by stream processing.

The close relationship between Samza and Kafka makes the processing steps themselves very loosely coupled. You can add any number of subscribers in any step of the output without prior coordination, which is very useful for organizations with multiple teams that need to access similar data. Multiple teams can subscribe to the data topics that enter the system, or they can subscribe to the topics created by other teams after some data processing. All these will not bring additional pressure to load-intensive infrastructure such as databases.

Writing directly into Kafka can also avoid the problem of back pressure. Back pressure refers to the situation that the data inflow speed exceeds the real-time processing capacity of the component due to the peak load, which may lead to the stop of processing work and the possible loss of data. According to the design, Kafka can save data for a long time, which means that components can continue processing at their convenience and can be restarted directly without worrying about any consequences.

Samza can use the fault-tolerant checkpoint system implemented by local key value storage to store data. In this way, Samza can get a "at least once" delivery guarantee, but in the face of the failure caused by multiple delivery of data, this technology can not provide accurate recovery of summary status (such as counting).

Compared with the primitives provided by Storm and other systems, the advanced abstraction provided by Samza is easier to use in many aspects. At present, Samza only supports JVM language, which means that it is not as flexible as Storm in language support.

abstract

Apache Samza is a good choice for streaming workloads in environments where Hadoop and Kafka are already available or easy to implement. Samza itself is very suitable for organizations with multiple teams, which need to use (but not necessarily coordinate closely with each other) multiple data streams at different processing stages. Samza can greatly simplify many stream processing tasks and achieve low latency performance. If the deployment requirements are not compatible with the current system, it may not be suitable for use, but it is still suitable for consideration if it requires extremely low latency processing or strict one-time processing semantics.

Mixed processing system: batch processing and stream processing

Some processing frameworks can handle batch and stream processing workloads. These frameworks can process two types of data with the same or related components and APIs, thus simplifying different processing requirements.

As you can see, this feature is mainly implemented by Spark and Flink, and these two frameworks will be introduced below. The key to realize this function is how to unify the two different processing methods and what assumptions should be made about the relationship between fixed and unfixed data sets.

Although projects that focus on a certain processing type will better meet the needs of specific use cases, the hybrid framework aims to provide a common solution for data processing. The framework can not only provide the methods needed to process data, but also provide its own integration items, libraries and tools, which can be competent for various tasks such as graphic analysis, machine learning and interactive query.

Apache spark

Apache Spark is the next generation batch processing framework with stream processing capability. Spark, which is developed based on the same principle as Hadoop's MapReduce engine, mainly accelerates the running speed of batch workload through perfect memory calculation and processing optimization mechanism.

Spark can be deployed as an independent cluster (with the cooperation of corresponding storage tiers), or it can be integrated with Hadoop to replace MapReduce engine.

Batch processing mode

Unlike MapReduce, Spark's data processing is all done in memory, and it only needs to interact with the storage layer when the data is read into memory at the beginning and the final result is permanently stored. The processing results of all intermediate states are stored in the memory.

Although the processing method in memory can greatly improve the performance, the speed of Spark in dealing with disk-related tasks has also been greatly improved, because it can realize more perfect overall optimization by analyzing the whole task set in advance. For this reason, Spark can create a directed acyclic graph (DAG) to represent all the operations that need to be performed, the data that need to be operated and the relationship between the operations and the data, so that the processor can coordinate tasks more intelligently.

In order to realize batch computing in memory, Spark will use a model called Resilient Distributed Dataset (RDD) to process data. This is an eternal structure that represents a data set and exists only in memory. Operations performed on RDD can generate new RDD. Each RDD can be traced back to the parent RDD by lineage, and finally to the data on disk. Spark can achieve fault tolerance through RDD, without writing the results of each operation back to disk.

Stream processing mode

The flow handling capability is realized by spark flow. Spark itself is mainly designed for batch processing workload. In order to make up the difference between engine design and streaming workload characteristics, Spark implemented a concept called micro-batch processing. In terms of specific strategies, this technology can treat data streams as a series of very small "batches", which can be processed through the native semantics of batch engine.

Spark Streaming buffers streams in sub-second increments, and then these buffers are processed in batches as small-scale fixed data sets. The actual effect of this method is very good, but compared with the real stream processing framework, there are still some shortcomings in performance.

Advantages and limitations

The main reason for using Spark instead of Hadoop MapReduce is speed. With the help of memory computing strategy and advanced DAG scheduling mechanism, Spark can process the same data set faster.

Another important advantage of Spark lies in its diversity. The product can be deployed as an independent cluster or integrated with existing Hadoop clusters. The product can run batch processing and stream processing, and a cluster can handle different types of tasks.

In addition to the capabilities of the engine itself, an ecosystem containing various libraries is built around Spark, which can provide better support for tasks such as machine learning and interactive query. Compared with MapReduce, the Spark task is "well known" and easy to write, so it can greatly improve productivity.

The stream processing system adopts batch processing, and the data entering the system needs to be buffered. Buffering mechanism enables this technology to handle a very large amount of input data and improve the overall throughput, but waiting for the buffer to be empty will also lead to an increase in delay. This means that the Spark stream may not be suitable for workloads with high latency requirements.

Because memory is usually more expensive than disk space, Spark is more expensive than disk-based systems. However, the increase in processing speed means that tasks can be completed faster, which can usually offset the increased costs in an environment where resources need to be paid by the hour.

Another consequence of Spark memory computing design is that if it is deployed in a * * * shared cluster, it may encounter the problem of insufficient resources. Compared with HadoopMapReduce, Spark consumes more resources, which may affect other tasks that also need to use the cluster. Essentially, Spark is not suitable for storage with other components of Hadoop stack.

abstract

Spark is the best choice for diversified workload processing tasks. Spark batch processing capability provides unparalleled speed advantage, but at the cost of higher memory consumption. Spark streaming is more suitable as a streaming solution for workloads that value throughput rather than latency.

Apache flink

Apache Flink is a stream processing framework that can handle batch processing tasks. This technology can treat batch data as a data stream with limited boundaries, so batch tasks can be regarded as a subset of stream processing. Adopting the method of stream processing priority for all processing tasks will produce a series of interesting side effects.

This first-stream processing method, also known as Kappa architecture, is in contrast to the more widely known Lambda architecture (in Lambda architecture, batch processing is the main processing method, stream is the supplement, and provides early unrefined results). Kappa architecture will stream everything to simplify the model, which is only feasible after the recent maturity of the streaming engine.

Stream processing model

Flink's stream processing model treats each project as a real data stream when processing incoming data. The data stream API provided by Flink can be used to handle endless data streams. The basic components that Flink can use together include:

Stream refers to an eternal and infinite data set circulating in the system.

Operators are the functions of pointers, which perform operations on one data stream and generate other data streams.

Source refers to the entry point of data flow into the system.

Sink refers to the location where the data stream enters after leaving Flink system, which can be a database or a connector connecting other systems.

In order to recover after encountering problems in the calculation process, the streaming task will create a snapshot at a predetermined point in time. In order to realize state storage, Flink can be used with various state back-end systems, depending on the complexity and persistence level of the required implementation.

In addition, Flink's stream processing ability can also understand the concept of "event time", which refers to the time when an event actually occurs, and this function can also handle conversations. This means that execution order and grouping can be ensured in some interesting way.

Batch model

Flink's batch processing model is largely an extension of the stream processing model. At this point, the model no longer reads data from persistent streams, but reads bounded data sets from persistent storage in the form of streams. Flink will use exactly the same runtime for these processing models.

Flink can optimize the batch processing workload to some extent. For example, because batch operations can be supported by persistent storage, Flink cannot create snapshots of batch workloads. Data can still be recovered, but normal processing operations can be performed faster.

Another optimization is to decompose batch tasks so that different stages and components can be called when needed. In this way, Flink can be better stored with other users of the cluster. By analyzing tasks in advance, Flink can see all the operations that need to be performed, the size of the data set, and the downstream operation steps that need to be performed, so as to achieve further optimization.

Advantages and limitations

Flink is a unique technology in the field of processing framework at present. Although Spark can also perform batch processing and stream processing, the micro-batch architecture adopted by Spark's stream processing makes it unsuitable for many use cases. Flink stream processing can first provide low latency, high throughput and almost item-by-item processing capacity.

Many components of Flink are self-managing. Although this practice is rare, for performance reasons, this technology can manage its own memory without relying on the native Java garbage collection mechanism. Unlike Spark, Flink does not need manual optimization and adjustment after the characteristics of the data to be processed change, and the technology can also handle data partition and automatic caching by itself.

Flink will divide the work in many ways and optimize the task. This analysis is similar to the optimization of relational database by SQL query planner to some extent, and can determine the most efficient implementation method for specific tasks. The technology also supports multi-stage parallel execution, which can gather the data of blocked tasks together. For iterative tasks, for performance reasons, Flink will try to perform the corresponding computing tasks on the nodes where the data is stored. In addition, you can also do "incremental iteration" or iterate only the parts where the data has changed.

In terms of user tools, Flink provides a Web-based scheduling view, which can easily manage tasks and view system status. Users can also view the optimization scheme of the submitted task, so as to understand how the task is finally realized in the cluster. For analysis tasks, Flink not only supports memory computing, but also provides SQL-like query, graphics processing and machine learning libraries.

Flink works well with other components. If used with Hadoop stack, this technology can be well integrated into the whole environment, occupying only the necessary resources at any time. This technology can be easily combined with yarn, HDFS and Kafka. With the help of compatible packages, Flink can also run tasks written for other processing frameworks, such as Hadoop and Storm.

At present, one of the biggest limitations of Flink is that it is still a very "young" project. The large-scale deployment of the project in the real environment is not as common as other processing frameworks, and the limitations of Flink in scalability are not deeply studied. With the rapid development cycle and the improvement of compatible packages, more and more Flink deployments may occur when more and more organizations begin to try.

abstract

Flink provides low-latency stream processing while supporting traditional batch processing tasks. Flink may be best suited for organizations with extremely high flow requirements and a small number of batch tasks. This technology is compatible with native Storm and Hadoop programs, and can run on the cluster managed by YARN, so it can be easily evaluated. The rapid development work makes it worthy of attention.

conclusion

Big data systems can use a variety of processing technologies.

For workloads that only need batch processing, Hadoop with lower implementation cost than other solutions will be a good choice if it is not time-sensitive.

For workloads that only need streaming processing, Storm can support a wider range of languages and achieve extremely low-latency processing, but the default configuration may produce repeated results and cannot guarantee the order. Samza's tight integration with YARN and Kafka can provide greater flexibility, easier multi-team use, and simpler replication and status management.

For mixed workloads, Spark can provide high-speed batch processing and micro batch processing. The support of this technology is more perfect, and there are various integration libraries and tools to realize flexible integration. Flink provides real stream processing and batch processing functions. Through deep optimization, it can run tasks written for other platforms and provide low-latency processing, but its practical application is still too early.

The most suitable solution mainly depends on the state of the data to be processed, the demand for the time required for processing and the expected results. Specifically, whether to use a full-featured solution or a solution that mainly focuses on a certain project needs to be carefully weighed. As it matures and is widely accepted, similar problems need to be considered when evaluating any emerging innovative solution.

A short paper on the American-Soviet War

Introduce the abstract of Hangzhou paper report

What material is used to weave carpets?

Is the meeting summary useful for evaluating professional titles?

Interpretation of technical archives terms

Should euthanasia be legalized?

How to improve primary school English? How to cultivate reading ability?

How to add footnotes to the paper?

How to write the brain drain analysis?

Talk about your understanding of the connotation of cultural concepts.