At present, the big data industry is becoming more and more popular, which leads to the extreme lack of big data talents in China. The following is IT training/introduction on the skills of managing big data storage in Hadoop environment.
1, distributed storage Traditional centralized storage has existed for some time.
But big data is not really suitable for centralized storage architecture.
Hadoop aims to make computing closer to data nodes, while adopting the large-scale horizontal expansion function of HDFS file system.
Although, the usual solution to the inefficiency of Hadoop in managing its own data is to store Hadoop data on SAN.
But it also caused the bottleneck of its own performance and scale.
Now, if you use a centralized SAN processor to process all the data, it violates the distributed and parallel characteristics of Hadoop.
You can manage multiple sans for different data nodes, or you can centralize all data nodes in one SAN.
However, Hadoop is a distributed application, so it should run on distributed storage, so that storage retains the same flexibility as Hadoop itself, but it also needs to embrace a software-defined storage scheme and run it on a commercial server, which is naturally more efficient than bottleneck Hadoop.
2, superconvergence VS distributed attention, don't confuse superconvergence with distribution.
Some hyper-converged solutions are distributed storage, but usually the term means that your application and storage are stored on the same compute node.
This is an attempt to solve the problem of data localization, but it will cause too much competition for resources.
This Hadoop application and storage platform will compete for the same memory and CPU.
Hadoop runs in the proprietary application layer, and distributed storage runs in the proprietary storage layer, which is better.
Then, cache and layered technology are used to solve the data positioning problem and compensate the network performance loss.
3. An important aspect of avoiding controller bottleneck is to avoid processing data through a single point (such as a traditional controller).
On the contrary, the performance can be significantly improved by ensuring the parallelization of storage platforms.
In addition, the scheme provides incremental scalability.
Adding functions to a data lake is as simple as putting x86 servers into it.
The distributed storage platform will automatically add functions and readjust data when necessary.
4. Deduplication and compression The key to mastering big data is deduplication and compression technology.
Usually, 70% to 90% of the data in large datasets will be simplified.
In terms of PB capacity, it can save tens of thousands of dollars in disk costs.
Modern platforms provide inline (compared with post-processing) deduplication and compression, which greatly reduces the capacity required to store data.
5. Merge Hadoop distributions Many large enterprises have multiple Hadoop distributions.
It may be that the developer needs it or the enterprise department has adapted to different versions.
In any case, it is often necessary to finally maintain and operate these clusters.
Once massive data really starts to affect an enterprise, multiple Hadoop distributions will lead to inefficiency.
We can improve data efficiency by creating a repeatable compressed data lake. 6. Virtualization Hadoop virtualization has swept the enterprise market.
More than 80% of physical servers in many regions are now virtualized.
However, there are still many enterprises that avoid talking about virtualized Hadoop because of performance and data localization problems.
7. Creating an elastic data lake It is not easy to create a data lake, but there may be a demand for big data storage.
We have many ways to do it, but which one is right? This correct architecture should be a dynamic and flexible data lake, which can store the data of all resources in various formats (structured, unstructured and semi-structured).
More importantly, it must support the execution of applications on local data resources, not on remote resources.