A JVM language: At present, JVM language classes account for a very large proportion in the big data ecology, and it is not an exaggeration to say that it is a monopoly to some extent. It is recommended that you learn Java or Scala. As for languages like Clojure, getting started is not easy, but it is not recommended. In addition, now is the era of "mother is precious to children", and a big data framework will bring the popularity of its programming languages, such as Docker to Go and Kafka to Scala. So I suggest you master at least one JVM language. It is worth mentioning that we must understand the multithreading model and memory model of this language. The processing modes of many big data frameworks are actually similar to the multi-thread processing model at the language level, but the big data framework extends them to the multi-machine distributed level.
Computing processing framework: Strictly speaking, this is divided into offline batch processing and streaming processing. Streaming is the future trend, I suggest you learn. In fact, offline batch processing is almost out of date, and its batch processing idea can't handle infinite data sets, so its application scope is shrinking day by day. In fact, Google has officially abandoned offline processing represented by MapReduce within the company. Therefore, if you want to learn big data engineering, it is necessary to master a real-time stream processing framework. At present, the mainstream frameworks are: Apache Samza, Apache Storm, Apache Spark Streaming and Apache Flink, which is in the limelight in recent years. Of course, Apache Kafka also launched its own stream processing framework: Kafka Streams.
Distributed storage framework: Although MapReduce is somewhat outdated, HDFS, another cornerstone of Hadoop, is still strong, and it is the most popular distributed storage in the open source community. You should definitely take the time to study. If you want to study in depth, you must also read Google's GFS paper ([URL =]/media/research.google.com/en//archive/GFS-sosp2003.pdf [/URL]). Of course, there are many distributed storage in the open source world, and OceanBase of Alibaba in China is also an excellent one.
Resource scheduling framework: Docker has been very popular in the past year or two. Every company is trying to develop container solutions based on Docker. The most famous open source container scheduling framework is K8S, but it is also famous for Hadoop's YARN and Apache Mesos. The latter two can not only dispatch container clusters, but also dispatch non-container clusters, which is worth learning.
Distributed coordination framework: There are some common functions to be realized in all mainstream big data distributed frameworks, such as service discovery, leader election, distributed lock, KV storage and so on. These functions also gave birth to the development of distributed coordination framework. The oldest and most famous one is Apache Zookeeper, and some new ones include Consul, etcd and so on. Learning big data engineering, the distributed coordination framework can not be ignored, and it needs to be deeply understood to some extent.
KV database: memcache and Redis are typical, especially Redis develops rapidly. Its concise API design and high-performance TPS are increasingly favored by users. Even if you don't learn big data, it is good to learn Redis.
Column storage database: I have spent a long time studying Oracle, but I have to admit that relational databases have gradually faded out of people's sight, and there are too many alternatives to rdbms. In view of the disadvantage that row storage is not suitable for ad hoc query of big data, people have developed column storage, and the typical column storage database is HBASE in open source community. In fact, the above storage concept is also from a paper of Google: Google BigTable. If you are interested, you'd better read:
Message Queuing: As the main system of "peak cutting and valley filling" in big data engineering processing, message queuing is essential. There are many solutions in this field at present, including ActiveMQ and Kafka. Domestic Ali has also opened RocketMQ. Apache Kafka is the best. Many of Kafka's design ideas are especially suitable for distributed stream data processing. No wonder Jay Crepps, the original author of Kafka, is the top god of real-time streaming media.