First of all, you need to know the Java language and Linux operating system, which is the basis of learning big data, and there is no order of learning. Graduation from JAVA is undoubtedly an excellent start and cornerstone. It can be said that he won at the starting line, and receiving and absorbing knowledge in the field of big data will be more handy than ordinary people.
Java? As long as you know some basics, you don't need deep Java technology to do big data. Learning java SE is equivalent to learning big data. basis
Linux: Because big data-related software runs on Linux, we should learn Linux well. Learning Linux well is of great help for you to quickly master big data related technologies, and can help you better understand the running environment and network environment configuration of big data software such as hadoop, hive, hbase and spark. , let you step on a lot of pits less, learn to understand scripts, and make it easier for you to understand and configure big data clusters. It also allows you to learn new big data technologies faster in the future.
Ok, after talking about the basics, let's talk about what big data technologies we need to learn, and we can learn them in the order I wrote them.
Hadoop: This is a popular big data processing platform, which is almost synonymous with big data, so it is necessary. Hadoop includes several components: HDFS, MapReduce and YARN. HDFS is a place to store data, just like the hard disk of our computer. MapReduce processes and calculates data. It has a feature that all data can be run in a given time, but the time may not be fast, so it is called data batch processing.
Remember that learning here can be used as a node for you to learn big data.
City zoo: It's a panacea. It will be used when installing Hadoop's HA, and it will also be used in future Hbase. It is generally used to store some cooperation information, which is relatively small and generally does not exceed1m. All software that uses it depends on it. For us personally, we just need to install it correctly and let it run normally.
Mysql: We have finished learning the processing of big data, and then we have to learn the mysql database, a processing tool for small data, because it will be used when installing hive later. What level does mysql need to master? You can install it on Linux, run it, configure simple permissions, modify the password of root, and create a database. Here is mainly to learn the syntax of SQL, because the syntax of hive is very similar to this.
Sqoop: Used to import data from Mysql into Hadoop. Of course, you can also export Mysql data table directly to a file and put it on HDFS without this. Of course, you should pay attention to the pressure of Mysql in production environment.
Hive: This thing is an artifact of people who understand SQL syntax. It makes it easy for you to deal with big data, and you don't have to work hard to write MapReduce programs. Some people say pigs? Just like a pig. Just master one.
Oozie: Now that you have learned Hive, I'm sure you need it. It can help you manage your Hive or MapReduce and Spark scripts, check whether your program is executed correctly, give you an alarm if there is a problem, help you retry the program, and most importantly, help you configure task dependencies. I'm sure you'll like it, or you'll feel like shit when you look at that pile of scripts and the dense crond.
Hbase: This is the NOSQL database in Hadoop ecosystem. Its data is stored in the form of keys and values. Keys are unique, so they can be used to copy data. Compared with MYSQL, it can store more data. Therefore, it is usually used for storage destinations after big data processing is completed.
Kafka: This is a good queuing tool. What is the queue for? Line up to buy tickets, okay? If there is too much data, you need to queue up for processing, so that other students who cooperate with you will not scream. Why did you give me so much data (for example, hundreds of gigabytes of files)? What should I do with it? Don't blame him for not dealing with big data. You can tell him that I put the data in the queue and take them one by one when you use them, so that he won't complain and optimize his program immediately, because it's his business not to deal with them. You didn't ask the question. Of course, we can also use this tool to store online real-time data in memory or HDFS. At this time, you can cooperate with a tool called Flume, which is specially used to provide simple data processing and write it to various data receivers (such as Kafka).
Spark: used to make up for the lack of data processing speed based on MapReduce. Its characteristic is to load data into memory for calculation, instead of reading a hard disk that is slow to death and has a particularly slow evolution. Especially suitable for iterative operation, so the algorithm flow is particularly porridge. It was written in scala. Either the Java language or Scala can operate, because both use JVM.