PS: I'm a novice. If you see any questions, please leave a message below!
Hadoop Distributed File System (HDFS) is a distributed file system with high fault tolerance and high throughput, which is used to process massive data.
HDFS generally consists of hundreds of machines, and each machine stores a part of the whole data set. Finding and recovering machine faults quickly is the core goal of HDFS.
The core goal of HDFS interface is high throughput rather than low latency.
HDFS supports massive data collection, and a cluster can generally support tens of millions of files.
HDFS application needs an interface model of writing a file at a time and reading it many times, and file modification only supports suffix and truncation.
The massive data and consistent interface characteristics of HDFS make it more efficient to migrate computing to adapt to file content than to migrate data to support computing.
HDFS supports cross-platform use.
HDFS adopts master-slave architecture. An HDFS cluster consists of a NameNode, a main server (used to manage the system namespace and control the client file interface) and a large number of DataNode (generally, one node is used to manage the data storage of nodes). HDFS exposes file system namespaces and allows user data to be stored in files. A file is divided into one or more blocks, which are stored in a set of DataNode. NameNode executes commands such as opening, closing and renaming file system namespaces, and records the mapping between blocks and DataNode. DataNode is used to handle the read and write requests of clients and related operations of blocks. NameNode and DataNode generally run on GNU/Linux operating system, and HDFS is developed in Java language, so NameNode and DataNode can run on any machine that supports Java. In addition, the high portability of Java language enables HDFS to be published on various machines. An HDFS cluster runs a NameNode, and other machines run one (or more, which is very rare) DataNode. NameNode simplifies the architecture of the system, and is only used to store all HDFS metadata, and user data will not enter the node. The following figure shows the HDFS architecture diagram:
HDFS supports traditional hierarchical file management, and users or applications can create directories or files under directories. The file system namespace is similar to other file systems and supports the creation, deletion, movement and renaming of files. HDFS supports user number restriction and access control, but does not support soft links and hard links. Users can implement soft and hard links by themselves. NameNode controls the namespace, and almost any changes in the namespace are recorded in NameNode. An application can declare the number of copies of a file in HDFS, which is called the copy factor and will be recorded in NameNode.
HDFS stores each file as one or more blocks, and sets the block size and replication factor for the file to support file fault tolerance. All blocks in a file (except the last block) are the same size, and the following variable-length blocks are supported. The copy factor is specified when the file is created and can be changed later. A file can only have one writer at any time. NameNode is responsible for block replication. It regularly receives heartbeat and block reports from each DataNode. The heartbeat indicates the normal operation of the data node, and the block report contains all the blocks of the data node.
Replica storage scheme is very important for the stability and performance of HDFS. In order to improve the reliability and flexibility of data and make full use of network bandwidth, HDFS introduced the rack-aware replica storage strategy, which is only the first step of the replica storage strategy and lays the foundation for subsequent optimization. Large-scale HDFS clusters usually run in computer clusters that span many racks. Generally speaking, the data transmission between two nodes in the same rack is faster than that between different racks. A simple method is to store copies in a separate rack, which can prevent data loss and improve bandwidth, but it increases the burden of data writing. Generally, the replication coefficient is 3, and the HDFS storage strategy is to store the first copy to the local machine or the next random DataNode in the same rack, and store the other two copies to different DataNodes in the same remote rack. NameNode does not allow the same DataNode to store the same copy multiple times. On the basis of rack awareness strategy, the strategy of combining storage types with rack awareness will be supported in the future. Simply put, it is to judge whether DataNode supports this type of file on the basis of rack-aware, and if not, find the next one.
HDFS uses the proximity principle to read data. First, it looks for replicas on the same rack, then the local data center, and finally the remote data center.
At startup, NameNode enters safe mode, in which block copying will not occur. NameNode receives the heartbeat and block report from DataNode, and the minimum number of copies of each block is n. After receiving n blocks, NameNode thinks that the data block has been copied safely. When the proportion of data blocks that have been securely copied reaches the configurable percentage value, NameNode exits the security mode after another 30 seconds, and finally judges whether there are still data blocks that have not reached the minimum number of copies, and copies these data blocks.
NameNode uses a transaction log named EditLog to continuously record every change of file system metadata (such as creating files and changing replication coefficient), and uses a file named FsImage to store all file system namespaces (including the mapping relationship between blocks and files and related attributes of the file system). EditLog and FsImage are stored in NameNode's local file system. NameNode keeps a snapshot of metadata and block mapping in memory. When NameNode is started or a configuration item reaches a threshold, it will read EditLog and FsImage from disk, update the FsImage in memory with the new records of EditLog, then refresh the new version of FsImage to disk, and then truncate the processed records in EditLog. This process is a checkpoint. The purpose of checkpoint is to ensure that the file system continuously observes the changes of metadata by using the snapshot of metadata in memory, and store the snapshot information in the disk FsImage. Checkpoints are based on the following two configuration parameters: time period (dfs.namenode.checkpoint.period) and number of file system transactions (dfs.namenode.checkpoint.txns). When both are configured at the same time, a checkpoint will be triggered if either condition is met.
All HDFS network protocols are based on TCP/IP, and the client establishes a configurable TCP port to the NameNode machine for their interaction. DataNode uses DataNode protocol to interact with NameNode, and RPC wraps client protocol and DataNode protocol. By design, NameNode is only responsible for responding to RPC requests from clients or DataNode.
The core goal of HDFS is to ensure the reliability of data, even in the case of failure or error. Three common fault situations include NameNode fault, DataNode fault and network partition fault.
Network partition may lead to the connection between part of DataNode downtown and NameNode. NameNode judges by heartbeat packet, and marks the disconnected DataNode as suspended, so all the data registered in the suspended DataNode will be unavailable, which may lead to the copy times of some data blocks being lower than the original configured copy coefficient. NameNode keeps track of which blocks need to be copied and copies them when necessary. Trigger conditions include various situations: DataNode is unavailable, copying garbled code, hardware disk failure or negative coefficient increase. In order to avoid the replication storm caused by the unstable state of DataNode, the timeout period for marking DataNode suspended is set to be very long (default is 10min), and users can set a shorter time interval to mark DataNode as obsolete, so as to avoid using these obsolete nodes in requests requiring high read and write performance.
HDFS architecture is compatible with various data rebalancing schemes. When the free space of one DataNode is less than a certain threshold, a scheme can move the data to another DataNode. When a special file suddenly has a high reading demand, one way is to actively create additional copies to balance other data in the cluster. These types of equilibrium schemes have not been realized (it is not clear what the existing schemes are ...).
Problems with storage devices, networks or software may lead to garbled data obtained from DataNode. The HDFS client has verified the contents of the file. When the client creates a file, it will calculate the check value of each block in the file and store it in the namespace. When the client retrieves data, it will use the check value to check each block. If there is a problem, the client will go to another DataNode to get a copy of this block.
FsImage and EditLog are the core data structures of HDFS, and their errors will cause the whole HDFS to hang. Therefore, NameNode should support multi-point replication of FsImage and EditLog at any time, and all files should be updated synchronously if they change. Another option is to use shared storage or distributed editing logs on NFS to support multiple NameNode. Officially recommended distributed editing log.
Snapshots can store a copy of data at a specific moment, thus enabling HDFS to roll back to the last stable version when an error occurs.
The application scenario of HDFS is a large data set. The data only needs to be written once, but it needs to be read once or several times, and it supports traffic reading data. Generally, the block size is 128MB, so a file is cut into large blocks of 128MB, and each block may be distributed in different DataNode.
When the client writes data with a replication coefficient of 3, NameNode receives the DataNodes set to be written by the replica through the target selection algorithm. The 1 th DataNode starts collecting data part by part, stores each part locally and forwards it to the second DataNode. The second datanode also stores each part locally and forwards it to the third DataNode, which stores the data locally. This is a pipe copy.
HDFS provides a variety of access methods, such as file system Java API, the C language wrapper and REST API of this Java API, and also supports direct browsing by browsers. By using NFS gateway, clients can install HDFS on the local file system.
HDFS uses directories and files to manage data, and provides a command-line interface called FS shell. Here are some simple commands:
The DFSAdmin command set is used to manage HDFS clusters. These commands can only be used by cluster administrators. Here are some simple commands:
A normal HDFS installation will configure a web service that exposes namespaces through configurable TCP ports so that users can view file contents through a web browser.
If the garbage collection configuration is turned on, the files deleted through the FS shell will not be deleted immediately, but will be moved to the directory dedicated to garbage files (/user/
When the file copy coefficient decreases, NameNode will select the redundant copy to be deleted, and send the deletion information to DataNode when it receives the heartbeat packet. As mentioned above, this delete operation will take some time to show the increase in available space on the cluster.
HDFS architecture