An event can be a collection of the following macros:
Advantages of epoll over Select/Poll:
The kernel code related to epoll is in the fs/eventpoll.c file. Here, the realization of three functions, epoll_create, epoll_ctl and epoll_wait, in the kernel is analyzed respectively. The source code of the linux kernel used in the analysis is version 4. 1.2.
Epoll_create is used to create the handle of epoll, which is implemented in the kernel system as follows:
sys_epoll_create:
It can be seen that when we call epoll_create, the passed-in size parameter is only used to judge whether it is less than or equal to 0, and then it has no other use.
The whole function has only three lines of code, and the real work is still put in the sys_epoll_create 1 function.
sys _ epoll _ create-& gt; sys_epoll_create 1:
The function flow of sys_epoll_create 1 is as follows:
sys _ epoll _ create-& gt; sys _ epoll _ create 1-& gt; ep_alloc:
sys _ epoll _ create-& gt; sys _ epoll _ create 1-& gt; EP _ alloc-& gt; get_unused_fd_flags:
In the linux kernel, current is a macro that returns a variable with the structure of task_struct (we call it a process descriptor), which represents the current process. The file resources opened by the process are stored in the files member of the process descriptor, so current->; File resources opened by the current process returned by the file. The rlimit(RLIMIT_NOFILE) function gets the maximum number of file descriptors that the current process can open. This value can be set, and the default value is 1024.
Related video recommendation:
Epoll, the underlying cornerstone supporting billion-dollar io, reveals the truth in actual combat.
Network principle tcp/udp, network programming epoll/reactor, serious "eight-part essay" in the interview.
Learning address: C/C++Linux server development/background architect zero voice education-learning video tutorial-Tencent classroom
Need more learning materials for C/C++ Linux server architects plus Group 8 12855908 (materials include C/C++, Linux, golang technology, Nginx, ZeroMQ, MySQL, Redis, fastdfs, MongoDB, ZK, streaming media, CDN, P2P, K8S, Docker, TCP/IP, etc.
The job of __alloc_fd is to allocate an available file descriptor for the process between [start, end] (note: here start is 0 and end is the maximum number of file descriptors that the process can open), so I won't go into details here. The code is as follows:
sys _ epoll _ create-& gt; sys _ epoll _ create 1-& gt; EP _ alloc-& gt; get _ unused _ FD _ flags-& gt; __alloc_fd:
Then, epoll_create 1 will call anon_inode_getfile to create the file structure, as shown below:
sys _ epoll _ create-& gt; sys _ epoll _ create 1-& gt; Anonymous information node obtains files:
The anon_inode_getfile function will first allocate a file structure and a dentry structure, and then hook the file structures with an anonymous inode node anon_inode_inode. It should be noted that when calling the anon_inode_getfile function to apply for a file structure, the ep variable of the eventpoll structure applied earlier and the file applied for->; Private_data will point to this ep variable, and after the anon_inode_getfile function returns, EP->; File will point to the file structure variable to which this function applies.
Briefly speaking, file/dentry/inode, when a process opens a file, the kernel will assign a file structure to the process, indicating that the opened file is in the context of the process, and then the application will access this structure through an int file descriptor. In fact, the kernel process maintains an array of file structures, and the file descriptor is the subscript of the corresponding file structure in the array.
Dentry structure (called "directory entry") records various attributes of files, such as file name, access rights, etc. Each file has only one dentry structure, and then a process can open a file multiple times, and multiple processes can also open the same file. In these cases, the kernel will apply for multiple file structures and establish multiple file contexts. However, the kernel will only assign a dentry to the same file no matter how many times it is opened. Therefore, there is a many-to-one relationship between the file structure and the item structure.
At the same time, each file has not only a dentry directory entry structure, but also an inode structure of an index node, which records the location and distribution of the file on the storage medium, and only one inode is allocated to each file in the kernel. Dentry and inode described different goals. A file may have several file names (such as linked files), and accessing the same file through different file names may be different. Dentry file represents a file in logical sense and records its logical attributes, while inode structure represents a file in physical sense and records its physical attributes. There is a many-to-one relationship between dentry and inode structures.
sys _ epoll _ create-& gt; sys _ epoll _ create 1-& gt; Fd _ installation:
Summarize the function of epoll_create: after calling epoll_create, allocate an eventpoll structure and a file structure representing an epoll file in the kernel, associate the two structures together, and return an epoll file descriptor fd that is also associated with the file structure. When an application operates epoll, it needs to pass in an epoll file descriptor fd. According to this fd, the kernel can find the file structure of epoll, and then through this file, it can obtain the structural variables of eventpoll applied by epoll_create, and all the important information related to epoll is stored in this structure. Next, all operations of the epoll interface function are performed on the eventpoll structure variable.
So the role of epoll_create is to establish a channel for the process from the epoll file descriptor to the eventpoll structure variable in the kernel.
Epoll_ctl interface is used to add/modify/delete file listening events. The kernel code is as follows:
sys_epoll_ctl:
According to the introduction of epoll_ctl interface, op is the action of epoll operation (add/modify/delete event), and ep_op_has_event(op) judges whether it is not a delete operation. If OP! = EPOLL_CTL_DEL is true, then you need to call the copy_from_user function to copy the event from user space to the epds variable of the kernel. Because, only the delete operation, the kernel does not need to use the event event passed in by the process.
Then call fdget twice in a row to get the file structure variables of the epoll file and the monitored file (hereinafter referred to as the target file) (note: this function returns the fd structure variables, and the fd structure contains the file structure).
The next step is to check the parameters. In the following cases, you can think that there is something wrong with the passed-in parameters and return an error directly:
Of course, if it is an addition operation, there are still some judgments about the operation. I won't explain it here. It's relatively simple and you can read it yourself.
In ep, a red and black tree is maintained. Every time a registered event is added, a variable with epitem structure is applied to represent the event listener, and then it is inserted into the red-black tree of ep. In epoll_ctl, the ep_find function will be called to find the listening item represented by the target file from the red-black tree of ep, and the returned listening item may be empty.
Next, the code of the switch area is the core of the entire epoll_ctl function. There are three situations for switching op: adding (EPOLL_CTL_ADD), deleting (EPOLL_CTL_DEL) and modifying (EPOLL_CTL_MOD). Here I will take addition as an example to illustrate. The other two situations are similar. I know how to add listening events, and other deleting and modifying listening events can also be extrapolated.
When adding a monitoring event to a target file, you must first ensure that the target file is not being monitored in the current ep. If it exists (epi is not empty), a -EEXIST error will be returned. Otherwise, the parameters are normal, and then the POLLERR and POLLHUP monitoring events of the target file are set by default, and then the ep_insert function is called to insert the monitoring events of the target file into the red-black tree maintained by ep:
sys _ epoll _ CTL-& gt; ep_insert:
As mentioned above, the monitoring of the target file is maintained by a listener variable with epitem structure, so in the ep_insert function, the kmem_cache_alloc function is first called to allocate an epitem structure listener from the slab allocator, and then the structure is initialized. There is nothing to say here. Next, let's look at the function call ep_item_poll:
sys _ epoll _ CTL-& gt; EP _ insert-& gt; Ep _ item _ polling:
In the ep_item_poll function, call the poll function of the target file and point to different functions for different target files. If the target file is a socket, polling points to sock_poll, while if the target file is a tcp socket, polling is a tcp_poll function. Although the functions that poll points to may be different, their functions are the same, that is, to obtain the event bits currently generated by the target file and bind the listening items to the poll hook of the target file (the most important thing is to register the poll callback function EP _ PTABLE _ QUEEN _ PROC). After this operation is completed, the EP _ PTABLE _ QUEEN _ PROC callback function will be called when the target file generates an event in the future.
Next, call list_add_tail_rcu to add the current listening item to the f_ep_links linked list of the target file, that is, the epoll hook linked list of the target file, and all listening items of the monitoring target file will be added to the linked list.
Then call ep_rbtree_insert to add the epi listener to the red-black tree maintained by ep. There is no explanation here, and the code is as follows:
sys _ epoll _ CTL-& gt; EP _ insert-& gt; ep_rbtree_insert:
As mentioned earlier, ep_insert calls ep_item_poll to get the event bits generated by the target file. Before calling epoll_ctl, there may be events that related processes need to monitor. If there are monitored events, (Revents &: event->; Events is true), and the listening items related to the target file are not linked to the ep's preparation linked list rdlist, and the listening items are added to the ep's preparation linked list rdlist, which links the listening items of all ready target files monitored by the epoll descriptor. Moreover, if there is a task waiting to generate an event, call the wake_up_locked function to wake up all the waiting tasks and handle the corresponding event. When a process calls epoll_wait, it appears in the wq waiting queue of ep. Next, explain the epoll_wait function.
To sum up the epoll_ctl function: this function applies for a listening item for the target file according to the monitored events, and hangs the listening item in the red-black tree of the eventpoll structure.
Epoll_wait waits for the event, and the kernel code is as follows:
sys_epoll_wait:
The first is to check some parameters passed in by the process:
After checking all parameters, call the ep_poll function for actual processing:
sys _ epoll _ wait-& gt; ep_poll:
The first thing ep_poll does is to deal with the waiting time. Timeout Timeout is in milliseconds, and the timeout is greater than 0, indicating that the waiting timeout time has timed out. If the timeout is equal to 0, the function will not block and return directly. If it is less than 0, it will be permanently blocked and will not return until the event occurs.
When no event is generated ((! Ep_events_available(ep)) is true), and the __add_wait_queue_exclusive function is called to add the current process to EP->; Wq waits in the queue, and then in an infinite for loop, first calls set _ current _ state (task _ interruptible) to set the current process to an interruptible sleep state, and then the current process abandons the cpu and goes to sleep, and it will not execute the next code until another process calls wake_up or an interrupt signal comes in to wake this process.
If the process is awakened, first check whether there is an event, or whether it has timed out or is awakened by other signals. If these situations occur, jump out of the loop and change the current process from EP->; Wp is removed from the waiting queue, and the current process is set to TASK_RUNNING ready state.
If there is an event, call the ep_send_events function to transfer the event to user space.
sys _ epoll _ wait-& gt; EP _ poll-& gt; Ep _ send _ event:
Ep_send_events doesn't work, but the real work is in the ep_scan_ready_list function:
sys _ epoll _ wait-& gt; EP _ poll-& gt; EP _ send _ events-& gt; ep_scan_ready_list:
Ep_scan_ready_list first links the data in the ep ready list to a global txlist, then clears the ep ready list and sets the ovflist of EP to NULL. Ovflist is a single linked list, which is a backup linked list that accepts ready events. When the kernel process copies events from the kernel to user space, the target file may generate new events during this time. At this time, you need to put a new time chain into ovlist.
Then, call the stored procedure callback function (the ep_send_events_proc function will be called here) to copy the event data from the kernel to the user space.
sys _ epoll _ wait-& gt; EP _ poll-& gt; EP _ send _ events-& gt; EP _ scan _ ready _ list-& gt; ep_send_events_proc:
The ep_send_events_proc callback function loops to obtain the event data of the monitored item. For each monitored project, it calls ep_item_poll to get the event of the monitored target file. If the event is obtained, it calls the __put_user function to copy the data to user space.
Back to the ep_scan_ready_list function, it was mentioned above that during the execution of the stored procedure callback function, the target file may generate new events and link them to the ovlist, so after the callback, the events in the ovlist need to be added to the rdllist ready event list again.
At the same time, at last, if the rdlist is not empty (indicating whether there is a ready event) and the process is waiting for the event, call wake_up_locked to wake up the kernel process again to handle the arrival of the event (the process is the same as before, that is, copy the event to the user space).
At this point, the process of epoll_wait is over, but there is a problem, that is, the above-mentioned process will sleep after calling epoll_wait, but when will this process be awakened? When epoll_ctl is called to register the listener item for the target file, an EP _ PTABLE _ QUEEN _ PROC callback function is registered for the listener item of the target file. The EP _ PTABLE _ QUEEN _ PROC callback function adds the process to the wake-up list of the target file and registers an ep_poll_callbak callback function. When the target file generates an event, the ep_poll_callbak callback will wake up the process in the waiting queue.
To sum up the epoll function: the epoll_wait function will put the calling process to sleep (except when timeout is 0). If there are monitored events, the process will wake up, and the events will be copied from the kernel to user space and returned to the process.