Current location - Education and Training Encyclopedia - Resume - Java training of Beida Jade Bird: How to deal with online problems quickly by operation and maintenance programmers?
Java training of Beida Jade Bird: How to deal with online problems quickly by operation and maintenance programmers?
For most operation and maintenance programmers, it is very necessary to always pay attention to the possible problems of servers and system programs and solve them in advance.

Today, we understand how the operation and maintenance programmers can deal with online problems quickly through case analysis.

Once you fall into the pit, it must be wise to jump into the pit _> Fill the pit _> Avoid the pit, and the process of online fault handling is the same. The priority is from high to low. The goal of online fault handling is: jump pit' jump pit'-quickly restore online service, or reduce the impact on online service.

The availability of online services determines the interests of service providers' customers and affects the company's income.

Once the online environment is unavailable and users cannot be served, it will bring economic losses to the company/team, and more seriously, it will bring bad reputation to the company/team.

Therefore, the general company will put forward the requirements of stability and reliability for the online environment, which is also the kpi of the team and even the department.

To this end, an important task after encountering production failure is to restore production services. Even if the online service cannot be fully restored, we should try our best to minimize the impact on the online service.

Fill the pit' fill the pit'-find the cause of the problem and solve the problem fundamentally.

After restoring online services and minimizing the impact on users/companies/teams, we need to thoroughly investigate the problem, find out the root cause of the failure and fundamentally solve the problem.

Usually, pit filling and pit jumping are carried out at the same time, and the completion of pit filling means the success of pit jumping. However, in an emergency, there are some special "jump pit" methods, such as restarting services, or downgrading/merging services. In fact, the "pit filling" was not completed at that time, but unconventional means were adopted to "jump into the pit" first.

Avoid the pit' avoid the pit'-draw inferences from others and eliminate hidden dangers.

After finding the root cause and solving the problem, we need to draw inferences from others and think about the weaknesses in this investigation and handling process. What processes/specifications/systems need to be optimized? Do such problems exist in other systems or teams? Through such reflection and self-criticism, an online accident report is formed, and the process is constantly improved to avoid stepping on the pit again, and experience is also exchanged in the team to improve together.

The idea of online fault handling is based on the goal of online fault handling and the priority of the goal. One goal of online troubleshooting is to restore online services or reduce the impact on online services. The key point is the word "fast". After "jumping into the pit" and "filling the pit", it comes back to avoiding the pit.

Therefore, the steps of online fault handling can be divided into: fault discovery, fault location, fault investigation and fault backtracking, in which the first three steps are' pit jumping' behavior, and the last step includes' pit filling' and' pit avoidance'.

The above steps are not meant to be carried out from top to bottom in turn. It is recommended to do it in parallel, and don't be confused, because usually after an online fault, the fault handler will be started urgently, and all roles of operation and maintenance, development, testing and products will participate. At this time, we will continue to divide the work, summarize the messages in parallel, quickly troubleshoot and restore the service.

This idea is similar to the fork/join design idea of operating system, aiming at improving efficiency.

When the cause of the fault cannot be found quickly, we should decisively skip the fault location link and directly eliminate the fault, such as using service degradation, server expansion and other means to ensure low controllability of online services.

Beijing Beida Jade Bird/Suggest that we can wait until the online service is' supported', and then slowly locate the cause of the failure and fundamentally solve the problem.