I. Introduction of Operation and Maintenance Engineer
1. The operation engineer is responsible for maintaining and ensuring the high availability of the whole service, and constantly optimizing the system architecture to improve deployment efficiency, optimize resource utilization and improve the overall ROI.
2. The biggest challenge for operation and maintenance engineers is the management of large-scale clusters. How to manage the services on hundreds of thousands of servers and ensure the high availability of services is the biggest challenge faced by operation and maintenance engineers.
Second, the operation and maintenance engineer's work content
1. Event management: The goal is to restore the service as soon as possible when the service is abnormal, thus ensuring the availability of the service; At the same time, in-depth analysis of the causes of the failure, upgrading and repairing the problems existing in the service, designing and formulating relevant programs to ensure efficient stop loss when the service fails.
2. Problem finding: Design and develop an efficient monitoring platform and alarm platform, and use machine learning, big data analysis and other methods to summarize and analyze a large number of monitoring data in the system, so as to quickly find problems when the system is abnormal and judge the impact of faults.
3. Problem handling: Design and develop an efficient problem handling platform and tools, which can quickly/automatically make decisions and trigger related stop-loss plans when the system is abnormal, and quickly restore services.
4. Problem tracking: Determine the root cause of the problem by analyzing various manifestations (logs, changes and monitoring) of the system when the problem occurs, and formulate and develop the pre-plan tool.