Source | Data Whale
Machine learning and data science are broad terms, involving many fields and knowledge. What one data scientist does may be very different from another, and so does a machine learning engineer. The past (data) is usually used to understand or predict (model) the future.
In order to put the points mentioned above into context, I must explain what my role is. I used to stay in a small machine learning consulting team. We have done many industries you can think of, from data collection to cleaning, modeling and service deployment. Because the team is small, everyone has many titles on his head.
Daily life of machine learning engineers;
At 9 o'clock in the morning, I walked into the office, said hello to my colleagues, put the food in the refrigerator, poured a cup of coffee and went to my desk. Then I sit down, look at the notes of the previous day, open Slack, read unread news, and open links to papers or blog articles that the team likes. Because this field is developing rapidly, I should read more cutting-edge things.
I usually spend a little time browsing papers and blog articles after reading unread news, and carefully study those incomprehensible contents. I have to say that some of these contents are of great help to the work I am doing. Generally speaking, reading will take me about an hour or more, depending on the article itself. Some friends will ask me why it took so long.
In my opinion, reading is the ultimate meta-skill. Because once there is a better way to finish what I am doing now, I will learn to use it immediately, thus saving more time and energy. But there are also special circumstances. If the deadline of a project is coming, then I will shorten the reading time to advance the project.
After reading it, I will check the previous day's work and my notebook to see where I need to start working and why I can do so. Because my notebook is a diary.
For example, "In order to process the data into the correct format, we now need to train the data in the model. If I encounter difficulties in my work, I will write something like this: "There is a data mismatch, and then I will try to fix the mixed match and get the baseline before trying the new model. 」
At about 4 pm, I will sort out my code, which probably involves: sorting out the messy code, adding comments and merging. Why are you doing this? Because of this question, I often ask myself: What if others can't understand this? What do I need most if I want to read this code? With this idea, I think it is particularly meaningful to spend some time sorting out the code. At about 5 pm, my code should be uploaded to GitHub.
This is an ideal day, but not every day. Sometimes you will have an excellent idea at 4 pm, and then follow it, and then it may be all night.
Now you should have a general understanding of the daily life of a machine learning engineer. Next, I will share my experience with you:
1. Opening your eyes and closing your eyes are all data.
Many times, machine learning engineers will focus on building better models instead of improving the data for building models. Although the model can provide exciting short-term results by investing enough computing power, it will never be our desired goal.
When you are new to a project, you must spend a lot of time getting familiar with the data. Because in the long run, being familiar with these data will save you more time in the future.
That doesn't mean you shouldn't start with the details. For any new data set, your goal should be to become an "expert" in this field. Check the distribution to find different types of features, outliers, why outliers, and so on. If you can't tell the story of the current data, how can you make the model handle the data better?
An example of an exploratory data analysis life cycle (an action that is performed every time a new data set is encountered). More details about exploratory data analysis.
2. Communication is more difficult than solving technical problems.
Most of the obstacles I encountered were not technical, but related to communication. Of course, there are also technical challenges, but it is the job of our engineers to solve technical problems.
But never underestimate the importance of internal and external communication. There is nothing worse than the wrong choice of technology, because it is to solve the wrong technical challenge. What will happen in this way?
Externally, this is because what customers pursue does not match what we can provide. Internally, it is difficult to ensure that everyone can concentrate on one thing because many people have several jobs.
How to solve these problems?
For external problems, we can only communicate with customers constantly. Do your customers understand the services you can provide? Do you understand the needs of customers? Do they understand what machine learning can and cannot provide? How can you convey your ideas more effectively?
For internal problems, you can judge the difficulty of internal communication according to the number of software tools we use to solve problems: Asana, Gila, Trello, Slack, Basecamp, Monday, Microsoft team. One of the most effective methods I have found is to make a simple message update in the relevant project channel at the end of the day.
Is it perfect? No, but it seems to be effective. This gives me a chance to reflect on what I have done and tell everyone who needs to support my next job, and even get everyone's advice.
No matter how excellent an engineer you are, your ability to maintain and acquire new business is related to your communication skills.
3. stability >; Most advanced technology
Now there is a natural language problem: classifying text into different categories, the goal is to let users send a piece of text to the service and automatically classify it into one of the two categories. If the model has no confidence in the prediction, please pass the text to the human classifier, and the daily load is about 1000-3000 requests.
Bert was very popular last year. But without Google-scale computing, it is still very complicated to use BERT training model to solve the problem we want to solve, because a lot of content needs to be modified before it is put into production. Instead, we use another method, ULMFiT. Although it is not the most advanced, we can still get satisfactory results, and it is easier to use.
4. The two most common pits for beginners of machine learning
There are two pits in applying machine learning to actual production: one is the gap from course work to project work, and the other is the gap from model in notebook to production model (model deployment).
I took the course of machine learning online and completed my master's degree in AI. But even after completing many of the best courses, when I started working as a machine learning engineer, I found that my skills were based on the structured backbone of the courses, and the projects were not as organized as the courses.
I lack a lot of specific knowledge that I can't learn in the course, such as how to question the data, what data to explore and what data to use.
How to make up for this defect? I am lucky to be the best talent in Australia, but I am also willing to learn and make mistakes. Of course, mistakes are not the goal, but in order to be correct, you must find out where the mistakes are.
If you learn machine learning through a course, continue to study this course, but you need to learn what you are learning through your own project to make up for the shortcomings in the course.
As for how to deploy? I'm still not doing very well at this point. Fortunately, I noticed a trend: machine learning engineering and software engineering are merging. Through services like Seldon, Kubeflow and Kubernetes, machine learning will soon become another part of the stack. It is simple to build a model in Jupyter, but how to make it available to thousands or even millions of people? This is what machine learning engineers should think about, and this is also the premise of creating value by machine learning. However, according to recent discussions in Cloud Native, people outside big companies don't know how to do this.
5.20% of the time
20% time, which means we spend 20% time studying. Objectively speaking, learning is a loose term. As long as it is about machine learning, it can be included in the learning category, and related businesses should continue to learn. As a machine learning engineer, knowing the business can greatly improve your work efficiency.
If your business advantage lies in what you do best now, then your future business depends on you continuing to do what you do best, which means you need to keep learning.
6. One tenth of the papers are worth reading, but they are seldom used.
This is a rough indicator. However, when exploring any data set or model, you will soon find that this rule is universal. In other words, you may get 10 groundbreaking papers among thousands of contributions every year. Of these 10 groundbreaking papers, 5 may come from the same institution or individual.
You can't keep up with every new breakthrough, but you can apply them on a solid foundation that has stood the test of time.
Next is the problem of exploration and development.
7. Be your biggest skeptic.
Exploring and developing problems is a dilemma between trying new things and things that have worked. You can deal with these problems by becoming your biggest skeptic. Keep asking yourself, what are the benefits of choosing these over the old ones?
exploit
Generally speaking, it's easy to run your used model and get high-precision figures, and then you can report them to the team as a new benchmark. But if you get a good result, remember to check your work and ask your team to do the same thing again. Because you are an engineer, you should have such awareness.
explore
It is a good decision to spend 20% time on exploration, but it may be better if it is 70/20/ 10. This means that you need to spend 70% of your time on core products, 20% on the secondary development of core products, and 10% on moonshots, although these things may not be effective immediately. I am ashamed to say that I have never practiced this in my role, but this is what I am developing in this direction.
8. "Toy problem" is very useful
Toy problems can help you understand many problems, especially help you solve a complex problem. First, establish a simple question, which may be a small part of your data or an irrelevant data set. Find a solution to this problem and then extend it to the whole data set. In a small team, the trick is to abstract the problems and then sort them out.
9.rubber duck
If you have a problem, sitting down and staring at the code may or may not solve it. At this time, if you discuss with your colleagues and pretend that they are your rubber ducks, the problem may be easily solved.
"Ron, I'm trying to iterate through this array, loop through another array and track the states, and then I want to combine these states into a tuple list."
"The reincarnation of reincarnation? Why not vectorize? "
"Can I do this?"
"Let's try it."
10. The number of models built from 0 is decreasing.
This is related to the integration of machine learning engineering and software engineering.
Unless your data problems are specific, many problems are similar, such as classification, regression, time series prediction and suggestions.
Services such as Google and Microsoft's AutoML are providing world-class machine learning for everyone who can upload data sets and select target variables. For developers, there are libraries like fast.ai, which can provide the most advanced models with a few lines of code, and various model animations (a set of pre-built models), such as PyTorch hub and TensorFlow Hub, which provide the same functions.
This means that we don't need to know the deeper principles of data science and machine learning, only need to know their basic principles, and we should be more concerned about how to apply them to practical problems to create value.
1 1. Math or code?
For the customer problems I deal with, we all focus on code, and all the codes of machine learning and data science are Python. Sometimes I dabble in mathematics by reading a paper and copying it, but most of the existing frameworks contain mathematics. This is not to say that mathematics is unnecessary. After all, machine learning and deep learning are both forms of applied mathematics.
Mastering the operation of minimum matrix, some linear algebra and calculus, especially the chain rule, is enough to become a practitioner of machine learning.
Remember, most of the time, or most practitioners' goal is not to invent a new machine learning algorithm, but to show customers whether potential machine learning will help their business.
12. What you did last year may be invalid next year.
This is a general trend, because of the integration of software engineering and machine learning engineering, this situation is becoming more and more obvious.
But that's why you entered this industry. The framework will change, and all kinds of practical libraries will change, but the basic statistics, probability and mathematics will remain unchanged. The biggest challenge remains: how to apply them to create value.
What now? There should be many pits to explore on the growth path of machine learning engineers. If you are a novice, it is enough to master these 12 first.