Machine learning technology has shown powerful functions in all aspects of modern society: from web search to social network content filtering, and then to product recommendation on e-commerce websites. It is increasingly appearing in consumer goods, such as cameras and smart phones.
Machine learning system is used to identify the target in the picture, convert voice into text, match news elements, provide jobs or products according to users' interests, and select relevant search results. Gradually, these applications use a technique called deep learning. Traditional machine learning technology has limited ability in processing unprocessed data. For decades, if you want to build a pattern recognition system or a machine learning system, you need a complex engine and quite professional knowledge to design a feature extractor, which converts the original data (such as the pixel value of an image) into an appropriate internal feature representation or feature vector. Sub-learning systems, usually classifiers, detect or classify input samples. Feature representation learning is a set of methods to fill the machine with original data and then automatically find the expressions that need to be detected and classified. Deep learning is a feature learning method, which transforms the original data into a more advanced and abstract expression through some simple but nonlinear models. With enough combinations of transformations, you can also learn very complex functions. For classification tasks, high-level expression can strengthen the discrimination ability of input data and weaken irrelevant factors. For example, the original format of an image is a pixel array, so the expression of the learning features of the first layer usually refers to whether there are edges in the specific position and direction of the image. The second layer usually detects the pattern according to some emission from those edges, and at this time it ignores some small interference from some edges. The third layer can combine these patterns so that they correspond to some part of the familiar target. Subsequent layers will recombine these parts to form the target to be detected. The core aspect of deep learning is that the characteristics of the above layers are not designed by artificial engineering, but are learned from data by using the general learning process.
Deep learning is making great progress, which has solved the problem that the artificial intelligence community has worked hard for many years but has not made progress. Facts have proved that it can be good at discovering complex structures in high-dimensional data, so it can be applied to science, commerce and government. In addition to breaking records in the fields of image recognition and speech recognition, it also beat other machine learning technologies in other fields, including predicting the activity of potential drug molecules, analyzing particle accelerator data, reconstructing brain circuits, and predicting the effects of mutations in non-coding DNA on gene expression and diseases. Perhaps more surprisingly, deep learning has produced very gratifying results in various tasks of natural language understanding, especially topic classification, emotion analysis, automatic question answering and language translation. We believe that in the near future, deep learning will achieve more success, because it hardly needs manual engineering, and it can easily benefit from the increase of available computing power and data volume. The new learning algorithm and architecture currently being developed for deep neural networks will only accelerate this process.
Supervised learning
The most common form of machine learning, whether deep or not, is supervised learning. Imagine that we need to build a system that can classify images containing houses, cars, people or pets. We first collect a large number of image data sets of houses, cars, people and pets, and label each object with its category. During the training process, the machine will take a photo and then produce an output, which is expressed as a score in the form of a vector, and each category has such a vector. We want the required category to score the highest in all categories, but this is unlikely to happen before training. The error (or distance) between the output score and the expected pattern score can be obtained by calculating the objective function. Then, the machine will modify its internal adjustable parameters to reduce this error. These adjustable parameters, usually called weights, are real numbers and can be regarded as "knobs", which define the input and output functions of the machine. In a typical deep learning system, there may be millions of samples, weight and label samples used to train machines. In order to adjust the weight vector correctly, the learning algorithm calculates the gradient vector of each weight, indicating the amount that the error will increase or decrease if the weight is increased by a small amount. Then adjust the weight vector in the opposite direction of the gradient vector. Our objective function, the average of all training samples, can be regarded as a changeable terrain in the high-dimensional space of weights. The negative gradient vector indicates that the descending direction is the fastest in this terrain, which makes it closer to the minimum value, that is, the place with the lowest average output error.
In practical application, most practitioners use an algorithm called Random Gradient Descent (SGD). It includes providing some input vector samples, calculating the output sum error, calculating the average gradient of these samples, and then adjusting the weights accordingly. Train the network by providing a small sample set, and repeat this process until the objective function stops growing. It is called random because a small sample set will estimate the noise of the average gradient of all samples. This simple process usually finds a good set of weights, and its speed is amazing compared with other well-designed optimization techniques. After the training, the system will show the performance of the system through different data samples-test sets. This is used to test the generalization ability of the machine-the recognition ability of untrained new samples.
At present, machine learning technology in many applications uses linear classifiers to classify manually extracted features. The second kind of linear classifier will calculate the weighted sum of feature vectors. When the weighted sum exceeds the threshold, the input samples will be assigned to a specific category. Since 1960s, we have known that linear classifiers can only divide samples into very simple regions, that is, space is divided into two parts by a hyperplane.
However, for problems such as image and speech recognition, the input-output functions they need should not be too sensitive to the changes of irrelevant factors in the input sample, such as the change of position, the direction or illumination of the target, or the change of tone or intonation when speaking, but should be very sensitive to some specific minor changes (such as the difference between white wolves and white dogs similar to wolves-Samoyeds). At the pixel level, the images of two Samoyed dogs in different postures and different environments can be said to be worlds apart. However, a Samoyed dog and a wolf are in the same position, and the two images under similar background may be very similar.
Graph 1 Multilayer Neural Network and BP Algorithm
Multilayer neural network (represented by connection points) can integrate the input space and make the data (samples represented by red and blue lines) linearly separated. Notice how the regular grid in the input space (left) is transformed by the hidden layer (the transformed grid is on the right). In this example, only two input nodes, two hidden nodes and one output node are used, but the network used for target recognition or natural language processing usually contains dozens or hundreds of such nodes. Kolah and Samsung are developing convolutional neural network chips to make real-time vision systems in smartphones, cameras, robots and self-driving cars possible.
Distributed feature representation and language processing
Compared with the classical learning algorithm without distributed representation, deep learning theory shows that deep network has two different great advantages. These advantages come from the weight of each node in the network and depend on the reasonable structural distribution of the data generated at the bottom. First of all, the learning distributed feature representation can be generalized to adapt to the combination of newly learned eigenvalues (for example, there are 2n possible combinations of n-ary features). Secondly, the combined presentation layer in the deep network brings another exponential advantage potential (exponential depth).
The hidden layer in multi-layer neural network uses the data input in the network for feature learning, which makes it easier to predict the target output. The following is a good demonstration example, such as training multi-layer neural network to predict the next word in a sentence with the content of local text as input. Every word in the content is represented as a vector of 1/n in the network, that is to say, one value in each component is 1, and the rest are zeros. At the first level, each word creates a different activation state, or word vector (as shown in Figure 4). In the language model, the rest of the network can predict the next word in the sentence by learning to transform the input word vector into the output word vector, and can predict the probability that the word in the vocabulary will appear as the next word in the text sentence. E-learning contains word vectors of many active nodes, which can be interpreted as independent characteristics of words, just like the hierarchical representation of text symbols in text learning for the first time. These semantic features are not clearly represented in the input. It is a good way to decompose the relationship structure of input and output symbols by using "micro-rules" found in the process of learning. When sentences come from a large number of real texts and individual microscopic rules are unreliable, learning word vectors can also perform well. When using the trained model to predict new cases, some words with similar concepts are easily confused, such as Tuesday and Wednesday, Sweden and Norway. This representation is called distributed feature representation because their elements are not mutually exclusive and their structural information corresponds to the observed data changes. These word vectors are constructed by learning features, which are not determined by experts, but are automatically discovered by neural networks. Vector representations of words learned from texts are now widely used in natural languages.
Figure 4 Visualization of Word Vector Learning
The center of feature representation debate is knowledge based on logical inspiration and knowledge based on neural network. In the logic-inspired paradigm, a symbolic entity represents something because its unique attributes are the same or different from other symbolic entities. Symbol instances have no internal structure, and the structure is related to usage. As for understanding the semantics of symbols, we must reasonably correspond to the changing reasoning rules. On the contrary, neural network uses a large number of active carriers, weight matrices and scalar nonlinearity to realize fast "intuition" function, which can support simple common sense reasoning.
Before introducing neurolinguistics model, let's briefly describe the standard method, which is a language model based on statistics, and this model does not use distributed features. But based on statistics, the frequency of short symbol sequences increases to N(n-gram, N-gram). The number of possible N-gram approaches VN, where V is the size of the vocabulary. Considering that the text contains thousands of words, a very large corpus is needed. N-grams regards each word as an atomic unit, so it cannot be generalized in semantically related word sequences. However, neural network language models can, because they associate each word with a vector of true eigenvalues, and semantically related words are close to each other in the vector space (Figure 4).
Recursive neural network
When the back propagation algorithm was first introduced, the most exciting thing was to use recurrent neural networks (RNNs) for training. For tasks involving sequential input, such as speech and language, RNNs can achieve better results. RNNs processes one input sequence element at a time, while maintaining an implicit "state vector" containing historical information of sequence elements in the past time in an implicit unit in the network. If it is the output of different neurons in the deep multi-layer network, we will consider the output of this implicit unit at different discrete time steps, which will make us more aware of how to train RNNs through back propagation (as shown in Figure 5, right).
Fig. 5 Recurrent Neural Network
Rnn is a very powerful dynamic system, but training them is proved to be problematic, because the gradient of back propagation increases or decreases at every time interval, so after a period of time, the result will surge or drop to zero.
Due to advanced architecture and training methods, it is found that RNNs can predict the next character in a text or the next word in a sentence, and can be applied to more complex tasks. For example, after reading the words in an English sentence at a certain moment, an English "encoder" network will be trained, so that the final state vector of the hidden unit can well represent the meaning or idea to be expressed in the sentence. This "thought vector" can be used as an initial implicit state (or additional input) to jointly train a French "encoder" network, and its output is the probability distribution of the first word translated in French. If a special first word is selected from the distribution as the input of the coding network, the probability distribution of the second word in the translated sentence will be output until the selection stops. Generally speaking, this process is a sequence of French words generated according to the probability distribution of English sentences. The performance of this simple machine translation method can even be compared with the most advanced (the most advanced) method, and it also causes people to question whether it is necessary to manipulate internal symbols to understand sentences like inference rules. This is consistent with the view that everyday reasoning involves both analogy and reasonable conclusion.
The rest exceeded the word limit. ...