We know that YOLO is actually that you only look at it once, which means that you only need to look at it to know the location and object. Personally, I think it is quite vivid. He doesn't need the RPN structure of the faster RCNN. In fact, he chose anchor to keep the candidate boxes, and divided the picture into 7×7 grids, each allowing two different bounding boxes. In this way, from the beginning, we have a candidate box of 7x7x2, which covers the whole area of the image. His idea is that even if the faster RCNN has a regression box in the first stage, the second stage still needs fine-tuning, so it is better to make a rough regression box.
Let's have a good look at this model.
First, the model structure
In fact, the simple vocabulary of this model is:
How did 30 form the channel scale?
A. Location of 2 bounding boxes (8 channels)
Each bounding box needs four numerical values to indicate its position, (Center_x, Center_y, width, height), that is, (x coordinate of the center point of bounding box, y coordinate, width and height of bounding box), and two bounding boxes * * * need eight numerical values to indicate their positions.
B. Confidence of 2 bounding boxes (2 channels)
C.20 classification probability (20 channels)
Let's talk about the remaining 20-dimensional classification channels. Each channel represents the classification probability of a category. Because YOLO supports the recognition of 20 different objects (people, birds, cats, cars, chairs, etc. ), there are 20 values representing the probability that any object exists in this grid position. However, we can only predict 49 objects in a group of pictures, which can be understood as a grid2 with two anchors, and only one object can predict accurately (that is, the anchor with the largest IOU ratio), so there are 7×7 objects.
In the figure, the position of the bicycle is placed in the bounding box 1, but in fact, after the network output in the training process, the two bounding boxes are compared with the IOU of the actual position of the bicycle, and the position of the bicycle (the actual bounding box) is placed in the bounding box with a larger IOU (it is assumed to be bounding box 1 in the figure), and the confidence of the bounding box is set to 1.
Second, the loss function
Generally speaking, the sum of the squares of the errors between the network output and the content of the sample label is taken as the total error of the sample.
Several terms in the loss function correspond to the contents of the output 30-dimensional vector.
Iii. YOLO v 1 defects
note:
Details:
The last layer of YOLO adopts linear activation function, and all other layers are leaky ReLU. Use exit and data enhancement in training to prevent over-fitting. Please refer to the original paper for more details.
At 67 FPS, YOLOv2 obtained 76.8% of the map on PASCAL VOC 2007. At 40 FPS, YOLOv2 gets 78.6% of the mAP, which is better than the faster R-CNN using ResNet and SSD. With such excellent achievements, YOLOv2 was published in CVPR in 20 17, and was cited more than 1000 times. YOLO has two shortcomings: one is inaccurate positioning, and the other is lower recall rate compared with the method based on regional proposal. Therefore, YOLOv2 is mainly improved in these two aspects. In addition, YOLOv2 does not improve the effect by deepening or broadening the network, but simplifies the network.
The following mainly introduces the promotion of YOLO v2 from two points. Better and faster.
1, dark net-19
In YOLO v 1, the training network adopted by the author is based on GooleNet. Here, the author makes a simple comparison between GooleNet and VGG 16. GooleNet is superior to VGG16 (8.25 billion operations vs 30.69 billion operations) in computational complexity, but the top-5 accuracy of the former is slightly lower than that of the latter (88. In YOLO v2, the author adopts a new classification model as the basic network, namely Darknet- 19. Table 6 shows the final network structure: Darknet- 19 only needs 5.58 billion operations. This network contains 19 convolution layers and 5 maximum pool layers, while GooleNet used by YOLO v 1 contains 24 convolution layers and 2 fully connected layers, so Darknet- 19 has less convolution operations than GoogleNet used by YOLO v 1, which is the key to reduce the amount of calculation. Finally, the average pool layer is used to replace the fully connected layer for prediction. This network achieves the accuracy of 0.2% of top-5 on 965438+ImageNet.
2. Classified training
The second and third parts mentioned above are tips for training and handling. The training for classification here is pre-training on ImageNet, which is mainly divided into two steps: 1, training Darknet- 19 from the beginning, training 160 epoch with ImageNet as the data set, the input image size is 224 224, and the initial learning rate is 0. 1. In addition, the standard data increase methods such as random cropping, rotation, chromaticity and brightness adjustment are adopted in the training. 2. Fine-tune the network. At this time, 448,448 inputs are used, and the parameters are unchanged except epoch and learning rate. Here, the learning rate is changed to 0.00 1 and training 10 epoch. The results show that the accuracy of top- 1 and top-5 after fine tuning is 76.5% and 93.3% respectively, while the accuracy of top- 1 of Darknet- 19 is 72.9% and that of top-5 is 91.. Therefore, it can be seen that steps 1 and 2 respectively improve the classification accuracy of the main network from two aspects of network structure and training mode.
3. Testing training
After the previous step 2, we began to transplant the network for detection, and began to fine-tune according to the detected data. Remove the last convolution layer first, and then add three 3/3 convolution layers. Each convolution layer has 1024 filters, and each convolution layer is connected by 1 1 The number of filters for 1 1 convolution depends on the category to be detected. For example, for VOC data, because each grid cell needs to predict 5 boxes, each box has 5 coordinate values and 20 category values, each grid cell has 125 filters (unlike YOLOv 1, in YOLOv 1, each grid cell has 30 filters. Remember the 7 7 30 matrix? In YOLOv 1, the category probability is predicted by grid cells, that is to say, the category probability of two boxes corresponding to a grid cell is the same, but in YOLOv2, the category probability belongs to boxes, and each box corresponds to a category probability, which is not determined by grid cells. So here, each box corresponds to 25 predicted values (5 coordinates plus 20 category values), while in YOLOv 1, the 20 category values of two boxes in a grid cell are the same). In addition, the author also mentioned connecting the last 3 * 512 convolution layer with the penultimate convolution layer. Finally, the author pre-trained the fine-tuning model with 160 epochs on the test data set, and the learning rate was 0.00 1. In the 60th and 90th periods, when the learning rate is divided by 10, the weight decays to 0.0005.
Yolo v3 has three points compared with yolo v2: 1. Multi-scale features are used to detect objects. 2. Adjust the basic network structure.
Work plan of orthopedics in 2023 1
First, seriously implement the rules and regulations.
Strict implementation of rules and regulations i