The second reason is that at the rate of 28 Mbytes/ sec, the uncompressed image of15s will occupy 420m Mbytes of memory space, which is unacceptable for most desktop computers that can only handle small image fragments.
Nowadays, the key problem of adding images to electronic signals is the compression method. There are several different compression methods, but mpeg is the most promising one.
The history and advantages of mpeg
Mpeg (Moving Picture Experts Group) is an international standard, namely iso 1 1 172. Its two standards-MPEG-1and mpeg-2 are particularly important. In 199 1, Mpeg- 1 is introduced to speed up the image transmission in optical disks. Its purpose is to compress the ntsc image of 22 1mbit/ s to 1.2mbit/ s with a compression ratio of 200: 1. This is an image compression standard recognized by the industry.
The image quality of mpeg-2 for broadband transmission has reached the standard of TV broadcasting and even HDTV. Compared with mpeg- 1, mpeg-2 supports a wider range of resolution and bit rate, and will become the compression mode of digital video compact disc (dvd) and digital broadcast television. These markets will be intertwined with the computer market, thus making mpeg-2 an important image compression standard for computers. This is very important because an mpeg-2 decompressor is needed to decompress the mpeg- 1 bit stream. Another standard, MPEG-4, is being developed, which will support applications of extremely low bit rate data streams, such as video telephony, video mail and electronic newspapers.
The wide acceptance of mpeg means investment protection for its users. Many retailers sell mpeg software or hardware players, and this competition leads to lower prices and higher quality. Mpeg- 1 is compatible with mpeg-2, so it is a standard with development space.
Basic principle of MPEG video compression algorithm
Generally speaking, many video sequences contain great statistical redundancy and subjective redundancy within and between frames. The ultimate goal of video source code is to reduce the bit rate required for storing and transmitting video information by mining statistical redundancy and subjective redundancy. The practical coding scheme of "minimum information group" is compiled by using direct coding technology, which is a compromise between coding characteristics (high compression with sufficient quality) and implementation complexity. For the development of MPEG compression algorithm, the most important thing is to consider the life cycle of these standards and the capabilities of modern VLSI technology. According to the application requirements, we can think that the purpose of lossless coding and lossy coding of video data is to reduce the number of images or video data that need to be stored and transmitted while maintaining the original image quality (that is, the decoded image quality is equal to the image quality before coding). On the contrary, the purpose of "lossy" coding technology (related to the future application of MPEG- 1 and MPEG-2 video standards) is to conform to the given storage and transmission bit string. Some important applications include: using limited bandwidth or narrow bandwidth to transmit video information through communication channels; Effectively store video information. In these applications, high video compression is achieved by reducing the video quality, that is, compared with the original image before encoding, the "objective" quality of the decoded image is reduced (that is, the mean square error of the original image and the reproduced image is used as the standard to evaluate the objective image quality), and the target bit rate of the channel is lower; The greater the compression rate that the video has to experience, the more coding artifacts are usually perceptible. The ultimate goal of lossy coding technology is to obtain the best image standard under the specified target bit string condition. The best standard of "objectivity" or "subjectivity" should be observed here. It should be pointed out here that the degree of image degradation (which refers to the objective reduction and the number of detectable artifacts) depends on the complexity of compression technology-for simple pictures and images with little video activity, a good reproduced image with no detectable artifacts can be obtained by using simple compression technology.
(A)MPEG video encoder source mode
MPEG digital video coding technology is essentially a statistical method. Video columns usually contain statistical redundancy in both temporal and spatial directions. The basic statistical characteristic of MPEG compression technology is the correlation between pixels, which includes the assumption that there is a simple correlation translation motion between consecutive frames. It is assumed that pixel values on a special picture can be predicted based on pixels near the same frame (using intra-frame coding technology) or based on pixels in additional frames (using inter-frame technology). Intuition tells us that in some cases, such as when the shot of a video sequence changes, the temporal correlation between pixels in nearby frames is very small or even disappears-at this time, the video shot becomes a combination of a group of still pictures without correlation. In this case, intra-frame coding technology can be used to develop spatial correlation to achieve effective data compression. MPEG compression algorithm uses discrete cosine transform (DCT) coding technology, and takes 8×8 pixel image blocks as a unit to effectively develop the spatial correlation between adjacent video cables on the same plane. However, if the correlation between pixels in adjacent frames is large, that is, the contents of two consecutive frames are similar or the same, an inter-frame DPCM coding technique that should use temporal prediction (inter-frame motion compensation prediction) can be adopted. In various MPEG video coding schemes, high data compression (DPCM/DCT mixed coding of video) can be realized if the transform codes of residual spatial information of temporal motion compensation prediction channels are adaptively combined. Figure 1 gives an example of the correlation characteristics between pixels of a picture, and a very simple but valuable statistical model is adopted here. This hypothetical simple pattern already contains some basic correlation characteristics of many "typical" pictures, namely, the high correlation between adjacent pixels and the single-value attenuation characteristics of correlation with the increase of pixel spacing. We will use this model to show some characteristics of transform region coding in the future. The spatial correlation between pixels of some "typical" pictures in figure 1 is calculated by applying the AR( 1) Gaussian Markov picture mode with high inter-pixel correlation. Variables x and y represent the distances between pixels in the horizontal and vertical directions of the image, respectively.
Secondary sampling and interpolation
Almost all video coding techniques introduced in this paper have gone through a lot of sub-sampling and quantization processes before coding. The basic concept of sub-sampling is to reduce the dimensions (horizontal and/or vertical dimensions) of the input video and number the pixels before coding. It is worth noting that in some applications, video is also subsampled in the time direction to reduce the frame rate before encoding. At the receiver, the decoded image is displayed by interpolation. This method can be regarded as the simplest compression technology, which makes use of the unique physiological surname of human eyes, thus removing the subjective redundancy contained in video data-that is, human eyes are more sensitive to the changes of brightness signals than the changes of chroma signals. Therefore, many MPEG coding schemes first divide the picture into Y:U:V and quantitative signals (one luminance component and two chrominance components). Then, the chrominance component is sampled twice with respect to the luminance component. For some special applications, there is a ratio of Y: U: V (that is, for MPEG-2 standard, 4: 1: 1 or 4: 2: 2).
(c) motion compensation prediction
Motion compensation prediction is a powerful tool to reduce inter-frame temporal redundancy. As a prediction technique of temporal DPCM coding, this tool has been widely used in MPEG 1 and MPEG-2 video coding standards. The concept of motion compensation is based on video inter-frame motion estimation, that is, if all objects in the video shot have spatial displacement, then the inter-frame motion can be described by limited motion parameters (such as the translational motion of pixels, which can be described by motion vectors). In this simple example, the motion compensated predicted pixel from the previous coded frame can give the best prediction of the effective pixel. Generally, both the prediction error and the motion vector are transmitted to the receiver. However, it is neither worthwhile nor necessary to encode each pixel of the encoded image with motion information. Because the spatial correlation between some motion vectors is usually high, sometimes one motion vector can be considered as representing the motion of adjacent pixel blocks. In order to do this, the picture is generally divided into some unconnected pixel blocks (one pixel block is 16× 16 pixels in MPEG/kloc-0 and MPEG2 standards), and only one motion vector is estimated, encoded and transmitted for each such pixel block (Figure 2). In MPEG compression algorithm, motion compensation prediction technology is used to reduce the temporal redundancy between frames, and only the prediction error image (the difference between the original image and the motion compensation prediction image) is encoded. Generally speaking, due to prediction based on previously coded frames, the correlation between pixels in the motion compensated inter-frame error image to be coded is poor compared with the intra-frame correlation shown in figure 1. Fig. 2 Block matching method for motion compensation: In the n-th effective frame to be encoded, estimate a motion vector (mv) for each image block, which is a reference image block with the same size in the previously encoded N- 1 frame. The prediction error of motion compensation is calculated as follows: in the face block of the counterpart with motion drift in the previous frame reference block, one pixel is subtracted.
(d) conversion or coding
In the past twenty years, people have done a lot of research on transform coding, which has become a very popular compression method for still image coding and video coding. The purpose of transform coding is to remove the correlation of intra-frame or inter-frame error image content and encode transform coefficients, rather than encoding the original pixels of the picture. For this reason, the input image is divided into disconnected image blocks of B pixels (i.e. N×N pixels). Based on a linear, separable and unitary forward transformATion, which can be expressed as a matrix operation, an N×N transformation matrix A is adopted to obtain the N×N transformation coefficient c. C = abAT where at represents a shift term of the transformation matrix A. Note: this transformation is reversible, because the original picture block with N×N b pixels can be reproduced by using a linear and separable inverse transformation. B=AtCa uses many possible methods, and the discrete cosine transform (DCT) applied to a small image block composed of 8×8 pixels has become the best transform for still images and video coding. In fact, because the method based on DCT has high anti-correlation performance and can obtain fast DCT algorithm, it is suitable for real-time application and has been used in most image and video coding standards. VLSI technology has been commercialized because its operating speed is suitable for a wide range of video applications. The main purpose of transform coding is to make as many transform coefficients as possible small enough to make them still invalid (from the perspective of statistics and subjective measurement). At the same time, the statistical correlation between coefficients should be reduced as much as possible in order to reduce the number of bits needed to encode the remaining coefficients. Fig. 3 shows the variance (energy) of an 8x8 pixel block of internal DCT coefficients, which is based on the simple statistical mode assumption discussed in Figure 1. Here, the variance of each coefficient represents the variability of the coefficient (almost average value of a large number of frames). Compared with the coefficient with large variance, the coefficient with small variance has little significance in the reproduction of picture pixel blocks. As shown in fig. 3, generally speaking, in order to obtain a useful approximate reproduction of image pixel blocks, only a small number of DCT coefficients need to be sent to the receiver. While those most significant bit DCT coefficients are concentrated in the upper left corner (low DCT coefficients), and the effectiveness of the coefficients gradually decreases with the increase of distance. This means that compared with low-order coefficients, high-order DCT coefficients are less important in the reproduction of image pixel blocks. Using motion compensation prediction, the result of DCT transform is to reproduce the temporal DPCM signal in DCT domain-this essentially inherits this similar statistical correlation, for example, the intra-frame signal in Figure 2 is reproduced in DCT domain with this DPCM signal (although the energy is reduced)-which is why MPEG algorithm should adopt DCT coding to make inter-frame compression successful. Fig. 3 illustrates the variance distribution of DCT coefficients, and generally calculates the average value obtained by a large number of painting blocks. The variance calculation of DCT coefficients is based on the statistical model in figure 1 U and V are horizontal and vertical image transform domain variables in 8×8 block, respectively. Most of the total variance is concentrated near the DCDC coefficient (u = 0, v = 0). DCT is very close to discrete Fourier transform, and it is important to realize that DCT coefficients can be described by frequency, so as to make them closer to DFTo in picture blocks. Low-order Dcr coefficients are related to lower spatial frequencies, and high-order DCT coefficients are related to higher frequencies. This feature is applied to MPEG coding scheme to eliminate subjective redundancy contained in image data, which is based on the standard of human visual system. Because the audience is more sensitive to the reproduction error of lower spatial frequency than the reproduction error related to higher spatial frequency, in order to improve the visual quality of the decoded picture after a given bit rate, the coefficient progressive frequency is often adaptively weighted (quantized) according to vision (perceptual quantization). The combination of the above two technologies-temporal motion compensation prediction and transform domain coding, which is considered as the focus of MPEG coding standards, has the third characteristic that these two technologies deal with smaller picture blocks (typically, motion compensation is performed on 16× 16 pixels and DCT coding is performed on 8×8 pixels). Therefore, MPEG coding algorithm is usually called DPCM/DCT hybrid algorithm based on image blocks.
MPEG- 1: A General Standard for Digital Storage Media
Coding standard of moving image and sound (the highest rate is 1.5 MB/s)
The video compression technology developed by MPEG- 1 has a wide range of applications, including interactive systems on CD-ROMs and video transmission on telecommunication networks. MPEG- 1 video coding standard is considered as a universal standard. In order to support a variety of applications, users can specify a variety of input parameters, including flexible image size and frame rate. MPEG recommends a set of parameters specified by the system: each MPEG-L compatible decoder must at least be able to support the video source parameters and meet the best TV standard: each line must have at least 720 pixels, each image must have at least 576 lines and at least 30 frames per second, the minimum bit rate is 1.86 MB/s, and the standard video input should include the non-interlaced video image format. It should be pointed out that it does not mean that the application of MPEGl is limited to the parameter set specified in this system. According to JPiG and H. 26 1 activities, the MPEG-L video algorithm is developed. At that time, the idea was: try to keep the same sex as CCITT H.26 1, so it seems possible to support the two standards. Of course, the main goal of MPEGl is the application of multimedia discs, in which additional functions supported by encoders and decoders are needed. Important features provided by MPEGl include: random video access based on frames, fast forward/fast backward search through compressed bit streams, reverse video playback, and editing ability of compressed bit streams.
(a) Basic MPEG- 1 inter-frame coding scheme.
The basic MPEG 1 (and MPEG-2) video compression technology is based on the macro-module structure, motion compensation and the conditional supplement of macro-modules. As shown in fig. 49, the MPEG- 1 coding algorithm encodes the first frame of a video sequence in an intra coding mode (I picture). Every next frame is coded by inter-frame prediction (P-picture)-only data from the latest coded I or P frame is used for prediction, and MPEG-L algorithm processes frames based on video sequences of image blocks. As shown in fig. 4b, each color input frame in the video sequence is divided into a plurality of non-overlapping "macro modules". Each macroblock includes four luminance blocks (Y 1, Y2, Y3, Y4) and two chrominance blocks (u, v), and the size of each macroblock is 8*8 pixels. These data blocks come from the luminance band and chrominance band of the * * * address. The sampling ratio between Y: U: V luminance and chrominance pixels is 4: 1: 1. Based on the latest previous frame, P image is coded by motion compensation prediction method. Each frame is divided into disconnected "macro modules" (MB). Fig. 4b For each macroblock, letters about four brightness blocks (Y 3, Y2, Y3, Y4) and two color blocks (U, V) are always encoded. Each containing 8x8 pixels. The block diagram of the basic hybrid DPCM/DCT mpeg 1 encoder and decoder structure is shown in Figure 5. The 1 frame (i picture) of a video sequence is encoded in intra mode, without reference to any past or future frames. At the encoder, DCT is added to each 8×8 luminance block and chrominance block. After DC output, each of the 64 DCT coefficients is uniformly quantized (Q), and the quantizer step for quantizing the DCT coefficients in the macroblock is transmitted to the receiver. After quantization, the processing method of the lowest order DO coefficient (DC coefficient) is different from that of the remaining coefficient (AC coefficient). The DC coefficient represents the average brightness of the module, and the DC coefficient can be encoded by the differential DC prediction method. Non-zero quantizer values of the reserved DCT coefficients and their positions are zigzag scanned, and they are encoded by a variable length coding (VLC) table. Fig. 5 is a block diagram of a basic hybrid DC/DPCM encoder and decoder structure. The concept of coefficient zigzag scanning is shown in Figure 6. Since a two-dimensional image signal is converted into a one-dimensional bit stream by using coefficients, a variable-length codeword allocation process is performed after scanning the two-dimensional signal in the quantized DCT domain. These non-zero AC coefficient quantizer values (lengths) are detected along the distance (run) between the scan line and two consecutive non-zero coefficients. Using the method of transmitting only one VLC codeword, each successive (run and length) pair is encoded. The purpose of zigzag scanning is to track low-frequency DCT coefficients (including maximum energy) before tracking high-frequency coefficients. Fig. 6 is a zigzag scan of quantized DCT coefficients in an 8x8 block. Only non-zero quantized DCT coefficients are encoded. The figure indicates the possible positions of non-zero DCT coefficients. In order to track DCT coefficients according to their validity, see fig. 3. The lowest order DCT coefficient (0,0) contains the largest part of the energy in these blocks, which is concentrated around the lower order DCT coefficient. The decoder performs the opposite operation. First, variable length coded words (VLD) are extracted from the bit stream and decoded to obtain the bit settings and quantizer values of non-zero DCT coefficients of each image block. Quantized block pixel values are obtained through the reproduction (q) of all non-zero DCT coefficients of an image block and the subsequent inverse DCT (DCT- 1). By processing the whole bit stream, all image blocks are decoded and reproduced. To encode the P picture, the N- 1 frame of the previous I picture is stored in the frame memory in the encoder and decoder. Perform motion compensation (MC) in the macroblock—For the macroblock to be coded, only one motion vector is estimated between the nth frame and the n- 1 frame. These motion vectors are encoded and transmitted to the receiver. The prediction error of motion compensation is calculated by subtracting one pixel from the macroblock with the corresponding motion drift of the previous frame. Then add 8×8d CT to each 8× 8 block contained in the macro module, then quantize DCT coefficient (Q), and perform scan width coding and pixel coding (VLC)o A video buffer is needed here; So as to ensure that the encoder can produce a constant target bit rate output. For each macroblock in a frame, the quantization step size (SZ) can be adjusted to obtain a given target bit rate and avoid overflow and underflow of the buffer. The decoder uses inverse processing to reproduce the macroblock of the nth frame in the receiver. After decoding the variable length word (VLD) contained in the video decoder buffer (VB), the prediction error pixel values (q and DCT- 1 operation) can be reproduced. Motion compensated pixels from the N- 1 th song contained in the frame memory (FS) are added to the prediction error to restore the macroblock of the n-th frame. In figs. 7a- 7d, 9 uses a typical test sequence to describe the benefits of encoding video using motion compensation prediction, which is based on the first N- 1 reproduction frame in the MPEG encoder. Fig. 7a shows a frame to be encoded in n time, and fig. 7b shows a frame reproduced in n- 1 time, which is stored in a frame memory (FS) provided in an encoder and a decoder. The block motion vector (mv, refer to Figure 2) shown in Figure 7b has been estimated by the encoder motion estimation method, and the displacement of the translational motion of each macroblock in the nth frame can be predicted (refer to the nth-1frame). Fig. 7b shows this pure frame difference signal (the nth frame minus the nth-1frame). If motion compensation prediction is not used at all in the encoding process, the frame difference signal can be obtained, that is, all motion vectors are assumed to be zero. Fig. 7d shows a motion compensated frame difference signal when the motion vector of fig. 7b is used for prediction. Obviously, compared with the pure frame difference coding in fig. 7c, the residual signal to be coded is greatly reduced by using motion compensation. Fig. 7: (a) the frame to be encoded at time n: (b) the frame at time P4- 1, which is used to predict the content of the nth frame (note: the motion vectors displayed on four sides are not part of the reproduced image stored in the encoder and decoder; (c) Prediction error images obtained without motion compensation-assuming that all motion vectors are zero; (d) If motion compensation prediction is adopted, it is the prediction error image to be encoded.
How does mpeg work
Mpeg- 1 is characterized by lossy and unbalanced coding. Lossy refers to the loss of some image and sound information in order to achieve low bit rate. Usually, this information is the most insensitive information to human eyes and ears, so even if it is compressed at the optical disc rate of 1x, it can achieve the quality of vhs and the effect of high-fidelity stereo. Unbalanced coding in mpeg means that compressing images is much slower than decompressing them.
The data stream of mpeg- 1 contains three parts: image stream, sound stream and system stream. The image stream only contains picture information, and the sound stream contains sound information. The system stream realizes the synchronization of image and sound. All clock information required for playing mpeg images and audio data is contained in the system stream.
Mpeg uses complex mathematical and psychological techniques to achieve its compression results. Mpeg audio compression coding makes use of the research results of human ear sensitivity, and image coding makes use of some favorable results of human eye sensitivity to brightness, color and motion.
Mpeg sound
The two paths of cd sound * * * contain 1.4mbit/ s data stream. The research of auditory psychology shows that this data stream can be compressed to 256kbit/ s without distortion by using appropriate compression technology. Mpeg audio takes advantage of this result, although some mpeg compressors do not support high-quality images.
Mpeg audio coding can achieve three compression levels. The first level is simple compression, which is sub-sampling coding under the auditory psychological model. In the second stage, higher precision is added, and in the third stage, advanced technologies such as nonlinear quantization and huffman encoding are added to realize low bit rate and high fidelity images. Continuous grade provides high quality and higher compression rate, but it requires computers to have stronger compression ability. Mpeg Level ii can compress the stereo data stream of 1.4mbit/ s to 32kbit/ s -384kbit/ s while maintaining high fidelity. Typical data is that the goal of the first level is 192kbit/ s per channel, the second level is 128kbit/ s per channel, and the third level is 64kbit/ s per channel. The second goal is not as good as the third level when it reaches 64kbit/ channel, but when it reaches 128kbit/ channel, the second and third levels have the same effect and are both better than the first level ... As mentioned above, each channel 128kbit/ s or two channels can achieve good fidelity. So level 2 is necessary but sufficient for hi-fi stereo.
Mpeg- 1 supports two channels set to mono, binaural, stereo or joint stereo. Class ii joint stereo combines the high-frequency parts (higher than 2khz) of the sound signal, and the whole stereo image is preserved, but only the instantaneous envelope is transmitted. Level I does not support sum stereo. Some mpeg compressors can't produce isochronous ii audio streams, so the sound fidelity is low and there is no joint stereo function.
mpeg image
Mpeg image coding consists of three parts: I frame, P frame and B frame. In the process of mpeg coding, some images are compressed into I frames, some are compressed into P frames, and some are compressed into B frames. I frame compression can get 6; The compression ratio of 1 will not produce any perceptible ambiguity. I-frame compression and P-frame compression can achieve a higher compression ratio without imperceptible blur. The compression ratio of B frame can reach 200: 1, and its file size is generally 15% of that of I frame, less than half of that of P frame. I-frame compression removes the spatial redundancy of the image, and P-frame and B-frame remove the temporal redundancy, which will be further explained below.
I-frame compression adopts reference frame mode, which only provides intra-frame compression, that is, when compressing frame images to I-frame, only intra-frame images are considered. I-frame compression cannot eliminate inter-frame redundancy. Intra-frame compression is based on discrete cosine transform (dct), which is similar to the compression standard of dct used in jpeg and h.26 1 images.
P frame adopts predictive coding, and uses the general statistical information of adjacent frames for prediction. In other words, it provides inter-frame coding considering motion characteristics. P frame predicts the difference between the current frame and the latest I frame or P frame.
B frame is bi-directional inter-coding. It extracts data from the front and back I frames or P frames. B frames are compressed based on the differences between the current frame and the images of the previous frame and the next frame.
At the beginning of mepg data stream, the uncompressed digital image with sif resolution specified by ccir-60 1 is sampled. Sif resolution, for ntsc system, means that the luminance signal is 352 * 240 pixels, and each chrominance signal is 176 * 120 pixels. Each signal has 30 frames per second. The mpeg compressor determines whether the current frame is an I frame, a p frame or a b frame. After the frame is determined, dct transform is adopted, and the result is quantized, rounded and run-length coded into variable-length coded. The typical frame sequence of coded images is: ibbpbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbi …
B-frame and P-frame require computers to have more powerful functions. Some compressors can't produce B frames or even P frames, so the compression results of images will be obviously discontinuous.
Other forms of image compression
Of course, mpeg is not the only image compression standard. H.26 1, motion jpeg, cinepak and indeo are the best alternative standards.
H.26 1 and motion-jpeg and mepg adopt similar techniques, that is, they both adopt discrete cosine transform (dct). But jpeg, like MPEG-I frame compression, is intra-frame compression, and the compression ratio should not exceed 10 to avoid perceptible blur. Therefore, jpeg is not a good choice to transmit images through CD or Internet, because their compression ratio is required to reach 200: 1. H.26 1 can provide a higher compression ratio, but it is not suitable for images with more motion, and is most suitable for dialogue images with static background. H.26 1 Although it supports inter-frame compression through P frames, it does not support B frame compression. Therefore, high compression rate is achieved at the expense of image quality. When image quality and motion are important, h.26 1 is no longer a good choice.
Indeo3.2 and 4.0 are exclusive and adopt different compression technologies. Indeo 4.0 compression is the more complicated of the two, which allows bidirectional prediction (B frame) and scaling. The compression of indeo4.0 is generally realized by software, which is very slow, especially when B-frame coding is used. Indeo's b-frame compression will also cause frame loss. Scaling function will also lead to obvious abrupt edge pixelation and frame loss. Without the function of B-frame compression and scaling, the image with 320x240 resolution and 15 frames per second can be compressed to 200 kb per second. In contrast, mpeg provides a higher compression ratio, that is, an image with a resolution of 352x240 and 30 frames per second is compressed to 150 kb per second.
Cinepak is a compression technology developed by radius. It is also a monopoly, and the compression speed is very slow. Generally, cd with 15 frames per second is provided, instead of 30 frames per second like mpeg.