Compression is a mechanism to reduce the size of computer files through a specific algorithm. This mechanism is a very convenient invention, especially for network users, because it can reduce the total number of bytes of files, enable files to be transmitted faster through a slow Internet connection, and at the same time reduce the disk space occupied by files.
Compression is to reduce the data size to save storage space and transmission time. For data transmission, compression can be applied to a single data content or all transmission units (including data headers), depending on some specific factors.
Content compression is very simple, that is, removing redundant blank characters, inserting a single repeated character to represent repeated characters in a string, and replacing small bit strings with common characters. This type of compression can reduce the size of text files by 50%. Compression is performed by programs that use specific formulas and algorithms that determine how to compress and decompress data. This algorithm is used for lossy or lossless processing of files, so as to keep most of the information of files and make them smaller. The basic principle of file compression is to find out the duplicate bytes in the file, create a "dictionary" file with the same bytes, and represent it with a code. For example, in several places in the file, the same word "China people * * and China" is represented by a code and written into the "dictionary" file, so as to achieve the purpose of file reduction. Because the information processed by computer is expressed in the form of binary numbers, compression software marks the same string in binary information with special characters to achieve the purpose of compression. To help you understand file compression, please imagine a picture of blue sky and white clouds in your mind. For thousands of monotonous blue pixels, instead of defining a long list of colors "blue, blue, blue ……" one by one, it is better to tell the computer that "storing117 blue pixels from this location" is more concise and can greatly save storage space. This is a very simple example of image compression. In fact, in the final analysis, all computer files are stored in the form of "1" and "0". Just like blue pixels, through reasonable mathematical calculation formula, the volume of files can be greatly reduced, and the effect of "lossless and dense data" can be achieved. Generally speaking, compression can be divided into lossy compression and lossless compression. If the loss of individual data will not have much impact, it is a good idea to ignore them, which is lossy compression. Lossy compression is widely used in animation, sound and image files, and the typical examples are mpeg, mp3 and jpg. But in more cases, the compressed data must be accurate, so people have designed lossless compression formats, such as common zip and rar. Compression software is naturally a tool to compress data by using compression principle. The file generated after compression is called archive, and its volume is only a fraction or even smaller. Of course, the compressed package is already another file format. If you want to use the data in it, you must first restore the data with compression software. This process is called decompression. Common compression software includes Winzip, WinRAR, etc.
There are two forms of duplication of computer data, and zip compresses them.
First kind
One is repetition in the form of phrases, that is, repetition of more than three bytes. For this repetition, zip uses two numbers: 1. The distance between the repetition position and the current compression position; 2. The length of the repetition, to represent this repetition, assuming that these two numbers each occupy one byte, then the data is compressed and easy to understand.
One byte has 0-255 * * 256 possible values, and three bytes have more than 256 * 256 * 256 * *16 million possible situations. The possible values of longer phrases increase exponentially, and the probability of repetition seems extremely low. In fact, all types of data tend to be repetitive. In a paper, several terms will appear repeatedly. A novel, names and places will appear repeatedly; For a background picture with a gradient up and down, pixels in the horizontal direction will appear repeatedly; Grammatical keywords will appear repeatedly in the source file of the program (how many times have we copied and pasted the program before and after writing it? ), in the uncompressed format data with tens of K as the unit, a large number of phrase repetitions often occur. After the above compression, the tendency of phrase repetition is completely destroyed, so the second phrase compression of the compression result is generally invalid.
The second type
The second kind of repetition is single-byte repetition, and there are only 256 possible values in a byte, so this kind of repetition is inevitable. Among them, some bytes may appear more times, while others are less, which tends to be unevenly distributed statistically, which is easy to understand. For example, in an ASCII text file, some symbols may be rarely used, while letters and numbers are used more, and the frequency of each letter is different. It is said that the letter e has the highest probability of use; Many pictures are dark or bright, and dark (or bright) pixels are often used (here by the way: png picture format is a lossless compression, and its core algorithm is zip algorithm. The main difference from zip format files is that as a picture format, it stores information such as the size of the picture and the number of colors used in the file header). The results of the above phrase compression also have this trend: repetition often appears near the current compression position, and the repetition length is often shorter (within 20 bytes). In this way, it can be compressed: 256 kinds of bytes are re-encoded, so that bytes with more occurrences use shorter encoding, and bytes with fewer occurrences use longer encoding. In this way, when the shorter bytes are more than the longer bytes, the total length of the file will be reduced, and the more uneven the byte usage, the greater the compression ratio.
Commonly used compression software and compression formats
edit
Universal compression software
WinMount、WinRAR、WinZip、7-Zip、coolrar
Common compressed file formats
Mainly includes: rar, zip, tar, cab, UUE, jar, iso, z, 7-zip, ace, lzh, arj, gzip, bz2 and other compressed files.
Files compressed by compression software are called compressed files. The principle of compression is to compress the binary code of the file, reducing the adjacent 0, 1 code, for example, there are 000000, which can be changed into six zeros to write 60, reducing the space of the file.
conflict
JAR file is a Java archive file, its application is closely related to Java, and it is a document format of Java. A JAR file is very similar to a ZIP file-exactly, it is a ZIP file, so it is called a file package. The only difference between JAR file and ZIP file is that the contents of JAR file contain a META-INF/MANIFEST. MF file, which is automatically created when the JAR file is generated.
vitality
ZIP should be regarded as the most common compressed file format, and you don't even need to install a compression or decompression software for it, because we use Windows system and integrate the support for ZIP compression format.
RAR
Although ZIP occupies a high position in the compressed file format, quite a few download websites choose to compress files in RAR format. The most fundamental reason is that the file compression rate of RAR format is higher than that of ZIP.
As a rising star of compression format, 7Z has a higher compression rate than RAR, which can make files more compact. However, because the RAR format has been highly popularized, there is no "time" for network popularization, and it is quite difficult for 7Z to replace the status of RAR.
taxi
CAB is an installation file compression format of Microsoft, which is mainly used in software installation programs. Because the installation program is involved, the files contained in the cab file are usually not simply compressed directly, but the file names are all processed, so although they can be decompressed directly, the decompressed files are usually not used directly.
International Organization for Standardization (ISO)
Many friends think that ISO is a compressed format, which stems from WinRAR's "decompression" support for ISO format. In fact, ISO is not a compressed format, and the files it contains are not compressed. ISO is just a mirror image format of the CD, which completely copies and saves the contents on the CD. The so-called process of "decompressing" ISO is just the process of decompressing files within ISO.
sailor
A file with tar as a dropout can be opened by WinZip or WinRar, because WinZip or WinRAR and. The tar file, that is to say, can be decompressed by the corresponding decompression software.
. Tar is a common compressed file format under linux, not a database file.
UUE
Uue is a useful compression format. In case of garbled code, you can open the mail code of WinZip or WinRAR.
Above we mainly introduced the commonly used compressed files.
Basic principle of compression
edit
abstract
If you download many programs and files from the Internet, you may encounter many ZIP files. This compression mechanism is a very convenient invention, especially for network users, because it can reduce the total number of bits and bytes in the file, make the file transfer faster through the slow Internet connection, and also reduce the disk space occupied by the file. After downloading the file, the computer can use a program like WinZip or Stuffit to expand the file and restore it to its original size. If all goes well, the expanded file will be exactly the same as the original file before compression. At first glance, it seems mysterious: how to reduce the number of bits and bytes and restore them intact? When everything comes out, you will find that the basic idea behind this process is actually very simple and clear. In this article, we will discuss this method of significantly reducing files through simple compression.
Most computer file types contain considerable redundancy-they list some of the same information over and over again. File compression programs are designed to eliminate this redundancy. Unlike listing a piece of information repeatedly, a file compressor lists the information only once, and then references it again when it appears in the original program.
for instance
Take the familiar information type-text-as an example.
John F. Kennedy once said the following famous words in his inaugural address of 196 1:
Ask not what your country can do for you, ask what you can do for your country. Ask not what your country can do for you, but what you can do for your country. )
This passage has 17 words, including 6 1 letter, 16 spaces, 1 dash, 1 period. If each letter, space or punctuation occupies 1 storage units, the total size of the file is 79 units. In order to reduce the file size, we need to find out the redundant parts.
We immediately found that:
If you ignore the difference between uppercase and lowercase letters, almost half of this sentence is redundant. Nine words (ask, no, what, yours, country, can, do, for, you) provide almost everything you need to form a complete sentence. In order to construct the other half sentence, we only need to take out the words in the first half sentence and then add spaces and punctuation marks.
Most compression programs use LZ algorithm based on adaptive dictionary to compress files. "LZ" refers to the inventors of this algorithm, Lempel and Zifu, and "dictionary" refers to the method of classifying data blocks.
There are many mechanisms for arranging dictionaries, which can be as simple as a numbered list. When we examine Kennedy's famous speeches, we can pick out the repeated words and put them in the numbered index. Then, we write numbers directly instead of the whole word.
conclusion
Therefore, if our dictionary is:
ask
what
your
country
can
do
for
you
Our sentence should be like this:
1 instead of 2 3 4 5 6 7 8-1 2 8 5 6 7 3 4
If you understand this mechanism, you can easily reconstruct the original sentence by using this dictionary and numbering pattern. This is what the decompressor in the computer does when expanding the downloaded file. You may also have encountered compressed files that can be extracted by yourself. To create such a file, programmers need to set up a simple decompressor in the compressed file. After downloading, it can automatically rebuild the original file.
But how much space can be saved by using this mechanism? "1 not 2345678- 1 2856734" is certainly shorter than "ask not what your country can do for you-ask what you can do for your country". But it should be noted that we need to save this dictionary with the file.
In the actual compression scheme, it is a very complicated process to calculate various file requirements. Let's go back and consider the above example. Each character and space takes up 1 storage unit, and the whole original sentence takes up 79 units. Compressed sentences (including spaces) account for 37 units, and dictionaries (words and numbers) also account for 37 units. In other words, the file size is 74 units, so we didn't reduce the file size too much.
But this is just a sentence! It is conceivable that if the rest of Kennedy's speech is processed by this compression program, we will find that these words and other words are repeated more times. Moreover, as mentioned in the next section, in order to obtain the highest possible organizational efficiency, you can rewrite the dictionary.
In the last example, we picked out all the repeated words and put them in the dictionary. For us, this is the most obvious way to write a dictionary. But the compressor doesn't think so: it has no concept of words-it just looks for patterns. In order to reduce the file size as much as possible, it carefully chooses the best mode.
If you look at this sentence from this angle, you will eventually get a completely different dictionary.
If the compressor scans Kennedy's sentence, the first redundant part it encounters is only a few letters long. In "Don't ask what you are", there is a repetitive pattern, that is, the letter T is followed by a space-in "not" and "what is it". If compressor writes this pattern into the dictionary, it will write a "1" every time a space follows the "t". However, in this short sentence, this pattern does not appear enough times to be an entry in the dictionary, so the program will eventually cover it.
The next thing the program notices is ou, which appears in both your country and yours. If this is a long document, writing this pattern into the dictionary will save a lot of space-ou is a common letter combination in English. But after reading the whole sentence, compressor immediately found a better choice of dictionary entries: not only the ou is repeated, but also the whole word of your and country, and they are actually repeated together as a phrase your country. In this example, the program will overwrite the ou entry in the dictionary with your country entry.
The phrase can do for is also repeated, once with you and once with you, so we find that can do for you is also a repetitive pattern. In this way, we can use a number to replace the characters (including spaces) of 15, and your country only allows us to use a number to replace the characters (including spaces) of 13, so the program will cover your country entry with the entry of R country, and then write a separate entry of can do for you. The program continues to work in this way, picking out all the duplicate information, and then calculating which pattern should be written into the dictionary. The "adaptive" part of LZ algorithm based on adaptive dictionary refers to this ability to rewrite dictionary. The process of executing this work by the program is actually very complicated.
No matter what method is used, this deep search mechanism can compress files more efficiently than just picking out words. If we use the pattern extracted above and then use "_ _" instead of spaces, we will eventually get the following larger dictionary:
Ask _ _
What _ _?
you
R _ _ country
_ _ can do _ _ things for _ _ you _ _
The sentence is shorter:
" 1 not _ _ 2345 _ _-_ _ 12354 "
Sentences occupy 18 storage units, and dictionaries occupy 4 1 storage units. Therefore, we reduced the total file size from 79 units to 59 units! This is just a way to compress sentences, not necessarily the most efficient way. See if you can find a better way! )
superiority
edit
So how good is this mechanism? The file compression ratio depends on many factors, including file type, file size and compression scheme.
In most languages in the world, some letters and words often appear together in the same pattern. It is precisely because of this high redundancy that the compression rate of text files is high. Generally, the compression ratio of a text file with a suitable size can reach 50% or higher. Most programming languages are also highly redundant because they have relatively few commands, and the commands usually adopt a fixed pattern. For files that contain a lot of non-repetitive information (such as images or MP3 files), this mechanism cannot be used to obtain high compression rate because they do not contain repetitive patterns.
If a file has a large number of repeating patterns, the compression ratio usually increases with the increase of file size. This can be seen from our example-if we extract Kennedy's speech longer, you will find that the patterns in our dictionary appear many times, so you can save more file space through each dictionary entry. In addition, for larger files, there may be a more general pattern that can create more efficient dictionaries.
In addition, the efficiency of file compression also depends on the specific algorithm used by the compression program. Some programs can better find patterns in certain types of files, so they can compress these types of files more effectively. Other compression programs use dictionaries in their dictionaries, which makes them perform well when compressing large files, but are inefficient when compressing small files. Although all these compression programs are based on the same basic idea, they are executed in different ways. Programmers always try to build a better compression mechanism.
Lossy compression and lossless compression
edit
The type of compression we discussed above is called lossless compression, because the file you re-create is exactly the same as the original file. All lossless compression is based on the idea that a file is transmitted or stored in a "smaller" form, and then recovered after it is received by the other party for reuse.
Lossy compression is completely different. These programs directly delete "unnecessary" information and cut files to make them smaller. This type of compression is widely used to reduce the file size of bitmap images, because the volume of bitmap images is usually very large. To understand how lossy compression works, let's look at how your computer compresses scanned photos.
For this kind of files, the compression rate of lossless compression programs is usually not high. Although most pictures look the same-for example, the whole sky is blue-there are subtle differences between most pixels. In order to reduce the picture without reducing the resolution, you must change the color values of some pixels. If the picture contains a large number of blue skies, the program will choose a blue color that can be used for all pixels. Then, the program rewrites the file, and all the values of the sky pixels use this information. If the compression scheme is chosen properly, you won't notice any changes, but the file size will be significantly reduced.
Of course, for lossy compression, you can't recover the original file after compression. You must accept the reinterpretation of the original file by the compressor. Therefore, if you need to completely reproduce the original content (such as software applications, databases and presidential inaugural speeches), you should not use this compressed form.