1. Power frequency
The main frequency is also called clock frequency, and the unit is MHz, which is used to indicate the running speed of CPU. Main frequency of CPU = external frequency × frequency multiplication factor. Many people think that the main frequency determines the running speed of CPU, which is not only one-sided, but also biased for the server. So far, there is no definite formula to realize the numerical relationship between the main frequency and the actual running speed. Even the two major processor manufacturers, Intel and AMD, have great disputes on this point. It can be seen from the development trend of Intel products that Intel attaches great importance to strengthening the development of its own main frequency. Like other processor manufacturers, someone once compared a 1G Quanmeida, which was equivalent to a 2G Intel processor.
Therefore, the main frequency of CPU is not directly related to the actual computing power of CPU, and the main frequency represents the oscillation speed of digital pulse signal in CPU. In Intel's processor products, we can also see such an example: 1 GHz An Teng chip can be almost as fast as 2.66 GHz Xeon/Snapdragon, or 1.5 GHz An Teng 2 is about as fast as 4 GHz Xeon/Snapdragon. The running speed of CPU depends on the performance indexes of CPU pipeline.
Of course, the main frequency is related to the actual running speed. It can only be said that the main frequency is only one aspect of CPU performance and does not represent the overall performance of CPU.
2. External frequency
The external frequency is the reference frequency of CPU in MHz. The external frequency of CPU determines the running speed of the whole motherboard. To put it bluntly, on the desktop, what we call overclocking is the external frequency of the super CPU (of course, in general, the frequency doubling of the CPU is locked). I believe this is very understandable. But for the server CPU, overclocking is absolutely not allowed. As mentioned earlier, the CPU determines the running speed of the motherboard, and the two run synchronously. If the server CPU is overclocked and the external frequency is changed, asynchronous operation will occur (many motherboards of desktop computers support asynchronous operation), resulting in instability of the whole server system.
At present, the external frequency in most computer systems is also the speed of synchronous operation of memory and motherboard. In this way, it can be understood that the external frequency of CPU is directly connected with the memory, so that the two can run synchronously. External frequency and front-side bus (FSB) frequency are easily confused. The following front-end bus introduces us to the difference between them.
3.FSB frequency
The frequency of FSB directly affects the speed of direct data exchange between CPU and memory. There is a formula that can be calculated, that is, data bandwidth = (bus frequency × data bit width) /8, and the maximum bandwidth of data transmission depends on the width and transmission frequency of all data transmitted at the same time. For example, Xeon Nocona, which currently supports 64 bits, has a front-end bus of 800MHz. According to the formula, its maximum data transmission bandwidth is 6.4 GB/s. ..
The difference between external frequency and FSB frequency: the speed of FSB refers to the speed of data transmission, and external frequency refers to the speed at which CPU and motherboard run synchronously. In other words, the external frequency of 100MHz means that the digital pulse signal oscillates10 million times per second; 100MHz front-end bus means that the acceptable data transmission capacity of CPU per second is100 MHz× 64 bit ÷ 8 byte/bit = 800 MB/s.
In fact, the appearance of "HyperTransport" architecture has actually changed the frequency of FSB. We know that IA-32 architecture must have three important components: memory controller hub (MCH), I/O controller hub and PCI hub, such as Intel's typical chipsets Intel 750 1 and Intel7505, which are tailored for dual Xeon processors. The MCH they contain provides a front-end bus with a frequency of 533MHz for CPU. With DDR memory, the bandwidth of the front-end bus can reach 4.3GB/. However, with the continuous improvement of processor performance, many problems have been brought to the system architecture. "HyperTransport" architecture not only solves the problem, but also improves the bus bandwidth more effectively, such as AMD Opteron processor. The flexible HyperTransport I/O bus architecture allows it to integrate the memory controller, so that the processor can directly exchange data with the memory without transmitting it to the chipset through the system bus. In this case, the front-side bus (FSB) frequency in AMD Opteron processor does not know where to start.
4. Bit and word length of 4.CPU
Bit: In digital circuits and computer technology, binary coding is adopted, and the coding is only "0" and "1", where "0" and "1" are both one bit in CPU.
Word length: In computer technology, the number of bits of binary numbers that CPU can process at one time in unit time is called word length. Therefore, a CPU capable of processing data with a word length of 8 bits is usually called an 8-bit CPU. Similarly, a 32-bit CPU can process 32-bit binary data in a unit time. Difference between byte and word length: Because commonly used English characters can be represented by 8-bit binary, 8 bits are usually called a byte. The length of word length is not fixed, and it is different for different CPU and word length. An 8-bit CPU can only process one byte at a time, while a 32-bit CPU can process four bytes at a time. Similarly, a CPU with a word length of 64 bits can process 8 bytes at a time.
5. Frequency doubling coefficient
Frequency doubling coefficient refers to the relative proportional relationship between CPU main frequency and external frequency. With the same external frequency, the higher the frequency doubling, the higher the CPU frequency. But in fact, under the premise of the same external frequency, the CPU with high frequency doubling itself is of little significance. This is because the data transmission speed between the CPU and the system is limited, and the CPU that blindly pursues high frequency doubling to obtain high frequency will have obvious "bottleneck" effect-the limit speed of CPU obtaining data from the system cannot meet the running speed of CPU. Generally, except for the engineering version, Intel's CPU is locked by frequency doubling, while AMD did not lock it before.
cache
Cache size is also one of the important indicators of CPU, and the structure and size of cache have a great influence on the speed of CPU. The cache in CPU runs at a very high frequency, usually at the same frequency as the processor, which is much more efficient than the system memory and hard disk. In practical work, CPU often needs to read the same data block repeatedly. The increase of cache capacity can greatly improve the hit rate of reading data inside CPU, without looking in memory or hard disk, thus improving system performance. However, due to the factors of CPU chip area and cost, the cache is very small.
L 1 cache is the first layer cache of CPU, which is divided into data cache and instruction cache. The capacity and structure of the built-in L 1 cache have great influence on the performance of CPU. However, cache memories are all composed of static RAM, and their structures are complex. Under the condition that the die area of CPU can't be too large, the capacity of L 1 level cache can't be too large. The L 1 cache capacity of general server CPU is usually 32-256 KB.
L2 cache is the secondary cache of CPU, which is divided into internal and external chips. The internal chip secondary cache runs at the same speed as the main frequency, while the external secondary cache is only half of the main frequency. L2 cache capacity will also affect the performance of CPU. The principle is that the bigger the CPU, the better. At present, the largest CPU capacity in China is 5 12KB, while the CPU L2 cache used by servers and workstations is as high as 256- 1MB, and some are as high as 2MB or 3MB.
L3 cache (three-level cache) is divided into two types, the early external cache and the present internal cache. Its practical function is that the application of L3 cache can further reduce the memory delay and improve the performance of the processor when calculating large data. Reducing memory latency and improving the computing power of big data are very helpful for the game. However, by adding L3 cache in the server domain, the performance is still significantly improved. For example, a configuration with a larger L3 cache can use physical memory more efficiently, so it can handle more data requests than a slower disk I/O subsystem. Processors with larger L3 caches provide more efficient file system caching behavior and shorter message and processor queue lengths.
In fact, the earliest L3 cache was applied on the K6-III processor released by AMD. At that time, L3 cache was not integrated into the chip, but was integrated into the motherboard due to the manufacturing process. L3 cache, which can only be synchronized with the system bus frequency, is not much different from main memory. Later, L3 cache was an Itanium processor introduced by Intel for the server market. Then P4EE and Xeon MP. Intel also plans to launch an Itanium2 processor with 9MB L3 cache and a dual-core Itanium2 processor with 24MB L3 cache in the future.
But L3 cache is not very important to improve the performance of the processor. For example, the Xeon MP processor equipped with 1MB three-level cache is still no match for Snapdragon, which shows that the increase of front-end bus will bring more effective performance improvement than the increase of cache.
7.CPU extended instruction set
CPU relies on instructions to calculate and control the system, and each CPU is designed with a series of instruction systems that match its hardware circuits. The strength of instruction is also an important indicator of CPU, and instruction set is one of the most effective tools to improve the efficiency of microprocessor. From the current mainstream architecture, instruction set can be divided into two parts: complex instruction set and simplified instruction set, while from the specific applications, such as Intel's MMX (Multimedia Extensions), SSE, SSE 2 (Streaming-Single Instruction Multiple Data-Extensions 2), SEE3 and AMD's 3DNow! They are all extended instruction sets of CPU, which respectively enhance the processing capabilities of multimedia, graphics and images and the Internet of CPU. We usually call the extended instruction set of CPU "CPU instruction set". SSE3 instruction set is also the smallest instruction set at present. Previously, MMX contained 57 commands, SSE contained 50 commands, SSE2 contained 144 commands, and SSE3 contained 13 commands. At present SSE3 is also the most advanced instruction set. The Intel Prescott processor already supports the SSE3 instruction set. AMD will add support for SSE3 instruction set in future dual-core processors, and Quanmeida processors will also support this instruction set.
8.CPU core and I/O working voltage
Starting from 586CPU, the working voltage of CPU is divided into kernel voltage and I/O voltage. Generally, the core voltage of CPU is less than or equal to I/O voltage. Among them, the kernel voltage depends on the production process of CPU. The smaller the general production process, the lower the working voltage of the core. I/O voltage is generally 1.6~5V. Low voltage can solve the problems of excessive power consumption and excessive heat generation.
9. Manufacturing process
The micron of manufacturing process refers to the distance between circuits in integrated circuits. The trend of manufacturing technology is to develop to higher density. The higher the density of IC circuit design, it means that in the same size and area of IC, you can have higher density and more complex function circuit design. Now the main 180nm, 130nm, 90nm. Recently, the official has indicated that there is a manufacturing process of 65nm.
10. Instruction set
(1)CISC instruction set
CISC instruction set, also known as complex instruction set, is called CISC (abbreviation of complex instruction set computer) in English. In CISC microprocessor, the instructions of the program are executed sequentially and serially, and the operations in each instruction are also executed sequentially and serially. The advantage of sequential execution is simple control, but the utilization rate of each part of the computer is not high and the execution speed is slow. In fact, it is the x86 series (that is, IA-32 architecture) CPU produced by Intel and its compatible CPUs, such as AMD and VIA. Even the new X86-64 (also called AMD64) belongs to CISC.
To know what an instruction set is, we should start with today's X86 architecture CPU. The X86 instruction set was specially developed by Intel for its first 16-bit CPU(i8086). The CPU-I 8088 (simplified version of I 8086) in the world's first PC introduced by IBM1981+0 also uses X86 instructions. At the same time, X87 chip was added to the computer to improve the ability of floating-point data processing, and then X86 instruction set and X87 instruction set were added.
Although with the continuous development of CPU technology, Intel has successively developed new types of i80386 and i80486, from PII Xeon, PIII Xeon and Pentium 3 in the past, and finally to Pentium 4 series and Xeon (excluding Xeon Novo Connor) today, in order to ensure that computers can continue to run various applications developed in the past and protect and inherit rich software resources, all CPUs produced by Intel Company still continue to use X86 instruction sets, so their CPUs still belong to X86 series. Because Intel X86 series and its compatible CPU (such as AMD Athlon MP, etc. ) All of them use X86 instruction set, forming today's huge X86 series and compatible CPU lineup. At present, x86CPU mainly includes intel's server CPU and AMD's server CPU.
(2)RISC instruction set
RISC is the abbreviation of English "Reduced Instruction Set Computing" and Chinese "Reduced Instruction Set". It is developed on the basis of CISC instruction system. Some tests on CISC machines show that the frequency of various instructions is quite different. The most commonly used are some simple instructions, accounting for only 20% of the total number of instructions, but their frequency of appearance in the program accounts for 80%. Complex instruction system will inevitably increase the complexity of microprocessor, leading to long development time and high cost. Moreover, complex instructions require complex operations, which will inevitably slow down the computer. For the above reasons, RISC CPU was born in 1980s. Compared with CISC CPU, RISC CPU not only simplifies the instruction system, but also adopts a structure called superscalar and super pipeline, which greatly increases the parallel processing ability. RISC instruction set is the development direction of high performance CPU. It is contrary to the traditional CISC (Complex Instruction Set). In contrast, RISC has a unified instruction format, fewer types and fewer addressing methods than complex instruction sets. Of course, the processing speed is much higher. At present, the CPU of this instruction system is widely used in high-end servers, especially in high-end servers, and RISC instruction system CPU is used. RISC instruction system is more suitable for high-end server operating system UNIX, and now Linux also belongs to UNIX-like operating system. RISC CPU is incompatible with Intel and AMD CPU in software and hardware.
At present, the CPU using RISC instruction in middle and high-end servers mainly includes the following categories: PowerPC processor, SPARC processor, PA-RISC processor, MIPS processor and Alpha processor.
⑶IA-64
There has been a lot of debate about whether EPIC (Explicit Parallel Instruction Computer) is the successor of RISC and CISC systems. As far as epic system is concerned, it is more like an important step for Intel processors to move towards RISC system. Theoretically, under the same host configuration, the CPU designed by EPIC system is much better than the application software based on Unix.
Intel's server CPU with EPIC technology is located in An Teng, An Teng (development code named Merced). It is an IA-64-bit processor and the first model of IA-64 series. Microsoft has also developed an operating system code-named Win64, which is supported by software. After Intel adopted the X86 instruction set, it turned to look for a more advanced 64-bit microprocessor. Intel did this because they wanted to get rid of the huge x86 architecture and introduce energetic and powerful instruction sets, so IA-64 architecture with EPIC instruction set was born. IA-64 has made great progress over x86 in many aspects. It breaks through many limitations of the traditional IA32 architecture, and achieves a breakthrough in data processing capacity, system stability, security, usability and observability.
The biggest defect of IA-64 microprocessor is its incompatibility with x86. In order to make the IA-64 processor run the software of two dynasties better, Intel used IA-64 processor (An Teng, An Teng 2 ...) to translate x86 instructions into IA-64 instructions. This decoder is not the most efficient decoder, nor is it the best way to run x86 code (the best way is to run x86 code directly on x86 processor), so An Teng and Itanium2 have poor performance when running x86 applications. This has also become the root cause of X86-64.
(4)X86-64 (AMD64 / EM64T)
Designed by AMD, it can handle 64-bit integer operations at the same time and is compatible with X86-32 architecture. It supports 64-bit logical block addressing and provides the option of converting to 32-bit addressing. However, the default data operation instructions are 32-bit and 8-bit, which provides the option of converting to 64-bit and 16-bit; General registers are supported. If it is a 32-bit operation, the result will be expanded to a complete 64-bit operation. In this way, there is a difference between "direct execution" and "conversion execution" in the instruction. The instruction field is 8 bits or 32 bits, which can avoid the field being too long.
The generation of x86-64 (also called AMD64) is not groundless. The 32-bit addressing space of x86 processor is limited to 4GB of memory, and IA-64 processor is not compatible with x86. AMD has fully considered the needs of customers and strengthened the function of x86 instruction set, so that this instruction set can support 64-bit computing mode at the same time, so AMD calls their structure x86-64. Technically, for 64-bit operation in X86-64 architecture, AMD introduced R8-R 15 general register as an extension of the original X86 processor register, but these registers are not fully used in 32-bit environment. The original registers such as EAX and EBX are also expanded from 32 bits to 64 bits. The SSE unit has added eight new registers to support SSE2. Increasing the number of registers will improve performance. At the same time, in order to support both 32-bit and 64-bit codes and registers, x86-64 architecture allows the processor to work in the following two modes: long mode and legacy mode, and the long mode is divided into two sub-modes (64-bit mode and compatible mode). This standard has been introduced into Opteron processor in AMD server processor.
This year, 64-bit EM64T technology was introduced. Before it was officially named EM64T, it was IA32E, the name of Intel's 64-bit extension technology, which was used to distinguish the X86 instruction set. Intel's EM64T supports 64-bit mode, similar to AMD's X86-64 technology. It adopts 64-bit linear plane addressing, and adds 8 new general registers (GPRs) and 8 registers supporting SSE instructions. Similar to AMD, Intel's 64-bit technology will be compatible with IA32 and IA32E, and IA32E will be adopted only when running a 64-bit operating system. IA32E will be composed of two sub-modes: 64-bit mode and 32-bit mode, which are backward compatible like AMD64. Intel's EM64T will be fully compatible with AMD's X86-64 technology. Now Nokona processor has added some 64-bit technologies, and Intel's Pentium 4E processor also supports 64-bit technologies.
It should be said that both of them are 64-bit microprocessor architectures compatible with x86 instruction set, but there are some differences between EM64T and AMD64. NX bits in AMD64 processors are not available in Intel processors.
1 1. Superpipeline and superscalar
Understand the pipeline before explaining the super pipeline and superscalar. Pipeline was first used by Intel in 486 chips. An assembly line works just like an assembly line in industrial production. In CPU, an instruction processing pipeline consists of 5-6 circuit units with different functions, and then an X86 instruction is divided into 5-6 steps, which are executed by these circuit units respectively, so that an instruction can be completed within a CPU clock cycle, thus improving the running speed of CPU. Each integer pipeline of the classic Pentium is divided into four stages: instruction prefetch, decoding, execution and result write-back, and the floating-point pipeline is divided into eight stages.
Superscalar is to execute multiple processors at the same time by establishing multiple pipelines, and its essence is to exchange space for time. By refining the assembly line and increasing the main frequency, the super assembly line can complete one or more operations in a machine cycle, the essence of which is to exchange time for space. For example, the assembly line of Pentium 4 is as long as 20 stages. The longer the pipeline is designed, the faster it can complete an instruction, so it can adapt to the CPU with higher working frequency. However, the long pipeline has also brought some side effects, and it is likely that the actual running speed of high-frequency CPU will be lower. This is the case with Intel Pentium 4. Although its main frequency can be as high as 1.4G, its running performance is far less than AMD Athlon or even Pentium III.
12. Packaging form
CPU package is a protective measure to prevent damage by curing CPU chip or CPU module in it with specific materials. Generally, CPU can only be delivered to users after it is packaged. The packaging mode of CPU depends on the installation form of CPU and the integrated design of devices. Generally speaking, the CPU installed in Socket socket is usually packaged in PGA (Grid Array), while the CPU installed in Slot x slot is all packaged in SEC (Single-sided Plug-in Box). Now there are packaging technologies, such as PLGA (Plastic Grid Array) and Olga (Organic Grid Array). Due to the increasingly fierce market competition, the current development direction of CPU packaging technology is mainly to save costs.
13, multithreading
Synchronous multithreading Synchronous multithreading, referred to as SMT. By copying the structural state of the processor, SMT enables multiple threads on the same processor to execute synchronously and share the execution resources of the processor, which can maximize the wide emission and out-of-order superscalar processing, improve the utilization rate of the processor's operational components, and alleviate the memory access delay caused by data correlation or cache miss. When no multithreading is available, the SMT processor is almost the same as the traditional wide emission superscalar processor. The most attractive thing about SMT is that it can significantly improve the performance only by changing the design of the processor core on a small scale with little additional cost. Multithreading technology can prepare more data to be processed for high-speed computing core and reduce the idle time of computing core. This is undoubtedly very attractive for desktop low-end systems. Starting from Pentium 4 at 3.06GHz, all Intel processors will support SMT technology.
14, multi-core
Multi-core, also known as chip multiprocessor (CMP). CMP was put forward by Stanford University. Its idea is to integrate SMP (symmetric multiprocessor) in large-scale parallel processors into the same chip, and each processor executes different processes in parallel. Compared with CMP, the flexibility of SMT processor structure is more prominent. However, when the semiconductor technology enters 0. 18 micron, the line delay has exceeded the gate delay, which requires the design of microprocessor to be carried out by dividing many basic cell structures with smaller scale and better locality. In contrast, CMP structure is divided into multiple processor cores, and each core is relatively simple, which is conducive to optimal design, so it has more development prospects. At present, both IBM's Power 4 chip and Sun's MAJC5200 chip adopt CMP structure. Multi-core processors can share the cache in processors, which improves the cache utilization and simplifies the complexity of multiprocessor system design.
In the second half of 2005, new processors from Intel and AMD will also be integrated into the CMP structure. The development code of the new Itanium processor is Montecito, which adopts dual-core design, at least has 18MB on-chip cache, and is manufactured by 90nm process. Its design is definitely a challenge to today's chip industry. Each independent core has independent L 1, L2 and L3 caches, including about 1 100 million transistors.
15、SMP
Symmetric Multi-Processing structure (SMP) is the abbreviation of symmetric multi-processing structure, which refers to a group of processors (multiple CPUs) assembled on a computer, and each CPU * * * enjoys a memory subsystem and a bus structure. With the support of this technology, the server system can run multiple processors at the same time and enjoy memory and other host resources. Like dual Xeon, which is what we call dual processors, it is the most common one in symmetric processor systems (Xeon MP can support four processors, AMD Opteron can support 1-8). A few of them are 16. But generally speaking, SMP architecture has poor machine scalability, and it is difficult to achieve a processor with 100 or more. There are usually 8 to 16 processors, but this is enough for most users. It is most common in high-performance server and workstation-level motherboard architectures, such as UNIX servers that can support systems with up to 256 CPUs.
The necessary conditions for building an SMP system are: the hardware supporting SMP includes motherboard and CPU;; System platform supporting SMP, and then application software supporting SMP.
In order to make SMP system run efficiently, the operating system must support SMP system, such as WINNT, LINUX, UNIX and other 32-bit operating systems. That is, multitasking and multithreading can be performed. Multitasking means that the operating system allows different CPUs to complete different tasks at the same time; Multithreading means that the operating system allows different CPUs to complete the same task in parallel.
In order to set up SMP system, there are high requirements for the selected CPU. First, the CPU must have a built-in APIC (Advanced Programmable Interrupt Controller) unit. The core of Intel's multiprocessing specification is the use of advanced programmable interrupt controllers (APICS). Thirdly, the same product model, the same type of CPU core, exactly the same running frequency; Finally, try to keep the same product serial number, because when two batches of CPU run as dual processors, one CPU may be overloaded and the other CPU may be lightly overloaded, which may not give full play to its maximum performance, and even more seriously may lead to a crash.
16, NUMA technology
NUMA is a non-uniform access distributed storage technology, which is a system connected by several independent nodes through a high-speed private network. Each node can be a single CPU or SMP system. In NUMA, there are many solutions to cache consistency, which need the support of operating system and special software. Fig. 2 is an example of NUMA system of Sequent Company. There are three SMP modules connected into a node through a high-speed private network, and each node can have 12 CPU. A system like Sequent can have up to 64 CPUs or even 256 CPUs. Obviously, this is a combination of SMP and NUMA technology.
17, out-of-order execution technology
Out-of-orderexecution refers to the technology that CPU allows multiple instructions to be developed out of the order specified in the program and sent to the corresponding circuit unit for processing. In this way, the instructions that can be executed in advance will be immediately sent to the corresponding circuit units for execution according to the state of each circuit unit and the specific situation of whether each instruction can be executed in advance. During this period, the instructions will not be executed in the specified order, and then the results of each execution unit will be rearranged by the rearrangement unit according to the order of the instructions. The purpose of using out-of-order execution technology is to make the internal circuit of CPU run at full load and correspondingly improve the speed of CPU running programs. Branch technology: (Branch) When an instruction is operated, you need to wait for the result. In general, unconditional branches only need to be executed in the order of instructions, while conditional branches must decide whether to proceed in the original order according to the processed results.
18, memory controller of CPU
Many applications have more complex reading patterns (almost random, especially when cache hits are unpredictable) and cannot use bandwidth effectively. A typical application is business processing software. Even if there are CPU characteristics such as out-of-order execution, it will be limited by memory delay. In this way, the CPU must wait until the dividend of the data needed for operation is loaded before executing instructions (whether these data come from CPU cache or main memory system). At present, the memory latency of low-level systems is about 120- 150 ns, while the CPU speed is above 3GHz, and a single memory request may waste 200-300 CPU cycles. Even when the cache hit rate reaches 99%, the CPU may spend 50% of the time waiting for the end of the memory request-for example, because of memory delay.
You can see that the latency of Snapdragon integrated memory controller is much lower than that of chipset supporting dual-channel DDR memory controller. Intel also integrated a memory controller into the processor as planned, which made the Northbridge chip less important. However, it changes the way the processor accesses the main memory, which helps to increase the bandwidth, reduce the memory delay and improve the performance of the processor.