Hyper-threading Technology Based on An Teng Architecture: IOE+CMT
Hyper-threading Technology Based on Atom Architecture: IOE+SMT
Hyper-threading based on Nehalem architecture: OOOE+SMT SMT (synchronous multithreading) is actually a proper term and the name of a technology, which is not only adopted by Nehalem, but also by many commercial processors such as Pentium 4. The correct situation should be that Nehalem's HT technology, like Pentium 4' s HT technology, belongs to SMT technology.
In fact, many Intel processors are using Hyper-Threading Technology, except Pentium 4(NetBurst architecture) and Core i7(Nehalem architecture), An Teng 2 () and Atom(Silverthorne), but the HT technology they carry is not SMT!
Before sorting out Intel's HT hyperthreading technologies, let's review the classification of multithreading technologies. Multi-thread Multi-thread is a technology of running multiple worker threads in a single processing core at the same time. Different from CMP (Chip Multiprocessing), it is a common multi-core technology to improve the processing capacity of the system by integrating multiple processing cores. The mainstream processors all adopt CMP technology.
However, CMP technology increases the corresponding circuits on a large scale and increases the cost. MT (multithreading) technology is not like this. It only needs to add a few lines (usually about 2%) to improve the overall processing capacity of the processor, and it is easy to improve the performance of related applications.
Multithreading originated from an idea called ILP (instruction level parallelism), which can be traced back to the 1990s. This idea gave birth to a term called throughput computing, which is used to improve the performance of parallel computing, such as online transactions. The two main ways of throughput calculation are multiprocessing and multithreading.
In the beginning, in order to develop ILP, in the past decades, many technologies have been used, such as superscalar (with multiple executors at the same time), out-of-order execution (allowing instructions without data correlation to run at the same time), dynamic branch prediction, VLIW (very long instruction word) and so on (the first three can be seen on the classic Pentium Pro architecture, and the last one is An Teng Pro). However, superscalar greatly increases the complexity of the design, and the data between instructions is related to control, the ILP that can be developed is limited, and other factors make it difficult for the classic superscalar processor to further improve the processor performance.
Moreover, from the application point of view, online transaction processing OLTP, decision support system DSS, Web services and other applications are characterized by rich thread-level parallelism and lack of ILP, which promotes the emergence of multi-processing and multi-threading.
The idea of multithreading and multithreading technology is somewhat similar to the early time-sharing computing system. When a processor executing multiple threads stops due to cache miss or branch prediction failure, it can switch to another thread for execution. Mainstream multithreading has three forms, the difference lies in the shared resources between threads and the thread switching mechanism:
Among them, CMT and FMT are both technologies under a single execution unit, and different threads are not really "parallel" at the instruction level, while SMT has multiple execution units and can execute multiple instructions at the same time, so the first two are sometimes classified as TMT (Temporal Multithreading) to distinguish them from SMT.
Firstly, CMT coarse-grained multithreading is introduced, because it is the simplest multithreading technology. When a single executing thread encounters a long delay (such as a cache miss), it will switch threads until the operation that the original thread is waiting for is completed. Rough multithreading is sometimes called block multithreading to prevent multithreading, or cooperative multithreading to cooperate with multithreading.
Because of its simplicity, CMT has been implemented by many processors, and many embedded microcontrollers have also implemented it, in addition to those listed below:
1999 IBM rs64 iii "pulsar" (single-core/double-thread)
2005 Fujitsu SPARC64 VI "Olympus -C-C" (dual-core/four-thread)
2006 Intel An Teng 2 "montecito" (dual-core /4-thread)
2007 Intel An Teng 2 "Montvale" (Dual Core /4 Threads)
Intel's An Teng 2 is impressive in FMT-fine-grained multithreading can switch multiple threads at any time in each clock cycle in pursuit of maximum output capacity-of course, switching at any time is also expensive, which prolongs the average execution time of each thread. Fine-grained multithreading is sometimes called staggered multithreading or preemptive multithreading.
Compared with CMT, FMT is more complicated, so there are not so many corresponding processors, such as:
2005 sun ultrasparc t 1 "Niagara" (8 cores /32 threads)
2007 Sun Ultrasparc T2 "Niagara 2" (8 cores /64 threads)
In fact, UltraSPARC T2 also uses other MT technologies, which makes its multithreading capability twice that of T 1. Take a closer look at the picture above. What MT technology does T2 use (note that the first CMT refers to chip multithreading, not coarse-grained multithreading)?
Although as early as NV G40 and ATI R520, the application of FMT technology on CPU has not started, but FMT fine-grained multithreading technology has started to be applied on GPU. In order to hide the high memory delay of cache misses, the execution units in GPU constantly switch between worker threads to improve the overall processing capacity. However, I don't know how many threads G40' s FMT has realized. According to the data, it should be around 100.
As mentioned earlier, SMT is actually different from the other two multithreading technologies-these two technologies are called TMT time multithreading. SMT- synchronous multithreading has multiple execution units and can run multiple instructions at the same time, so it is called "synchronous multithreading"! SMT originated from fully exploiting the potential of superscalar architecture processors-superscalar means that multiple different instructions can be executed at the same time. So SMT has the greatest flexibility and resource utilization, but it is also the most complicated to implement (of course, it is a piece of cake compared with multi-core structure).
2002 Intel Pentium 4 Xeon "Prestonia" (single-core/dual-thread)
2007 Sun Ultrasparc T2 "Niagara 2" (8 cores /64 threads)
2008 Intel Core i7 "Nehalem" (4 cores /8 threads)
Here we see UltraSPARC T2 again, because it adopts both FMT and SMT technologies: because UltraSPARC T2 has two execution units, one for each thread group, and the four threads in the thread group are T 1. Modern GPUs also adopt a similar hybrid design:
Different stream processors can execute different threads simultaneously. Of course, the same stream processor can also switch between different threads. After introducing various MT multithreading technologies, let's take a look at Intel's hyper-threading technology. As mentioned earlier, Intel's CPU with Hyper-Threading Technology includes Pentium 4(NetBurst architecture), Core i7(Nehalem architecture), An Teng 2 () and Silverthorne. We already know that Pentium 4/ Pentium 4 Xeon with Hyper-Threading Technology (not all P4 have Hyper-Threading Technology) adopts SMT architecture, while Core i7 is an improved version. Let's take a look at An Teng 2: An Teng 2 Montecito with dual-core design and two threads per core; An Teng 2 Montecito's hyper-threading technology adopts CMT architecture; It can be seen that An Teng 2' s hyper-threading technology is different from Pentium 4' s SMT, and it is actually CMT coarse-grained multithreading technology. This is because An Teng 2 is an orderly architecture, and the original intention of SMT is to fully squeeze the ability of OOOE (out-of-order execution), so An Teng 2 with an orderly architecture does not adopt SMT. Because the cost of creating multithreading is too high.
So, can't an orderly architecture processor realize SMT? No, Intel's Atom is a typical example: in addition to Atom, IBM's monster Power6 (starting at 4.7GHz) also adopts SMT technology based on ordered architecture (the SMT of Power5 is based on disorder): IBM Power 6 processor, dual-core, two threads per core; Power6: Ordered +SMT, while Power5 is disordered +SMT.