Simultaneous multithreading

From Wikipedia, the free encyclopedia

Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures.

1 Details
2 Taxonomy
3 Historical implementations
4 Modern commercial implementations
5 See also
6 References
7 External links

[edit] Details

Normal multithreading operating systems allow multiple processes and threads to utilize the processor one at a time, giving exclusive ownership to a particular thread for a time slice in the order of milliseconds - this is called Temporal multithreading. Quite often, a process will stall for hundreds of cycles while waiting for some external resource (for example, a RAM load), thus lowering processor efficiency.

A successive improvement is super-threading, where the processor can execute instructions from a different thread each cycle. Thus cycles left unused by a thread can be used by another that is ready to run.

Still, a given thread is almost surely not utilizing all the multiple execution units of a modern processor at the same time. Simultaneous multithreading allows multiple threads to execute different instructions in the same clock cycle, using the execution units that the first thread left spare. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads can be decided by the chip designers, but practical restrictions on chip complexity usually limit the number to 2, 4 or sometimes 8 concurrent threads.

Since the technique is really an efficiency solution, and there is inevitable increased conflict on shared resources, measuring or agreeing on the "goodness" of the solution can be difficult. Some researchers have shown that the extra threads can be used to proactively seed a shared resource like a cache, to improve the performance of another single thread, and claim this shows that SMT is not just an efficiency solution. Others use SMT to provide redundant computation, for some level of error detection and recovery.

But, in most current cases, SMT is about efficiency and increased throughput of computations per amount of hardware used.

[edit] Taxonomy

In processor design, there are two ways to increase on-chip parallelism with less resource requirement: one is superscalar technique which tries to increase Instruction Level Parallelism (ILP), the other is multithreading approach exploiting Thread Level Parallelism (TLP).

Superscalar means executing multiple instructions at the same time while chip-level multithreading (CMT) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:

Multithreaded: Interleaved issue of multiple instructions from different threads
Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle.
Chip-level multiprocessing (CMP or Multicore): integrate two or more superscalar processors into one chip, each execute one thread independently
Any combination of multithreaded/SMT/CMP

The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with multithreaded technique instead of simultaneous multithreading because each core can only issue one instruction at a time.

[edit] Historical implementations

While multithreading CPUs have been around since the 1950s, Simultaneous Multithreading was first researched by IBM in 1968. The first major commercial CPU developed with SMT was the DEC 21464 (EV8). This chip was developed by DEC in coordination with Dean Tullsen of the University of California, San Diego. The processor was never released, since the Alpha line of processors was discontinued when Compaq acquired DEC. Dean Tullsen's work was also used to create the Intel Pentium 4 Processor. The technology developed for this processor may eventually find its way into Tukwila, a CPU being developed at Intel by many of the engineers who designed the EV8.

[edit] Modern commercial implementations

The Intel Pentium 4 was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality Hyper-Threading Technology (HTT), and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application dependent, however, and some programs actually slow down slightly when HTT is turned on. This is due to the replay system of the Pentium 4 tying up valuable execution resources, thereby starving the other thread. However, any performance degradation is unique to the Pentium 4 (due to various architectural nuances), and is not characteristic of SMT in general.

The latest MIPS architecture designs include an SMT system known as "MIPS MT". MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI, a Cupertino-based startup, is the first MIPS vendor to provide a processor SOC based on 8 cores, each of which runs 4 threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities.

The IBM POWER5, announced in May 2004, which comes as either a dual core DCM, or quad-core or 8-core MCM, with each core including a two-thread SMT engine. IBM's implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading.

Although many people reported that Sun Microsystems' UltraSPARC T1 (known as "Niagara" until its 14 November 2005 release) and the upcoming processor codenamed "Rock" (to be launched ~2007) are implementations of SPARC focused almost entirely on exploiting SMT and CMP techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara chip uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads can be issued simultaneously, the processor uses a round robin policy to issue instructions from a single thread each cycle. The designers of the Montecito (processor) have also chosen not to use SMT.

[edit] See also

Chip-level multiprocessing, a complementary technique
Thread (computer science), what is executed
Symmetric multiprocessing uses multiple processors

[edit] References

LE Shar and ES Davidson, "A Multiminiprocessor System Implemented through Pipelining", Computer Feb 1974
D.M. Tullsen, S.J. Eggers, and H.M. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallelism," In 22nd Annual International Symposium on Computer Architecture, June, 1995
D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, and R.L. Stamm, "Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor," In 23rd Annual International Symposium on Computer Architecture, May, 1996