H ISTORY OF P ROCESSOR P ROGRESS - 具複雜運算單元之低功率多執行緒資料路徑的研究與設計

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Performance (vs. VAX-11/780)

• RISC + x86: ??%/year 2002 to present Hennessy and Patterson, Computer

Architecture: A Quantitative Approach, 4th edition, 2006

Figure 1-1 Uniprocessor Performance

[1] Figure 1-1 shows the uniprocessor performance measured by SPECint.

 Complex Instruction Set Computer (CISC)

From 1978 to 1986, the CISC-style processors dominate the processor market.

The essence of CISC is to allocate as many hardware as the functions need. Therefore, the CISC processors can perform some specific functions at high speed.

But CISC processors have some well-known drawbacks.

Instructions of CISC processors have very different execution cycles. Single instruction may consume cycles from several to thousands corresponding to its functionality and complexity. The complexity of instruction variation and hardware selection lead to the inefficiency of the compiler. Besides, If pipeline technique is coordinated to boost the performance of CISC processors, it would be hard to pipeline, and the improvement of performance is limited.

In addition, all CISC-style processors suffer a serious problem. Some hardware is idle at most time. DEC-PDP 10 is a famous CISC processor, and some surveys [2]

of this processor tell us that 70 instructions account for 99% of operation and 50 instructions account for 95% of operation. Lots of instructions and hardware are rarely used, and the idle hardware implies the waste of power consumption.

 Simple scalar, Reduced Instruction Set Computer (RISC)

In 1980s, the RISC concept showed up. The essence of RISC is to handle single function within single instruction. Figure 1-2 is an example of RISC, simple scalar.

Designers only deploy some primitive hardware to maintain its functionality. All the instructions of RISC have same instruction length, so it is easy to pipeline. Moreover, the performance can be easily enhanced along with the advance of technology. Using some instruction encoding techniques can raise the performance, too. Because of the regularity and simplicity of RISC, it has been widely spread out.

Figure 1-2 Scalar

Here are some disadvantages of RISC processors. The RISC processors have a low hardware utilization problem. They operate one function in single instruction, their hardware utilization is 1. When the number of primitive functional units increases, the low utilization characteristic remains unchanged. What is more, the RISC needs to fetch operands from register or memory first. After perform the single operation, the RISC needs to store the data into register of memory. The accesses per operation of the RISC is very high. It implies the power inefficiency.

 Multi-issue (VLIW)

Multi-issue is opposite to single-issue. It means that multiple instructions issued at the same time. The most popular type called Very Long Instruction Word (VLIW) stands for the multi-issue processors since it has been proposed in 1980s. Figure 1-3 shows a 3-way VLIW and a 4-way VLIW.

Figure 1-3 VLIW

The VLIW processors exploit the instruction level parallelism (ILP). The functional units operate concurrently. It can reach high performance by taking advantage of ILP.

The greatest problem of VLIW is the register file pressure. Because the VLIW perform multiple functions in the mean time, each functional unit needs corresponding read or write ports to the register file. The port number strongly affects the area of register file. According to [3], in full custom design, for N FUs, area and delay are increases as N ³ and N ^3/2. Besides, the same frequency of register or memory accesses with RISC processor makes the power inefficiency problem remained.

Finally, the performance can’t be raised infinitely due to the ILP has its limit.

Some other processor architectures are taken into consideration.

 Multi-core

After 2000, the performance of VLIW is no longer sufficient for some specific application.

Multi-core processors use several homogeneous or heterogeneous processors to do things at the same time. Hence the higher performance can be achieved. In this thesis, we only concern about the single core processor. The multi-core issue is beyond the scope.

 Multi-threaded

[4] In unending quest for computers with higher performance, computer system architects seek to reduce or hide latency, the number of cycles an operation takes from start to finish. A long latency may extend for 10 to 100cycles, forcing the traditional processor to sit idle until the result comes in. Less time is wasted if the latency is reduced or even hidden behind the ongoing execution of another operation.

A popular means of reducing latency is the on-chip cache memory, which can shorten the round trip to data storage from tens of cycles to just one or two.

Multithreaded architectures, however, take the tack of hiding latency by supporting multiple concurrent streams of execution, or threads, which are independent of one another. The threads are interleaved on a single processor. When a long-latency operation occurs in one of the threads, another begins execution. In this way, useful work is performed while the time-consuming operation is completed.

Figure 1-4 shows the multithreaded architecture. It includes several parts including computing, selection network, hardware and software context (threads)…etc. The computing part is composed of some functional units, memory/registers, and some interconnection network. Threads are mapped onto hardware context, which each include general-purpose registers, status registers, and a program counter. One context represents a running thread, while the others represent

threads that are eligible to run or are waiting on an operation to complete. Because of hardware limits, some threads are not currently mapped. The functional units handle the operations. The memory/registers store some intermediated value to accelerate the whole works. The interconnection network and the context selection hardware maintain the accuracy of each interleaved thread.

Figure 1-4 Multithreaded architecture

Multi-threaded architectures take advantage of thread level parallelism. Three categories of multi-threaded architectures are coarse-grained multithreading, fine-grained multi-threading, and simultaneous multithreading [5].

z Coarse-grained (block) multithreading (BMT)

The simplest type of multi-threading is where one thread runs until it is blocked by an event that normally would create a long latency stall. Such a stall might be a cache-miss that has to access off-chip memory, which

might take hundreds of CPU cycles for the data to return. Instead of waiting for the stall to resolve, a threaded processor would switch execution to another thread that was ready to run. Only when the data for the previous thread had arrived, would the previous thread be placed back on the list of ready-to-run threads.

z Fine-grained (interleaved) multithreading (IMT)

A higher performance type of multithreading is where the processor switches threads every CPU cycle. The purpose is to remove all data dependency stalls from the execution pipeline. Since one thread is relatively independent from other threads, there's less chance of one instruction in one pipe stage needing an output from an older instruction in the pipeline.

z Simultaneous multithreading (SMT)

The most advanced type of multi-threading applies to superscalar processors. A normal superscalar processor issues multiple instructions from a single thread every CPU cycle. In Simultaneous Multi-threading (SMT), the superscalar processor can issue instructions from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction level parallelism, this type of multithreading is trying to exploit parallelism available across multiple threads to decrease the waste associated with unused issue slots.

SMT is the most complex because of the functionality among threads must be maintained. The most regular is IMT. By the way, the IMT can totally hide instruction latency if enough threads are supported. The hardware cost of IMT, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and TLBs need to larger to avoid thrashing between the different threads. In this thesis, our hide instruction latency technique will focus on IMT.

在文檔中具複雜運算單元之低功率多執行緒資料路徑的研究與設計 (頁 11-17)