Chapter 4 Decoder and Control Circuit Design
4.5 The post-simulations
A dual thread 64 x 64 bits register file with the proposed low power techniques is implemented in UMC 90um CMOS technology. Its simulation result is shown in Table 4.1. Operating voltage range is between 1.0v and 0.5v. It can operate up to 204MHz at 0.5v and consumes 197.51µW read power and 175.77µW write power at 50MHz with 0.5v. It consumes 3.62mW read power and 3.04mW write power at 250MHz with 1.0v. Fig. 4.16 shows the layout photograph of the proposed register file.
Technology 90nm UMC CMOS
Configuration Dual thread 4W/4R 64 x 64 bits
area 426 x 219 µm2
Power supply 0.5v 1.0v
Frequency 50MHz 250MHz
Read power 197.51µW 3.62mW
Write power 175.77µW 3.04 mW
Access time 10.42ns 2.70ns
Table 4.1 Register file simulation result 0
Fig 4.15 The comparison between this work and conventional design (a) area (b) power consumption.
4.6 Conclusion
The decoder is separated into a row decoder and a block decoder.
When a block is unused, it can be disabled by turning off the switch of block decoder to save power. Logical Effort technique helps determine transistor sizes for speed being an objective function.
The timing control circuit design which control the register file to operate rightly at all process corners and the wide range of Vdd from 0.5v to 1.0v is discussed in this chapter.
The dual thread 64 x 64 bits register file implemented in UMC 90um CMOS technology consumes around 197.51µW to 175.77µW at 50MHz with 0.5v and consumes around 3.62mW to 3.04mW at 250MHz with 1.0v.
426.25 µm
Fig. 4.16 Layout photograph of the dual thread 64 x 64 bits register file.
Chapter 5
Multithreading and Multi-core systems
5.1 Different type of Register file organizations
Register files are not only the storage elements but also the communicational component. For multi-port register files above a threshold size, the area of the communication switch dominates the area of the register file. This section recognizes the ways to rearrange and decouple the storage and communication of register files.
5.1.1 Clustered architecture
The scheme, used in the Alpha 21264 [5.1] and 21464 [5.2] designs, consists of dividing the functional units among two clusters and providing a copy of all registers in each cluster. This approach halves the number of read ports required on each copy of the register file, but requires the same number of write ports on both register files to allow values produced in one cluster to be made available in the second cluster.
An extension of this approach is to develop a clustered architecture that divides the registers among a number of clusters [5.3], [5.4], [5.5], [5.6], [5.7]. Clustered architectures also allow the instruction window to be divided among clusters and have the potential to scale to larger issue widths at high clock frequencies. The number of write and read ports on each individual physical register and the overall complexities of the physical
register file, the bypass network and the wake-logic are decreased.
For example, a 4 -cluster architecture is shown in Fig. 5.1.
Compared with a conventional superscalar architecture (Fig. 5.1(b)), the 4-cluster architecture presents a major difference: any physical register is connected with only half of the functional unit entries and can be written by only one fourth of the functional units.
However, Clustered architecture requires inter-cluster communication when a value is needed from a different cluster. The primary disadvantages of a clustered architecture are the complexity of the inter-cluster control logic and the additional area required to achieve performance similar to a centralized architecture.
Fig. 5.1 Monolithic versus clustered register file organization
Register File
Functional Units
FU0 FU1 FU2 FU3
Register File0
Register File1
Register File2
Register File3
(a) Monolithic register file (b) Clustered register file
5.1.2 Duplicated Register File
In SMT microprocessor, access time of register file is crucial part in instruction latency. It will increase as the size and ports of register file increase.
In [5.8], a new kind of Duplicated register file architecture is proposed for embedded SMT microprocessor. The Duplicated register file architecture distributes read ports to each local function unit, which reduce access time by reducing read ports of each Duplicated register file. Each copy of Duplicated register file has the same size, the same number of ports and the same contents.
Each function unit writes its results to all Duplicated register files simultaneously and does not need to synchronize the different Duplicated register files.
As a result, it does not need communication between different clusters if some function unit tries to use value generated by other
Fig. 5.2 4-thread, 2-read, 6-write, full-duplicate register file architecture.
function units. So, this kind of Duplicated register file architecture has dual functions: storage and communication.
Fig. 5.2 shows 6-duplicate (full-duplicate) register file architecture. Total area of all is larger than a central register file, but the access time become lesser.
The access time of Duplicated register files become lesser.
However, total area and power consumption of all Duplicated register files is larger than a central register file.
5.1.3 Multilevel Register File
Fig. 5.3 Multilevel register file (register file cache)
Registers are cached to reduce average access latency in [5.9], [5.10]. A processor needs many physical registers. However, a very small number are actually required from a register file at a given moment.
A multilevel register file architecture consists of several levels of physical registers with a heterogeneous organization.
Each level may have a different number of registers, a different number of ports and a different access time.
In a multi-level organization, the functional units can only obtain the source operands from the uppermost level directly. A subset of registers in the lower levels are cached in the upper levels depending on the expectations of being required in the near future. Results are always written to the lowest level, which contains all the values, and optionally to upper levels if they are expected to be useful in the near future.
A bank at the upper level of a register file cache can has many ports but few registers, which may result in a single-cycle access time. Banks at the lower levels have many more registers, a somewhat lower number of ports, and may have an increased latency. A more aggressive fetching mechanism could prefetch the values before they are required. Like in cache memories, prefetching must be carefully implemented to prevent premature or unnecessary fetching from polluting the upper levels. In general, prefetching can be implemented by software or hardware schemes.
It is a critical issue for the approach to deciding which values are cached in the upper level of the hierarchy. Like in cache memories, upper levels should contain those values that are more likely to be accessed in the near future. However, the locality properties of registers and memory are very different. First of all, registers have a much lower temporal re-use. In fact, most
physical registers are read only once, and there is even a significant percentage that are never read. Spatial locality is also rare, since physical register allocation and register references are not correlated at all.
Register caches have much worse locality than conventional data caches. Therefore, register caching can add considerable control complexity to an architecture and determining the appropriate values to cache is nontrivial.
5.1.4 One-Level Less-Port register file Architecture
Using a less-ported structure and only allowing necessary register file read accesses reduce the register file’s area, energy, and access time. The designs in [5.11], [5.12], [5.13] do not use banked reads to avoid increasing the complexity of the select logic.
[5.12] propose two techniques to reduce the number of register ports without impacting performance. First, a small memory structure is added, the delayed write-back queue. To access the write-back queue instead of accessing the register file can reduce the access frequency of register file. In addition, the results is written back both in the register file and the write-back queue concurrently to avoid consistency problems during renaming.
Second, it proposed the technique to reduce the number of read ports by pre-fetching ready operands employs an operand pre-fetch buffer to store the pre-fetched operands, and a status bit, the pre-fetch flag, in the instruction queue entry to specify whether the operand is in the pre-fetch buffer or the register file.
There are two options for reducing demand for read ports in [5.11].
The first option is straightforward and identifies bypass operands
in an extra pipeline stage inserted between out-of-order issue and register read. Second, a novel technique, bypass hint, is proposed.
However, the select logic still has to select no more instructions than the number of available read ports after considering the bypass hint bits [5.11] or the prefetch flags [5.12]. [5.13] presents a novel register file architecture, which has single ported cells and asymmetric interfaces to the memory and to the datapath.
A high number of ports has a negative impact on the energy efficiency of register files. Traditionally, this problem is addressed through various clustering techniques that partition (or bank) the RF. However, as partitions get smaller the cost of inter-cluster copies quickly grows and the resulting register files are still multi-ported. For high energy efficiency, it is preferable that the registers be single ported.
Fig. 5.4. Very Wide Register Organization
By making wide memories, related blocks of data can be loaded in parallel, thereby reducing the decoder overhead. This requires the bus between the memories and the register file to be wide as well.
Three aspects are important in the proposed organization: the interface to the memory, single ported cells and the interface to the datapath. The interface of this foreground memory organization is asymmetric: wide towards the memory and narrower towards the datapath.
A set of Very Wide Registers (VWR), with a single port each is used to replace a traditional register file. Every single VWR is made of single ported cells and it has no pre-decode circuit. A post-decode circuit consisting of a multiplexer is provided to select the appropriate word(s).
The asymmetric interface of the VWR, having a wide connection to the memory (width is complete row of the scratchpad) and a narrow connection of one word wide to the datapath, results in the following mode of operation: a complete row of the scratchpad is copied to the VWR at once, using a LOAD row., this scheme can save a lot of power in compared to a clustered VLIW register file.
5.2 Multithreading
Servers equipped with more powerful and power-hungry processors to meet higher computational demands are pushing the power and cooling capabilities of these datacenters to their limits, resulting in increased operating costs and decreased system reliability. Therefore, achieving high performance while maintaining existing power and thermal envelopes requires that microprocessor designs focus not only on performance but rather on the aggregate performance per watt.
Multithreading allows multiple threads to share the functional units of a single processor in an overlapping fashion. To permit this sharing, the processor must duplicate the independent state of each thread. For example, a separate copy of the register file, a separate PC, and a separate page table are required for each thread.
The memory itself can be shared through the virtual memory mechanisms, which already support multiprogramming. In addition, the hardware must support the ability to change to a different thread relatively quickly; in particular, a thread switch should be much more efficient than a process switch, which typically requires hundreds to thousands of processor cycles.
Fig. 5.5 How four threads use the issue slots of a superscalar processor in different approaches.
The top portion of Fig. 5.5 shows how four threads would execute independently on a superscalar with no multithreading support. In the superscalar without multithreading support, the use of issue slots is limited by a lack of instruction-level parallelism. In addition, a major stall, such as an instruction cache miss, can leave the entire processor idle. The bottom of Fig. 5.5 shows the three multithreaded categories including of Fine-grained, Coarse-grained, and Simultaneous multithreading.
5.2.1 Fine-grained multithreading
Fine-grained multithreading switches between threads on each instruction, resulting in interleaved execution of multiple threads.
This interleaving is often done in a round-robin fashion, skipping any threads that are stalled at that time. To make fine-grained multithreading practical, the processor must be able to switch threads on every clock cycle.
One key advantage of fine-grained multithreading is that it can hide the throughput losses that arise from both short and long stalls, since instructions from other threads can be executed when one thread stalls. The primary disadvantage of fine grained multithreading is that it slows down the execution of the individual threads, since a thread that is ready to execute without stalls will be delayed by instructions from other threads.
In the fine-grained case, the interleaving of threads eliminates fully empty slots. Because only one thread issues instructions in a given clock cycle, however, instruction-level parallelism limitations still lead to a significant number of idle slots within individual clock cycles.
5.2.2 Coarse-grained multithreading
Coarse-grained multithreading was invented as an alternative to fine-grained multithreading. Coarse-grained multithreading switches threads only on costly stalls, such as level 2 cache misses.
This change relieves the need to have thread switching be essentially free and is much less likely to slow down the execution of an individual thread, since instructions from other threads will only be issued when a thread encounters a costly stall.
Coarse-grained multithreading suffers, however, from a major drawback: It is limited in its ability to overcome throughput losses, especially from shorter stalls. This limitation arises from the pipeline start-up costs of coarse-grained multithreading. Because a CPU with coarse grained multithreading issues instructions from a single thread, when a stall occurs, the pipeline must be emptied or frozen. The new thread that begins executing after the stall must fill the pipeline before instructions will be able to complete.
Because of this start-up overhead, coarse-grained multithreading is much more useful for reducing the penalty of high-cost stalls, where pipeline refill is negligible compared to the stall time.
In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by switching to another thread that uses the resources of the processor. Although this reduces the number of completely idle clock cycles, within each clock cycle, the instruction-level parallelism limitations still lead to idle cycles.
Furthermore, in a coarse-grained multithreaded processor, since thread switching only occurs when there is a stall and the new thread has a start-up period, there are likely to be some fully idle cycles remaining.
5.2.3 Simultaneous multithreading
Simultaneous multithreading (SMT) is a variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor to exploit thread-level parallelism at the same time it exploits instruction-level parallelism.
The key insight that motivates SMT is that modern multiple-issue processors often have more functional unit parallelism available than a single thread can effectively use. Furthermore, with register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to the dependences among them; the resolution of the dependences can be handled by the dynamic scheduling capability.
In the SMT case, thread-level parallelism (TLP) and instruction-level parallelism (ILP) are exploited simultaneously, with multiple threads using the issue slots in a single clock cycle.
Ideally, the issue slot usage is limited by imbalances in the resource needs and resource availability over multiple threads.
In practice, other factors—including how many active threads are considered, finite limitations on buffers, the ability to fetch enough instructions from multiple threads, and practical limitations of what instruction combinations can issue from one thread and from multiple threads—can also restrict how many slots are used. Although Fig. 5.5 greatly simplifies the real operation of these processors, it does illustrate the potential performance advantages of multithreading in general and SMT in particular.
As mentioned earlier, simultaneous multithreading uses the insight that a dynamically scheduled processor already has many of the hardware mechanisms needed to support the integrated exploitation of TLP through multithreading. In particular,
dynamically scheduled superscalar processors have a large set of registers that can be used to hold the register sets of independent
threads (assuming separate renaming tables are kept for each thread).
Because register renaming provides unique register identifiers, instructions from multiple threads can be mixed in the data path without confusing sources and destinations across the threads.
This observation leads to the insight that multithreading can be built on top of an out-of-order processor by adding a per-thread renaming table, keeping separate PCs, and providing the capability for instructions from multiple threads to commit. There are complications in handling instruction commit, since we would like instructions from independent threads to be able to commit
independently. The independent commitment of instructions from separate threads can be supported by logically keeping a separate reorder buffer for each thread.
There is a variety of other design challenges for an SMT processor.
First, dealing with a larger register file needed to hold multiple contexts. Second, maintaining low overhead on the clock cycle, particularly in critical steps such as instruction issue, where more candidate instructions need to be considered, and in
instruction completion, where choosing what instructions to commit may be challenging. Third, ensuring that the cache conflicts generated by the simultaneous execution of multiple threads do not cause significant performance degradation.
In viewing these problems, two observations are important. First, in many cases, the potential performance overhead due to
multithreading is small, and simple choices work well enough. Second, the efficiency of current super scalars is low enough that there is room for significant improvement, even at the cost of some overhead. SMT appears to be the most promising way to achieve that improvement in throughput.
5.3 Multiprocessors
Computer performance has been driven largely by decreasing the size of chips while increasing the number of transistors they contain. In accordance with Moore’s law, this has caused chip speeds to rise and prices to drop. This ongoing trend has driven much of the computing industry for years.
However, transistors can’t shrink forever. Even now, as transistor components grow thinner, chip manufacturers have struggled to cap power usage and heat generation, two critical problems. Even performance-enhancing approaches like running multiple instructions per thread have bottomed out.
For these reasons, processor performance increases have begun slowing. Chip performance increased 60 percent per year in the 1990s but slowed to 40 percent per year from 2000 to 2004, when performance increased by only 20 percent.
Manufacturers are building chips with multiple cooler-running, more energy-efficient processing cores instead of one increasingly powerful core. The multicore chips don’t necessarily run as fast as the highest performing single-core models, but they improve overall performance by handling more work in parallel.
Current transistor technology limits the ability to continue making single processor cores more powerful. For example, as a transistor gets smaller, the gate, which switches the electricity on and off, gets thinner and less able to block the flow of electrons.
Thus, small transistors tend to use electricity all the time, even when they aren’t switching. This wastes power. Also, increasing clock speeds causes transistors to switch faster and thus generate more heat and consume more power. However, this approach can’t keep
pace with processors’ increasing power and heat build up.
pace with processors’ increasing power and heat build up.