簡介 - 可動態擴充之數位訊號處理核心於系統單晶片內之整合架構研究(III)

Instruction set extension (ISE) is an effective means of meeting the growing efficiency demands for both circuit and speed in many applications. Currently, the growing number of commercial products is marked, such as Tensilica Xtensa [1], ARC ARCtangent [2], MIPS CorExtend [3] and Nios II [4]. Because most applications frequently execute the several instruction patterns, grouping these instruction patterns into new instructions, i.e. instructions in ISE, is an effective way to enhance the performance. For simplicity, instruction(s) in ISE is called ISE(s) hereinafter. ISEs are realized by application specific functional units (ASFU) within the execution stage of the pipeline. Notably, since this work adopts a load-store architecture, the ASFU cannot access data directly from main memory.

The ISE design flow, as illustrated in Figure 1, comprises application profiling, basic block (BB) selection and ISE generation which consists of ISE exploration as well as selection phases. After profiling, basic blocks are selected as the input of ISE exploration based on their execution time. ISE exploration explores legal instruction patterns as ISE candidates, which have to conform to predefined physical constraints, including ISA format, pipestage timing and instruction/operation types. In other words, ISE exploration determines which operation in basic block should be implemented in hardware (i.e. ASFU) or in software (i.e. executed in CPU core). In addition to exploring ISE candidate(s), the proposed algorithm also explores the hardware implementation options of all operations in ISE candidates. After generating ISE candidates, ISE selection chooses as many ISEs as possible among ISE candidates under predefined physical constraints, which are silicon area and ISA format. In this paper, we only focus on ISE exploration; therefore, the algorithms of ISE selection would not be addressed. Interesting readers could refer ISE selection to [9, 15 and 16] for details.

Profiling

BB Selection ISE

Exploration

ISE Selection Application(s)

ISE(s)

Physical constraints BBs

ISE candidates

Figure 1: ISE design flow

Several important physical constraints, e.g., pipestage timing, instruction set architecture (ISA) format, register file, convex and silicon area, should be considered during exploring ISE candidates. These physical constraints are described as follows:

Pipestage Timing

The pipestage timing constraint refers to a situation in which the execution time of ASFU should fit in the original pipestage, i.e. the execution time of ASFU should be an integral number of original pipeline cycles. Many operations in ASFU often have multiple hardware implementation options owing to different area and speed requirements. In ISE exploration, to achieve the highest speed-up ratio, most works [XXXX] usually deploy the fastest hardware implementation option for each operation in ASFU. Nevertheless, the fastest hardware implementation option may not be the best choice unless it can make the execution time of ASFU to fit in or to be as close as possible to the original pipestage. Restated, an ASFU using miser hardware implementation option is better choice due to the manufacturing cost benefit, if it has same execution time reduction with the one using the fastest hardware implementation option. From another perspective, under the same silicon area constraints, designers can employ more ISEs to achieve higher performance improvement than previous works, if the miser implementation option is adopted.

Figure 2 is an example to explain the benefit of considering pipestage timing constraint. Assume that there are two hardware implementation options for an ISE, namely implementation option-A (IO-A) and implementation option-B (IO-B). The silicon area and execution time of IO-A are 3000 µm² and 0.7 cycle, respectively; as well as IO-B, 7000 µm² and 0.35 cycle. Obviously, IO-B has faster execution time than IO-A. Nevertheless, under pipestage timing constraint, both of them have same performance improvement, but IO-A consumes less silicon area. This is why considering pipestage timing constraint can bring benefit in silicon area saving;

meanwhile, maintain the same level of performance.

Figure 2: The benefit with considering pipestage timing constraint Instruction Set Architecture (ISA) Format

The ISA format represents two constraints. The first constraint limits the number of input/output operands employed by an ISE. The second constraint is the number of ISEs and is usually used in ISE selection. That is, the number of ISEs generally cannot exceed the number of unused opcodes. However, perhaps some people would question why not use multiple instructions to overcome both constraints. The reason is that using multiple instructions to represent one ISE may increase ISE fetching latency. This leads to lengthening the execution time of ISE. The trade-off between execution time reduction and the number of opcodes as well as operands is another problem. We do not intend to address this problem in this paper.

The Register file constraint resembles the first constraint of the ISA format.

Under the register file constraint, the number of input/output operands adopted by an ISE cannot exceed the number of read/write ports of the register file.

Convex

The convex constraint is that the ISE’s output cannot connect to its input via other operations not grouped in ISE. In other words, if no path exists from a operation u ISE A to another operation v ISE A involving a operation w∉ ISE A, then ISE A is convex. Figure 3 illustrates an example of the convex and non-convex ISEs. The convex constraint is needed to ensure that a feasible scheduling exists. On the other hand, an ISE violating the convex constraint would have less or no execution time reduction.

Figure 3: The convex and non-convex ISEs Silicon Area

The silicon area constraint limits the silicon area usage for a single ISE to a predefined or reasonable size. In ISE selection, the silicon area constraint also restricts the total silicon area utilized by all ASFUs.

In ISE exploration, many investigations may overlook pipestage timing constraint such that causes unnecessary waste of silicon area. To handle pipestage timing and other constraints above, we propose a new ISE exploration algorithm. The proposed algorithm is derived from the ant colony optimization (ACO) algorithm [5, 6 and 7]. In contrast with previous studies [8], the proposed algorithm explores not only ISE candidates, but also their hardware implementation option, thus reducing the silicon area and minimizing the execution time. Results with MiBench reveal that under same number of ISE, the proposed approach achieves 69.43%, 1.26% and 33.8% (max., min. and avg.) of further reduction in silicon area, and also has maximally 1.6% performance improvement over the previous work [8]. Conversely, under the same silicon area constraint, the proposed approach reaches up to 3.85%, 0.97% and 2.17% (max., min. and avg.) more speedup than the previous one [8].

Moreover, simulation results also demonstrate that the proposed approach is extremely close to the optimal scheme, but takes much less computing time.

This study has the following contributions:

1. An ISE exploration algorithm is proposed, which explores not only ISE candidates, but also their hardware implementation options, and thus reducing the silicon area cost.

2. The proposed ISE exploration algorithm not only significantly lowers silicon area cost, but also enhances performance over the previous algorithm.

3. The proposed ISE exploration algorithm can explore a search space comprising hundreds of instructions in a few minutes, and has a near-optimal solution.

The rest of this work is structured as follows. Section 2 studies the previous related work and background of Ant Colony Optimization Algorithm. Section 3 then presents the proposed approach. Next, Section 4 presents the simulation results and discussion. Conclusions are finally drawn in Section 5.

在文檔中可動態擴充之數位訊號處理核心於系統單晶片內之整合架構研究(III) (頁 8-11)