Design and implementation of a high-performance and complexity-effective VLIW DSP for multimedia applications

(1)

DOI: 10.1007/s11265-007-0061-x

Design and Implementation of a High-Performance

and Complexity-Effective VLIW DSP for Multimedia Applications

TAY-JYI LIN, SHIN-KAI CHEN, YU-TING KUO AND CHIH-WEI LIU Department of Electronics Engineering, National Chiao Tung University,

Hsinchu, Taiwan

PI-CHEN HSIAO

SoC Technology Center, Industrial Technology Research Institute, Hsinchu, Taiwan

Received: 13 September 2006; Revised: 9 March 2007; Accepted: 17 March 2007

Abstract. This paper presents the design and implementation of a novel VLIW digital signal processor (DSP) for multimedia applications. The DSP core embodies a distributed & ping-pong register file, which saves 76.8% silicon area and improves 46.9% access time of centralized ones found in most VLIW processors by restricting its access patterns. However, it still has comparable performance (estimated in cycles) with state-of-the-art DSP for multimedia applications. A hierarchical instruction encoding scheme is also adopted to reduce the program sizes to 24.1õ26.0%. The DSP has been fabricated in the UMC 0.13 mm 1P8M Copper Logic Process, and it can operate at 333 MHz while consuming 189 mW power. The core size is 3.2 3.15 mm2_{including 160 KB}

on-chip SRAM.

Keywords: VLIW, digital signal processor, register organization, instruction encoding, micro-architecture 1. Introduction

Programmable processors are attractive in embedded multimedia systems because software-based systems need less development effort. Bugs can be fixed with field patches and the products may be upgraded to support new standards with only software. These factors reduce the to-market, extend the time-in-market, and thus help to make the greatest profits. However, today_s multimedia applications, including speech, audio, image and video processing demand extremely high computing power, and the processors must exploit the inherent parallelism extensively to meet the requirements. For instruction-level

parallel-ism, VLIW architectures [1] have low-cost static instruction scheduling (by compilers or assembly programmers) and thus deterministic execution time compared to the dynamically scheduled superscalar ones. They have already become mainstreams of high-performance DSP designs [2, 3]. However, VLIW processors have two major weaknesses that prevent the integration of more functional units with higher instruction issue rate—the dramatically grow-ing register file complexity and the poor code density. This paper proposes a distributed and ping-pong register organization, which saves 76.8% silicon area and 46.9% access time of non-scalable centralized register files in conventional VLIW

(2)

processors. A hierarchical instruction-encoding scheme is also proposed to reduce the program sizes

to 24.1õ26.0%. Figure 1 shows the VLIW DSP

architecture with packed instructions and the clus-tered architecture (Pica), which integrates up to four clusters depending on application requirements. These clusters can operate independently by explor-ing data-level parallelism in most DSP applications, or inter-cluster communications can be carried out via the memory subsystem. We have implemented a prototype of Pica DSP with two identical clusters in the UMC 0.13mm 1P8M Copper Logic Process. The chip can operate at 333 MHz and consumes 189 mW average power. Its core size is 3.2 3.15 mm2

including the 160 KB on-chip memory.

The rest of this paper is organized as follows. Section 2 describes clustered architectures and our proposed distributed and ping-pong register file respectively that reduce datapath complexity. Section 3 describes hierarchical instruction encoding to improve the code density of VLIW processors. The design and

verification flow of programmable processors are summarized in Section 4, while the implementation results from instruction set simulation to silicon proof are available in Section5. Section6concludes this work and outlines our future researches. 2. Register Organization

2.1. Clustered Architectures

A register file provides temporal storage and inter-connections for parallel functional units (FU) in a microprocessor. As more and more FU are integrated into a microprocessor, the complexity of the register file grows rapidly. For a centralized register file for N FU, where each FU can read or write any register directly, the area complexity is N3, the access delay is N3/2, and the power dissipation is N3 [4]. Therefore, register files in modern wide-issue microprocessors are usually partitioned to reduce hardware complexity [5–12]. Figure 2 shows a generic clustered architecture, where each FU has direct accesses only to a subset of the register file. If data need passing across clusters, additional wiring and execution cycles are necessary for inter-cluster communication (ICC).

ICC mechanisms can be classified into three categories [5, 6]. The first one is to use Bcopy^ instructions, which are issued through existing instruction slots [7] or dedicated slots [8], to copy variables from (or to) the register files of other clusters. The former can reduce additional access ports on each of the partitioned register files, but it wastes some effective instruction cycles. The second ICC category is through extended register accesses, where each cluster has limitedBread^ accesses from

Instruction Memory Data Memory

Instruction Fetch Instruction Unpack (dispatch & pre-decode)

Interrupt Controller Program Sequencer Load/Store Unit (L/S) address registers ping registers pong registers

Arithmetic Unit (AU) accumulators status/predicate registers Cluster 0 Cluster 1 Program Control Unit

Figure 1. Pica DSP architecture.

Partitioned RF FU FU FU Cluster 0 Partitioned RF FU FU FU Cluster 1 Partitioned RF FU FU FU Cluster N-1

Switch for Inter-Cluster Communication Figure 2. Clustered architecture.

(3)

[9] (or Bwrite^ accesses to [10]) other partitioned register files. The third kind of ICC mechanisms is via some common storage resources, such as shared memory or specific shared registers [11].

Table 1 shows the comparison of ICC

mecha-nisms [6]. The function f(x, y, n, w) denotes the hardware complexity of a centralized register file with x read ports, y write ports, and n w-bit registers, of which the cell-based implementation requires n registers, x n-input multiplexers (n:1 MUX) for the read ports and n y:1 MUX with some glue control for the write ports. The glue control is negligible when w is much larger than y. Thus, the area complexity of f can be approximated with the area of

x nð 1Þ þ n y 1ð Þ

½ w 2:1 MUX and n w

flipflops, while the time complexity can be approx-imated as the delay of log2n 2:1 MUX or the delay of

log2y 2:1 MUX plus the write control overheads. The

register file complexity of the clustered architectures with dedicated ICC copy slots can be described as c f x=c þ p; y=c þ p; n=c; wð Þ þ gCPðc; p; wÞ, where c denotes the number of clusters, each of which has p extra ports. The g function describes the interconnection complexity of the ICC switch net-work. The architectures without dedicated copy slots have the same register file complexity as those with extended read ICC: c f x=c þ p; y=c; n=c; wð Þ þ gERðc; p; wÞ, where no extra write port is needed. Similarly, the register file complexity of those with

extended write ICC can be described as c

f xð =c; y=c þ p; n=c; wÞ þ gEWðc; p; wÞ . We have adopted the ICC mechanism through the on-chip data memory [6]. When a load/store pair in a VLIW packet has an identical memory address, the data variable from the Bstore^ will be directly forwarded to the Bload^. The proposed ICC mechanism with load/store pairs has the lowest complexity of c f xð =c; y=c; n=c; wÞ þ gLSðc; p; wÞ . Note that the existing crossbar in the memory subsystem can be used for ICC, and therefore the g complexity can be absorbed.

2.2. Distributed and Ping-Pong Register File Besides global clustering, the register file complexity of each cluster can be further reduced. Figure 3 shows a cluster of Pica with a load/store unit (LS), an arithmetic unit (AU), and the proposed distributed and ping-pong register file with 32 registers. The 32 registers are divided into four independent banks, and each bank is equipped with the access ports for a single functional unit (FU) only (i.e. twoBread^ and twoBwrite^ ports in our case). The address registers (a0õa7) and the accumulators (ac0õac7) are

dedi-cated to LS and AU respectively, and they are not visible to the other FU. The remnant 16 registers are shared between these two FU and they are divided into two banks with exclusive accesses. In other words, when LS accesses the ping, AU can only access the pong, and vice versa. Each VLIW packet of Pica needs to specify its access mode (to be either ping or pong) with a corresponding bit (i.e. the Bping-pong index^ described later) in its encoding.

Pica supports powerful SIMD instructions based on the proposed distributed and ping-pong register file. For example, the double load (store) word instruction:

dlw dm; að Þ þ k; ai j

þ l:

It performs two memory accesses (i.e. dm@

Mem32[ai] and dm+1@Mem32[aj]) and two address

updates (i.e. ai@ai+ k, and aj@aj+ l)

simultaneous-ly. The index m must be an even number with m + 1 implicitly specified. These double load/store instruc-tions require six concurrent accesses of the register

Table 1. Comparison of ICC mechanisms. HW complexity Centralized f (x, y, n, w)

Copy c f x=c þ p; y=c þ p; n=c; wð Þ þ gCPðc; p; wÞ

Extended read c f x=c þ p; y=c; n=c; wð Þ þ gERðc; p; wÞ

Extended write c f x=c; y=c þ p; n=c; wð Þ þ gEWðc; p; wÞ

Load/store pair c f x=c; y=c; n=c; wð Þ þ gLSðc; p; wÞ

address registers a0~a7 Load/Store Unit (LS)

Arithmetic Unit (AU) ping registers d0~d7

pong registers d'0~d'7

accumulators ac0~ac7

(4)

file (including twoBreads^ and four Bwrites^ for dlw, or fourBreads^ and two Bwrites^ for dsw). They do not cause access conflict on the distributed and ping-pong register file, because ai and aj are private

address registers while dm and dm+1 are ping-pong

registers that deliver data between LS and AU. These registers locate at independent banks. The DSP also supports sub-word (16-bit) SIMD multiply-accumulations with full-precision results on two 40-bit accumulators aciand aci+1:

fmac:d aci; dm; dn:

It performs aci aciþ dm:H dn:H (dm.H and

dn.H denote the upper 16 bits of dm and dn

respectively) and aciþ1 aciþ1þ dm:L dn:L (dm.L

and dn.L denote the lower 16 bits of dm and dn)

in parallel. Similarly, the index i must be an even number with i + 1 implicitly specified. This instruc-tion requires six concurrent accesses of the register file (i.e. four Breads^ and two Bwrites^). Without the proposed ping-pong register file, Pica will need a 14-port centralized one for the 14 simulta-neous register accesses (i.e. four Breads^ and four Bwrites^ for LS and four Breads^ and two Bwrites^ for AU).

Finally, each of the four four-port banks (i.e. with two Breads^ and two Bwrites^) in Fig. 3 can be further partitioned into even and odd banks, and the distributed and ping-pong register file can be implemented using eight smaller banks, each of which has only two Bread^ and one Bwrite^ ports. We define fDPPðn; wÞ ¼ 8 f 2; 1; n=8; wð Þ þ g_DPPð Þw

as the hardware complexity of a distributed and ping-pong register file with n w-bit registers. The f function is that for the centralized register file described in Section 2.1, and g`DPP denotes the

complexity of the mapping between logical and physical ports. The port mapping contains six 6:1 MUX and two 3:1 MUX (for rm+1and ri+1in LS and

AU respectively) for the read ports, and one 2:1 MUX (even private bank), one 4:1 MUX and one 2:1 MUX (odd private banks), two 3:1 MUX (even ping-pong banks) and two 6:1 MUX (odd ping-ping-pong banks) for the write ports respectively. By the way, the ping-pong registers can be extended for a special ICC mechanism via register permutations, such as the proposedBring-structure register organization^ in our previous work [12].

The assembly syntax of our VLIW packet starts from the ping-pong index, followed by the instruc-tions to each issue slot in sequence:

pingpong index; i0 for LSð Þ; i1 for AUð Þ;:

In the following, we are going to use FIR and DCT, two popular DSP kernels [13], to illustrate how the proposed distributed and ping-pong register file works, and show some code optimization techniques. Assume there is no delay slot (such as an ALU instruction that follows a load instruction immediately cannot use the load result in classical five-stage pipelined processors) for simplicity. The data memory is byte-addressable.

&

FIR filtering

The following code segment is to perform 64-tap finite-impulse response (FIR) filtering on 1,024 samples. The inputs including both the data samples (pointed by X) and the coefficients (pointed by COEF) are 16-bit fractional numbers and the outputs (pointed by Y) are 32-bit variables. PP LS AU 1 0; li a0, COEF; li ac0, 0; 2 0; li a1, X; nop; 3 0; li a2, Y; nop; 4 rpt 1024, 8; 5 0; dlw d0, (a0)+4, (a1)+4; li ac1, 0; 6 rpt 15, 2; 7 1; dlw d0, (a0)+4, (a1)+4; fmac.d ac0, d0, d1; 8 0; dlw d0, (a0)+4, (a1)+4; fmac.d ac0, d0, d1; 9 1; dlw d0, (a0)+4, (a1)+4; fmac.d ac0, d0, d1; 10 0; li a0,COEF; fmac.d ac0, d0, d1; 11 0; addi a1, a1, -126; add d0, ac0, ac1; 12 1; sw (a2)+4, d0; li ac0, 0;

The zero-overhead loop instructions (rpt in the line 4 and line 6) are carried out in the instruction dispatcher and do not consume execution cycles of the datapath. The inner loop (line 7õ8 in dark gray) loads two 16-bit inputs and two 16-bit coefficients into the 32-bit d0 and the 32-bit d1

with the SIMD load operations (i.e. d0@Mem32

[a0] and d1@Mem32[a1]), and the address

registers a0 and a1 are updated simultaneously

(i.e. a0@a0+ 4 and a1@a1+ 4). In the meanwhile,

AU performs two 16-bit (SIMD) multiply-accumulations (i.e. ac0@ac0+ d0. H d1. H and

(5)

32-bit products respectively with two 40-bit accumulators, ac0 and ac1 are added together

and rounded back to the 32-bit d0 in the

ping-pong registers. Finally, the 32-bit output is stored in the memory by LS via d0. In this example,

each output is produced in 35 cycles, and the loops can be unrolled to easily achieve similar performance when the load slots are taken into account.

&

Discrete Cosine Transform (DCT)

Figure 4 shows the eight-point fast DCT

algorithm [14], which is extensively used in image and video processing. In this case, the number of operations per sample is larger than the previous FIR example. Thus, LS can assist in performing some simple arithmetic operations to balance the loading.

The first dlh (double load half words; line 3) instruction loads and sign-extends the first and the last 16-bit input data (i.e. x0and x7in Fig.4) into

32-bit d0 and d1. AU performs a butterfly

operation on these two data by switching the ping-pong index in the succeeding VLIW packet. The butterfly operations on the other three input

pairs are performed accordingly. The final two butterfly operations for the even outputs (i.e. X0,

X2, X4, and X6shown in Fig.4) are performed in

LS instead (line 12õ17) after the coefficient multiplication, which must be performed in AU following the two butterfly operations (line 8–11). In this example, an eight-point DCT can be carried out in 24 cycles. Note that two eight-point DCT can be concurrently performed in one cluster with SIMD instructions, and thus a 2D 8 8 DCT (i.e. equivalent to sixteen eight-point DCT) requires only 192 cycles.

3. Instruction Encoding

VLIW architectures are notorious for their poor code density, which comes from the redundancy inside (1) fixed-length RISC-like instructions, for many instructions do not use all bits actually; (2) fixed-length VLIW packets of which the encoding dedi-cates a bit-field for each issue slot, and NOPs will be filled in the unused slots; and (3) the repeated codes due to loop unrolling or software pipelining. Here, we define an instruction to be a RISC-like operation for an issue slot, and a VLIW packet (or simply a packet) as the instructions issued in the same cycle. HAT [15] is an effective variable-length instruction format to improve the first problem. Variable-length VLIW encoding applied on TI C64 [9] and NEC SPXK5 [16] can eliminate NOP in a packet by attaching a slot number to each instruction for run-time dispersal. Besides, an additional mark is needed for each instruction to denote the boundary of each variable-length packet (i.e. with a varying number of

x0 X0 x1 X4 x2 X2 x₃ X₆ x4 X1 x5 X5 x6 X3 x7 X7 C3/8 C3/8 S3/8 C1/4 C_1/4 S1/4 C_1/4 C1/4 S1/4 C7/16 C7/16 S7/16 S3/16 C_3/16 C3/16

Figure 4. Discrete cosine transform.

PP LS AU

1 0; li a0,X; nop;

2 0; li a₁,COEF; nop;

3 0; dlh d0,0(a0),14(a0); nop;

4 1; dlh d0,2(a0),12(a0); bf ac0,d0,d1;

5 0; dlh d0,4(a0),10(a0); bf ac4,d0,d1; 6 1; dlh d₀,6(a₀),8(a₀); bf ac₆,d₀,d₁;

7 0; lh d₄,0(a₁); bf ac₂,d₀,d₁;

8 1; dlh d4,0(a1),2(a1); bf d0,ac0,ac2;

9 1; dlh d₆,4(a₁),6(a₁); bf d2,ac4,ac6;

10 1; nop; add d₃,d₁,d₃;

11 1; nop; xmpy16 d₃,d₃,d₄;

12 0; add d4,d0,d2; add d0,ac3,ac7; 13 0; sub d₅,d₀,d₂; add d₁,ac₅,ac₇;

14 0; dsh d₄,0(a₀),8(a₀); add d2,ac5,ac1; 15 0; add d4,d1,d3; xmpy16 d4,d4,d1; 16 0; sub d₅,d₁,d₃; bf ac₀,ac₁,d₄;

17 0; dsh d₄,4(a₀),12(a₀); sub d3,d0,d2;

18 0; nop; xmpy16 d₃,d₃,d₅; 19 0; nop; xmpy16 d₆,d₀,d₆; 20 0; nop; add ac2,d3,d6; 21 0; nop; xmpy16 d7,d2,d7; 22 0; nop; add d₇,d₇,d₂; 23 0; nop; add ac₃,d₃,d₇;

24 0; nop; bf d0,ac1,ac3; 25 1; dsh d₀,10(a₀),6(a₀); bf d₀,ac₀,ac₂;

(6)

effective instructions). Indirect VLIW [8] uses a programmable microinstruction memory for the VLIW datapath (i.e. the programmable VIM), and the VLIW packets are stored internally and executed with short indices. The instructions in existing packets may be reused to synthesize new packets to reduce the instruction bandwidth. Systemonic pro-poses an incremental encoding scheme for the prolog and the epilog of software pipelined codes [17], which helps to remove the repeated instructions. Pica uses a novel and integrated encoding scheme [18], which takes into account the three problems to improve the VLIW code density.

3.1. Variable-length Instructions

First, each instruction is variable-length coded, depending on the number of operands, the size of its immediate variable, the frequency of its occur-rence, and whether it is conditionally executed. Note that the same operations in different issue slots need not be coded identical. The variable-length instruc-tion is divided into a fixed-length Bhead^ and a variable-length Btail^ as the HAT format to reduce the complexity of instruction alignment and dis-persal. Figure 5 shows our encoding of the add/sub instructions for the arithmetic unit (AU). Note that the heads need not have the same length across different issue slots.

3.2. VLIW Packets without NOP

NOP instructions are not coded in the VLIW packets for Pica. Here we attach a fixed-length CAP to each

VLIW packet. The CAP has a Bvalid^ field for

centralized dispersal, of which each bit indicates whether its corresponding issue slot has an effective instruction. In other words, NOPs are eliminated by turning the correspondingBvalid^ bits off. Besides, the total length of theBheads^ and Btails^ (HT) of a VLIW packet can be calculated easily for parallel grouping, with the Bvalid^ for the fixed length Bheads^ and the Btail length^ for the variable-length Btails^. To reduce the CAP length and the alignment complexity, while supporting enough flexibility, the tail lengths are restricted to be multiple of 4. Figure6 shows the 14-bit CAP format of the 4-way Pica DSP. Figure7a gives an illustrating example with only two effective instructions. The addi instructions are converted into machine code first by looking up the encoding table in Fig.5. The CAP is then encoded as 0 for a VLIW packet; 0101 to remove NOP in the first and the third instruction slots; 00010 for an eight-bit total tail length; 00 for the ping-pong indices; and 00 to disable SIMD clusters and conditional execution.

The identical clusters in the VLIW/data streaming mode of Pica can be configured into SIMD execution by asserting the S bit in the CAP. The instructions of the main cluster will be replicated to reduce the code

0000 Tail (0~32-bit) Head (16-bit) add func rd rs rt 0001 add.d rd rs rt 0110 rd rs rt

func: 0000(sub), 0001(sub.d), 0010(sub.q), 0011(adc) 0101(bf.d), 0110(add.q)

3-addr (less frequently occur)

1000 rd rs DL

immediate

func immediate

func: 00(addi), 01(addi.d), 10(addi.q), 11(rsbi) DL: 00(4-bit), 01(8-bit), 10(16-bit), 11(32-bit) 1001 abs rd rs 0000 1001 abs.d rd rs 0001

Figure 5. Instruction encoding for add/sub instructions.

Valid Tail Length PP

1-bit 4-bit 5-bit 2-bit

0: VLIW/data streaming mode

1: Scalar/program control mode & differential encoding

M S C

Conditional execution SIMD clusters

Figure 6. CAP in Pica.

e.g. 00 nop; addi d₀, ac₄, -8; nop; addi d₀, ac₄, -8;

0101 00010 00 0 1000 CAP 1000 1000 1000 H1 H3 T1 T3 0 0 0100 00 1000 1000 0100 00 a 0100 00001 00 0 CAP 1 0 b 00 00 1000 1000 1000 H1 0100 00 00 T1

Figure 7. Example of proposed VLIW encoding. a An illustrat-ing example with only two effective instructions. b SIMD execution by asserting the S bit in the CAP.

(7)

size. Figure7b is an example, which eliminates the encoding of the instruction for the slave cluster by turning on the S bit. Besides SIMD execution, we also supports differential encoding to reduce the repeated instructions in loop-unrolled or software-pipelined programs. A VLIW packet can be reused to synthe-size a new one for its succeeding cycle with a small register index offset or new ping-pong indices. 3.3. Instruction Bundles

In order to simplify the accesses of variable-length VLIW packets in the instruction memory, the packets are first packed into fixed-length instruction bundles. The fixed-length CAP and the variable-length HT of a VLIW packet are placed individually from the two ends of an instruction bundle as shown

in Fig. 8a. Moreover, all fixed-length Bheads^ are placed first in order, ahead of their variable-length Btails^. The proposed code layout enables parallel packet fetch, alignment, dispersal, and decoding. It is even possible to look ahead succeeding VLIW packets to further reduce control overheads. We attach aB#CAP^ field to each bundle to indicate the number of packets in the bundle. In our simulations, the 512-bit bundle size is optimal, which has practical decoding complexity and acceptable frag-ment (i.e. the unused bits in a bundle that cannot accommodate a new packet). Our first prototype of Pica DSP contains 32 kByte on-chip instruction memory, which can hold 512 512-bit bundles. We will use these parameters hereafter for simplicity.

Instead of huge multiplexers, we use incremental and logarithmic shifters to extract VLIW packets

CAP 1 T3 H1

VLIW packet 1 (variable-length)

512-bit bundle 14-bit # CAP 5-bit T1 H3

a

32-KB instruction memory with 512 bundles

CAP shifter (14-bit)

CAP buffer (280-bit)

CAP decoder 14 280 H0 shifter (0/16-bit) H1 shifter (0/16-bit) H2 shifter (0/16-bit) C shifter (0/20-bit) Tail shifter (0~124-bit)

HT buffer (465-bit) Tail decoder 465 16 16 16 16 465 124 Head decoder 465 280 280 H3 shifter (0/16-bit)

bundle register (512-bit)

4 1 5 valid C tail length # CAP

b

(8)

from an instruction bundle as shown in Fig. 8b. The CAP decoder first decodes a leading CAP (14-bit) of the CAP buffer and the CAP will then be shifted out to allocate a new CAP for the next packet. The right-hand-side incremental shifters follow to shift out 0õ4 fixed-length Bheads^ depending on the Bvalid^ of the current CAP. If the C-bit in the CAP is asserted, more 20 bits will be shifted out, where an instruction slot has an independent five-bit condition code. Finally, the logarithmic tail shifter will shift out all Btails^ of the VLIW packet. A new VLIW packet can thus be allocated, similarly to the CAP. In brief, a CAP and a HT are continuously shifted out from the two ends of an instruction bundle, and a new VLIW packet will be aligned every cycle to the boundaries of the CAP buffer and the HT buffer respectively. The lengths of the CAP and HT buffers can be estimated as follows:

bundle capacity¼ bundle size logd 2ðworstcase #CAPÞe

¼ 512 logd 2ð19:5Þe ¼ 507 bitsð Þ

worstcase #CAP ﬃaverage length of scalar instr:bundle capacity

¼507

26 ¼ 19:5 20ð Þ

CAP buffer size¼ CAP size worstcase #CAP ¼ 14 20 ¼ 280 bitsð Þ

HT buffer size¼ bundle capacity CAP size bundle capacity maximum packet length

l m

¼ 507 14 507 222

¼ 465 bitsð Þ

First, the capacity of a bundle to accommodate VLIW packets can be calculated by deducting the B#CAP^ bits from the bundle size. We assume that the worst-case number of CAPs turns up when a bundle only holds scalar instructions (encoded as a fixed-length CAP and a variable-lengthBtail^, describe later), which are much shorter than VLIW packets. Thus, the CAP buffer needs to hold this worst-case number of entries, which can be approximated by the maximum number of scalar instructions of average length. The worst-case HT buffer size is the bundle capacity minus the bits impossible to be either Bheads^ or Btails^ (i.e. the minimum number of CAPs in a bundle, which equals to the minimum number VLIW packets). Note that the two buffers have overlapped bits, for the boundary between CAPs and HTs is not deterministic. Finally, it is important to mention that the above estimation is conservative. The buffer sizes can be decided arbitrarily with constraints on the linker and the code generation tools to prevent illegal instruction bundles.

Program flow instructions in Pica have a bundle offset, a CAP index, and a HT pointer, and it can

jump to an arbitrary VLIW packet by fetching a new instruction bundle and shifting out unused CAPs and HTs. They are also variable-length encoded as the effective VLIW instructions, where a variable-length program flow instruction is decomposed into a length CAP with a leading 1 (instead of a

fixed-length Bhead^ as the VLIW instruction) and a

variable-length Btail^. Program flow instructions and VLIW packets are mixed together, and their CAPs, Bheads^ and Btails^ are stuffed in order into instruction bundles. We have developed a linking tool to automatically translate the code labels in Pica assembly (or the instruction offsets of PC-relative jump or branch instructions in a compiled code) into their corresponding bundle offsets, CAP indices and HT pointers in the machine code. Figure9shows the complete instruction decoder for Pica DSP, where a CAP selector is used instead of the shifter in Fig.8b. The Bhead^ dispersal of a packet and the packet alignment with branching are parallel, so the sizes of the incremental shifters are significantly reduced. 4. Design and Verification Flow

Figure 10 shows a generic design flow for program-mable processors, including definition of an instruction set, exploration of an optimal micro-architecture that implements the instruction set architecture (ISA), RTL authoring, and silicon/or FPGA implementations. Designers move forward to the next phase once the function and the performance meet the processor specification. Although iteration is inevitable in the design process, the overall design time can still be minimized by reducing the number of iterations, especially in the major loops. In the following sections, we will describe the design tasks respectively.

32-KB instruction memory with 512 bundles

CAP decoder 14 280 H0 shifter (0/16-bit) H1 shifter (0/16-bit) H2 shifter (0/16-bit) C shifter (0/20-bit) HT buffer (465-bit) Tail decoder 208 16 16 16 16 HT aligner (0~456-bit) 124 Head decoder CAP selector (20-to-1) H3 shifter (0/16-bit) bundle register (512-bit)

465 465 465 20 4 1 6 valid C HT length # CAP

(9)

4.1. Instruction Set Design

An instruction set characterizes the functional be-havior of a processor, and the software must be mapped to or encoded in this instruction set in order to be executed by the processor. In other words, every program is compiled into a sequence of instructions in the instruction set. Attributes associ-ated with an instruction set design include assembly language, instruction format, addressing modes and programming model, which are exposed to the software as perceived by compilers or assembly programmers. The instruction set may influence design effort and implementation complexity. Thus, we focus not only on the micro-architecture techni-ques but also on the instruction set design. The primitive instructions of Pica DSP are defined by pruning the MIPS32 instruction set. DSP-enhanced instructions are added by surveying the instruction sets of state-of-the-art DSPs, such as TI C64 [9]/C55

[19], NEC SPXK5 [16], and Intel/ADI MSA [20], etc. Whether an instruction is included in our instruction set depends on the tradeoffs between performance (in terms of cycle counts or code sizes of applications) and implementation complexity (both the cycle time and silicon area). An instruction set design can be modeled as an instruction set simulator (ISS) for early performance evaluation, which is usually the most abstract model of a programmable processor. By the way, ISS is also very useful for development of large-scale applica-tion software, because it avoids hardware details that are not exposed to the software programming. 4.2. Micro-Architecture Exploration

A micro-architecture is a specific implementation of an ISA design, where all implementations should execute any program encoded in that instruction set. Attributes associated with an implementation include the pipeline design and the memory organization. Figure 11 shows the pipeline design flow. The instructions are first classified to design their corresponding datapaths. In general, a processor must do three generic tasks to perform a computa-tion—(1) arithmetic operations, (2) data movements, and (3) instruction sequencing. The semantics of each of the three instruction types can be specified based on the sub-computations performed by the instruction type. Designers can begin with the five ISA Design (ISS)

Exploration (SystemC) Microarchitecture RTL Authoring Implementation Function? Instr. Count ? Function? Cycle Count? Function? Timing Area/Power? Modify ISA /

Add or factorize instr

Modify latency / Merge or factorize stages

Improve coding style

Improve constraints

Verified with hand-code assembly

Verified with hand-code assembly or compiled program

Equivalence checking with cycle-accurate SystemC model

Function?

Formal equivalence checking with RTL model

Figure 10. Generic design flow for programmable processors.

Instruction Classification Operation Factorization Pipeline Balancing (subdivide or merge) Optimal Forwarding Instruction Set Processor Pipeline

(10)

generic sub-computations. Instructions with distinct sequences of sub-computations can be separated into different pipelines to improve modularity [21].

One natural operation factorization is based on the five generic sub-computations. That is, each of the five sub-computations is mapped to a pipeline stage, resulting in a five-stage pipeline. The pipeline stages can be further balanced to minimize the internal fragment with two methods—(1) merge multiple computations into one, and (2) subdivide a sub-computation into multiple sub-sub-computations. Figure 12a shows the initial design with two four-stage pipelines, where the critical path lies on the multiply-accumulator (MAC) for arithmetic operations. Figure 12b shows more balanced pipelines by subdividing MAC and memory accesses into three stages.

The forwarding paths can be grouped as a separate

module as shown in Fig. 13 to improve timing

closure in silicon implementation with registered output ports [22, 23]. We define Btolerable latency (TL)^ of a data dependency between consecutive instructions on the fly to be the difference between the time the produced datum is ready and the time the consumer actually needs the datum. A depen-dency has a negative TL must be avoided with processor stalls or exposed explicitly as delay slots. On the other hand, there is no harm in dependencies with non-negative TLs, where the data item can be passed through either the register file or specific forwarding paths. Thus, dependencies from all data generators to all data consumers in a pipeline can be classified into non-causal (TL < 0), timing-critical (TL = 0), and normal (TL > 0) one. Data for normal dependences are passed through the register file or the forwarding unit without or with small impacts on

the timing of the datapath. However, the forwarding paths for timing-critical dependencies increase the number of input ports (MUX) of the functional units, which usually lengthen the critical path. Thus, the inclusion of a timing-critical forwarding path is a difficult tradeoff between the cycle times the cycle counts for application software.

The micro-architecture designs were modeled in SystemC [24] and validated against the ISS to ensure they perform the functional requirements specified by the ISA. Figure14is the example dump file from cycle-accurate SystemC simulation, which is faster than RTL simulations by an order of magnitude. After micro-architecture explorations, the final de-sign was described in synthesizable Verilog for the implementation phase. Moreover, the RTL design has been cross-verified with the cycle-accurate SystemC model from bottom up to achieve 100% coverage (e.g. statement, branch, state, arc, etc.) [25]. 4.3. FPGA Prototyping

Figure 15 depicts a simplified Pica system for multimedia applications based on the AMBA AHB bus. We have prototyped this system on the ARM Versatile platform [26] with an ARM926EJ-S core, 128 MB SDRAM, and a rich set of peripherals such as an LCD controller, a stereo audio codec. Pica was first encapsulated in an AHB wrapper and ported on the Xilinx XC2V6000 FPGA on the Logic Tile daughter card. The system AHB, the expansion AHB and ARM run at 70 MHz, 35 MHz and 210 MHz respectively, while Pica operates at 70 MHz with the embedded DLL-based clock generator on FPGA that doubles the clock rate of the expansion AHB. ARM functions as a smart DMA controller that moves data OF 2.05 ns EX/AG 2.00 ns MEM 4.30 ns WB 1.88 ns OF 2.05 ns EX 5.35 ns -0.60 ns WB 1.88 ns Instruction Dispatcher 2.08 ns OF 2.05 ns EX/AG 2.00 ns MC 1.43 ns MEM1 2.03 ns OF 2.05 ns EX_B 1.86 ns EX1 2.14 ns EX2 2.16 ns Instruction Dispatcher 2.08 ns MEM2 1.44 ns WB 1.88 ns AC 1.65 ns WB 1.88 ns

a

b

Figure 12. Pipelines for Pica a classical and b balanced pipeline stages. ALU pipeline register pipeline register Forwarding Unit

timing-critical forwarding paths to consumers from producers normal forwarding paths

(11)

between the Pica on-chip memory and the on-board 128 MB SDRAM. Besides, all peripherals are controlled by ARM, too. After the power-on reset, Pica stays in its standby mode until ARM initializes the instruction memory and triggers an interrupt. We have successfully demonstrated a simplified H.264 intra coder on the prototyping system, which can real-time compress 21 CIF-format pictures per second. Besides, we have also implemented a standard JPEG encoder and a real-time MP3 player using this prototyping platform.

5. Results

5.1. Performance Evaluation

We have hand-optimized several DSP kernels in assembly to evaluate the performance of Pica DSP with cycle-accurate instruction set simulation. Table2 shows the comparison between state-of-the-art DSP processors and Pica DSP. The first column shows the number of cycles required for 40-sample and 16-tap real-value FIR filtering on 16-bit samples. The second column compares the execution cycles to perform the eight-by-eight 2D DCT. The third column is for the 256-point radix-2 fast Fourier transform (FFT) [13]. The last column summarizes

the cycle counts for block-based motion estimation under MAE criteria [27], of which the block size is 16 16 and the search range is within T15 pixels. TI

C64 [9] and NEC SPXK5 [16] are two

high-performance VLIW DSP. C64 can issue eight instruc-tions per cycle and its datapath is partitioned into two identical clusters, each of which has a 16-port centralized register file. Besides, the two clusters can communicate with each other via the extended-read mechanism. SPXK5 is a four-issue processor with a centralized register file, which has only eight general-purpose registers to reduce the hardware complexity. TI C55 [19] and Intel/ADI MSA [20] are conventional DSP architectures with dual multiply-accumulators, and the former even allows memory operands (i.e. not a load/store architecture [28]), and these results are included just for reference. All cycle counts are excerpted from the application notes from the vendors to reveal more objective performance indices.

Pica has DSP-oriented instructions such as MAC and rich SIMD instructions (e.g. as those described in Section2), but they are not new and can be found in some other DSP processors. However, most Pica instructions are simple (i.e. RISC-like). Pica would likely have comparable performance with state-of-the-art DSP if it is equipped with a generic register file, for the supported instructions are very similar. The simulation results in Table 2 show that the clustered architecture with distributed and ping-pong register organization has small impact on highly-parallel DSP kernels, whenever the dataflow is appro-priately arranged. By the way, the instruction slots inside the hand-optimized assembly codes can be almost filled to maximize the hardware utilization. Note that the concurrently-issued instructions may be

~~~~~~~~ Cycle 71 ~~~~~~~~ ======== BOOLEAN REGISTERS ======== B0 = 1 B1 = 0 B2 = 0 B3 = 0 B4 = 0 B5 = 0 B6 = 0 B7 = 0 ======== CLUSTER 0 ======== ======== PING REGISTERS ======== D'[0] = ffff6400 D'[1] = 6600 D'[2] = fffffe00 D'[3] = db3a D'[4] = 1413a D'[5] = ffff8ac6 D'[6] = 0 D'[7] = 5a84 ======== ADDRESS REGISTERS ========

A[0] = 0 A[1] = 8 A[2] = 8 A[3] = 0

A[4] = 0 A[5] = 0 A[6] = 0 A[7] = 0

======== PONG REGISTERS ========

D[0] = fffe4d00 D[1] = 4d00 D[2] = 1065e D[3] = 1fc4b D[4] = 1535e D[5] = ffff46a2 D[6] = 494b D[7] = fffc50b5 ======== ACCUMULATORS ========

AC[0] = 113e6 AC[1] = fffff278 AC[2] = e865 AC[3] = fffeb300 AC[4] = ffffcc00 AC[5] = ffff0000 AC[6] = ffff7f00 AC[7] = ffffe700 ======== CLUSTER 1 ========

======== PING REGISTERS ========

D'[0] = 0 D'[1] = 0 D'[2] = 0 D'[3] = 0 D'[4] = 0 D'[5] = 0 D'[6] = 0 D'[7] = 0 ======== ADDRESS REGISTERS ========

A[0] = 0 A[1] = 0 A[2] = 0 A[3] = 0

A[4] = 0 A[5] = 0 A[6] = 0 A[7] = 0

======== PONG REGISTERS ========

D[0] = 0 D[1] = 0 D[2] = 0 D[3] = 0

D[4] = 0 D[5] = 0 D[6] = 0 D[7] = 0

======== ACCUMULATORS ========

AC[0] = 0 AC[1] = 0 AC[2] = 0 AC[3] = 0 AC[4] = 0 AC[5] = 0 AC[6] = 0 AC[7] = 0 Figure 14. Cycle-accurate SystemC simulation.

IM DM Unified RISC/VLIW DSP Core Pica DMA AMBA AHB Main Memory

(SRAM/SDRAM) Misc Peripherals Versatile Baseboard Logic Tile (70MHz) (210MHz) (35MHz) (70MHz)

(12)

severely limited in general applications with irregular dataflow or arbitrary dependency (especially across the clusters), and thus the performance will degrade significantly. Finally, we have also hand-crafted a standard JPEG encoder [14, 29] on Pica DSP, which can compress 512 512 24-bit color Lena and Baboon images in 4,340,149 and 6,326,148 cycles respectively (i.e. about 53.6õ76.7 frames/s at 333 MHz).

5.2. Code Size Analysis

It is very difficult to evaluate the quality of the instruction encoding of different processor architec-tures, because it strongly depends on the instruction set architectures (ISA), compilers and code genera-tion strategies. Therefore, we base on a single ISA (the aforementioned Pica DSP) and try our best to

make the comparison fair. Table 3 shows the

comparison of various instruction encoding schemes. The first four rows summarize the results from the

hand-optimized assembly programs described in Table 2. The fifth and the sixth rows are for a standard JPEG encoder [14] and a simplified H.264 encoder [27]. The 1st column lists the code sizes for 192-bit fixed-length VLIW packets, each of which contains four 48-bit instructions. The second column lists the variable-length VLIW encoding used in TI C64 and NEC SPXK5. The third column shows the results with variable-length instruction encoding and the proposed NOP removal scheme with CAP. The fourth column shows the improved code sizes with SIMD execution, while the fifth shows those with additional differential encoding.

The variable-length VLIW packets are then stuffed into bundles of different sizes to evaluate the fragment (i.e. the unused bits in a bundle that cannot accommodate a new packet). Note that the lengths of program flow instructions depend on the bundle size (Bbundle offset^ can decrease by 1 bit, but both BCAP index^ and BHT pointer^ need 1 more bit, as the bundle size is doubled). To minimize the offset of intra-instruction fragment (i.e. theBtails^ must be multiple of 4), we use worst-case lengths of program flow instructions for all bundles sizes under compar-ison. Figure16 shows the bundling overheads over the effective instructions.

Table 2. Performance comparison (in cycles). FIR DCT FFT ME Processor descriptions TI C64 [9] 194 126 1,246 36,538 8-issue VLIW at 1 GHz (90 nm) NEC SPXK5 [16] – 240 2,944 – 4-issue VLIW

at 250 MHz (0.13mm) TI C55 [19] 394 238 4,922 82,260 Dual MAC

at 300 MHz Intel/ADI MSA [20] 381 284 3,176 90,550 dual MAC

at 400 MHz Pica DSP 232 123 2,510 41,372 4-issue VLIW at 333 MHz (0.13mm)

Table 3. Code size comparison (in bits).

Fixed

NOP removal

Hierarchical

Original +SIMD +Diff FIR 5,016 4,947 2,686 1,646 1,646 DCT 11,248 10,557 6,160 3,696 3,696 ME 12,616 13,158 6,946 4,486 3,838 FFT 60,648 56,202 32,850 20,058 19,698 JPEG 62,472 47,481 28,936 20,128 19,748 H.264 229,368 167,535 99,424 72,992 72,128

Table 4. Comparison of register organizations. Centralized Proposed f (16, 12, 64, 32) 2 f 8; 6; 32; 32ð Þþ gLS 2 f0_ð₃₂_{; 32}_Þþ gLS Delay 3.18 ns 1.91 ns 1.69 ns Gate count 196,920 104,976 45,850 Area 1,440,000mm2 _766,314_mm2 _334,544_mm2 0% 5% 10% 15% 20% FIR DCT ME FFT JPEG H.264 256-bit 512-bit 1024-bit 2048-bit

(13)

5.3. Silicon Implementation

We have implemented the proposed distributed and ping-pong register file for a cluster with 32 32-bit registers and equivalent centralized register files for both one and two clusters. These three designs are described in Verilog RTL, and then synthesized using Synopsys Design Compiler and UMC L130E High Speed FSG Library (from Faraday Technology). The designs are optimized for timing, and their net-lists

are placed and routed for UMC 0.13 mm 1P8M

Copper Logic Process using Cadence SoC Encoun-ter. Table 4 summarizes the implementation results from post-layout simulations, where the proposed register organization reduces the access delay by a factor of 1.88 and the silicon area by 4.30 respec-tively for the dual-cluster configuration.

We have implemented the instruction decoders for various bundle sizes in Verilog RTL, which are synthesized using Synopsys Design Compiler and UMC L130E High Speed FSG Library (from Faraday Technology). Table 5 shows the gate counts and the delays of the synthesis results. Combined with the

aforementioned overhead analysis as depicted in Fig. 16, 512-bit bundles are optimal for our implementation. Finally, we have implemented the complete four-way Pica DSP with the proposed ICC mechanism with load/store instruction pairs, the distributed and ping-pong register file, and the hierarchical VLIW decoder. The RTL design is synthesized using Synopsys Physical Compiler and UMC L130E High Speed FSG Library. The gate count is 223,578, where the register file and the instruction dispatcher (i.e. hierarchical VLIW decoder) account for 20.5 and 13.1% respectively. The net-lists are placed and routed using Cadence SoC Encounter for UMC 0.13 mm 1P8M Copper Logic Process. Figure 17 shows the layout of Pica DSP, and the core size is 3.2 3.15 mm2 including the 128-kB data memory and the 32-kB instruction memory. The DSP core can operate at 333 MHz while consuming 189 mW average power (while running 2D DCT), where the power breakdown is shown in Fig.18.

6. Conclusions

The paper describes the design and implementation of a high-performance and complexity-efficient VLIW DSP for multimedia applications. A simple inter-cluster communication mechanism with load/store pairs and a novel distributed and ping-pong register organization are proposed to reduce the register file complexity. The simulation results show that a four-issue VLIW DSP with the proposed micro-architecture design has comparable performance with state-of-the-art high-performance DSPs. However, 76.8% silicon area and 46.9% access time are saved from an equivalent centralized register file found in most DSPs. Hierarchical VLIW encoding with flexible variable-length instruction formats, NOP removal and automat-ic code replautomat-ication is proposed to solve VLIW code density problem. In our simulations, the code sizes of a four-issue VLIW DSP can be reduced to only 24.1õ26.0%. An efficient decoding architecture is proposed, too. Finally, a complete four-issue VLIW

Figure 17. Layout of Pica DSP.

7% 4% 11% 9% 69% Instruction Dispatcher ICC & Memory Control Cluster 0 Cluster 1 Memory

Figure 18. Power breakdown of Pica DSP. Table 5. Complexity comparison.

Bundle size 256-bit 512-bit 1,024-bit 2,048-bit Delay (ns) 1.79 2.08 2.32 2.71 Gate count 14,780 29,792 66,165 147,195

(14)

DSP with the area-efficient register organization and the hierarchical instruction decoder is implemented in UMC 0.13 mm Copper Logic Process. The gate count is 223,578 (core only), where the register file and the instruction dispatcher account for 20.5 and 13.3% respectively. The core size is 3.2 3.15 mm2

including the 128 kB data memory and the 32 kB instruction memory. The DSP core can operate at 333 MHz with 189 mW power dissipation (while performing 2D DCT on a natural image).

The drawback of the proposed register organiza-tion is that it significantly complicates the operaorganiza-tion scheduling and the code generation. Actually, we can only optimize the assembly codes by hand. We have tried some compilation methods with ping-pong access constraints, but the performance of the com-piled codes is still far behind that of hand-optimized ones. Development of customized compilation tech-niques may be a good research direction [30].

The variable-length VLIW packets are packed into fixed-length instruction bundles to simplify the instruction memory accesses. However, it introduces fragments, especially for small bundles. There is an opportunity for future researches on the optimal code generation and linking methods to reduce such fragments. Besides, automatic synthesis of optimized variable-length instruction encoding for application-specific instruction set processors (ASIP) is also an interesting topic. The overheads of Bvalid^ bits to remove NOP in each VLIW can be reduced by allowing only limited patterns [31] or even applying entropy coding according to the occurrence frequen-cy. Compressing redundant operands such as the reduced ISA in ARM Thumb [32] may also further improve the code density at small costs.

References

1. P. Lapsley, J. Bier, and E. A. Lee,BDSP Processor Fundamen-tals - Architectures and Features,^ IEEE Press, 1996. 2. Y. H. Hu,BProgrammable Digital Signal

Processors—Architec-ture, Programming, and Applications,^ Marcel Dekker, 2002. 3. J. A. Fisher, P. Faraboschi, and C. Young, BEmbedded

Computing—A VLIW Approach to Architecture, Compiler, and Tools,^ Morgan Kaufmann, 2005.

4. S. Rixner, W. J. Dally, B. Khailany, P. Mattson, U. J. Kapasi, and J. D. Owens,BRegister Organization for Media Process-ing,^ in Proc. HPCA, 2000, pp. 375–386.

5. A. Terechko, E. L. Thenaff, M. Garg, J. Eijndhoven and H. Corporaal, BInter-Cluster Communication Models for Clus-tered VLIW Processors,^ in Proc. HPCA, 2003, pp. 354–364.

6. T. J. Lin, P. C. Hsiao, C. W. Liu, and C. W. Jen,BArea-Efficient Register Organization for Fully-Synthesizable VLIW DSP Cores,^ Int. J. Electr. Eng., vol. 13, pp. 117–127, May 2006. 7. P. Faraboschi, G. Brown, J. A. Fisher, G. Desoll and F. M. O. Homewood, BLx: A Technology Platform for Customizable VLIW Embedded Processing,^ in Proc. ISCA, 2000, pp. 203–213. 8. G. G. Pechanek and S. Vassiliadis,BThe ManArray Embedded Processor Architecture,^ in Proc. Euromicro Conf., 2000, pp. 348–355.

9. TMS320C64x DSP Generation.http://www.ti.com.

10. K. Arora, H. Sharangpani, and R. Gupta, BCopied Register Files for Data Processors Having Many Execution Units,^ US Patent 6,629,232, Sep. 30, 2003.

11. A. Kowalczyk et al., BThe First MAJC Microprocessor: A Dual CPU System-On-a-Chip,^ IEEE J. Solid-State Circuits, vol. 36, pp. 1609–1616, Nov. 2001.

12. T. J. Lin et al.,BPerformance Evaluation of Ring-Structure Register File in Multimedia Applications,^ in Proc. ICME, July 2003. 13. A. V. Oppenheim, R. W. Schafer, and J. R. Buck,

BDiscrete-Time Signal Processing, 2nd ed.,^ Prentice Hall, 1999. 14. Independent JPEG Group.http://www.ijg.org.

15. H. Pan and K. Asanovic,BHeads and Tails: A Variable-Length Instruction Format Supporting Parallel Fetch and Decode,^ in Proc. CASES, 2001.

16. T. Kumura, M. Ikekawa, M. Yoshida, and I. Kuroda,BVLIW DSP for Mobile Applications,^ IEEE Signal Process. Mag., pp. 10–21, July 2002.

17. G. Fettweis, M. Bolle, J. Kneip, and M. Weiss,BOnDSP: A New Architecture for Wireless LAN Applications,^ Presented at Embedded Processor Forum, San Jose, 2002.

18. T. J. Lin et al.,BA Unified Processor Architecture for RISC & VLIW DSP,^ in Proc. GLSVLSI, Apr. 2005.

19. TMS320C55x DSP Generation.http://www.ti.com.

20. R. K. Kolagotla et al., BHigh-Performance Dual-MAC DSP Architecture,^ IEEE Signal Process. Mag., pp. 42–53, July 2002. 21. J. P. Shen and M. H. Lipasti,BModern Processor Design— Fundamental of Superscalar Processors,^ McGraw-Hill, 2005. 22. M. Keating and P. Bricaud,BReuse Methodology Manual—

For System-on-a-Chip Designs, 3rd ed.,^ Kluwer, 2002. 23. D. Chinnery and K. Keutzer,BClosing the Gap Between ASIC

& Custom—Tools and Techniques for High-Performance ASIC Design,^ Kluwer, 2002.

24. J. Bhasker,BA SystemC Primer,^ Star Galaxy Publishing, 2002. 25. J. Bergeron, BWriting Testbenches—Functional Verification

of HDL Models, 2nd ed.,^ Kluwer, 2003.

26. Versatile Platform Baseboard for ARM926EJ-S. http://www. arm.com/.

27. I. E. G. Richardson, BH.264 and MPEG-4 Video Compression,^ Wiley, 2003.

28. J. L Hennessy and D. A. Patterson, Computer Architecture—A Quantitative Approach, 3rd ed.,^ Morgan Kaufmann, 2002. 29. W. B. Pennebaker and J. L. Mitchell, JPEG—Still Image Data

Compression Standard, Van Nostrand Reinhold, 1993. 30. Y. C. Lin, Y. P. You, and J. K. Lee,BRegister Allocation for

VLIW DSP Processors with Irregular Register Files,^ in Proc. CPC, 2006.

31. Intel 64 and IA-32 Architectures Software Developer_s Manual, Intel, Nov. 2006.

(15)

Tay-Jyi Lin received the B.S. degree in Electrical and Control Engineering and the Ph.D. degree in Electronics Engineering, from National Chiao Tung University, Taiwan, in 1998 and 2005, respectively. He is now with the Department of Electronics Engineering and the Institute of Electronics, National Chiao Tung University, as a Researcher Assistant Professor. His research interests include VLSI signal process-ing, configurable computprocess-ing, and computer architecture.

Shin-Kai Chen received the B.S. degree in Electronics Engineering from National Chiao Tung University, Taiwan, in 2004, where he is working toward the Ph.D. degree. His researches include system software designs and application mapping of programmable processors.

Yu-Ting Kuo received the B.S. and the M.S. degrees in Electronics Engineering from National Chiao Tung

Univer-sity, Taiwan, in 2004 and 2006, respectively. He is working towards a Ph.D. degree. His researches include low-power signal processing, digital hearing aids, and computer architecture.

Chih-Wei Liu received the B.S. and Ph.D. degrees in Electrical Engineering from National Tsing Hua University, Taiwan, in 1991 and 1999, respectively. From 1999 to 2000, he was an Integrated Circuit Design Engineer at the Electronics Research and Service Organization (ERSO) of Industrial Technology Research Institute (ITRI), Taiwan. Then, near the end of 2000, he started to work for the SoC Technology Center (STC) of ITRI as a project leader and eventually left ITRI at the end of October, 2003. He is currently with the Department of Electronics Engineering, National Chiao Tung University, Taiwan, as an Assistant Professor. His current research interests include SoC and VLSI system design, processor architecture, digital signal process-ing, digital communications, and coding theory.

Pi-Chen Hsiao received the B.S. degree in Electronics Engineering from National Central University, Taiwan, in 2003, and the M.S. degree in Electronics Engineering from National Chiao Tung University, Taiwan, in 2005. He is now with the SoC Technology Center at Industrial Technology Research Institute, Hsinchu, Taiwan. His research interests include DSP architecture and datapath designs.