S Scalar Coprocessors for Accelerating the G723.1 and G729A Speech Coders

(1)

Scalar Coprocessors for Accelerating the G723.1 and G729A Speech Coders

Vassilios A. Chouliaras and Jose Nunez, Member, IEEE

Abstract — We investigate two scalar coprocessors for accelerating the ITU-T G723.1 and G729A speech coders.

Architecture space exploration indicates up to 72% reduction in the total number of instructions executed through the introduction of custom instructions and small changes to the C reference code. The accelerators are designed to be attached to a configurable embedded RISC CPU where they make use of the host register file and Load/Store Infrastructure¹.

Index Terms —Coprocessor, Embedded systems, RISC CPU, Speech coding.

I. INTRODUCTION

peech compression is utilized in a multitude of applications including amongst others VoIP networks and digital satellite systems. Typical consumer products comprise multimedia terminals, digital dictation machines, videophones and IP phones. The G723.1 recommendation [1] in particular was designed to standardize telephony and videoconferencing over public telephone lines (POTS) and is part of the ITU H.324 standard.

This work investigates the benefit, in terms of complexity reduction, of architecture (instruction) extensions for the efficient execution of the above vocoders, building on previous work by the authors [6].

The identified extensions are implemented as coprocessors, tightly- coupled to a configurable, embedded RISC processor.

There is a significant body of research into application acceleration via targeted coprocessors: application domains are diverse, ranging from cryptography [12], maze-routing [7] to high- end video processing [19]. Previous research into the efficient execution of speech coders include [13] and [14] which describe the necessary changes in the ITU reference code when targeting very high-performance, off-the-self digital signal processors. [15]

describes a semi-automated chip-synthesis flow targeting a horizontally microprogrammed (VLIW) embedded DSP architecture, capable of executing one multiply-accumulate operation per clock cycle. The workload in this case was the GSM half-rate speech coder.

Our research is a continuation of [6] which describes instruction set extensions, implemented in a moderate-complexity datapath (coprocessor) attached to a configurable embedded processor. We have investigated a second coprocessor configuration which includes a private register file. Results indicate that the new configuration is superior the previously reported method.

V. A. Chouliaras is with the Department of Electronic and Electrical Engineering, University of Loughborough, Loughborough, Leicestershire LE11 3TU, UK (e-mail: V.A.Chouliaras@lboro.ac.uk).

Jose Nunez is with the Department of Electronic and Electrical Engineering, University of Loughborough, Loughborough, Leicestershire LE11 3TU, UK (e-mail: J.L.Nunez-yanez@lboro.ac.uk).

II. LPAS- BASED SPEECH CODERS

The G723.1 and G729A standards belong to the category of Linear-Prediction Analysis-by-Synthesis (LPAS) [21] speech coders. They produce low bit-rate, high-quality speech using a combination of analysis-by-synthesis techniques where the encoder (analysis) includes the decoder (synthesis) to determine the initial excitation signal, and linear prediction techniques to determine the coefficients of the speech synthesis filter. The G723.1 standard specifies a dual rate speech coder that can operate at 5.3 or 6.3 Kbps while the G729A operates at a rate fixed at 8 Kbps. Quality improves with higher bit rates although the overall performance of G723.1 at 6.3 Kb/s and G729A is similar. A clear difference in these coders is their algorithmic delay where the total one-way delay of G729A of 25 ms compares favorably with the 67.5 ms of G.723.1. Technically, G723.1 at 6.3 Kbps differs from G729A and G723.1 at 5.3 Kbps in the excitation model for the synthesis filter. G.723.1 at 5.3 Kbps uses multi-pulse excitation with a maximum likelihood quantizer (MP-MLQ) while G723.1 at 6.3 kbps and G729A use code excited linear prediction (CELP) [21]. CELP coders are based in a codebook that stores possible excitation sequences for the synthesis filter. This is the most common realization of the LPAS paradigm and its dataflow is depicted in figure 1.

In the figure, the original input speech is used to perform linear prediction analysis and calculate the coefficients of a tenth-order synthesis filter. The filter order models the number of resonant frequencies or formants of the transfer function of the human vocal tract. The excitation signal to the synthesis filter is obtained from two codebooks that model the initial stages of the human sound production system. An adaptive codebook is used to model the pitch structure of voice sounds originating in the vibrating vocal chords and a fixed codebook is used to model unvoiced sounds such as nasal or plosive sounds. The residual error between the reconstructed speech produced by the synthesis filter and the original input speech is then further processed by a perceptual weighting filter. The output signal from this process is then matched against the adaptive codebook elements to determine the codebook index and gain that best approximate the residual signal. The adaptive codebook contribution is removed from the residual and the same process is repeated using the fixed codebook. The index and gains for both codebooks are assembled together with the synthesis filter coefficients in the bitstream transmitted to the decoder. This processing is done for every frame of 10 ms of voice signal. The G729A decoder dataflow is illustrated in figure 2. The received bitstream is disassembled to obtain the filter coefficients and the codebook parameters. The excitation is constructed by adding the adaptive and fixed codebook vectors scaled by their gains.

The excitation is then filtered through the same synthesis filter as

S

(2)

during encoding. Additional post-processing of the speech signal is performed to enhance its quality.

Figure 1: G729A CELP Coder

Figure 2: G729A CELP Decoder

III. PROBLEM FORMULATION

This research identifies architecture and microarchitecture requirements for the efficient implementation of the G729A and G723.1 speech coders on high-performance, low-cost, configurable microprocessors.

The workloads where initially executed and profiled in native mode (Linux X86): ^{Table 1} shows the relative amount of time spent outside the DSP emulation instructions.

In order to investigate the potential acceleration of the algorithms when executing on an embedded microprocessor, the workload was recompiled for the Simplescalar instruction set architecture (ISA) [15]. ^{Table 2} illustrates the simulated processor profiling results.

As expected, the workloads spend a significant amount of time/instructions executing the DSP emulation functions. It is clear that efficient implementation of the DSP emulation instructions on a configurable extensible microprocessor can lead to a very high-performance, targeted-architecture for the particular workloads. The small form-factor and reduced power consumption of the proposed solution makes it a very attractive candidate for replication and integration in an SoC ASIC.

Table 1: Relative amount of time spent outside the DSP emulation instructions

Algorithm Relative time (%, native) G723 Coder 31.3

G723 Decoder 22.8 G729 Coder 30.4 G729 Decoder 26.9

Table 2: Relative number of total instructions executed outside the DSP emulation instructions

Algorithm Relative instructions (%, simulated) G723 Coder 34.5

G723 Decoder 33.3 G729 Coder 34.2 G729 Decoder 37.2

This is the approach taken in this work: the Instruction Set Architecture was chosen to be precisely the DSP emulation instructions as they appear in the reference source. It is summarized in table 3:

Table 3: Coprocessor ISA Move ops Description

Mvrc Move RISC CPU register to

coprocessor register

Mvcr Move Coprocessor register to RISC CPU register

Mvrv Move RISC CPU register LSB to

coprocessor overflow

Mvcvr Move coprocessor overflow to RISC CPU register LSB

Data ops Description

Sature 32-16 bit ITU saturate Add 16-bit add and saturate Sub 16-bit sub and saturate Abs_s 16-bit absolute value L_abs 32-bit absolute value

Shl 16-bit Shift-left with negative shift support and saturation

Shr 16-bit shift-right with negative shift support and saturation

Negate 16-bit negation

Norm_s 16-bit normalization calculation Norm_l 32-bit normalization calculation L_add 32-bit add with overflow saturation L_sub 32-bit sub with overflow and saturation Mult 16x16->16 signed multiplication with

overflow and saturation

L_mult 16x16->32 signed multiplication with overflow and saturation

L_mac 16x16->32 multiplication and 32-bit summation with overflow and saturation L_msu 16x16->32 multiplication and 32-bit

subtraction with overflow and saturation Miscellaneous ops Description

Clv Clear sticky overflow bit Setv Set sticky overflow bit

(3)

IV. MICROARCHITECTURE

We have investigated two microarchitectures: One that uses the main CPU register file and another that utilizes its own.

Both microarchitectures make use of the RISC memory subsystem (L1 Data cache) and are designed to be attached to a Sparc-V8 compliant SoC subsystem distributed under LGPL [10]. We choose to connect the coprocessors to the integer unit pipeline directly instead of designing them as AHB-compliant masters [11] for performance reasons: Stand-alone AHB coprocessors are very effective when working on medium to large blocks of streaming data. Although the workloads perform a lot of work on blocks of data (samples), there were many more instances where we had to insert custom assembly code into irregular (non-iterative) blocks. As a result, we opted for a very tightly-coupled configuration which accommodates efficiently both cases. High-level views of both microarchitectures are depicted in figures 4 and 6 respectively.

This section discusses a number of design parameters:

A. Coprocessor Interface

The open-source embedded RISC processor lacked detailed microarchitecture documentation. Initial experimentation with the already existing coprocessor interface was inconclusive as to its ability to operate in a pipelined fashion. That would have had a detrimental effect on the performance of the coprocessors and it was therefore decided to implement a new, pipelined coprocessor interface. The newly developed coprocessor port can handle two coprocessors and is able to deliver an instruction on every cycle. External coprocessors provide flow control to the main processor through a dedicated stall signal.

The diagram of figure 3 shows a coprocessor data operation on cycle 1 followed by a host-to-coprocessor register transfer on cycle 2. In cycle 3, a coprocessor register is requested by the RISC processor but due to internal stall conditions, data are made available one cycle later than the expected time (cycle 5 instead of cycle 4). During that time, the main processor is held with the holdn signal. Finally, a second read operation, this time directed to Coprocessor 1, is initiated in cycle 6.

Results are made available to the main pipeline in cycle 7.

B. Microarchitecture 1: Using the main RISC CPU Register File

This is the simplest microarchitecture since it makes use of the main RISC processor register file. This type of approach has been adopted by configurable microprocessor vendors [18]

[22] and it is effectively a side-datapath with associated control, attached to the main CPU as depicted in ^{Figure 4}:

holdn deasserted

1 2 3 4 5 6 7

data_op mvrc mvcr data_op mvcr

din

dout

dout holdn asserted data out valid data into coproc clk

pcop_in.cop_no pcop_in.holdn pcop_in.valid pcop_in.opc[19:0]

pcop_in.din[31:0]

pcop_out[1].dout[31:0]

pcop_out[0].holdn

pcop_out[0].dout[31:0]

pcop_out[1].holdn

Figure 3: Pipelined coprocessor I/F

SHIFT UNIT

MISC UNIT

16x16 Signed Mult opr1, opr2

32-bit signed adder saturation

res1 opr3

CONTROLPIPELINE

RF(1,2) InstructionI$

Cache

RISC Decode

Tags way select mux

Data CacheI$

way select mux

RF RISC CPU

ALU CTRL Other CTRL

EXECDMEM/ EXEC2WBDECODEIFETCH

DATAPATH

Coproc Decode

RF (RF3)

Figure 4: Microarchitecture without register file

In this case, the coprocessor consists of the Datapath and the Control Pipeline

Starting at the IFETCH stage, the main RISC processor fetches one instruction word from a multi-way set-associative instruction cache and clocks it into the instruction register.

RISC and coprocessor decoding take place concurrently at the DECODE stage with the main RISC register file accessed at the falling edge of the clock. Due to the significant number of Multiply-add operations in the workload, a third read port was added to the main CPU register file to accommodate single-

(4)

cycle addition (RF3). This port is depicted as an embedded SRAM block, instantiated in the coprocessor hierarchy, clocked at the falling edge of the DECODE stage. Finally, all result bypassing takes place in this stage.

The EXEC stage is the main processing stage for both the RISC processor and the coprocessor. During this stage all non- arithmetic operations are computed in the coprocessor. In addition, the 16-bit signed-multiplication is performed. All transfers between the main RISC pipeline and the internal coprocessor state take place in this stage.

Coprocessor results are pipelined in the EXEC2 stage where the add part of the Multiply-add operation is performed along with saturation. During this stage, the L1 data cache is accessed and one 32-bit word is returned to the main RISC pipeline from the load path as depicted in the diagram. It is this stage that qualifies state updates in the coprocessor side since all possible exception conditions have been resolved.

Finally, results are clocked into a staging register prior to committing to the RISC register file, on the falling edge of the clock.

C. Microarchitecture 2: Using private Register File This microarchitecture is considerably different to the previous one due to utilizing a separate, 16x32-bit register file in addition to a more elaborate control mechanism. The coprocessor state is fully accessible from the RISC CPU and is shown in figure 5:

0

15 4 3 2 1

V

Figure 5: Coprocessor Programmers Model

It consists of sixteen 32-bit registers and a sticky overflow bit.

Bi-directional transfer instructions, between the host RISC processor and the coprocessor, were added to accommodate the lack of Move-to-coprocessor/Move-from-coprocessor instructions in the Sparc V8 architecture [17].

The high-level schematic of the coprocessor with its own register file is depicted in figure 6. In this case, the coprocessor pipeline is segmented in three major sections:

Front-end, Control pipeline and Datapath.

Starting from the top, the main CPU reads an instruction from the multi-way set-associative instruction cache and clocks it into the instruction register.. The latched command is then decoded, both at the RISC processor and the coprocessor front-end, and register-file read-addresses are extracted. In parallel, the coprocessor decoding logic computes a number of control fields that are sent to the control pipeline.

During the EXEC/READ stage, the register file is accessed followed by operand bypassing. The resolved operands opr1, opr2 and opr3 are clocked into the operand registers where they are utilized during the first execution stage (EXEC1).

In DMEM/EXEC1, all shifting, normalization and miscellaneous operations are performed. In addition, the signed-multiplier is accessed if the command specifies that.

Results are passed to EXEC2 for the second stage of execution where all arithmetic and saturation takes place.

The configuration of figure 6 permits the pipelined execution of all the commands with a latency of 1 cycle. The only exceptions are the multiply-add and multiply-subtract with saturation, which span both execution stages and have a latency of 2 cycles.

RF BYPASS1

SHIFT UNIT

MISC UNIT

16x16 Signed Mult opr1, opr2 opr3

32-bit signed adder saturation

RF BYPASS2

res1 opr3

READEXEC1EXEC2

Coproc Decode CPU Command

I/F

DECODE

READ CTRL

EXEC1 CTRL

EXEC2/WB CTRL

FRONT END

DATAPATH

CONTROLPIPELINE

RF InstructionI$

Cache

RISC Decode

Tags way select mux

Data CacheI$

way select mux

RF

RISC CPU

ALU Other CTRL CTRL

EXECDMEMWBDECODEIFETCH

DATAPATH

Figure 6: high-level microarchitecture

The following sections discuss in more detail the microarchitecture blocks common to both coprocessors. These include the EXEC1 and EXEC2 stages and lower hierarchical blocks.

1) EXEC1 Stage

EXEC1 includes datapath logic to perform 16x16 bit signed multiplication, all ITU shift operations and a miscellaneous block responsible for handling all opcodes not falling in the previous category. These are depicted in figure 7

a) Multiplier

This is the signed, 16-bit multiplier. Due to the highly configurable nature of the RISC processor and the portability requirements of this work, HDL constants are used to select whether the multiplier is inferred in the RTL code or instantiated. In the later case, a Booth-Encoded, Wallace-tree multiplier [20] is utilized due to the higher pipelined performance when compared to the implementations chosen by the synthesis tools.

(5)

shift_unit

opr1o(16) opr2e(16) opr2o(16) cmde cmdo

shif t_rese(16) opr1e(16)

shif t_reso(16) shif t_setv (2)

misc_unit

opr1o(16) opr2e(16) opr2o(16) cmde cmdo

misc_rese(16) opr1e(16)

misc_reso(16) misc_setv (2)

signed 16 mult

mux_proc

cmd_s3

nop nop

cmd_s3

s3_res_i

s3_setv

s3_res

s3_res_r s3_setv

s3_setv_r

opr1 15:0

31:16

opr1 15:0 31:16

opr1 15:0

31:16

opr2 15:0

31:16

opr2 15:0

31:16

Figure 7: EXEC1 Stage

Table 4: Multiplier performance vs. architecture (MHz)

Multiplier Unpipelined 2-stage

Synthesis/CS 204 330

Synthesis/NBW 376

Synthesis/WALL 385 502

WALL/No

BOOTH 345 476

WALL/BOOTH 370 574

Table 4 depicts the unpipelined and two-stage pipelined maximum operating frequency of the 16x16 signed multiplier in a high-performance 0.13 process. Our timing budget allows for the use of a non-pipelined multiplier thus, simplifying coprocessor pipeline design.

b) Shift Unit

The shift unit implements the 16 and 32-bit ITU shift operations. A particular characteristic of these operations is the ability to specify negative shift amounts resulting in a positive shift in the opposite direction. The high-level schematic of the shift unit is depicted in figure 8.

2) EXEC2 Stage

This stage performs the Add-part of the MAC instruction as well as all arithmetic and saturation. Results commit to the private register file at the end of this cycle or return to the host pipeline during stage DMEM. The common EXEC2 high-level schematic is shown in figure 9.

16 sext sl32 ³²

!=0

>15 a b opr1e

sext 15:0

32 c 1 MIN16 MAX1 6

(a & b)!c 1 0 1(a & b)!c

v e

!=

16 opr2e 16

15:0 sel_mx1e

1

shift_rese

shift_reso sel_mx

2

1

sr32 ³² 1

16 - 1 0

16 +1 -

opr1b(15 )

1 shamt(15:0

)

1 6 sel_shift

1

1 32

-1 0

32

>31

d

1

d

1 1 opr2a(15 )

16 sext sl32 ³²

!=0

>15 a b opr1a

sext 15:0

32 c 1 MIN16 MAX1 6

(a & b)!c 1 0 1(a & b)!c

v e

!=

16 opr2a 16

15:0 sel_mx

1

sel_mx 2

1

sr32 ³² 1

16 - 1 0

16 +1 -

opr1b(15 )

1 shamt(15:0

)

1 6 sel_shift

1

>31

d

1

Figure 8: ITU Shifter Schematic

16

16 + ¹⁶ SEXT

SATURE 32 32

16 RF

to host CPU operands

Figure 9: EXEC2 Stage high-level schematic

V. RESULTS

Results were obtained for both coprocessors at the architectural level with the baseline architecture being the Simplescalar ISA. The workloads where compiled and all ITU test vectors were validated on the standard architecture simulator (sim-profile). Tables 5 and 6 depict the number of simulated processor instructions required for each workload, for the G723.1 and G729A algorithms respectively

Table 5: G723.1 unmodified instruction count Test vector Instructions

Dtx53mix (mix rate) 1,063,099,834 Dtx53mix (5.3 Kbits/s) 926,595,183 Dtx63 (6.3 Kbits/s) 10,159,707,298

(6)

Table 6: G729A unmodified instruction count Test vector Instructions

Algthm 62,620,904 Fixed 213,968,970 Lsp 3,977,189,411 Pitch 3,253,182,556 Tame 230,922,927

The workloads where then modified to include custom assembly instructions and a new architecture-level simulator (sim-coproc), based on the existing profiling simulator, was designed. The test vectors were again simulated and the algorithmic complexity was measured and compared to that obtained in the previous run. Fully compliance to the ITU-T test vectors was maintained at any instance.

A. Coprocessor without register file results

Tables 7 and 8 depict the average (over all test vectors), relative algorithmic complexity for both the coder and decoder of the G729A and G723.1 standards respectively when compiled and simulated for a coprocessor using the RISC processor register file.

Table 7: G729 Coder Results (average) Normalized

Complexity Coder Decoder Coder

Delta Decoder Delta SATURE 0.940 0.972 0.060 0.028

ADD 0.937 0.969 0.003 0.002

SUB 0.927 0.967 0.010 0.002

ABS_S 0.927 0.967 0.000 0.000

SHL 0.924 0.962 0.003 0.005

SHR 0.923 0.956 0.002 0.006

L_SHL 0.899 0.898 0.024 0.059 L_SHR 0.896 0.895 0.002 0.002 NEGATE 0.896 0.895 0.000 0.000 L_ADD 0.814 0.837 0.082 0.059 L_SUB 0.802 0.812 0.012 0.025 ROUND 0.796 0.801 0.006 0.011 L_ABS 0.796 0.801 0.000 0.000 NORM_S 0.796 0.801 0.000 0.000 NORM_L 0.795 0.799 0.001 0.002 DIV_S 0.792 0.797 0.003 0.002 MULT 0.771 0.784 0.021 0.012 L_MULT 0.660 0.674 0.111 0.110 L_MAC 0.534 0.580 0.126 0.094 L_MSU 0.510 0.529 0.024 0.051

Table 8: G723.1 Coder Results (average) Normalized

Complexity Coder Decoder Coder Delta

Decoder Delta SATURE 0.987 0.985 0.013 0.015

ADD 0.985 0.981 0.002 0.004

SUB 0.985 0.980 0.000 0.000

ABS_S 0.984 0.977 0.001 0.003

SHL 0.981 0.965 0.003 0.012

SHR 0.981 0.959 0.000 0.006

L_SHL 0.936 0.908 0.044 0.051 L_SHR 0.912 0.901 0.024 0.006 NEGATE 0.912 0.901 0.000 0.000 L_ADD 0.824 0.819 0.088 0.082 L_SUB 0.814 0.804 0.010 0.015 ROUND 0.809 0.788 0.005 0.016 L_ABS 0.809 0.788 0.000 0.000 NORM_S 0.809 0.788 0.000 0.000 NORM_L 0.808 0.787 0.001 0.001 DIV_S 0.807 0.787 0.000 0.001 MULT 0.806 0.786 0.001 0.001 L_MULT 0.678 0.670 0.129 0.116 L_MAC 0.563 0.541 0.114 0.129 L_MSU 0.543 0.510 0.020 0.031 The tables illustrate the fractional complexity reduction as extension instructions are added, one by one, for both coder and decoder. In the case of the G729A coder, an average architectural improvement in algorithmic complexity of the order of 49% (coder) to 47.1% (decoder) is achieved. The G723.1 standard achieves similar figures with to 45.7% and 49% complexity reduction for the coder and the decoder respectively. These improvement figures do not take into account cycle-effects such as cache misses, prefetching or the possibility of multi-issue.

B. Coprocessor with private register file results

Tables 9 and 10 show the average (over all test-vectors), relative algorithmic complexity of the G723.1 and G729A coders respectively for a coprocessor with a private register file and utilizing all the defined instructions of table 3 (except division). Further substantial gains are observed: The G723.1 coder demonstrates an average relative complexity of 65%

compared to the unmodified standard and an improvement of 35.6% over to the previous architecture whereas the G729A standard achieves 69% of unmodified complexity and improvement of 39.3% compared to the previous architecture.

It is clear that the introduction of the coprocessor register file provided significant benefit due to reducing the register pressure compared to the previous method. In addition, a significant number of Load/Store operations were eliminated since transient values are now cached in the dedicated register file.

Table 9: G723.1 Results

Benchmark Instruction Count

(Coprocessor)

Fractional complexity Dtx53mix (mix rate) 380,717,669 0.36 Dtx53mix (5.3

Kbits/s)

257,744,402 0.28 Dtx63 (6.3 Kbits/s) 4,261,239,585 0.42

Average 0.35

(7)

Table 10: G729A Results

Benchmark Instruction Count

(Coprocessor)

Fractional complexity

Algthm 19,765,353 0.31

Fixed 67,662,019 0.31

Lsp 1,257,199,028 0.31

Pitch 1,030,256,280 0.31

Tame 73,056,645 0.31

Average 0.31

VI. SOC SUBSYSTEM

Architecture research demonstrated the superiority of the coprocessor with a private register file. This microarchitecture is currently being implemented in RTL VHDL as a tightly- coupled coprocessor for the Leon Sparc-V8 CPU. Detailed microarchitecture analysis followed by trial synthesis confirmed that all instructions can fit in a single high- frequency cycle resulting in a latency of 1 and an initiation rate of 1. Exceptions to this are the Multiply-add/subtract instructions and the short divide with latency/initiation rate of 2/1 and 17/17 respectively. In particular, it was decided that due to the very low improvement, the iterative divider block would not be utilized.

The CPU/Coprocessor attaches to a 32-bit AHB system which connects to an external host via an AHB-PCI Bridge. This is depicted in figure 10.

Instruction Cache

RISC CPU

Coprocessor

Data Cache

Arbiter/

Mem

Ctrl AHB

SDRA M Ctrl

SDRA M

PROM

Ctrl PROM

AHB/

Wishbone Bridge

Wishbone Interconnect

PCI I/F PCI

Host Figure 10: SoC Subsystem

The optimized speech coder and the frames to be processed are transferred with DMA from the host PC to the SDRAM memory of the RISC/Coprocessor FPGA board. After that, the RISC CPU/coprocessor combination processes the frames and stores the compressed frames in local memory (SDRAM). The compressed frames are transferred back to the PC memory for comparison with the ITU-T test vectors.

VII. SYSTEM VERIFICATION

Significant effort is spent in validating the system both at block as well as system level [16]:

A. Block-level verification

The reference code DSP emulation instructions were instrumented to produce human-readable files of their input operands, the state of the global Overflow flag and output results. These vectors were subsequently fed into the individual datapath blocks and their functionality validated on a per-workload basis.

B. System level verification

In parallel to block-level verification, system verification involved the design of a DMA controller, to transfer the embedded processor binary and frames from the host memory into the FPGA board SDRAM. The RISC processor, without the coprocessor, executed the workload and agreement with the ITU-T test vectors was obtained.

I$

Way selection Instruction

Cache

+size BPred

BTAC prediction IuTLB

Cmp

GPR (1/2)

Flags

Bpred update new_pc

ICACHE1ICACHE2

Dec1

X1 GPR

(3)

X2

sat

GPR

Coprocessor Dispatch

RISC CPU Front-end

Data CacheI$

DuTLB

Cmp

Ld/St Unit

Main Execution Pipeline

Bypass Buses