Manuscript received May 22, 2003 0098 3063/00 $10.00 © 2003 IEEE
Scalar Coprocessors for Accelerating the G723.1 and G729A Speech Coders
Vassilios A. Chouliaras and Jose Nunez, Member, IEEE
Abstract — We investigate two scalar coprocessors for accelerating the ITU-T G723.1 and G729A speech coders.
Architecture space exploration indicates up to 72% reduction in the total number of instructions executed through the introduction of custom instructions and small changes to the C reference code. The accelerators are designed to be attached to a configurable embedded RISC CPU where they make use of the host register file and Load/Store Infrastructure1.
Index Terms —Coprocessor, Embedded systems, RISC CPU, Speech coding.
I. INTRODUCTION
peech compression is utilized in a multitude of applications including amongst others VoIP networks and digital satellite systems. Typical consumer products comprise multimedia terminals, digital dictation machines, videophones and IP phones. The G723.1 recommendation [1] in particular was designed to standardize telephony and videoconferencing over public telephone lines (POTS) and is part of the ITU H.324 standard.
This work investigates the benefit, in terms of complexity reduction, of architecture (instruction) extensions for the efficient execution of the above vocoders, building on previous work by the authors [6].
The identified extensions are implemented as coprocessors, tightly- coupled to a configurable, embedded RISC processor.
There is a significant body of research into application acceleration via targeted coprocessors: application domains are diverse, ranging from cryptography [12], maze-routing [7] to high- end video processing [19]. Previous research into the efficient execution of speech coders include [13] and [14] which describe the necessary changes in the ITU reference code when targeting very high-performance, off-the-self digital signal processors. [15]
describes a semi-automated chip-synthesis flow targeting a horizontally microprogrammed (VLIW) embedded DSP architecture, capable of executing one multiply-accumulate operation per clock cycle. The workload in this case was the GSM half-rate speech coder.
Our research is a continuation of [6] which describes instruction set extensions, implemented in a moderate-complexity datapath (coprocessor) attached to a configurable embedded processor. We have investigated a second coprocessor configuration which includes a private register file. Results indicate that the new configuration is superior the previously reported method.
V. A. Chouliaras is with the Department of Electronic and Electrical Engineering, University of Loughborough, Loughborough, Leicestershire LE11 3TU, UK (e-mail: V.A.Chouliaras@lboro.ac.uk).
Jose Nunez is with the Department of Electronic and Electrical Engineering, University of Loughborough, Loughborough, Leicestershire LE11 3TU, UK (e-mail: J.L.Nunez-yanez@lboro.ac.uk).
II. LPAS- BASED SPEECH CODERS
The G723.1 and G729A standards belong to the category of Linear-Prediction Analysis-by-Synthesis (LPAS) [21] speech coders. They produce low bit-rate, high-quality speech using a combination of analysis-by-synthesis techniques where the encoder (analysis) includes the decoder (synthesis) to determine the initial excitation signal, and linear prediction techniques to determine the coefficients of the speech synthesis filter. The G723.1 standard specifies a dual rate speech coder that can operate at 5.3 or 6.3 Kbps while the G729A operates at a rate fixed at 8 Kbps. Quality improves with higher bit rates although the overall performance of G723.1 at 6.3 Kb/s and G729A is similar. A clear difference in these coders is their algorithmic delay where the total one-way delay of G729A of 25 ms compares favorably with the 67.5 ms of G.723.1. Technically, G723.1 at 6.3 Kbps differs from G729A and G723.1 at 5.3 Kbps in the excitation model for the synthesis filter. G.723.1 at 5.3 Kbps uses multi-pulse excitation with a maximum likelihood quantizer (MP-MLQ) while G723.1 at 6.3 kbps and G729A use code excited linear prediction (CELP) [21]. CELP coders are based in a codebook that stores possible excitation sequences for the synthesis filter. This is the most common realization of the LPAS paradigm and its dataflow is depicted in figure 1.
In the figure, the original input speech is used to perform linear prediction analysis and calculate the coefficients of a tenth-order synthesis filter. The filter order models the number of resonant frequencies or formants of the transfer function of the human vocal tract. The excitation signal to the synthesis filter is obtained from two codebooks that model the initial stages of the human sound production system. An adaptive codebook is used to model the pitch structure of voice sounds originating in the vibrating vocal chords and a fixed codebook is used to model unvoiced sounds such as nasal or plosive sounds. The residual error between the reconstructed speech produced by the synthesis filter and the original input speech is then further processed by a perceptual weighting filter. The output signal from this process is then matched against the adaptive codebook elements to determine the codebook index and gain that best approximate the residual signal. The adaptive codebook contribution is removed from the residual and the same process is repeated using the fixed codebook. The index and gains for both codebooks are assembled together with the synthesis filter coefficients in the bitstream transmitted to the decoder. This processing is done for every frame of 10 ms of voice signal. The G729A decoder dataflow is illustrated in figure 2. The received bitstream is disassembled to obtain the filter coefficients and the codebook parameters. The excitation is constructed by adding the adaptive and fixed codebook vectors scaled by their gains.
The excitation is then filtered through the same synthesis filter as
S
during encoding. Additional post-processing of the speech signal is performed to enhance its quality.
Figure 1: G729A CELP Coder
Figure 2: G729A CELP Decoder
III. PROBLEM FORMULATION
This research identifies architecture and microarchitecture requirements for the efficient implementation of the G729A and G723.1 speech coders on high-performance, low-cost, configurable microprocessors.
The workloads where initially executed and profiled in native mode (Linux X86): Table 1 shows the relative amount of time spent outside the DSP emulation instructions.
In order to investigate the potential acceleration of the algorithms when executing on an embedded microprocessor, the workload was recompiled for the Simplescalar instruction set architecture (ISA) [15]. Table 2 illustrates the simulated processor profiling results.
As expected, the workloads spend a significant amount of time/instructions executing the DSP emulation functions. It is clear that efficient implementation of the DSP emulation instructions on a configurable extensible microprocessor can lead to a very high-performance, targeted-architecture for the particular workloads. The small form-factor and reduced power consumption of the proposed solution makes it a very attractive candidate for replication and integration in an SoC ASIC.
Table 1: Relative amount of time spent outside the DSP emulation instructions
Algorithm Relative time (%, native) G723 Coder 31.3
G723 Decoder 22.8 G729 Coder 30.4 G729 Decoder 26.9
Table 2: Relative number of total instructions executed outside the DSP emulation instructions
Algorithm Relative instructions (%, simulated) G723 Coder 34.5
G723 Decoder 33.3 G729 Coder 34.2 G729 Decoder 37.2
This is the approach taken in this work: the Instruction Set Architecture was chosen to be precisely the DSP emulation instructions as they appear in the reference source. It is summarized in table 3:
Table 3: Coprocessor ISA Move ops Description
Mvrc Move RISC CPU register to
coprocessor register
Mvcr Move Coprocessor register to RISC CPU register
Mvrv Move RISC CPU register LSB to
coprocessor overflow
Mvcvr Move coprocessor overflow to RISC CPU register LSB
Data ops Description
Sature 32-16 bit ITU saturate Add 16-bit add and saturate Sub 16-bit sub and saturate Abs_s 16-bit absolute value L_abs 32-bit absolute value
Shl 16-bit Shift-left with negative shift support and saturation
Shr 16-bit shift-right with negative shift support and saturation
Negate 16-bit negation
Norm_s 16-bit normalization calculation Norm_l 32-bit normalization calculation L_add 32-bit add with overflow saturation L_sub 32-bit sub with overflow and saturation Mult 16x16->16 signed multiplication with
overflow and saturation
L_mult 16x16->32 signed multiplication with overflow and saturation
L_mac 16x16->32 multiplication and 32-bit summation with overflow and saturation L_msu 16x16->32 multiplication and 32-bit
subtraction with overflow and saturation Miscellaneous ops Description
Clv Clear sticky overflow bit Setv Set sticky overflow bit
IV. MICROARCHITECTURE
We have investigated two microarchitectures: One that uses the main CPU register file and another that utilizes its own.
Both microarchitectures make use of the RISC memory subsystem (L1 Data cache) and are designed to be attached to a Sparc-V8 compliant SoC subsystem distributed under LGPL [10]. We choose to connect the coprocessors to the integer unit pipeline directly instead of designing them as AHB-compliant masters [11] for performance reasons: Stand-alone AHB coprocessors are very effective when working on medium to large blocks of streaming data. Although the workloads perform a lot of work on blocks of data (samples), there were many more instances where we had to insert custom assembly code into irregular (non-iterative) blocks. As a result, we opted for a very tightly-coupled configuration which accommodates efficiently both cases. High-level views of both microarchitectures are depicted in figures 4 and 6 respectively.
This section discusses a number of design parameters:
A. Coprocessor Interface
The open-source embedded RISC processor lacked detailed microarchitecture documentation. Initial experimentation with the already existing coprocessor interface was inconclusive as to its ability to operate in a pipelined fashion. That would have had a detrimental effect on the performance of the coprocessors and it was therefore decided to implement a new, pipelined coprocessor interface. The newly developed coprocessor port can handle two coprocessors and is able to deliver an instruction on every cycle. External coprocessors provide flow control to the main processor through a dedicated stall signal.
The diagram of figure 3 shows a coprocessor data operation on cycle 1 followed by a host-to-coprocessor register transfer on cycle 2. In cycle 3, a coprocessor register is requested by the RISC processor but due to internal stall conditions, data are made available one cycle later than the expected time (cycle 5 instead of cycle 4). During that time, the main processor is held with the holdn signal. Finally, a second read operation, this time directed to Coprocessor 1, is initiated in cycle 6.
Results are made available to the main pipeline in cycle 7.
B. Microarchitecture 1: Using the main RISC CPU Register File
This is the simplest microarchitecture since it makes use of the main RISC processor register file. This type of approach has been adopted by configurable microprocessor vendors [18]
[22] and it is effectively a side-datapath with associated control, attached to the main CPU as depicted in Figure 4:
holdn deasserted
1 2 3 4 5 6 7
data_op mvrc mvcr data_op mvcr
din
dout
dout holdn asserted data out valid data into coproc clk
pcop_in.cop_no pcop_in.holdn pcop_in.valid pcop_in.opc[19:0]
pcop_in.din[31:0]
pcop_out[1].dout[31:0]
pcop_out[0].holdn
pcop_out[0].dout[31:0]
pcop_out[1].holdn
Figure 3: Pipelined coprocessor I/F
SHIFT UNIT
MISC UNIT
16x16 Signed Mult opr1, opr2
32-bit signed adder saturation
res1 opr3
CONTROLPIPELINE
RF(1,2) InstructionI$
Cache
RISC Decode
Tags way select mux
Data CacheI$
way select mux
RF RISC CPU
ALU CTRL Other CTRL
EXECDMEM/ EXEC2WBDECODEIFETCH
DATAPATH
Coproc Decode
RF (RF3)
Figure 4: Microarchitecture without register file
In this case, the coprocessor consists of the Datapath and the Control Pipeline
Starting at the IFETCH stage, the main RISC processor fetches one instruction word from a multi-way set-associative instruction cache and clocks it into the instruction register.
RISC and coprocessor decoding take place concurrently at the DECODE stage with the main RISC register file accessed at the falling edge of the clock. Due to the significant number of Multiply-add operations in the workload, a third read port was added to the main CPU register file to accommodate single-
cycle addition (RF3). This port is depicted as an embedded SRAM block, instantiated in the coprocessor hierarchy, clocked at the falling edge of the DECODE stage. Finally, all result bypassing takes place in this stage.
The EXEC stage is the main processing stage for both the RISC processor and the coprocessor. During this stage all non- arithmetic operations are computed in the coprocessor. In addition, the 16-bit signed-multiplication is performed. All transfers between the main RISC pipeline and the internal coprocessor state take place in this stage.
Coprocessor results are pipelined in the EXEC2 stage where the add part of the Multiply-add operation is performed along with saturation. During this stage, the L1 data cache is accessed and one 32-bit word is returned to the main RISC pipeline from the load path as depicted in the diagram. It is this stage that qualifies state updates in the coprocessor side since all possible exception conditions have been resolved.
Finally, results are clocked into a staging register prior to committing to the RISC register file, on the falling edge of the clock.
C. Microarchitecture 2: Using private Register File This microarchitecture is considerably different to the previous one due to utilizing a separate, 16x32-bit register file in addition to a more elaborate control mechanism. The coprocessor state is fully accessible from the RISC CPU and is shown in figure 5:
0
15 4 3 2 1
V
Figure 5: Coprocessor Programmers Model
It consists of sixteen 32-bit registers and a sticky overflow bit.
Bi-directional transfer instructions, between the host RISC processor and the coprocessor, were added to accommodate the lack of Move-to-coprocessor/Move-from-coprocessor instructions in the Sparc V8 architecture [17].
The high-level schematic of the coprocessor with its own register file is depicted in figure 6. In this case, the coprocessor pipeline is segmented in three major sections:
Front-end, Control pipeline and Datapath.
Starting from the top, the main CPU reads an instruction from the multi-way set-associative instruction cache and clocks it into the instruction register.. The latched command is then decoded, both at the RISC processor and the coprocessor front-end, and register-file read-addresses are extracted. In parallel, the coprocessor decoding logic computes a number of control fields that are sent to the control pipeline.
During the EXEC/READ stage, the register file is accessed followed by operand bypassing. The resolved operands opr1, opr2 and opr3 are clocked into the operand registers where they are utilized during the first execution stage (EXEC1).
In DMEM/EXEC1, all shifting, normalization and miscellaneous operations are performed. In addition, the signed-multiplier is accessed if the command specifies that.
Results are passed to EXEC2 for the second stage of execution where all arithmetic and saturation takes place.
The configuration of figure 6 permits the pipelined execution of all the commands with a latency of 1 cycle. The only exceptions are the multiply-add and multiply-subtract with saturation, which span both execution stages and have a latency of 2 cycles.
RF BYPASS1
SHIFT UNIT
MISC UNIT
16x16 Signed Mult opr1, opr2 opr3
32-bit signed adder saturation
RF BYPASS2
res1 opr3
READEXEC1EXEC2
Coproc Decode CPU Command
I/F
DECODE
READ CTRL
EXEC1 CTRL
EXEC2/WB CTRL
FRONT END
DATAPATH
CONTROLPIPELINE
RF InstructionI$
Cache
RISC Decode
Tags way select mux
Data CacheI$
way select mux
RF
RISC CPU
ALU Other CTRL CTRL
EXECDMEMWBDECODEIFETCH
DATAPATH
Figure 6: high-level microarchitecture
The following sections discuss in more detail the microarchitecture blocks common to both coprocessors. These include the EXEC1 and EXEC2 stages and lower hierarchical blocks.
1) EXEC1 Stage
EXEC1 includes datapath logic to perform 16x16 bit signed multiplication, all ITU shift operations and a miscellaneous block responsible for handling all opcodes not falling in the previous category. These are depicted in figure 7
a) Multiplier
This is the signed, 16-bit multiplier. Due to the highly configurable nature of the RISC processor and the portability requirements of this work, HDL constants are used to select whether the multiplier is inferred in the RTL code or instantiated. In the later case, a Booth-Encoded, Wallace-tree multiplier [20] is utilized due to the higher pipelined performance when compared to the implementations chosen by the synthesis tools.
shift_unit
opr1o(16) opr2e(16) opr2o(16) cmde cmdo
shif t_rese(16) opr1e(16)
shif t_reso(16) shif t_setv (2)
misc_unit
opr1o(16) opr2e(16) opr2o(16) cmde cmdo
misc_rese(16) opr1e(16)
misc_reso(16) misc_setv (2)
signed 16 mult
mux_proc
cmd_s3
nop nop
cmd_s3
s3_res_i
s3_setv
s3_res
s3_res_r s3_setv
s3_setv_r
opr1 15:0
31:16
opr1 15:0 31:16
opr1 15:0
31:16
opr2 15:0
31:16
opr2 15:0
31:16
Figure 7: EXEC1 Stage
Table 4: Multiplier performance vs. architecture (MHz)
Multiplier Unpipelined 2-stage
Synthesis/CS 204 330
Synthesis/NBW 376
Synthesis/WALL 385 502
WALL/No
BOOTH 345 476
WALL/BOOTH 370 574
Table 4 depicts the unpipelined and two-stage pipelined maximum operating frequency of the 16x16 signed multiplier in a high-performance 0.13 process. Our timing budget allows for the use of a non-pipelined multiplier thus, simplifying coprocessor pipeline design.
b) Shift Unit
The shift unit implements the 16 and 32-bit ITU shift operations. A particular characteristic of these operations is the ability to specify negative shift amounts resulting in a positive shift in the opposite direction. The high-level schematic of the shift unit is depicted in figure 8.
2) EXEC2 Stage
This stage performs the Add-part of the MAC instruction as well as all arithmetic and saturation. Results commit to the private register file at the end of this cycle or return to the host pipeline during stage DMEM. The common EXEC2 high-level schematic is shown in figure 9.
16 sext sl32 32
!=0
>15 a b opr1e
sext 15:0
32 c 1 MIN16 MAX1 6
(a & b)!c 1 0 1(a & b)!c
v e
!=
16 opr2e 16
15:0 sel_mx1e
1
shift_rese
shift_reso sel_mx
2
1
sr32 32 1
16 - 1 0
16 +1 -
opr1b(15 )
1 shamt(15:0
)
1 6 sel_shift
1
1 32
-1 0
32
>31
d
1
d
1 1 opr2a(15 )
16 sext sl32 32
!=0
>15 a b opr1a
sext 15:0
32 c 1 MIN16 MAX1 6
(a & b)!c 1 0 1(a & b)!c
v e
!=
16 opr2a 16
15:0 sel_mx
1
1
sel_mx 2
1
sr32 32 1
16 - 1 0
16 +1 -
opr1b(15 )
1 shamt(15:0
)
1 6 sel_shift
1
>31
d
1
Figure 8: ITU Shifter Schematic
16
16 + 16 SEXT
SATURE 32 32
16 RF
to host CPU operands
Figure 9: EXEC2 Stage high-level schematic
V. RESULTS
Results were obtained for both coprocessors at the architectural level with the baseline architecture being the Simplescalar ISA. The workloads where compiled and all ITU test vectors were validated on the standard architecture simulator (sim-profile). Tables 5 and 6 depict the number of simulated processor instructions required for each workload, for the G723.1 and G729A algorithms respectively
Table 5: G723.1 unmodified instruction count Test vector Instructions
Dtx53mix (mix rate) 1,063,099,834 Dtx53mix (5.3 Kbits/s) 926,595,183 Dtx63 (6.3 Kbits/s) 10,159,707,298
Table 6: G729A unmodified instruction count Test vector Instructions
Algthm 62,620,904 Fixed 213,968,970 Lsp 3,977,189,411 Pitch 3,253,182,556 Tame 230,922,927
The workloads where then modified to include custom assembly instructions and a new architecture-level simulator (sim-coproc), based on the existing profiling simulator, was designed. The test vectors were again simulated and the algorithmic complexity was measured and compared to that obtained in the previous run. Fully compliance to the ITU-T test vectors was maintained at any instance.
A. Coprocessor without register file results
Tables 7 and 8 depict the average (over all test vectors), relative algorithmic complexity for both the coder and decoder of the G729A and G723.1 standards respectively when compiled and simulated for a coprocessor using the RISC processor register file.
Table 7: G729 Coder Results (average) Normalized
Complexity Coder Decoder Coder
Delta Decoder Delta SATURE 0.940 0.972 0.060 0.028
ADD 0.937 0.969 0.003 0.002
SUB 0.927 0.967 0.010 0.002
ABS_S 0.927 0.967 0.000 0.000
SHL 0.924 0.962 0.003 0.005
SHR 0.923 0.956 0.002 0.006
L_SHL 0.899 0.898 0.024 0.059 L_SHR 0.896 0.895 0.002 0.002 NEGATE 0.896 0.895 0.000 0.000 L_ADD 0.814 0.837 0.082 0.059 L_SUB 0.802 0.812 0.012 0.025 ROUND 0.796 0.801 0.006 0.011 L_ABS 0.796 0.801 0.000 0.000 NORM_S 0.796 0.801 0.000 0.000 NORM_L 0.795 0.799 0.001 0.002 DIV_S 0.792 0.797 0.003 0.002 MULT 0.771 0.784 0.021 0.012 L_MULT 0.660 0.674 0.111 0.110 L_MAC 0.534 0.580 0.126 0.094 L_MSU 0.510 0.529 0.024 0.051
Table 8: G723.1 Coder Results (average) Normalized
Complexity Coder Decoder Coder Delta
Decoder Delta SATURE 0.987 0.985 0.013 0.015
ADD 0.985 0.981 0.002 0.004
SUB 0.985 0.980 0.000 0.000
ABS_S 0.984 0.977 0.001 0.003
SHL 0.981 0.965 0.003 0.012
SHR 0.981 0.959 0.000 0.006
L_SHL 0.936 0.908 0.044 0.051 L_SHR 0.912 0.901 0.024 0.006 NEGATE 0.912 0.901 0.000 0.000 L_ADD 0.824 0.819 0.088 0.082 L_SUB 0.814 0.804 0.010 0.015 ROUND 0.809 0.788 0.005 0.016 L_ABS 0.809 0.788 0.000 0.000 NORM_S 0.809 0.788 0.000 0.000 NORM_L 0.808 0.787 0.001 0.001 DIV_S 0.807 0.787 0.000 0.001 MULT 0.806 0.786 0.001 0.001 L_MULT 0.678 0.670 0.129 0.116 L_MAC 0.563 0.541 0.114 0.129 L_MSU 0.543 0.510 0.020 0.031 The tables illustrate the fractional complexity reduction as extension instructions are added, one by one, for both coder and decoder. In the case of the G729A coder, an average architectural improvement in algorithmic complexity of the order of 49% (coder) to 47.1% (decoder) is achieved. The G723.1 standard achieves similar figures with to 45.7% and 49% complexity reduction for the coder and the decoder respectively. These improvement figures do not take into account cycle-effects such as cache misses, prefetching or the possibility of multi-issue.
B. Coprocessor with private register file results
Tables 9 and 10 show the average (over all test-vectors), relative algorithmic complexity of the G723.1 and G729A coders respectively for a coprocessor with a private register file and utilizing all the defined instructions of table 3 (except division). Further substantial gains are observed: The G723.1 coder demonstrates an average relative complexity of 65%
compared to the unmodified standard and an improvement of 35.6% over to the previous architecture whereas the G729A standard achieves 69% of unmodified complexity and improvement of 39.3% compared to the previous architecture.
It is clear that the introduction of the coprocessor register file provided significant benefit due to reducing the register pressure compared to the previous method. In addition, a significant number of Load/Store operations were eliminated since transient values are now cached in the dedicated register file.
Table 9: G723.1 Results
Benchmark Instruction Count
(Coprocessor)
Fractional complexity Dtx53mix (mix rate) 380,717,669 0.36 Dtx53mix (5.3
Kbits/s)
257,744,402 0.28 Dtx63 (6.3 Kbits/s) 4,261,239,585 0.42
Average 0.35
Table 10: G729A Results
Benchmark Instruction Count
(Coprocessor)
Fractional complexity
Algthm 19,765,353 0.31
Fixed 67,662,019 0.31
Lsp 1,257,199,028 0.31
Pitch 1,030,256,280 0.31
Tame 73,056,645 0.31
Average 0.31
VI. SOC SUBSYSTEM
Architecture research demonstrated the superiority of the coprocessor with a private register file. This microarchitecture is currently being implemented in RTL VHDL as a tightly- coupled coprocessor for the Leon Sparc-V8 CPU. Detailed microarchitecture analysis followed by trial synthesis confirmed that all instructions can fit in a single high- frequency cycle resulting in a latency of 1 and an initiation rate of 1. Exceptions to this are the Multiply-add/subtract instructions and the short divide with latency/initiation rate of 2/1 and 17/17 respectively. In particular, it was decided that due to the very low improvement, the iterative divider block would not be utilized.
The CPU/Coprocessor attaches to a 32-bit AHB system which connects to an external host via an AHB-PCI Bridge. This is depicted in figure 10.
Instruction Cache
RISC CPU
Coprocessor
Data Cache
Arbiter/
Mem
Ctrl AHB
SDRA M Ctrl
SDRA M
PROM
Ctrl PROM
AHB/
Wishbone Bridge
Wishbone Interconnect
PCI I/F PCI
Host Figure 10: SoC Subsystem
The optimized speech coder and the frames to be processed are transferred with DMA from the host PC to the SDRAM memory of the RISC/Coprocessor FPGA board. After that, the RISC CPU/coprocessor combination processes the frames and stores the compressed frames in local memory (SDRAM). The compressed frames are transferred back to the PC memory for comparison with the ITU-T test vectors.
VII. SYSTEM VERIFICATION
Significant effort is spent in validating the system both at block as well as system level [16]:
A. Block-level verification
The reference code DSP emulation instructions were instrumented to produce human-readable files of their input operands, the state of the global Overflow flag and output results. These vectors were subsequently fed into the individual datapath blocks and their functionality validated on a per-workload basis.
B. System level verification
In parallel to block-level verification, system verification involved the design of a DMA controller, to transfer the embedded processor binary and frames from the host memory into the FPGA board SDRAM. The RISC processor, without the coprocessor, executed the workload and agreement with the ITU-T test vectors was obtained.
I$
Way selection Instruction
Cache
+size BPred
BTAC prediction IuTLB
Cmp
GPR (1/2)
Flags
Bpred update new_pc
ICACHE1ICACHE2
Dec1
X1 GPR
(3)
X2
sat
GPR
Coprocessor Dispatch
RISC CPU Front-end
Data CacheI$
DuTLB
Cmp
Ld/St Unit
Main Execution Pipeline
Bypass Buses
Tags
DISPATCHRFALUDCACHE1DCACHE2WB
Instruction Buffers
cond br/ jmprestart Exception Vectorrestart
Bypass
Figure 11: High-level schematic of limited dual-issue CPU
VIII. CONCLUSIONS AND FUTURE WORK We utilized a combination of techniques to profile and optimize the ITU-T G729A and G723.1 speech coders.
A further significant source of optimization lies with tapping the amount of data-level parallelism available in the workloads. Our group currently investigates vector architectures for the efficient execution of the speech coders.
Additional insight on the cycle effects will be provided through the cycle-accurate modeling of both coprocessors when attached to a more generic RISC CPU with limited dual- issue ability. This is portrayed in figure 11 where a high- performance scalar RISC processor with 8 pipeline stages and limited dual-issue capability (one scalar, one coprocessor) is described. This will allow for experimentation of the processor/co-processor design space and provide insight into the necessary microarchitecture requirements for the efficient execution of the workloads.
Finally, we are building the RTL model of the microarchitecture of figure 6 in the context of the system of figure 10.
REFERENCES
[1] ITU-T Recommendation G.723.1, ‘Dual Rate Speech coder for multimedia communications transmitting at 5.3 and 6.3 kbits/s’, 3/96 [2] ITU-T Recommendation G.729, ‘Coding of speech at 8 kbits/s using
conjugate-structure algebraic-code-excited linear-prediction (CS- ACELP)’, 3/96
[3] M. Prasad, P. Arcy, M. Diamondstein, H. Srinivas, ‘Half-Rate GSM Vocoder Implementation on a Dual-Mac Digital Signal Processor’, Proceedings of the 1997 IEEE International Conference on Acoustics, Speech and Signal Processing, pg 619-622
[4] Vinod Kathail, Shail Aditya, Robert Schreiber, B. Ramakrishna Rau, Darren C. Cronquist, Mukund Sivaraman, ‘PICO: Automatically designing custom computers’, IEEE Computer, 35(9), September 2002 [5] D. Burger, T. Austin, ‘Evaluating Future Microprocessors: The
Simplescalar Tool Set’ http://www.simplescalar.com
[6] V. A. Chouliaras, J. L. Nunez, “A scalar coprocessor for accelerating the G723.1 and G729A speech coders”, accepted for publication in the IEEE International Conference on Consumer Electronics (ICCE03) [7] Y. Won, S. Sahni, Y. El-Ziq, ‘A hardware accelerator for maze
routing’, IEEE Trans on Computers, vol. 39, no. 1, pp. 141-145, Jan.
1990
[8] R. Cox, ‘Three new speech coders from the ITU cover a range of applications’, IEEE Communications magazine, pp. 40-47, Sept 1997 [9] R. Cox, P. Kroon, ‘Low bit-rate speech coders for multimedia
communication’, IEEE Communications magazine, pp.34-41, December 1996
[10] ‘The Leon-2 processor User’s manual, XST edition, ver. 1.0.14’, www.gaisler.com
[11] ‘AMBA Specification (Rev 2.0)’, www.arm.com
[12] A. Royo, J. Moran, C. Lopez, “Design and implementation of a coprocessor for cryptography applications”, Proceedings of the 1997 IEEE European Design and Test Conference (ED&TC’97), pg 213-217 [13] B. Costinescu, R. Ungureanu, M. Stoica, E. Medve, R. Pread, M.
Alexiu, C. Ilas, ‘ITU-T G729 Implementation on Starcore SC140’, AN2094/D, Rev. 0,02/2001, www.motorola.com
[14] S. Chang, J. Hu, ‘Real-time implementation of G723.1 speech codec on a 16-bit DSP processor’, Department of electronic and control engineering, National Chiao Tung Univesity, Hsinchu, Taiwan, R.O.C [15] M. Soler, A. Andre, E. Closse, J. Laval, F. Balestro, D. Morche, P. Senn,
‘An embedded DSP platform for multi-standard ITU G728, G729 &
G723.1 audio compression’, France Telecom, CNET
[16] M. Medina, G. Ezer, P. Konas, ‘Verification of configurable processor cores’, proceedings of the 2000 Design Automation Conference, Los Angeles, California
[17] ‘The Sparc Architecture Manual Version 8’, www.sparc.com
[18] A. Wang, E. Killian, D. Maydan, C. Rown, ‘Hardware/software instruction set configurability for system-on-chip processors’, proceedings of the 2001 Design Automation Conference, Las Vegas, Nevada
[19] W. Raab, N. Bruels, U. Hachmann, J. Harnisch, U. Ramacher, C. Sauer, A. Techmer, ‘A 100-GOPS programmable processor for vehicle vision systems’, IEEE Design and Test of Computers, pp.8-16, Jan-Feb 2003 [20] Arithmetic module generator, http://www.fysel.ntnu.no/modgen/
[21] A. S. Spanias, ‘Speech Coding: A tutorial review’, Proceedings of the IEEE, vol. 82, no. 10, pp.1541-1581, October 1994
[22] Y. Zhao, A. Wang, M. Moskewicz, C. Madigan, ‘Matching architecture to application via configurable processors. A case study with the Boolean satisfiability problem’, proceedings of the 2001 International Conference on Computer Design: VLSI in Computers and Processors
Vassilios A. Chouliaras was born in Athens, Greece in 1969. He received a B.Sc. in Physics and Laser Science from Heriot-Watt University, Edinburgh in 1993 and an M.Sc. in VLSI Systems Engineering from UMIST in 1995. He worked as an ASIC design engineer for Intracom SA and as a senior R&D Engineer/Microprocessor architect for ARC International.
Currently, he is a lecturer in the Department of Electronic and Electrical Engineering at the University of Loughborough, UK. His research interests include superscalar and vector CPU microarchitecture, high-performance embedded CPU implementations, performance modeling, custom instruction set design and self-timed design.
José Luis Núñez is a research fellow in the department of Electronic Engineering at Loughborough University where he has worked since 1997. His current interests include the areas of lossless data compression, reconfigurable vector architectures, FPGA-based design and high-speed data networks. He received his BS and MS degree in Electronics Engineering from Universidad de La Coruna (La Coruna, Spain) and Universidad Politécnica de Cataluña (Barcelona, Spain) respectively in 1993 and 1997. He received his PhD degree at Loughborough University (Loughborough, England ) in 2001 working in the area of hardware architectures for high-speed data compression.