DEVELOPMENT SOFTWARE - DESIGN OF APPLICATION-DRIVEN DIGITAL SIGNAL PROCESSOR

CHAPTER 3 DESIGN OF APPLICATION-DRIVEN DIGITAL SIGNAL PROCESSOR

3.9 DEVELOPMENT SOFTWARE

In order to verify and debug the DSP programs, a tool called DEFY-I is developed for functional emulation. The DEFY-I is an instruction-set-level hardware emulator for the processor core. With the emulator, the instructions could be taken out from the program memory and put into the instruction register for instruction analysis and execution. Finally, the execution results are written back to the register file or data memories. The flowchart of DEFY-I is shown in Fig. 3-13. The whole emulator is constructed as the functional simulation kernel and could connect to other peripheral devices to perform the memory and display functions.

Functional Simulation Kernel

Functional Simulation Kernel ^Filter parameters

Instruction analysis and execution

100001111000000000000000

Fig. 3-13. The structure of DEFY-I for LASP24.

The high-level algorithm model with the LASP24 assembly language is translated into the machine language by the developed translator, and then a conversion table of mnemonic and an operation code is generated. For the software development, the tool of an effective functional simulation supports software developers so that the software application can be embedded into the tool to verify its function. The tool, named hardware emulator, can help software developers to simulate and debug developing applications. The emulator is an instruction-set-level hardware emulator based on an application-specific speech processor. With the emulator, the basic operation of LASP24 is to take out the instruction from the program memory first, set it to the INST register, decode the instruction, execute the decoded instruction, and finally write back the operation results to the register file (RF) or data memories (RAM0, RAM1, EXT RAM). The operation flow of the hardware emulator is shown in Fig. 3-14 with C pseudo codes, and its structure is shown in Fig. 3-13. The whole emulator is constructed as the functional simulation kernel and connects to other peripheral devices as memories.

Initial memory data

RAM ROM

Instruction decode & execute switch (OPCODE)

default: undefined OPCODE.

IF (REG mode)

Write back (MEM or RF) INST=MEM[PC]

If (branch taken) PC=jump address;

Instruction decode & execute switch (OPCODE)

default: undefined OPCODE.

Instruction decode & execute switch (OPCODE)

default: undefined OPCODE.

IF (REG mode)

Write back (MEM or RF) IF (REG mode)

Write back (MEM or RF) INST=MEM[PC]

If (branch taken) PC=jump address;

else

PC=PC+1;

Instruction fetch INST=MEM[PC]

If (branch taken) PC=jump address;

else

PC=PC+1;

Instruction fetch

For the hardware emulator to be useful for effectively improving the flow of software development, we identify the following functions and requirements:

● Step execution: The emulator can execute one-by-one instruction so that the programmer can trace the execution result in an instruction or clock cycle.

● Free run: When a program prototype is finished, we can use the free-run way to simulate the program. Through this way, an expected result will be estimated.

● Set breakpoint: Users can press the breakpoint value based on the program counter. Until the program count is equal to the breakpoint value, the program always runs.

● Displays: The screen of the emulator is shown in Fig. 3-15. It can display information as the program counter (PC) in the region D, general-purpose registers (R0_R7) in the region A, a source program in the region C, auxiliary registers (R, J, M, N, R EXT, R FIL), status registers (TC, NTC, Z, NZ) in the region E, and the contents of memory banks (RAM0, RAM1, ROM, EXT RAM) in the region B, where RAM1 and RAM2 indicate the internal memory, ROM indicates filter and window ROMs, and EXT RAM indicates the external (or on-chip) memory.

C D

Fig. 3-15. The hardware emulator.

● Debug information: When the emulator loads the object codes, the related debug information is read as well. At the same time, the emulator can show the executing instruction located in the source code to suit debugging for programmers.

● Emulator initialization: When the emulator is enabled, it can search related initial files in the current working directory. If these initial files including the filter parameters, window coefficients, and initial values of the external memory exist, the emulator can auto-load them and finish initialization.

When design is completed, we check them against the specifications for completeness and correctness. The co-verification method is created, and a script file is described as follows:

load (analyzed sources);

load (target library);

load (debugging information);

while ( (read (instruction) != NULL) or (!finish) ) execute the instruction from HW simulator;

check (debugging information);

match the results;

if ( mismatch )

printf (show messages and different values);

errcount++;

endif end

if ( errcount != 0 )

printf (“Here are %d errors between HW and SW”, errcount);

else

printf (“Maching is finished. No error found.”);

endif

The automatic verification can help us to check whether the specifications of the

hardware/software co-design are correct. If any violation, the output information show immediately the location which indicates the error. Thus the debugging time can be reduced, and the functional design can quickly meet our requirements.

CHAPTER 4 SIMULATION RESULTS

These two algorithms, speech coding and audio enhancement processing of reverberation, are performed on the proposed digital signal processor, LASP24. They are implemented with LASP24’s assembly language and can be performed in real time. Finally, the performance result is compared with TI TMS320C3X.

4.1 Speech Processing

4.1.1 LPC and pitch estimation

Fig. 4-1 shows the microprogramming flow for performing three kernel functions (LPC, PE, and test mode) analyzed in Chapter 2. The C program was used to verify the speech processing algorithms and to test the floating-point precision. According to the experimental results, Table 4-1 shows the 10-order LPC coefficients in different bit numbers (24-bit and 32-bit) of floating-point precision. The maximal error occurred at the frequency 18.52 Hz, and the error of the two different bit numbers of floating-point precision in Table 4-1 is maximal when the LPC order is equal to 4. When precision or iterations of divider were not high enough, the reconstructed speech signals would be unnatural. After we listened to the synthesized speech, the 24-bit floating-point precision appeared to be good enough.

Initialize

Fig. 4-1. The microprogramming flow in the program ROM of LASP24.

Table 4-1. Simulation results of LPC calculations in different floating-point precision.

LPC Order 32-bit Floating Point 24-bit Floating-Point

1 -1.948923 -1.954345 2 0.923492 0.913543 3 -0.052776 0.017284 4 0.841343 0.730545 5 -1.204289 -1.122589 6 -0.476735 -0.426231 7 -0.280020 -0.223022 8 -0.945771 -0.904251 9 -0.968852 -0.966308 10 0.296600 0.302504

The RTL codes were written by Verilog language and simulated. Design Compiler was used to transfer the RTL codes to gate-level codes. In RTL simulation, we obtained the execution time of the realized speech processing algorithms in Table 4-2, where Pitch 1

(P1) performs τ=15 to 76 and Pitch 2 (P2) performs τ=77 to 152 in Eq. (2.13).

Table 4-2. Timing simulation results. The time unit of execution is microsecond (ms), and the total time of execution is the sum of LPC and PE computation time.

Algorithms Execution (cycles)

Vector operation Rate (%)

Execution Time (ms) 25 MHz 33 MHz 40 MHz

LPC 3,298 2,348 (71) 0.13 0.1 0.08

P1 14,346 13,698 (95.5) 0.57 0.43 0.35

P2 17.424 16,680 (95.7) 0.70 0.52 0.44

Total 35,068 32,736 (93.3) 1.40 1.05 0.87

These simulations were executed with the operating frequency of 25 MHz, 33 MHz, and 40 MHz, respectively. The time for vector and matrix operations was about 93.3% of the whole algorithm; that is, the rate of chip running at optimal condition was 93.3%. The chip’s internal driving ability between cells to cells was simulated in gate level simulations, too.

After the timing simulation, the post-layout simulation was performed. Final power dissipation and maximal operating frequency could be estimated at this stage. The LASP24’s performance in typical (33 MHz), best (40 MHz), and worst (25 MHz) cases had also been simulated. In the typical case, LASP24 can provide the computation capability of 66.6 MFLOPS (Million Floating-point Operations per Second) and 33.3 MIPS (Million Instructions per Second). The best condition was achieved at 80 MFLOPS and 40 MIPS in a single cycle. In the worst case, the computation power is 50 MFLOPS and 25 MIPS. At the room temperature 23 (25◦C ∼ 27◦C) and 5 V, the current requirement was 4 mA, about 20 mW, and the maximal frequency is 28.5 MHz which was lower than the gate level simulation result. At the worst case, 85◦C and 4.5 V, the current requirement

case, LASP24 still could provide 50 MFLOPS and 25 MIPS computation power that was higher than that of TMS320C30.

We compared the performance of the LASP24 processor to that of TMS320C3x series, which are floating-point general-purpose DSPs. Fig. 4-2 shows the floating-point operation ability of each processor and the comparisons of vector operation ability. At the best case, LASP24 at 40 MHz provided 80 MFLOPS that was much better than TMS320C31 at 50 MHz did. In the vector operation mode, we set the vector processing ability of LASP24 at 25 MHz as index 100 and compared it with other processors. In the figure, higher value indicates higher performance. At the best case, LASP24 at 40 MHz was about 4.75 times higher than TMS320C30 and about 3.2 times higher than TMS320C31.

0 20 40 60 80 100 120 140 160 180 LASP24 40MHz

TMS320C31 40MHz LASP24 33MHz TMS320C30 40MHz LASP24 25MHz TMS320C30 33MHz

Ability

Vector MFLOPS

Fig. 4-2. Performance comparisons of LASP24 and TMS320C3x.

4.1.2 MELP Coding

The MELP coder is divided into an encoder and a decoder module. The frame size is 22.5ms (180 samples) with a sampling frequency of 8000Hz. The MELP coder is based on the traditional Linear Prediction Coding (LPC) parametric model, but also includes five

additional features: mixed excitation, aperiodic pulses, adaptive spectral enhancement, pulse dispersion, and Fourier magnitude. The encoder uses 10^th order LPC coefficients, which are transformed into line spectral frequencies or quantization and transmission. For each voiced and unvoiced frame, the parameters computed are listed in Table 4-3.

Line spectral frequencies are computed from the prediction coefficients, which uses Chebyshev polynomials. A fast numerical method is used for implementation on the proposed processor. Final pitch is computed using an autocorrelation analysis on the low passed residual signal:

)

where τ is the lag. The computation of the autocorrelation sequence is centered on the last sample of the past frame. Band pass voicing strengths are computed using autocorrelation analysis about the pitch lag for each of the bands. Gain is computed twice per frame using an adaptive window size, which is a multiple of the pitch period. A residual signal is obtained by filtering the input speech using the set of de-quantized LPC coefficients. An FFT is performed on this residual signal and a search is performed selecting 10 Fourier magnitudes.

Table 4-3. Bit allocation for the MELP coder

Parameters

Per Frame Voiced Un-voiced

LSFs 10 25 25

Pitch 1 7 7

Band pass voicings 5 4 -

Aperiodic flag 1 1 -

Fourier magnitudes 10 8 -

Gain 2 8 8

Error protection - 13

Sync bit 1 1

Total 54 54

We provide the instruction set for matrix operations which reduce the size of program memory to 12K Bytes. For example, the autocorrelation operation of (4.2) is optimized and implemented by LASP24’s instructions as follows:

//input R1 = s[0]

//input R2 = x //input R4 = n-m

FIX R2, R7 SHF R7, +1 FLOAT R7, R2

ADD R2, ROM[&80.0]

SUB R2, R1, R1

ADD R3, R1, R1 FIX R1, R7 LDE R_EXT, R7

RPB j, #160

L1: MOV EXT[R_EXT+j], RAM0[j]

RETB j, L1

Variable definition

Set initial address For input signals

Data moving from external RAM to RAM0(A)

CMPR R4, ROM[&0.0]

BCND NZ, P1

RPB j, #160

L2: MOV RAM0[j], RAM1[j]

RETB j, L2 BCND Z, P2

P1: ADD R4, R1, R1 FIX R1, R7

LDE R_EXT, R7

RPB j, #160

L3: MOV EXT[R_EXT+j], RAM1[j]

RPB j, L3

P2: MOV FIL[&0.0], R3

RPB j, #160

COR_MAC: MAC RAM0[j], RAM1[j], R3 RETB j, COR_MAC

4.1.3 Power Analysis

To achieve power saving, LASP24 was also designed with a gated-clock architecture.

The power dissipation of the LASP24 is summarized in Table 4-4, which includes average dynamic power dissipation and power reduction. Power reduction compared the average power dissipation of the gated-clock design and the original implementations. It was expressed as a percentage by the following equation:

Power reduction = (1 − power ratio) × 100, (4.3)

where the power ratio is Pgatedclock/Poriginal, and the ratio is the average dynamic power dissipation. Table 4-4 lists the power dissipation of three parts including the ALU unit, the system control, and the whole design in different operation frequencies and processes. The results indicate that ALU unit wastes more power than the other units. The reason is

Calculate Cx(m,n) and store in R3

Data moving from external RAM to RAM1 (B) If m≠n, the A≠B

Data moving from external RAM to RAM1 (B) If m=n, the A=B

power optimization, average power reduction is about one-fourth at 33 MHz and 40 MHz, but can be reduced by 60% at 25 MHz. We find that the power dissipation rate is reduced to about 3/4 of the total power for the whole arithmetic unit shown in Table 4-4.

Table 4-4. Power dissipation analysis of LASP24 between different processes.

Before/After gated-clock design (0.6um) (5V supply voltage, unit mW) Frequency ALU

Before/After gated-clock design (0.35um) (3.3V supply voltage, unit mW) Frequency ALU

Before/After gated-clock design (0.18um) (1.8V supply voltage, unit mW) Frequency ALU

100 MHz 181.35/137.72 15.02/11.50 166.67/128.45 22.9%

Before/After gated-clock design (Cyclone FPGA) (3.3V supply voltage, unit mW)

Frequency ALU unit

Control unit

Average power

Power reduction 25 MHz 58.67/26.11 2.24/1.07 40.92/17.04 58%

33 MHz 81.43/69.33 4.77/2.61 67.13/45.28 32.5%

40 MHz 100.72/75.20 3.15/2.53 83.72/56.06 33%

80 MHz 140.13/89.27 10.13/7.39 109.86/90.32 17.8%

100 MHz 172.16/112.94 13.76/11.53 138.34/98.74 28.6%

4.2 Reverberation Algorithm

4.2.1 DSP Programming

Digital reverberation algorithms tried to mimic a room reverberation by using primarily two types of infinite impulse response (IIR) filters, so that the output would gradually decay. One such filter is the comb filter, which gets its name from the comb-like notches in the frequency response. The other primary filter is the allpass filter. The allpass filter has the nice property that all frequencies are passed equally, reducing a coloration of the sound.

Much of the early work on digital reverberation was done by Schroeder, and one of his well-known reverberation designs uses four comb filters and two allpass filters. More advanced algorithms can be developed to model specific room sizes. With chosen room geometry, source, and listener location, ray tracing techniques can be used to come up with a reverb pattern. By modifying Schroeder’s algorithm, a finite impulse response (FIR) filter is used to create the early reflections, and then IIR filters are used to create the diffuse reverberation. Low pass filters may be used to model the air absorption. Reverberation

Performing designs and real-time prototyping of digital reverberation algorithms is based on random FIR filters, as presented in [13] to construct artificial early reflection. The four parallel comb filters and four cascade all-pass filters are to model the late reverberation and to increase echo density. Consider a modified comb filter in the frequency given by:

where M is the delay length, and (4.5) is a low pass filter. Combining (4.4) and (4.5), we can obtain (4.6):

Here four cascade all-pass filters are used to increase echo density and disperse the phase.

Each all-pass filter has its own delay length Di and coefficient ai. Hence the total transfer function will be

The algorithm is run on a single 80MHz (about 80MIPS) where each instruction cycle is 12.5ns. The original and processed sound is stored in the external RAM. For each filter, 2500 memory locations are used as a spatial buffer. The parameters of four comb filters and four all-pass filters are listed in Table 4-5 and Table 4-6, respectively. Simulated waveforms are shown in Fig. 4-3.

Table 4-5. Allpass filter coefficients. Parameter

Filter

Di ai

Allpass-1 22 0.45 Allpass-2 36 0.45 Allpass-3 23 0.45 Allpass-4 33 0.45

Table 4-6. Comb filter coefficients.

Filter Parameter

Comb-1 Comb-2 Comb-3 Comb-4

a 0.25 0.27 0.28 0.29

g 0.7 0.680 0.674 0.654

m 37 40 41 43

(a) (b)

(e) (f)

Fig. 4-3. The original and resulting waveforms after the reverberation algorithm: (a) is a simulated impulse response with early reflection in FIR; (b) is FIR coefficients using a pseudo random method; (c) and (e) are original audio music and female speech with 44.1 kHz sampling rate and 16-bit data format; (d) and (f) are the signals after processing (c) and (e).

4.2.2 Implementation of Application-Specific Reverberator

The multi-tap FIR filter constructed as two-stage pipeline architecture for audio reverberation applications is designed in HDL and C simulation. It consists of pipeline registers, two circular buffers, 16-bit carry look-ahead adders, shifters, and the fast state machine controller. Due to pseudo-random coefficients (existence of many zero values) based on (2.4), the executing time and computational consumption of FIR is reduced. Fig.

4-4 shows the results of desire and HDL FIR over 1,000 FIR orders. These two results are quite similar, but exist on 2% inaccuracy at the location of the 800^th samples. This is because of the effect truncation errors.

-200 -100 0 100 200

1 101 201 301 401 501 601 701 801 901 1001

Desire HDL

Fig. 4-4. Fully FSM control flows for two-stage architecture.

A given music as input sources via the I²S interface is fed into the spatial circular buffer.

After the FIR processing, the results are shown in Fig. 4-5. The circular buffer is set to 2,500 blocks. The test is to process single channel, 20,282 samples of input, which is about 0.5 ms of samples with 44,100 Hz sampling rate and 16-bit data width. The desire result shown in Fig. 4-5(a) is similar to the result of HDL simulation shown in Fig. 4-5(b).

Table 4-7 shows the comparison of different FIR schemes for implementation of early reflection. The number of adders, multipliers, and shifters and delay latency is estimated and compared. The different FIR style includes in terms of Direct Form (DF), Distributed Arithmetic (DA) [38], Canonic Sign Digit (CSD) [39], Digital Signal Processor (DSP) [40], and our proposed method. The delay latency is defined as the output of the first data. As can be seen in Fig. 4-5, the proposed method can greatly save multiplication power. Most of MAC instruction in DSP needs two or higher clock cycles to accomplish operations.

Although DA and CSD do not need any multiplier, their delay latency is more than 1 stage due to bit and table operations. For the proposed two-stage FIR design, the number of adders and shifters is reduced to be 1/2 orders for each stage, and it is suitable for audio reverberation.

Table 4-7. Comparison of different FIR schemes for early reflection implementation.

Schemes Items

DF (TDF)

DA [38]

CSD [39]

DSP [40]

Proposed

Adder Order Order/2 2*Order Order Order/2

Multiplier Order None None Order None

Shift None Order/2 Order/6 None Order/2

Delay latency None 16 32 None 1

(a) (b)

Fig. 4-5. Sound with 2,0282 digital samples after FIR processing: (a) Desire results and (b) design results. (c) and (d) are the results of frequency domain analysis with Hamming window for (a) and (b).

For multi-tap filter implementation, parallel architecture and random coefficients are not only computation reduction, but also can save multiplication power. At the same time, the circular buffer can effectively be used as a spatial size. Thus, the proposed two-stage architecture can be effectively used in FIR filter hardware implementation for the audio reverberator system. In the future, the adaptive pseudo-random FIR coefficient generator can be implemented by hardware according to the feature parameters of non-zero filter taps, sampling rate, and time variance.

4.3 Performance Analysis of LASP24

Complexity is measured using million instructions per second (MIPS), random access memory (RAM) and read only memory (ROM) measurements. MIPS are measured using the execution time and instruction counts. Linker memory maps are obtained with required

sizes. As Table 4-8 shows MELP complexity exceeds LPC and CELP in both processor and memory requirements. Additionally, the total performing cycles is listed for MELP, CELP (TI DSP [84]), and reverberation algorithms.

Now LASP24 can perform the two practice applications in real time. We analyze the performance between them. For the MELP coder, the program performs 1,338,280 cycles in 60 MHz. The frame size is 22.5ms (180 samples) with a sampling frequency of 8000 Hz.

Hence, the latency is about 21 ms (1,338,280×16.67 ns) for the encoder. As the result for the decoder, the latency is about 9.1ms. Due to many filters used in the reverberation algorithm, the required execution time is larger. The program performs 1,574,430 cycles at 80 MHz. The frame size is 22.7 us (stereo channels) with a sampling frequency of 44,100 Hz. The latency is about 19.67 us. Anyway, LASP24 can operate max frequency at 100 MHz. By the above analysis, it is able to satisfy all conditions with operating frequency 80MHz.

Table 4-8. Complexity comparison between LASP24 and memory with optimization codes.

RAM ROM Items

DSP Algorithm

MIPS

Unit: byte

Total Cycles

MELP Decoder 40 96K 10K 546449

MELP Encoder 60 96K 26K 1338280

CELP Decoder (TI 320C3X)

30 14.8K 128K 364299

Reverberation 80 96K 30K 1574430

CHAPTER 5 THE INTEGRATED PLATFORM FOR MULTIMEDIA PROCESSING

5.1 Introduction

Today, the VLSI growing gap between the silicon gate capacity and the engineering productivity has lead to the advance of System-on-Chip (SoC) designs and the need for new forms of design reuse and methodologies [50]. With the rapid progress of semiconductors, SoC is very popular recently. Reuse is done at the chip level called Virtual Component (VC) or intellectual property (IP), which represents functions of specification domains like DSP or multimedia modules. In order to connect each IP on SoC, the standardized bus is indispensable [55].

Several bus protocols enjoying a certain degree of popularity are currently used in

在文檔中多媒體系統晶片平台的設計與應用 (頁 57-0)