CHAPTER 2 MULTIMEDIA IN SOUND PROCESSING
2.3 SPEECH PROCESSING
2.3.2 PITCH ESTIMATION
property that it is symmetrical and all elements along a given diagonal are equal, i.e., it is a Toeplitz matrix [18]. Equation (2.10) can be solved by the simple inversion of the p × p matrix; however this is not usually performed since computational errors such as finite precision tend to accumulate. By exploiting the Toeplitz characteristic, however, very efficient recursive procedures have been devised. The Levinson-Durbin’s algorithm [33] is used to compute the prediction coefficients for LPC analysis of the auto-correlation sequence of samples. It provides solutions to the linear equations through recursive procedure that exploits the symmetry property.
2.3.2 Pitch Estimation
Accurate estimation of the pitch period or the lag τ in the speech coding is very important. The direct distance measurement is the most popular criterion, examining the similarity between two waveforms which can be expressed as
[
( ) ( )]
,where β is a scaling factor, or the pitch gain, controlling the changes in signal level. Under the assumption that the signal is stationary, the error criterion of Eq. (2.12) can be written as
Speech in the long term is a non-stationary signal, and the direct similarity criterion may exhibit large errors, implying fewer similarities in position where the shift is equal to the real pitch period. Equation (2.13) is the direct auto-correlation which indicates more similarities in triple pitch period as the amplitude is increasing. The normalized similarity criterion in Eq. (2.12) is derived under the consideration of such a non-stationary process.
Setting ∂E(τ, β)/∂β = 0 of Eq. (2.12), the optimum normalization coefficient (pitch gain) can be calculated using
.
By substituting the optimum gain back into the error function of Eq. (2.12), the pitch can be estimated by minimizing
.
This is equivalent to maximizing the square of the normalized auto-correlation function given by
The pitch period can be determined from Eq. (2.16). The normalized auto-correlation method shows a much better performance than direct (un-normalized) auto-correlation method.
CHAPTER 3
DESIGN OF APPLICATION-DRIVEN DIGITAL SIGNAL PROCESSOR
3.1 Introduction
The proposed application-driven digital signal processor (DSP) [85]-[87], called LASP24 (Low-cost Application-driven Speech Processor, 24-bit data width), is constructed as a reduced instruction set computer (RISC) architecture with vector and matrix operations and power optimization. An effective verification is used to subserve the hardware design and to decrease debugging time during the development of hardware and software. High performance is achieved by vector and matrix operations that are not usually supported by general-purpose DSPs. The parallel architecture of LASP24 can quickly execute vector and matrix operations without extra overhead. High flexibility in use, small area on silicon, high data throughput, and fast portability to a wide range of technologies are our main targets in the core development.
The development of the digital signal processor shown Fig. 3-1 is to meet the system demands that are based on sophisticated arithmetic algorithms and that emphasize on both hardware and software solutions. The verified tools offer the opportunity to trade off between software (for flexibility) and hardware (for performance and power optimization).
The development flow consists of two parts: hardware implementation and software development. Software includes two development tools: the assembler and the emulator.
The assembler can translate assembly language into binary codes (or called machine codes).
Simultaneously, the initial ROM file is generated for the processor emulator and the HDL
simulator. The emulator can emulate the computations of the processor hardware and verify the precision of different floating-point formats such as 32- or 24-bit. In hardware design, using the hardware description language (HDL) implements the processor and improves performance and power dissipation for speech/audio algorithms. The processor can be regarded as an embedded DSP processor.
Fig. 3-1. Hardware/Software development flow for LASP24.
3.2 Micro-architecture
The RISC-type [31] processor has traditionally enhanced performance by the reduced instruction set to maximize the throughput, and most of them access rather a large program memory at every clock cycle to fetch each instruction. Thus, application-driven design can reduce complexity and is greatly enhanced at performance. For an embedded DSP, it is necessary that the architecture should support effective data communication between memory system and execution units, low-overhead loop control, and accumulator-based
instruction set architecture.
An efficient method of data representation and a hardware implementation is proposed to utilize a smaller program memory, while maintaining other merits of the RISC, such as simple decoding, fixed instruction size, and high performance. LASP24 is a 24-bit DSP processor with a floating-point unit and is ease of use. The DSP processor has the architecture of a 24-bit single-instruction/multiple-data (SIMD) instruction set with five addressing modes, and a five-level pipeline executing engine, which is Instruction Fetch (IF), Instruction Decode (ID), Execution (EX1, EX2), and Write Back (WB). It is important to perform parallel multiplication and arithmetic operations in a single cycle.
This allows instruction execution to overlap. Thus, the effective execution time for most instructions is one cycle. Some key features of LASP24 are listed below:
z 24-bit fixed length instructions which support 2- and/or 3-operand.
z Five pipeline stages to improve throughput.
z Five addressing modes and one control mode. Up to the support of 32 instructions.
z Two bank internal memories for use of vector addressing.
z 24 address stacks and 70 data stacks.
z Block repeat capability.
z Zero-overhead loops with a single-cycle branch.
z Branch conflict with hardware detection and solution.
z Power saving consideration.
Floating-point operations provide fast, accurate, and precise computations. The 24-bit floating-point format is compatible with IEEE-754 standard [32]. Specifically, LASP24 facilitates floating-point operations at high speed for speech/audio signal processing, which offers addition, subtraction, multiplication, and simulated division.
The block diagram of proposed LASP24 is shown in Fig. 3-2. LASP24 is functionally partitioned into the following major blocks: a computation unit, which indicates ALU, multiplier, and accumulators, a program control unit, an external bus control dictating LASP24 external buses, a vector address generator computing the addresses which are used in vector operations. The program control unit performs instruction fetch, decoding, exception handling, and wait state supports. The PCU generators the next address to the program memory and controls hardware loops.
Fig. 3-2. The block diagram of the proposed digital signal processor.
LASP24 includes four register groups. The eight general-purpose registers (Register File) are capable of storing and supporting operations on 24-bit floating-point numbers.
The two 8-bit auxiliary registers can be accessed by the processor and modified by the auxiliary register arithmetic unit. The primary function of the auxiliary registers is the generation of 8-bit addresses. They can also be used as loop counters or as matrix point register. The status registers contain information relating to the state of ALU and parallel multiplication. When the status registers is loaded, LASP24 sends out a busy signal, and
executes the selected function. The two 8-bit repeat counters which used to specify the number of times are to be repeated when performing a block repeat.
LASP24 uses a five-stage pipelined structure, and the pipelined operation is shown in Fig. 3-3. The Instruction Fetch (I) stage fetches the instruction words from instruction ROM and updates the program counter (PC). The Read and Decode (R) stage decodes the instruction word and performs address generation. Also, it controls the modification of the AR0 and AR1 registers in the matrix and vector addressing modes, and if required, reads the operands from memory or general registers. The Execution (E) stage is divided into two stages and performs the necessary operation, such as floating-point addition, subtraction and multiplication. The Write Back (W) stage, if required, writes results to the register file and memory.
Fig. 3-3. Pipelining operations.
The pipelined control exists the problems of conflicts (or hazards). The conflicts can be grouped as branch, memory, and register conflicts. The branch and register conflicts are described in [58], and the concept of its solution to these conflicts is applied to our design.
The register conflicts arise when an instruction depends on the results of a previous instruction in a way that is caused by the overlapping of instructions in the pipeline. Using the forwarding way can solve the problem of register conflicts. The branch conflicts arise from the pipelining of branches and other instructions that change the PC. The condition of
a branch conflict is shown in Fig. 3-4. The (i+2)th instruction will return to the jth instruction, but the pipeline register has fetched the (i+2)th instruction. For the branch taken, the (i+2)th instruction is not used and replaced by the “NOP” instruction. This
Fig. 3-4. Branch operations.
change solves the branch conflict, but the pipeline causes overhead. Hence, we modify the way of branch conflicts in Fig. 3-4 to avoid NOP operation and to reduce time overhead.
The branch conflict in LASP24 does not exist because the PC is changed in the I stage and the R stage, not in the E stage. Before the next cycle, the indicated branch instruction will be ready in the I stage. That means the program control is free of branch conflicts, and there is zero overhead for a branch instruction. The memory conflicts arise from resource conflicts when the hardware cannot support all possible combinations of instructions in the simultaneous overlapping. As shown in Fig. 3-5, this type of conflicts may happen. The ith instruction does not yet write R1 to the location of RAM0[r], but the (i+1)th instruction reads data from the location of RAM0[r]. At this time, a memory hazard occurs in the pipeline. The (i+2)th instruction is reading data from the locations of RAM0[r] and RAM1[r], but the ith instruction is writing R1 to RAM0[r]. This is seriously conflicts for memory data buses. The solution is to assign the priority of writing memory higher than
that of reading. The above condition similarly occurs between two internal RAMs and one external bus. In the other way, the software codes can also avoid this type of conflicts.
Fig. 3-5. Memory accessing operations.
3.3 Instruction Set
The processor instruction sets have been designed with two goals in mind: 1) to make maximum use of the processor’s underlying hardware, thus increasing efficiency and 2) to minimize the amount of memory space required to store DSP programs, since DSP applications are often quite cost-sensitive and the cost of memory contributes substantially to overall chip and/or system cost. To accomplish these two goals, it is necessary to reduce the number of bits required to encode instructions and to offer fewer registers and addressing modes than other types of processors. Thus, the architecture of LASP24 is defined as a fixed instruction length at 24 bits. A 24-bit instruction uses five bits each for addressing 8 general-purpose registers. LASP24 instruction set includes five addressing modes and is classified into three groups as data transfer, arithmetic, and control instructions. The total of defined instructions is about twenty-five (see Appendix A in details). Some representative instructions are listed as follows.
Instruction Descriptions and Examples
Load and Store Instructions MOV Load, store and move data
1. General data moves
EX: MOV RAM0[address], R0; R0=RAM0[address]
2. Data moves for the matrix addressing mode
EX: MOV RAM1[AR1L+1, AR0L], R3; R3=RAM1[AR1L+1, AR0L], where AR0L and AR1L are defined as AR0[3:0] and AR1[3:0]).
LD Load fixed values as follows:
0.0, 0.75, 1.0, and
2.0 - A (the floating-point value from 2.0 leaves operand A) Arithmetic Instructions
ADD Add floating-point values EX: ADD R0,R1,R2; R2=R1+R0 SUB Subtract floating-point values
EX: SUB R0,R1,R2; R2=R1-R0 MPY 1. General multiplication
EX: MPY R0,R1,R2; R2=R1×R0 2. Matrix multiplication
EX: MPY R3,RAM0[1110,AR0L-AR1L];
R3=RAM0[1110,AR0L-AR1L]×R3 VMPY Vector multiplication
EX: VMPY EXT[j],WIN[j],RAM0[j],RAM1[j];
{RAM0[j],RAM1[j]}=EXT[j]×WIN[j]
MAC Multiplication-and-accumulation
EX: MAC RAM0[j], RAM1[j], R3; R3=RAM0[j]×RAM1[j]+ACC, where ACC is an accumulator.
DIVEXP Re-scale after division
EX: DIVEXP R0,R3; R3=DIVEXP(R0) NORM Normalize floating-point value
EX: NORM R0,R1; R1=Norm(R0)
Program Control Instructions
NOP No operation
LDC Load AR0 and AR1 value
EX: LDC AR0,#14; load 14 to AR0
RPB Begin repeat block
EX: RPB RC0, 255; for (r=0; r<=254; r++) RETB Return repeat block of instruction
EX: RETB AR0, label; if AR0=RC0, goto label END End of programs (halt)
3.3 Addressing Modes
Most of speech and audio processing is related with auto-correlation, convolution, and FIR calculation. Hence, addressing modes are to enhance the hardware computing
capability for the algorithms. Five types of addressing modes allow access of data and instruction words from memory and registers: register, direct, indirect, immediate, and vector addressing modes. These detailed addressing formats are described in Appendix B.
The register addressing mode offers internal accessing operations of general-purpose registers. In this addressing mode, an ALU register contains three operands, as shown in this general operation: “RA Operation RB ⇒ RC.” The destination operand is RC and the source operands are RA and RB. The direct addressing mode offers an immediate value as an index of memory address to access memory data. In this addressing mode, the data address is formed by 0-7 bits in the instruction. Because the length of instruction is short, the direct addressing mode only supports RAM block 0. The matrix addressing mode is designed for Durbin's algorithm [33] and used to compute matrix multiplication. For example, there is a 10×10 matrix multiplication. To access data in the matrix fast, the auxiliary registers (AR0 and AR1) are used to assist addressing the coordinate (X, Y) in the matrix. In matrix addressing, a three-operand instruction can be used in the indirect addressing mode. The vector addressing mode is used in data computation between memory and memory. This mode provides 512-data-length vector operations and can also execute parallel instructions that make auto-correlation function operate faster than the general-purpose DSPs.
Additionally, a control mode is defined to control data paths in the processor design.
Programmers can use this mode to control their program flow and/or to easily set of repeat counters. Through two auxiliary registers (AR0 and AR1), the processor can execute two-level nested program. The function-finishing instruction and holding status are also in the control mode. The loop control is very useful for auto-correlation function in Durbin's algorithm [33] because they are all two-level nested programs. The mode is very efficient to handle the program flow without any additional instructions, which might be necessary
to other general-purpose DSPs.
3.4 Matrix Processing Technique
Particularly, we design an auto-index method which uses auxiliary registers to address memory data as shown in Fig. 3-6. This method called matrix addressing can easily get memory data in a single multiplier instruction. When the instruction decoder gets the vector address, the address would represent the coordinate of the matrix. Matrix multiplication is based on the operation of RAM0 and R3 (the third general-purpose register). The results are stored to the R3 register. An example for the equation of matrix multiplication is as
∑
= −We can replace the above with the following LASP24 micro codes:
RPB j, #r-1 // set repeat block counter
L1: MOV WIN[j+1, r], R3; // move a coefficient to R3 MPY R3, RAM0[AR0, r-j], R3 // matrix multiplication
ADD R1, R3, R1 // R1=R1+R3
RETB j, L1 // if j≠0, return to L1
The index of a matrix coordinate is defined by auxiliary registers (AR0 and AR1). The address index can automatically increase so that the pointer indicates the next matrix address. Hence, this addressing method enables a single-instruction matrix computation so that the size of program memory and the number of program memory access can be reduced.
Fig. 3-6. Illustration for computing a matrix address with the vector addressing mode.
In Fig. 3-6, the instruction decoder gets the matrix position with four bits listed in Table 3-1 and then transfers them to the address processing unit. The processing unit can analyze and calculate the matrix address (X, Y) in RAM0. Table 3-1 shows the coordinate table of two matrix addressing modes. One is the indirect addressing mode as RAM0[AR0];
the other is the matrix addressing mode as RAM0[AR0L+1, AR0L+1]. The matrix coordinate is defined in AR0 and AR1. The index automatically adds one so that the pointer indicates the next matrix address. The vectors {0000, 0001} and {1110, 1111} are two special coordinates which can directly access the start and the end of row location in the matrix. Hence, the proposed matrix addressing method enables a single-instruction matrix computation so that the total number of program instructions can be reduced.
Table 3-1. The matrix coordinate for the matrix addressing mode, where AR0L and AR1L represent the lower four bits of AR0 (AR0[3:0]) and ar1 (AR1[3:0]), respectively.
CODE Addressing Mode CODE Addressing Mode
0000 RAM0[AR0] 1000 RAM0[AR0L-AR1L, AR0L]
0001 RAM0[AR1] 1001 RAM0[AR1L+1, AR0L+1]
0010 RAM0[AR0+AR1] 1010 Reversed
0011 RAM0[1111, AR0L] 1011 Reversed
0100 RAM0[AR1L+1, AR0L] 1100 RAM0[0000, AR0L]
0101 RAM0[1110, AR0L-AR1L] 1101 RAM0[1110, AR0L]
0110 RAM0[1110, AR0L+1] 1110 Reversed
0111 RAM0[AR0L+1, AR0L+1] 1111 RAM0[0001, AR0L]
3.5 Vector Processing Technique
The SIMD-style vector processing scheme provides an approach to accelerating the processing of data streams. This technique can provide a significant speedup for communications, multimedia, and other performance-driven applications by using data-level parallelsim. In the vector processors [34], [35] the design can provide high-level operations that work on vectors - linear arrays of numbers. The vector processing unit supports both intra- and extra-memory operations. In the operation, elements work in parallel on the corresponding elements from multiple intra- or extra-memory sources and place the results in the corresponding fields in the destination operand memories. An operation example is the vector multiplication (VMPY) instruction shown in Fig. 3-7, and the instruction format and addressing representation are shown in Table 3-2.
VA (Source Memory 1)
VB (Source Memory 2)
VC (Destination Memory)
OP OP OP OP
...........
.........
.........
Fig. 3-7. An example of memory operations in LASP24, where OP indicates the vector multiplication. VA, VB, and VC represent different memory banks. They are defined in Table Table 3-2.
Table 3-2. The format of the vector addressing mode and the representation of vector addresses in LASP24, where OP indicates operation; VA, VB, and VC represent vector registers. The symbols, FIL, EXT, WIN, RAM0, and RAM1, are memory symbols.
VC ⇐ VA[AR_A] OP VB[AR_B]
23 ~ 19 18~16 15~14 13~12 11~10 9~8 7 ~6 5~4 3~2 1~0
OPCODE 011 NU FIL EXT RAM0 RAM1 VC VA VB
FIL EXT RAM0 RAM1 VC VA VB
VL 13~12 11~10 9 ~ 8 7 ~ 6 5 ~ 4 3 ~ 2 1 ~ 0
00 FIL EXT AR0 AR0 RAM0 RAM0 RAM0
01 FIL+AR0 EXT+AR0 AR1 AR1 RAM1 RAM1 RAM1
10 FIL+AR1 EXT+AR1 AR0+AR1 AR0+AR1 EXT EXT WIN
11 FIL-AR0 EXT-AR0 AR1-AR0 AR1-AR0 R3 - FIL
The vector multiplier has several important properties that solve most of the above problems as explained below.
1. The computation of each result is independent of the computation of previous results, allowing a pipelined operation without generating any data hazards.
2. A single vector instruction specifies a great deal of computation work. It is equivalent to executing an entire loop. Thus, the number of instruction fetch is reduced, and the bottleneck is considerably mitigated.
3. The vector instruction has a known memory access pattern. If the vector's elements are all adjacent, then fetching the vector from a set of heavily interleaved memory banks works very well. The high latency of initiating a main memory access versus accessing an instruction ROM is rather high, because a single access is initiated for the entire vector rather than for a single element. Thus, the cost of the latency to memory is seen only once for the entire vector, rather than once for each element of the vector.
4. Because an entire loop is replaced by a vector instruction whose behavior is predetermined, control hazards that would normally arise from the loop branch are nonexistent.
To illustrate the above features, we compare performance with a general-purpose DSP in computing the vector multiplication of 100 points. A vector multiplication instruction fetches data from RAM0 and RAM1 and feeds into ALU. ALU executes the “MAC”
operation and adds the result to the accumulating register. The final results are stored to the external memory. An example of vector processing (100 points) is shown as follows.
L1: MPY RAM0(r), RAM1(r), EXT(r); // EXT(r)= RAM0(r)× RAM1(r)
RETB r, L1 // r=r+1. if r=100, then jump to L1
The total execution time is about 200 clock cycles. Hence, we use a single instruction
The total execution time is about 200 clock cycles. Hence, we use a single instruction