CHAPTER 4 SIMULATION RESULTS
4.3 PERFORMANCE ANALYSIS OF LASP24
(a) (b)
(c) (d)
Fig. 4-5. Sound with 2,0282 digital samples after FIR processing: (a) Desire results and (b) design results. (c) and (d) are the results of frequency domain analysis with Hamming window for (a) and (b).
For multi-tap filter implementation, parallel architecture and random coefficients are not only computation reduction, but also can save multiplication power. At the same time, the circular buffer can effectively be used as a spatial size. Thus, the proposed two-stage architecture can be effectively used in FIR filter hardware implementation for the audio reverberator system. In the future, the adaptive pseudo-random FIR coefficient generator can be implemented by hardware according to the feature parameters of non-zero filter taps, sampling rate, and time variance.
4.3 Performance Analysis of LASP24
Complexity is measured using million instructions per second (MIPS), random access memory (RAM) and read only memory (ROM) measurements. MIPS are measured using the execution time and instruction counts. Linker memory maps are obtained with required
sizes. As Table 4-8 shows MELP complexity exceeds LPC and CELP in both processor and memory requirements. Additionally, the total performing cycles is listed for MELP, CELP (TI DSP [84]), and reverberation algorithms.
Now LASP24 can perform the two practice applications in real time. We analyze the performance between them. For the MELP coder, the program performs 1,338,280 cycles in 60 MHz. The frame size is 22.5ms (180 samples) with a sampling frequency of 8000 Hz.
Hence, the latency is about 21 ms (1,338,280×16.67 ns) for the encoder. As the result for the decoder, the latency is about 9.1ms. Due to many filters used in the reverberation algorithm, the required execution time is larger. The program performs 1,574,430 cycles at 80 MHz. The frame size is 22.7 us (stereo channels) with a sampling frequency of 44,100 Hz. The latency is about 19.67 us. Anyway, LASP24 can operate max frequency at 100 MHz. By the above analysis, it is able to satisfy all conditions with operating frequency 80MHz.
Table 4-8. Complexity comparison between LASP24 and memory with optimization codes.
RAM ROM Items
DSP Algorithm
MIPS
Unit: byte
Total Cycles
MELP Decoder 40 96K 10K 546449
MELP Encoder 60 96K 26K 1338280
CELP Decoder (TI 320C3X)
30 14.8K 128K 364299
Reverberation 80 96K 30K 1574430
CHAPTER 5
THE INTEGRATED PLATFORM FOR MULTIMEDIA PROCESSING
5.1 Introduction
Today, the VLSI growing gap between the silicon gate capacity and the engineering productivity has lead to the advance of System-on-Chip (SoC) designs and the need for new forms of design reuse and methodologies [50]. With the rapid progress of semiconductors, SoC is very popular recently. Reuse is done at the chip level called Virtual Component (VC) or intellectual property (IP), which represents functions of specification domains like DSP or multimedia modules. In order to connect each IP on SoC, the standardized bus is indispensable [55].
Several bus protocols enjoying a certain degree of popularity are currently used in SoC design. IBM’s CoreConnect [51] is supported by a vast set of tools that allow the automatic generation of many parts of the system. The Wishbone specification [52] offers a set of guidelines for a basic, simple bus structure. This protocol has been selected by OpenCores organization [53] as the standard based on [52] to follow for the development of the free IP library. Advanced RISC Machines Inc. (ARM) developed the very popular AMBA AHB/APB protocol [54] and it has been used in many products. This protocol is also adopted in this thesis to develop the SoC platform and audio IPs.
A single proposed SoC platform [80] which combines microprocessor, memory, and other functional modules such as GPIO (General-Purpose Input/Output), I2S (Inter-IC
Sound), and communication (UART) into a single IC is popular recently. To verify these functions [73] of the proposed platform for audio processing, the FPGA rapid prototype approach is used. FPGA were primarily used for prototyping and lower volume applications in years and custom ASICs were used for high volume, cost-sensitive designs.
Today, state-of-the-art wafer fabrication finds that FPGAs are an excellent mechanism for testing new wafer technology because of their reprogrammable and high volume natures.
Hence, more and more designers use platform FPGA technology [56] to develop and verify their SoC design quickly. In this thesis, we offer a way to develop low cost SoC products within the FPGA environment for fast design and verification, and the SoC integration platform for digital audio applications as reverberation and speech processing has also been presented in Chapter 2 for demonstration.
5.2 SoC Platform
The proposed SoC platform is shown in Fig. 5-1 for speech/audio processing. The primitive prototype is constructed in two platforms: FPGA Integration Platform (FIP) using Altera Cyclone Edition and DSP Verification Platform (DSPVP) using the Analog Devices DSP KIT [59] evaluation system for Blackfin embedded media processors [76].
FIP has a full implementation of AMBA AHB and APB on-chip buses. A flexible configuration scheme makes it simple to add new IP cores. Also, all provided peripheral units implement the AMBA AHB/APB interface making it easy to add more of them, or reuse them on other components using AMBA. In the SoC platform, the 8051 microcontroller is used as a main resource dispenser. Due to timing request, the 8051 microcontroller can not directly connect to AHB. It is necessary to be packed by a wrapper.
Likewise, the wrapper is required if a DSP processor is added to AHB. In the case, the DSP
wrapper is not shown due to the use of the DSP evaluation system. An on-chip memory called dual-port synchronous static RAM (SSRAM) is embedded into the system. The SSRAM, which store multi-channel audio streaming, is defined as a share buffer between FPGA and DSP. Four peripheral devices as GPIO (General-Purpose I/O), I2S (Inter-IC Sound), Interrupt Controller, and UART (Universal Asynchronous Receiver Transmitter) are hanged on APB. These modules are listed in Table 5-1 and explained in detail in the next section.
Table 5-1. Module design of the SoC platform.
Module Design/BUS Technology Description Dual-port SSRAM
Controller (AHB)
Dual port synchronous static RAM: offer general-purpose memory accessing interface.
8051 wrapper (AHB)
Offer the conversion of 8051 signals into AHB. It is an interface and buffer devices. In the platform, it is a main control center, i.e., Master.
UART (APB)
A device, usually an integrated circuit chip, which performs the parallel-to-serial conversion of digital data to be transmitted and the serial-to-parallel conversion of digital data that has been transmitted.
GPIO (APB) It can be individually configured through software as either an input or output, and provide additional control and monitoring when the microcontroller or chipset has insufficient I/O ports, or in systems where serial communication and control from a remote location is advantageous. In this platform, GPIO provides a little as 4 ports and up to 24 ports.
I2S (APB)
It provides digital sound processing interface, which is serial communication to connect digital sound devices. In the peripheral bus, if processing multi-channel sound, we can put one or more I2S groups.
Module Design/BUS Technology Description Interrupt Controller
(APB)
The interrupt allows edge or level triggers to handle interrupt routines with the programmable method. It provides 8-bit or 16-bit data width for fast and general interrupt as input.
APB Bridge (between AHB and APB)
Handle signal transformation between high-speed bus and low-speed bus. The bridge can maintain devices between two buses at the same time.
AHB Decoder (AHB)
Offer interconnection and mechanism that uses the bus between master and slave devices in AHB, i.e., bus arbiter.
DSP (LASP24) 24-bit floating-point digital signal processor described in Chapter 3.
DSPVP performs given 3-D audio algorithms such as reverberation in real-time, but these two audio algorithms are not shown in this thesis. Audio streaming via SSRAM is fed into DSP. Processed audio streaming is exported to stereo speakers via the audio codec. In the developing process, DSPVP is an auxiliary platform to cooperate with the verification of the designed SoC system.
80C51
Increment modeSingle transfer
32-bit 32-bit
16-bit
16-bit 16-bit
(a)
Bridge
Reverb Dual-Dual-Port Port SRAM
Fig. 5-1. Multimedia SoC platform: (a) SoC architecture and (b) the proposed prototype system.
5.3 Intellectual Property Design
5.3.1 Microprocessor
The microprocessor is compatible with the MCS-51 family, originally designed in the 1980's by Intel. The processor has gained great popularity since its introduction and is estimated it is used in a large percentage of all embedded system products. It features are 8-bit CPU, on-chip memory which has separated Data and Program (read-only) memory, two 16-bit timer/counters and four 8-bit I/O ports including two interrupts. There are 64K bytes of off-chip program memory and up to 4K bytes of on-chip program memory.
Remaining part of the program memory is external and can be reached with a specific signal EA. The some features of the 8051 IP core referred to Opencores [53] are described as follows.
8-bit CPU optimized for control applications
Extensive Boolean processing (single-bit logic) capabilities
64K program and data memory address space
32 bidirectional and individually addressable I/O lines
6-source/5-vector interrupt structure with two priority levels
Up to 4K bytes of on-chip program memory
Two 16-bit timer/counters
The instruction set of the 8051 core is already said optimized for 8-bit control applications. This optimization shows in a variety of fast addressing modes for accessing the internal RAM to facilitate byte operations on small data structures. The instruction set is also good for systems that require a lot of Boolean processing because it has an extensive support for one-bit variables as a separate data type (that makes direct bit manipulation a lot easier). The total of addressing modes is five kinds, which include direct, indirect, register, register-specific, immediate, and index addressing.
The 8051 core contains four I/O ports. All four ports in the 8051 core are bidirectional.
Each port has SFR (Special Function Registers P0 through P3) which works like a latch, an output driver and an input buffer.Both the output driver and the input buffer of Port 0, and the output driver of Ports 2 are used for accessing the external memory. It works like this:
Port 0 outputs the low byte of the external memory address (which is time-multiplexed with the byte being written or read) and Port 2 outputs the high-byte of the external memory address (this is only needed when the address is 16 bits wide). If the address in question is 8 bits wide the Port 2 pins are not needed in this application. The Port 3 pins are multifunctional. Their alternate functions are listed in Table 5-2. The alternate functions are activated with the 1 written in the corresponding bit latch in the port SFR.
Table 5-2. Microprocessor’s alternate functions.
P3 Port Pin Alternate Function PIN 2 INT0(external interrupt) PIN 3 INT1(external interrupt)
PIN 4 T0 (timer/counter 0 external input) PIN 5 T1 (timer/counter 1 external input) PIN 6 WR(external data memory write strobe) PIN 7 RD (external data memory read strobe)
The new value arrives at the latch during the last phase (Phase 2), of the final cycle of the instruction that changes the value in a port latch. Because the port latches are sampled by their output buffers only during Phase 1 of any clock period (during Phase 2 the output buffer holds the value it saw during the previous Phase 1), the new value in the port latch won’t actually appear at the output pin until the next Phase 1, which will be at the beginning of the following machine cycle.
5.3.2 Inter-IC Sound Interface
The I2S is used only to handle audio serial data. To minimize the number of pins required and to keep wiring simple, a 3-line serial bus consisting of a line for two time-multiplexed data channels, a word select line (WS), and a clock line (SCK) is used.
Serial data (SD) is transmitted in two’s complement with the MSB first. A simple configuration and the basic interface timing are illustrated as Fig. 5-2. The MSB is transmitted first because the transmitter and receiver may have different word lengths. The WS indicates the channel being transmitted: when WS=0, SD belongs to the left channel;
conversely, SD belongs to the right channel. The WS line changes one clock period before the MSB is transmitted. This allows the slave transmitter to derive synchronous timing of the serial data that will be set up for transmission. Serial data sent by the transmitter may
be synchronized with either the trailing (high-to-low) or the leading (low-to-high) edge of the clock signal. However, the serial data must be latched into the receiver on the leading edge of the serial clock signal, and so there are some restrictions when transmitting data that is synchronized with the leading edge.
Invalid Valid Invalid Valid Invalid
Cycle time Setup time
Hold time SCLK
SD/WCLK
44.1kHz 3.072MHz
Invalid Valid Invalid Valid Invalid
Cycle time Setup time
Hold time SCLK
SD/WCLK
44.1kHz 3.072MHz
Fig. 5-2. The basic interface timing of I2S.
The hardware configuration of I2S transmitter and receiver are shown in Fig. 5-3. At each WS-level change, a pulse WSP is derived for synchronously parallel-loading the shift register. For the transmitter, the output of one of the data latches is then enabled depending on the WS signal. Since the serial data input is zero, all the bits after the LSB will also be zero. For the receiver, following the first WS-level change, WSP will reset the counter on the falling edge of SCK. As the counter increases by one every clock pulse, subsequent data bits are latched into the 16-bit shift register.
(a) Transmitter
(b) Receiver
Fig. 5-3. The block of audio I2S configuration (SCK=64×fs andWS=fs=48kHz).
In the I2S format, any device can act as the system master by providing the necessary clock signals. A slave will usually derive its internal clock signal from an external clock input. This means, taking into account the propagation delays between master clocks and the data and/or word-select signals, that the total delay is simply the sum of the delay between the external (master) clock and the slave’s internal clock; and the delay between the internal clock and the data and/or word-select signals. For data and word-select inputs, the external to internal clock delay is of no consequence because it only lengthens the
effective set-up time (see Fig. 5-2). The major part of the time margin is to accommodate the difference between the propagation delay of the transmitter, and the time required to set up the receiver. All timing requirements are specified relative to the clock period or to the minimum allowed clock period of a device. This means that higher data rates can be used in the future. Fig. 5-4 shows the operation of audio data transmission from the I2S interface to the internal system high-speed bus (indicated by arrowheads). Then audio data is stored into the share memory.
Fig. 5-4. FPGA simulation of I2S transmission.
5.3.3 Serial Communication Design
UART (Universal Asynchronous Receiver/Transmitter) is designed to make an interface between a RS-232 line and an AMBA bus. It works fine connected to the standard serial port of any device for data exchange with custom electronic. It was built in the perspective to be very small, but efficient. It has to fit in a small FPGA. It is not suited to
interface a modem since there is no control handshaking (CTS/RTS). It integrates two separate clocks, one for AMBA bus and the other for bitstream generation. This has the advantage to let the user bring own desired frequency for the baud rate. The baud rate, however, is defined as 9,600 bps in the case.
The core implements the AMBA SoC bus interface for communication with the platform. It has an 8-bit data bus, even parity, and 1 stop bit for compatibility reason. The core requires one interrupt. It requires 2 pins in the chip (serial in RX and serial out TX).
The block diagram of the core is shown in Fig. 5-5. The line control register assigns one of operations between the transmitter and receiver. If the received operation is active, serial data (RX) is fed into the receiver shift register. When the action is finished, the receiver logic will send an interrupt signal to the microprocessor. Conversely, if the transmitted operation is active, 8-bit data is fed into TX from the transmitter shift register.
Fig. 5-5. The block diagram of UART (Baud rate at 9,600 b/s).
The UART simulation is shown in Fig. 5-6. For the receiver (Fig. 5-6(a)), after SRX has received 8-bit data, data is then stored into the data_out_reg register. This monmentm,
an interrupt is triggered by INTn in order to info the master device. Hence, the master device can obtain data from the peripherial bus. Note that when the bus is selected (apb_sel=1), the interrupt signal INTn has to be disabled. The received data is stored into apb_rdata. For the transmittor (Fig. 5-6(b)), it is very easy. STX first sends a start bit (=0), then sends 8-bit data from LSB to MSB in order. Follow the parity and stop bits.
(a)
(b)
Fig. 5-6. FPGA simulation of UART (a) receiver and (b) transmitter.
5.3.4 Wrapper and Interrupt Design
Since 8051 I/O signals can not directly meet AHB timing constraints, it is necessary for the 8051 wrapper design. The interface is prepared for 8-bit accesses. In each read or
Start bit
8-bit data
write access to the AMBA AHB will not require wait state cycles. Thus, all AMBA AHB single transfer modes are supported in the 8051 wrapper design. The simulation of single transfer mode is shown in Fig. 5-7.
Fig. 5-8 shows the complete state machine for the 8051 wrapper design. The structure of the state machine is divided into three main parts: I2S data processing, user-programmable inputs via GPIO registers, and UART communication. When an interrupt occurs, 8051 can handle corresponding procedures via the wrapper. Note that the wrapper spends most of the term in I2S data communication. In other words, each time the wrapper must access three-channel data when the sampling rate starts.
Fig. 5-7. FPGA simulation of data transfer for 8051 wrapper.
Fig. 5-8. Complete stat machine for 8051 wrapper.
Due to 8051 accepted only two interrupts, in order to handle more interrupts for the system, the interrupt controller can decide the priority of all interrupts. Of course, GPIO has the highest priority. The following is UART and I2S, respectively. The interrupt controller supports up to 16 interrupts: 3 interrupts from the internal APB devices and other reserves.
5.3.5 Specialized Hardware for System Verification
To realize the benefits of emulation, virtually all of the circuit and testbench for the design must run on the emulator. This means that the testbench should be synthesizable.
One approach would be to make the testbench synthesizable from the beginning, and to use the same testbench for both RTL verification and emulation. The bus functional models (BFM) used in the SoC platform are common method of creating testbenches. Typically they are written in the register-transistor level (RTL), a testbench automation tool, or in C/C++, and use some form of command language to create sequences of transactions on
on the bus. They do not model any of the functionality of an agent on the bus; each read and write transaction is specified by the test developer explicitly. Because of their simplicity, these bus models place little demand on simulator performance; simulation speeds are mostly determined by the macro itself.
Many testbemchs require multiple BFMs, as in the SoC platform above. In this case, it is best to use a single command file to coordinate the actions of the various models as AHB and APB. The models must be written so that they can share a common command file. Many commercial BFMs offer this capability. If we take our canonical design, the following approach seems reasonable. In Fig. 5-9, the software for the processor is compiled and loaded into memory in the emulator. This allows the processor and peripherals to perform at full emulation speed. The stimulus for the data transformation
Many testbemchs require multiple BFMs, as in the SoC platform above. In this case, it is best to use a single command file to coordinate the actions of the various models as AHB and APB. The models must be written so that they can share a common command file. Many commercial BFMs offer this capability. If we take our canonical design, the following approach seems reasonable. In Fig. 5-9, the software for the processor is compiled and loaded into memory in the emulator. This allows the processor and peripherals to perform at full emulation speed. The stimulus for the data transformation