Thesis Organization - 使用32位元低功率嵌入式處理器之高效能MP3解碼系統

Chapter 1 Introduction

1.3 Thesis Organization

Chapter 1 introduces the motivation that the number of digital consumer electronic products increases so dramatically nowadays. The new products are rapidly released and they are all powered by the batteries. Therefore, there are two challenges.

How to save more power of these portable electronic devices and how to design a product with short time-to-market cycle are the most important subjects in the competitive market. A low-power embedded processor with MP3 decoder software is proposed in this thesis.

Chapter 2 describes the previous works related to ARM9E-S which is a member of general purpose 32-bit embedded processor of the Advanced RISC Machines (ARM) family. MPEG-1 Audio Layer-3, more commonly referred to as MP3, is a digital audio encoding format using a form of lossy data compression. It is a common audio format for consumer audio storage.

Chapter 3 describes the proposed core design ACARM9 which improves the operating performance and power consumption. The architecture of the proposed design has been discussed in this chapter. The verification strategy and synthesis result are described in the following sections.

Chapter 4 presents an MP3 Platform-based System implemented by ACARM9 which is configured in an FPGA as a specific purpose processor and all system behaviors are controlled by an ARM926EJ-S processor on the development board.

The system can be examined in three different aspects, which include system level, hardware level and software level.

Chapter 5 describes the MP3 decoder software design which is analyzed in two phase. First is before optimization. After all the improvements the analysis shows that the minimum clock rate required in FPGA. The porting technique is also presented in

this chapter.

Chapter 6 presents the experiment result of the decoder software and MP3 decoder system. The discussions of improvement are described in this chapter. The experiment result shows the minimum clock rate to real-time decode.

Chapter 7 describes the conclusion and future work of this thesis.

Chapter 2 Preliminaries

This chapter describes the preliminaries of this thesis. Section 2.1 introduces the ARM9E-S, which is a member of the Advanced RISC Machines (ARM) family of general purpose 32-bit embedded processors. Section 2.2 introduces the MPEG 1 Layer-3 (MP3) decoder. Section 2.3 describes the target system platform, RealView Platform Baseboard for ARM926EJ-S.

2.1 Overview of ARM9E-S

This section describes some prerequisites related to ARM9E-S, which is a member of the Advanced RISC Machines (ARM) family of general purpose 32-bit embedded processors. Section 2.1.1 depicts basic architecture of the processor core.

Section 2.1.2 mentions the programmer’s model. Section 2.1.3 describes the instruction set architecture (ISA) of ARM9E-S. More related works about ARM9E-S will be discussed in Section 3.1 later.

2.1.1 Core Architecture of ARM9E-S

The ARM architecture is based on Reduced Instruction Set Computer (RISC) principles. The reduced instruction set and related decode mechanism are much simpler than those of Complex Instruction Set Computer (CISC) designs. This simplicity gives a high instruction throughput, an excellent real-time interrupt response and a small, cost-effective, processor macrocell.

The ARM9E-S core uses a pipeline to increase the speed of the flow of instructions to the processor. This enables several operations to take place

simultaneously, and the processing and memory systems to operate continuously. A five-stage pipeline is used, consisting of Fetch, Decode, Execute, Memory, and Write-back stages. This is shown in Fig. 2.1[18]. The program counter (PC) points to the instruction being fetched rather than to the instruction being executed.

Fetch

Registers used in instruction are decoded

Shift and ALU operation

Data access to/from memory

Writeback to register bank

Fig 2.1 Five-stage pipeline

During normal operation, one instruction is being fetched from memory, the previous instruction is being decoded, the instruction before that is being executed, the instruction before that is performing data accesses (if applicable), the instruction before that is writing its data back to the register bank.

Fig 2.2[18] illustrates the ARM9E-S core diagram. The major components are:

‧ The address register and incrementer, which select and hold al memory address and generate sequential address when required.

‧ The register file, which contains 31 32-bit general purpose registers and six status registers.

‧ The 32x16 multiplier, which can perform multi-cycle multiply operations.

‧ The barrel shifter, which can shift or rotate one operand by an immediate offset or a register offset.

‧ The ALU, which can perform arithmetic and logic operations.

‧ The read and write data registers, which hold data storing to or lording from memory.

‧ The instruction decoder and control logic, which can decode instructions and generate associated control signals.

‧ The count leading zero (CLZ) and saturation detection (SAT) logic, which are partially useful for certain DSP applications.

2.1.2 Programmer’s Model

ARM9E-S has 37 registers including 31 general-purpose registers and six status registers as shown in Fig. 2.3[18]. The processor state and operating modes determine which registers are available to the programmer. ARM9E-S supports seven operating modes including user mode, system mode, fiq mode, supervisor mode, abort mode, irq mode, and undefined mode. The non-user modes, a.k.a. privileged modes, are used for system level programming and typically for handling interrupts or exceptions.

Fig. 2.3 ARM9E-S registers organization in ARM state

ARM9E-S has six program status registers (PSRs) including a current program status register (CPSR) and five saved program status registers (SPSRs). These registers are used to hold information about the most recently executed ALU

operation, control the enabling and disabling of FIQ/IRQ interrupts, and set the processor operating mode. Fig. 2.4[18] shows the PSR format. The top five bits (N, Z, C, V, Q), a.k.a. the condition code flags, may be affected by a result of an arithmetic/logical instruction, and may be used to determine whether an instruction should be executed. The bottom eight bits are the control bits. When I bit or F bit set, the processor disables IRQ and FIQ interrupts, respectively. The T bit describes the state of ARM9E-S. When T bit set, the processor is in THUMB state, otherwise the processor is in ARM state. The M4, M3, M2, M1, M0 bits are the mode bits. The mode bits determine the current operating mode of the processor. Table 2.1 shows the mode bit values and the corresponding operating modes.

Fig. 2.4 Program status register format Table 2.1 PSR mode bit values

M[4:0] Mode

10000 User

10001 FIQ

10010 IRQ

10011 Supervisor

10111 Abort

11011 Undefined

11111 System

ARM exceptions can be classified into three groups:

 Exceptions generated as the direct effect of executing an instruction, such as software interrupt (SWI), undefined instruction, and prefetch abort (instruction fetch memory fault).

 Exceptions generated as the side-effect of an instruction, such as data abort (data access memory fault).

 Exceptions generated externally, unrelated to the instruction flow, such as reset, IRQ (normal interrupt request), FIQ (fast interrupt request).

When two or more exceptions arise at the same time, the exception priorities determine the order in which the exceptions are handled. Table 2.2 shows the exception priorities from high to low.

Table 2.2 Exception priorities Priority Exception

1 Reset

2 Data abort

3 FIQ

4 IRQ

5 Prefetch abort

6 Undefined instruction Software interrupt

2.1.3 ARM Instruction Set Architecture Version 5

The ARM9E-S adopts the ARM ISA version 5. It can be classified into eight instruction types:

‧ Data processing instructions. These instructions perform arithmetic and logical operations on data values in registers and typically require two operands and produce one single result. These instructions allow the second register operand performing a shift or rotate operation before it is operated with the first operand.

‧ DSP enhanced instructions. There instructions include saturated integer arithmetic, half-word integer multiply and multiply-accumulate, two word load and store, cache preload and two word coprocessor register transfer.

‧ Program status registers transfer instructions. The MRS instruction moves PSR contents to a register. The MSR instruction moves a register contents or a 32-bit immediate value to the relevant PSR. Note that in user mode, the control bits of the CPSR are protected from change, so only the condition code flags of CPSR can be changed.

‧ Multiply instructions. These instructions perform multiply and multiply-accumulate operations.

‧ Data transfer instructions. These instructions copy memory values into registers (lord instructions) or copy register values into memory (store instructions).

‧ Swap instruction. This instruction exchanges values between a memory location and a register.

‧ Branch instructions. These instructions execute to switch to a different address, either permanently (B) or saving a return address (BL).

‧ Coprocessor instructions. This class of instructions is used to tell a coprocessor to perform some data operations or data transfers.

‧ Software interrupt instruction. The SWI instruction is used to enter supervisor mode in a controlled manner.

Fig. 2.5[6] shows the encoding formats. In the Fig. 2.5, the most significant four-bit segment of each instruction is called condition field. In ARM state, all instructions are conditionally executed according to the CPSR condition codes and the condition field of the instruction. The instruction is only executed when the condition is true, otherwise it is ignored. Table 2.3[18] lists the conditions.

Fig. 2.5 ARM instruction set formats

Table 2.3 The condition code summary

Code Suffix Flags Meanings

0000 EQ Z set equal

0001 NE Z clear not equal

0010 CS C set unsigned higher or same

0011 CC C clear unsigned lower

0100 MI N set negative

0101 PL N clear positive or zero

0110 VS V set overflow

0111 VC V clear no overflow

1000 HI C set and Z clear unsigned higher

1001 LS C clear or Z set unsigned lower or same

1010 GE N equals V greater or equal

1011 LT N not equal to V less than

1100 GT Z clear AND (N equals V) greater than 1101 LE Z set OR (N not equal to V) less than or equal

1110 AL (ignored) always

2.2 Overview of MPEG 1 Layer-3 Decoder

This section describes some prerequisites related to MPEG 1 Layer-3, Section 2.2.1 describes the background of MPEG audio. Section 2.2.2 introduces the bitstream format of MPEG 1 Layer-3 and Section 2.2.3 describes the decoding procedure of MPEG audio.

2.2.1 Introduction

MPEG (The Motion Picture Experts Group) Layer-3, otherwise known as MP3, has generated a phenomenal interest among internet users, or at least among those who want to download highly-compressed digital audio files at near-CD quality.

MPEG-1 is the name for the first phase of MPEG work; MPEG-1 audio consists of

three operating modes called “Layers”, with increasing complexity and performance, named Layer-1, Layer-2 and Layer-3. Layer-3, with highest complexity, was designed to provide the highest sound quality at low bit-rates (around 128-kbit/s for a typical stereo signal). The MPEG-1 audio features are shown in Table2.4.

Table2.4 MPEG-1 audio features MPEG-1 audio

Channel One or two

Sample rate 32, 44.1 or 48kHz

Bit rate from 32kbps up to 448kbps Samples per frame 384, 576 or 1152 samples

2.2.2 Bitstream Format

MP3 is based on frames; each frame consists of five parts: header, CRC, side information, main audio data and ancillary data. Fig 2.6[42] shows the data describing the structural factors of that frame; this data is called the frame's "header", and Fig.

2.7[42] shows the MP3 frame header represented visually, in the Table 2.5[42] shows the thirteen header files' characteristics. The header is 32 bit and first starts of with a 12-bit synchronization word to determine the beginning of each frame. Other bits are used to specify the parameters of the file, such as layer number, bit-rate and sampling frequency. Information such as copyright and original bits is not used by the decoder.

The CRC is an optional feature added in during encoding process. The side information consists of information that is needed to decode the main data. The main audio data is shown in Fig. 2.8 which is decoded into audio samples. The ancillary data stores the user-defined information and not used by the decoder.

Fig. 2.6 MP3 frame header

Fig. 2.7 MP3 frame header represented visually

Table 2.5 The thirteen header files' characteristics

Position Purpose Length

(in Bits)

A Frame sync 12

B MPEG audio version (MPEG-1, 2, etc.) 2 C MPEG layer (Layer I, II, III, etc.) 2 D Protection (if on, then checksum follows

header) 1

E Bitrate index (lookup table used to specify bitrate for this MPEG version and layer) 4

F Sampling rate frequency (44.1kHz, etc.,

determined by lookup table) 2

G Padding bit (on or off, compensates for

unfilled frames) 1

H Private bit (on or off, allows for

application-specific triggers) 1 I Channel mode (stereo, joint stereo, dual

channel, single channel) 2

J Mode extension (used only with joint stereo,

to conjoin channel data) 2

K Copyright (on or off) 1

L Original (off if copy of original, on if

original) 1

M Emphasis (respects emphasis bit in the

original recording; now largely obsolete) 2

Fig. 2.8 Main data structure

2.2.3 Decoding Procedure

The decoding process is shown in Fig. 2.9[31]. We briefly describes the MP3 decoding flow.

‧ Synchronization: The purpose is to find out start of a frame, it is done by searching for the 12-bit synchronization word.

‧ Huffman decoding: Huffman encoding is a loss-less compression technique which presses according to the frequency of the input symbols occurring. The symbol that occurs most frequent is assigned a shorter code and vice versa.

‧ Requantization: The requantization process rescales the Huffman coded data back to the spectral values.

‧ Reordering: The purpose of reordering is to reorder the frequency lines in the granule to compensate for the reordering done in the encoding process.

‧ Joint Stereo Decoding: The purpose of joint stereo decoding is to convert the encoded stereo signal into left/right stereo signals and the information on which mode is used can be found in the header of each frame.

‧ Alias Reduction: The alias reduction tries to reduce the alias effects by merging the frequencies using the butterfly calculations.

‧ Inverse Modified Discrete Cosine Transform (IMDCT): The IMDCT transforms the frequency lines into time-domain samples.

‧ Frequency Inversion: This is done to compensate for the frequency inversion in the synthesis polyphase filterbank process.

‧ Synthesis Polyphase Filterbank: It transforms the 32 subbands of 18 time domain samples in each granule to 18 blocks of 32 PCM samples, which is the final decoding result.

Fig. 2.9 Decoding process

2.3 Overview of Target Platform

This section describes some prerequisites related to RealView Platform Baseboard for ARM926EJ-S; Section 2.3.1 describes the major components on the platform system.

2.3.1 Major Platform Components

The proposed system adopts the components on platform are:

‧ ARM926EJ-S PXP Development Chip

— ARM926EJ-S processor core

— Multi-layer bus matrix that gives highly efficient simultaneous transfers

— Two external AHB master bridges and one external AHB slave bridge

— DMA controller

— Vectored Interrupt Controller (VIC)

‧ Field Programmable Gate-Array (FPGA)

‧ 128MB of 32-bit wide SDRAM

‧ 2MB of 32-bit wide static RAM

‧ 64MB of 32-bit wide NOR flash

‧ Programmable clock generators

‧ Connectors for keyboard, audio

‧ RealView Logic Tile

— Xilinx Virtex II FPGA

— configuration Programmable Logic Device (PLD) and flash memory for storing FPGA configurations

— Two 2MB ZBT SSRAM chips (these can be used, for example, to model IRAM and DRAM for an ARM core)

— clock generators and reset sources

The ARM926EJ-S PXP Development Chip is illustrated in Fig. 2.10[5].

Fig. 2.10 ARM926EJ-S Development Chip block diagram

Chapter 3 ACARM9 Processor Core Design

This chapter describes the proposed processor core design. An ARM9E-like, 32-bit RISC embedded processor core is implemented in the register-transfer level with verilog language. Section 3.1 depicts the organization of the proposed processor core. Section 3.2 describes the implementation of multi-cycle multiplication. Section 3.3 describes the verification strategy, and Section 3.4 describes the synthesis result.

3.1 Organization of the Proposed Processor Core

The proposed Verilog RTL model consists of 13 major functional blocks including the decoder, register file (RF), barrel shifter (BS), arithmetic/logical unit (ALU), 32x32 multiplier (MUL), count leading zero (CLZ), forwarding unit, program counter (PC) selector, operand fetcher (OF), load/store data address generator (DAG), read/write data operation and control logic. The proposed Verilog RTL model is shown in Fig. 3.1.

IF/ID pipeline registers ID/EX pipeline registers EX/MEM pipeline registers

Fig. 3.1 Block diagram of the proposed Verilog RTL model

The decoder unit obtains the instruction from IF phase and decodes it to generate all the information needed for the other functional units.

The register file is composed of total 37 registers – 31 general-purpose 32-bit registers and six status registers.

The execution stage consists of a BS, an ALU, a CLZ, and a 32x32 multiplier. The arithmetic and logical operations are implemented with the ALU module whose second operand is received from the BS to perform shift operations if needed. The 32x32 multiplier produces a 32-bit product. More details of multi-cycle multiply operation will be described in Section 3.2.

The forwarding unit forwards the output data of EX stage to make sure that the next instruction can get these data as soon as possible.

The program counter selector chooses one valid address from five sources including the PC incrementer, the ALU output, LDM (load multiple)/STM (store

multiple) output, read data and the interrupt as shown in Fig. 3.2.

Fig. 3.2 Source selection of the address registers

The operand fetcher select data from register file, read data, ALU output, data address generator, immediate value or forwarding value as shown in Fig. 3.3.

Fig. 3.3 Operand selection

The data address generator calculates the read/write data address on data bus includes the multiple load-store, and branch address on instruction bus.

The control logic controls all the data flows of combinational logic and all the state transitions of sequential logic.

The read/write data operation performs data alignment. As shown in Fig. 3.4, the read operation shifts byte data and halfword data to the bottom of a 32-bit register with zero-extended or sign-extended when the processor reads byte data or halfword

data from memory. For writing halfword data to memory, the write operation copies the low halfword part to the high halfword part to fill the 32-bit data width. For writing byte data to memory, the write operation copies the least significant byte to the other three more significant bytes.

Fig. 3.4 Read/write operation

3.2 Multi-cycle Multiplication

For a multiply instruction, source operand A and source operand B are fed to the multiplier directly to perform multiply operation. Fig. 3.5 shows the multi-stage multiplication FSM.

32x32

Fig. 3.5 Multiplication FSM of ARM

A normal multiply instruction with/without accumulation needs two cycles, since two 64-bit partial products are obtained in the first one cycle. The two low half 32-bit partial products and the value that is accumulated to the product are added by a carry save adder to generate two values, and then add the two values by a 32-bit adder to get final 32-bit result. While all 32-bit of product is valid, two cycles are needed. A long multiply instruction with/without accumulation needs three cycles, after the first cycle to obtain two 64-bit partial products, the 64-bit values need to take two cycles to perform 32-bit addition twice.

While the multi-cycle multiply operation is finished, a finish signal is sent from multiplication FSM to main FSM to tell that the multi-cycle operation is finished and the next instruction may get executed to continue the program flow.

3.3 Low-power Techniques

The proposed ACARM9 core design is convinced of ultra low-power characteristic and is implemented with some significant methods as follows. Section 3.3.1 presents gated unused registers, Section 3.3.2 describes gated unexecuted functional units, Section 3.3.3 describes the low-power MUX tree, Section 3.3.4 presents the low-power IP Fong-adder and Section 3.3.5 presents early flag checking for conditional instructions.

3.3.1 Gated Unused Registers

The data stored in registers are updated in every clock cycle without any clock gating operation; as a result, huge power is consumed and some manipulations must be done to avoid so much power wasting of the proposed design. One manipulation that makes new data to transmit into D port of the FLIP-FLOP only when the enable signal is high can be implemented in RTL level, as can be seen in the Fig.3.8. The

在文檔中使用32位元低功率嵌入式處理器之高效能MP3解碼系統 (頁 13-0)