Chapter 2 Preliminaries
2.3 Overview of Target Platform
2.3.1 Major Platform Components
The proposed system adopts the components on platform are:
‧ ARM926EJ-S PXP Development Chip
— ARM926EJ-S processor core
— Multi-layer bus matrix that gives highly efficient simultaneous transfers
— Two external AHB master bridges and one external AHB slave bridge
— DMA controller
— Vectored Interrupt Controller (VIC)
‧ Field Programmable Gate-Array (FPGA)
‧ 128MB of 32-bit wide SDRAM
‧ 2MB of 32-bit wide static RAM
‧ 64MB of 32-bit wide NOR flash
‧ Programmable clock generators
‧ Connectors for keyboard, audio
‧ RealView Logic Tile
— Xilinx Virtex II FPGA
— configuration Programmable Logic Device (PLD) and flash memory for storing FPGA configurations
— Two 2MB ZBT SSRAM chips (these can be used, for example, to model IRAM and DRAM for an ARM core)
— clock generators and reset sources
The ARM926EJ-S PXP Development Chip is illustrated in Fig. 2.10[5].
Fig. 2.10 ARM926EJ-S Development Chip block diagram
Chapter 3
ACARM9 Processor Core Design
This chapter describes the proposed processor core design. An ARM9E-like, 32-bit RISC embedded processor core is implemented in the register-transfer level with verilog language. Section 3.1 depicts the organization of the proposed processor core. Section 3.2 describes the implementation of multi-cycle multiplication. Section 3.3 describes the verification strategy, and Section 3.4 describes the synthesis result.
3.1 Organization of the Proposed Processor Core
The proposed Verilog RTL model consists of 13 major functional blocks including the decoder, register file (RF), barrel shifter (BS), arithmetic/logical unit (ALU), 32x32 multiplier (MUL), count leading zero (CLZ), forwarding unit, program counter (PC) selector, operand fetcher (OF), load/store data address generator (DAG), read/write data operation and control logic. The proposed Verilog RTL model is shown in Fig. 3.1.
IF/ID pipeline registers ID/EX pipeline registers EX/MEM pipeline registers
Fig. 3.1 Block diagram of the proposed Verilog RTL model
The decoder unit obtains the instruction from IF phase and decodes it to generate all the information needed for the other functional units.
The register file is composed of total 37 registers – 31 general-purpose 32-bit registers and six status registers.
The execution stage consists of a BS, an ALU, a CLZ, and a 32x32 multiplier. The arithmetic and logical operations are implemented with the ALU module whose second operand is received from the BS to perform shift operations if needed. The 32x32 multiplier produces a 32-bit product. More details of multi-cycle multiply operation will be described in Section 3.2.
The forwarding unit forwards the output data of EX stage to make sure that the next instruction can get these data as soon as possible.
The program counter selector chooses one valid address from five sources including the PC incrementer, the ALU output, LDM (load multiple)/STM (store
multiple) output, read data and the interrupt as shown in Fig. 3.2.
Fig. 3.2 Source selection of the address registers
The operand fetcher select data from register file, read data, ALU output, data address generator, immediate value or forwarding value as shown in Fig. 3.3.
Fig. 3.3 Operand selection
The data address generator calculates the read/write data address on data bus includes the multiple load-store, and branch address on instruction bus.
The control logic controls all the data flows of combinational logic and all the state transitions of sequential logic.
The read/write data operation performs data alignment. As shown in Fig. 3.4, the read operation shifts byte data and halfword data to the bottom of a 32-bit register with zero-extended or sign-extended when the processor reads byte data or halfword
data from memory. For writing halfword data to memory, the write operation copies the low halfword part to the high halfword part to fill the 32-bit data width. For writing byte data to memory, the write operation copies the least significant byte to the other three more significant bytes.
Fig. 3.4 Read/write operation
3.2 Multi-cycle Multiplication
For a multiply instruction, source operand A and source operand B are fed to the multiplier directly to perform multiply operation. Fig. 3.5 shows the multi-stage multiplication FSM.
32x32
Fig. 3.5 Multiplication FSM of ARM
A normal multiply instruction with/without accumulation needs two cycles, since two 64-bit partial products are obtained in the first one cycle. The two low half 32-bit partial products and the value that is accumulated to the product are added by a carry save adder to generate two values, and then add the two values by a 32-bit adder to get final 32-bit result. While all 32-bit of product is valid, two cycles are needed. A long multiply instruction with/without accumulation needs three cycles, after the first cycle to obtain two 64-bit partial products, the 64-bit values need to take two cycles to perform 32-bit addition twice.
While the multi-cycle multiply operation is finished, a finish signal is sent from multiplication FSM to main FSM to tell that the multi-cycle operation is finished and the next instruction may get executed to continue the program flow.
3.3 Low-power Techniques
The proposed ACARM9 core design is convinced of ultra low-power characteristic and is implemented with some significant methods as follows. Section 3.3.1 presents gated unused registers, Section 3.3.2 describes gated unexecuted functional units, Section 3.3.3 describes the low-power MUX tree, Section 3.3.4 presents the low-power IP Fong-adder and Section 3.3.5 presents early flag checking for conditional instructions.
3.3.1 Gated Unused Registers
The data stored in registers are updated in every clock cycle without any clock gating operation; as a result, huge power is consumed and some manipulations must be done to avoid so much power wasting of the proposed design. One manipulation that makes new data to transmit into D port of the FLIP-FLOP only when the enable signal is high can be implemented in RTL level, as can be seen in the Fig.3.8. The disadvantage is that the stored data of registers is still re-written for holding old data every clock; as a matter of fact, it is still power wasting and other manipulations should be done. As can be seen in the Fig.3.8, a lower part with a CG cell composed of an enable signal controlled latch will make clock of the unused registers gated and guarantee that no data updated any more for holding old value. This method can absolutely shut down updating operations of unused registers and save unnecessary power consumption as much as it can.
Fig. 3.6 Registers with clock gated by RTL and synthesized CG cell
3.3.2 Gated Unexecuted Functional Units
All the functional units in EX stage discussed in Section 3.1 and 3.2 are gated by registers. The operands of each functional unit are isolated and gated by registers. As shown in Fig. 3.7. When one functional unit is active, all the other functional units are gated by operand registers.
Fig. 3.7 Functional units in EX stage
3.3.3 Low-power MUX Tree
In this section, we introduce a method to reduce transitions of MUX tree of register files, the register files have 37 32-bit registers, and one or two register data will be selected for each instruction. The selection MUX tree consumes a great deal of power. An independent selection signal method is proposed in Fig. 3.8. The selection of MUX tree reduces the transitions of MUX tree.
Fig. 3.8 Low-power MUX tree
3.3.4 Low-power IP Fong-adder
Fong-adder [40] is one of the most significant function units in the proposed core design due to its low-power consumption, and smaller area using with no increasing performance overhead.
As can be seen in the Table3.1, Fong-adder compares to Hybrid K-S Ling-adder [43]. The former saves 32.40% power as data width is 32-bit and 12.05% as data width is 64-bit. As can be seen in the Table3.2, the timing analysis of Fong-adder compares to Hybrid K-S Ling-adder is %2 better as data width is 32-bit and %6.56
Hybrid K-S Ling-adder is also %21.40 better as data width is 32-bit and 17.72%
better as data width is 64-bit. The proposed design implemented with Fong-adder has a characteristic of ultra-low power consumption and small area usage but still remains high-performance. In fact, this is one of the most outstanding advantages of the proposed core design.
Table 3.1 PDP analysis
Power Delay Product (PDP) Analysis (UMC 0.18um / TT corner) Data
32 18.98 12.83 18.98*1.02 12.83*1.00 33.73%
64 32.77 28.82 32.77*1.22 28.82*1.14 17.82%
Note : Power (mW)/PDP (pJ)
Table 3.2 Timing and area analysis Timing & Area Analysis (UMC 0.18um / TT corner)
Data Width 32 1.02 1.00 2.00% 14173.790 11140.114 21.40%
64 1.22 1.14 6.56% 28839.888 23730.583 17.72%
Note : Delay (ns) / Area (um2)
3.3.5 Early Flag Checking
In Section 3.3.2 talks about gated unexecuted functional units. The method can be more improved by early flag check. When an instruction with setting flag is executed, the flag is also calculated and checked at EX stage. The flag checking at EX stage can save power if the next instruction is not executed. The control signals for gated operand registers are prepared before instruction enters EX stage.
Fig. 3.9 Conditional check
3.4 Verification Strategy
A functional verification flow is proposed in this section. Section 3.4.1 proposes the overall functional verification flow. Section 3.4.2 proposes coding style checking by Linting; section 3.4.3 proposes a deterministic verification approach; Section 3.4.4 proposes an input-constrained random verification flow.
3.4.1 Functional Verification
The functional verification flow diagram is listed in Fig.3.10 and in each of steps the RTL model is verified with a systemC behavior model which is designed for matching all the cycle behaviors of ADS. All mismatches during the comparison will cause the flow back to RTL revision step for modification.
Fig. 3.10 Functional verification flow
3.4.2 Coding Style Checking by Linting Free
This section describes a static coding style check which improves the quality of the design in reuse and verification perspectives of RTL code. Checking the coding-style of the design by Novas nLint tool with 328 lint rules adopted from Frescale Semiconductor Reuse Standard (SRS) can avoid all kinds of warnings and errors including naming, synthesis, simulation, common syntax, undeclared objects, unexpected latches, DFT issue, and so on.
3.4.3 Deterministic Verification
This section describes deterministic verification which is made to check all regular cases and special corner case should be confronted with in the simulation phase. Deterministic verification is composed of three parts including specialized handcrafted pattern, real application pattern.
The handcrafted pattern is written in all cases of instructions implemented in the proposed design. It checks all results of instructions to be right in the first step in deterministic verification.
The real application patterns are implemented by some benchmarks like Dhrystone, Whetstone, and DSPstone. Moreover, a JPEG encoder and an MP3 decoder program are also provided to verify the correctness of the design. This kind of pattern checks real cases met in applications of real world and is essential in deterministic verification phase.
After the code coverage verification, it suggests that the test vectors applied to the design under verification are sufficient and the next verification step input-constrained random verification can be entered.
3.4.4 Input-constrained Random Verification
Input-constrained random verification is implemented by a constraint-driven random pattern generator which generates random pattern to both ISS model and RTL model and values of both models are compared cycle by cycle, as shown in Fig.3.11.
This kind of verification is made to detect all of unexpected cases and the random pattern is constrained to the meaningful range to avoid undefined instructions generation. More simulation cycles are verified, more vector space is spanned by random patterns; therefore, fewer bugs exist in the design and more robust design can
be declared.
RTL model Compare ISS model
Output Output
Fig. 3.11 Input-constrained random verification
3.5 Synthesis Result
Synthesis results are proposed in this section. Section 3.5.1 proposes comprehensive comparison with in-house ACARM7 and ARM. Section 3.5.2 proposes summaries of FPGA synthesis results.
3.5.1 Comprehensive Comparison
This section provides all experimental results with a comprehensive comparison compared among the proposed core ACARM9, in-house ACARM7, ARM946E-S without cache and ARM7TDMI.
All the information like timing, area, and power will be compared among the four different cores in Table 3.3. The ARM7TDMI-S and ARM946E-S are intellectual property (IP) of ARM. The ACARM7 and ACARM9 are in-house ARM-like IP. The processes of four cores are 0.18 um. The note of * is that all experiment results are
measured in worst case condition. The * worst case condition is 0.18μm process -1.62V, 125℃, slow silicon. The note of ** is that all experiment results are measured in typical case condition. The ** typical case condition is 0.18μm process -1.8V, 25℃, typical silicon. The area calculation is in gate-count (TSMC 0.18um NAND2 equivalent).
Table 3.3 Comprehensive comparison between four different cores Comprehensive comparisons between 4 cores
ARM7TDMI-S ACARM7 ACARM9 ARM946E-S
Process (um) 0.18 0.18 0.18 0.18
Gate-count (k) N/A 35.8 48.7 N/A
Area* (mm2) 0.62 0.36 0.49 2.00
3.5.2 Summaries of Synthesis Results
This section has analyzed the FPGA synthesis results of ACARM9. The device utilization summary and timing summary are shown in Table 3.4. The maximum frequency on FPGA is 29.914MHz. The frequency of ACARM9 which can run MP3 decoder real-time will be calculated in Chapter 5 and Chapter 6.
Table 3.4 ACARM9 FPGA synthesis results ACARM9 FPGA synthesis results TOOL: Xilinx ISE 6.2i
Device: Versatile LT-XC2V6000 Device utilization summary:
Number of Slices: 5633 out of 33792 16%
Number of Slice Flip Flops: 2560 out of 67584 3%
Number of 4 input LUTs: 10291 out of 67584 15%
Number of bonded IOBs: 401 out of 1104 36%
Number of TBUFs: 1 out of 16896 0%
Number of GCLKs: 2 out of 16 12%
Timing summary:
Minimum period: 33.429ns
Maximum Frequency: 29.914MHz
Chapter 4
MP3 Platform-based System
This chapter describes the proposed MP3 Decoder System implemented by the ACARM9 processor core design which is implemented in the FPGA as a specific purpose processor and all system behaviors are controlled by an ARM926EJ-S processor on the develop board. Section 4.1 depicts the implementation of the system.
Section 4.2 describes organization in system level. Section 4.3 describes organization from hardware perspective. Section 4.4 describes organization from software perspective.
4.1 Implementation
The development board of MP3 decoder system is implemented in a Versatile RealView Platform Baseboard for ARM926EJ-S; the proposed core ACARM9 is configured in an FPGA of Versatile LT-XC2V6000 (Xilinx VirtexII); The system software is programmed by a PC with an ARM® RealView MultiICE; All development environment is based on ARM Developer Suite (ADS) Version 1.2; The proposed design is then transformed to BIT-file to be configured in the FPGA by Xilinx Integrated Software Environment (ISE) 6.2i; the device drivers are ARM PrimeCell Vector Interrupt Controller (PL190); ARM PrimeCell Keyboard Mouse Controller (PL050); ARM PrimeCell Advanced Audio CODEC Interface (PL041);
ARM PrimeCell DMA (PL080) and Character LCD display. All environment setups can be found in Table 4.1.
Table 4.1 The platform environment setup MP3 Decoder System
Development Board Versatile RealView Platform Baseboard for ARM926EJ-S®
Logic Tile Versatile LT-XC2V6000 (Xilinx VirtexII)
Multi-ICE ARM® RealView MultiICE®
ADS ARM® Developer Suite (ADS) Version 1.2
Xilinx ISE Xilinx® Integrated Software Environment (ISE) 6.2i Device Driver ARM PrimeCell Vector Interrupt Controller (PL190) ARM PrimeCell Keyboard Mouse Controller (PL050) ARM PrimeCell Advanced Audio CODEC Interface (PL041)
ARM PrimeCell DMA (PL080) Character LCD display
4.2 Organization in System Level
This section introduces the overview of the proposed system in system level view.
Section 4.2.1 describes major components of the proposed system and Section 4.2.2 describes the entire control flow of the proposed system.
4.2.1 Major Components
This section describes major components of the proposed MP3 decoder system, which is composed of an ARM926EJ-S core, a 128 SDRAM, a 64MB NOR flash, a 2MB SSRAM, device drivers, FPGA on development board and a logic tile. The ARM926EJ-S core controls all behaviors and components of the entire system; the data in the SSRAM can be loaded or stored by the core; the 64MB NOR flash is a nonvolatile storage and bitstream or programs can be stored in the flash prepared to be loaded into SDRAM automatically right after power-up. A logic tile includes an FPGA, two 2MB ZBTSRAMs (Zero Bus Turnaround SRAM), a keyboard interface which is used for user interface controller of the system, DMAC (Direct Memory Access Controller) and VIC (Vectored Interrupt Controller) which are used for audio play, and many other devices and components. All components discussed above are shown in Fig 4.1
This section has described major components of the proposed system and more details of control flow in system level of proposed system will be discussed in the next section.
Device
Fig. 4.1 Block diagram in system level
4.2.2 Control Flow in System Level
This section describes the control flow of the proposed system. First, all the software binary data are obtained from ADS tool and is written into the NOR flash in advance by a multi-ICE with semi-hosting mechanism, the software binary data include boot loader, MP3 system control program and MP3 decoder program for ACARM9. The MP3 bitstream is also written into the NOR-flash, the write address to the NOR-flash is also designated at the same time. Second, configure the booting program, after booting from development board, the MP3 system control program is loaded into SDRAM and executed by the ARM926EJ-S core in this step. Third, the MP3 system control program initializes all hardware drivers, including Vectored Interrupt Controller, Secondary Vectored Interrupt Controller, Advanced Audio
CODEC Interface, PS2 Keyboard/Mouse Interface, Real Time Clock Controller, Direct Memory Access Controller and Character LCD Display. Fourth, the MP3 system control program sets the address of MP3 bitstream, and then the MP3 decoder program is loaded from NOR-flash into two ZBTSRAM on the logic tile, in this step, the program of MP3 decoder is loaded into both ZBTSRAM on the logic tile and part of the encoded MP3 bitstream file are also loaded into the ZBTSRAM, therefore, both program and encoded bitstream are prepared for ACARM9 configured in FPGA inside the logic tile. Fifth, after all the initialization is done, ARM926EJ-S sends a signal to tell ACARM9 to begin to decode the bitstream in the ZBTSRAM. Sixth, once a frame is completed decoded, ACARM9 sends out an interrupt, ARM926EJ-S switches to the interrupt service routine; ARM926EJ-S moves the audio data from the ZBTSRAM on the logic tile to SSRAM on the development board and the new bitstream from NOR-flash into the ZBTSRAM for the next frame to be decoded; the audio data is moved to Advanced Audio CODEC by Direct Memory Access Controller, then the audio data can be played by Advanced Audio CODEC. Seventh, while ACARM9 finishes decoding procedure, it will send a finish signal to tell ARM926EJ-S, and then the MP3 control program can restart from next bitstream.
The major steps of control flow in system level are illustrated in Fig. 4.2. These are primary steps of system control flow and more delicate and efficient control techniques will be discussed in Section 4.4, Organization in Software Level. On the other hand, more details about the proposed core wrapped with an AHB interface will be discussed in Section 4.3, Organization in Hardware Level.
Boot development board
Load control program Run by ARM926EJ-S
Initialize peripherals
Enable interrupt
Load MP3 decoder and bitstream Run by ACARM9
interruptUser
Finish
Yes
No
Fig 4.2 Control flow in system level
4.3 Organization in Hardware Level
This section describes hardware architecture of the proposed core wrapped with an AMBA bus interface in FPGA of the proposed system. Section 4.3.1 describes all the AHB peripherals in FPGA. Section 4.3.2 describes the core wrapped with AHB bus interface. Section 4.3.3 describes the implementation of interrupt logic. Section 4.3.4
This section describes hardware architecture of the proposed core wrapped with an AMBA bus interface in FPGA of the proposed system. Section 4.3.1 describes all the AHB peripherals in FPGA. Section 4.3.2 describes the core wrapped with AHB bus interface. Section 4.3.3 describes the implementation of interrupt logic. Section 4.3.4