• 沒有找到結果。

Chapter 1 Introduction

1.3 Thesis Organization

The remainder of this thesis is organized as follows: Chapter 2 describes preliminaries of this thesis. We give the overview of SystemC and the abstraction levels of processor modeling. Also, we introduce the instruction set architecture (ISA) that we adopt for ISS implementation. Chapter 3 describes the SystemC RTL model.

Chapter 4 describes the proposed SystemC cycle-accurate behavioral level ISS.

Chapter 5 shows our verification strategies. Chapter 6 discusses the experimental results and Chapter 7 concludes this thesis and describes the future works. The reference is provided afterward.

Chapter 2

Preliminaries

This chapter describes the preliminaries of this thesis. Section 2.1 introduces the SystemC basic components. Section 2.2 describes the abstraction levels and views of processor modeling. At last, Section 2.3 introduces the ARM7TDMI and the related processor ACARM7.

2.1 Overview of SystemC

SystemC is based on the C++ programming language. It can be used at a very high level such as system level and can also be used at lower levels such as the register transfer level. It basically extends the capabilities of C++ to enable hardware description. Fig. 2.1 shows basic components of SystemC.

Fig. 2.1 Basic components of SystemC: A module, ports, processes and signals.

z Modules: A module is the basic unit for describing a structure or class in SystemC and can be described by the SC_MODULE macro. A module can be hierarchical in that it can have processes and the other child modules instantiated within it.

z Processes: A process is used to describe functionality. A module can have one or more processes. There are three kinds of processes: SC_METHOD, SC_THREAD, and SC_CTHREAD. An SC_METHOD process can be invoked multiple times but can not be suspend during the execution. An SC_THREAD or an SC_CTHREAD can be invoked only once and it has the option to suspend itself.

z Ports: Ports including input, output, and inout ports are used to communicate with other modules.

z Signals: A signal is used to connect processes and child modules.

Besides the C++ native data types, SystemC also supports synthesizable hardware-oriented data types in SystemC RTL as shown in Table 2.1 [4].

Table 2.1 SystemC data types that are supported in SystemC RTL.

Name Description

sc_bit Single bit with two values, ‘0’ and ‘1’

sc_bv<n> Arbitrary width bit vector

sc_logic Single bit with four values, ‘0’, ‘1’, ‘X’, ‘Z’

sc_lv<n> Arbitrary width logic vector sc_int<n> Signed integer type, up to 64 bits sc_uint<n> Unsigned integer type, up to 64 bits sc_bigint<n> Arbitrary width signed integer type sc_biguint<n> Arbitrary width unsigned integer type

2.2 Processor Modeling at Different Abstraction Levels and Views

A processor can be modeled at different abstraction levels and views to meet different evaluation requirements. Different abstraction levels such as register-transfer level and transaction-level modeling (TLM), different views such as Programmer’s View (PV), Architecture View (AV) and Verification View (VV), all have different functionalities and play different roles in design and verification, as shown in Table 2.2 [3].

Table 2.2 Abstraction of different views of modeling.

View Model Accuracy & Purpose Functional View Event ordering.

Functional specification and algorithm development.

Programmer’s View Bit accurate.

Software development and verification.

Architecture View Cycle approximate.

Architectural exploration and verification.

Verification View Cycle accurate.

System level Verification.

A Functional View model is used for developing and verifying software functions and algorithms. Only the instruction set of the processor is modeled. This model is just instruction-accurate and un-timed for achieving high simulation speed. Similarly, a Programmer’s View model is used for software development too. Typically, the model is bit-accurate and un-timed.

On the other hand, an Architecture View model is used for architecture exploration, which is transaction-accurate and usually a cycle approximate model. As for Verification View, it is a cycle accurate model and typically focuses on

hardware-software co-simulation and co-verification.

2.3 Overview of ARM7TDMI and ACARM7 Verilog RTL Model

2.3.1 Core Architecture of ARM7TDMI

ARM7TDMI [5, 6] is a general-purpose 32-bit RISC processor developed by Advanced RISC Machine (ARM) Limited. It employs the von Neumann architecture which uses a single memory to hold both instructions and data. Fig. 2.2 illustrates the ARM7TDMI core diagram [5]. The major components are:

z The address register and incrementer, which select and hold all memory addresses and generate sequential addresses when required.

z The register file, which contains 31 32-bit general purpose registers and six status registers.

z The 32x8 multiplier, which can perform multi-cycle multiply operations.

z The barrel shifter, which can shift or rotate one operand by an immediate offset or a register offset.

z The ALU, which can perform arithmetic and logic operations.

z The read and write data registers, which hold data storing to or lording from memory.

z The instruction decoder and control logic, which can decode instructions and generate associated control signals.

PC bus Incrementer bus

A bus B bus

ALU bus

Fig. 2.2 ARM7TDMI core diagram.

The ARM7TDMI adopts a 3-stage pipeline including the instruction fetch (IF) stage, the instruction decode (ID) stage, and the execution (EX) stage. In the IF stage, the instruction is fetched from memory and then queued in the instruction pipeline. In the ID stage, the instruction is decoded and the control signals are generated for the next cycles. In the EX stage, the instruction enters into the datapath. The register bank is read, the second operand is shifted or rotated if needed, and the result is generated by ALU and written back to a destination register. For a single-cycle instruction, such as a data processing instruction, the processor takes one EX stage cycle as shown in Fig. 2.3. For a multi-cycle instruction, the processor takes more than one EX stage cycle. As shown in Fig. 2.4, a sequence of single-cycle ADD instructions, with a data store instruction, STR, occurring after the first ADD instruction. The cycles that access memory are shown with shading and we can see that the memory is used in every cycle. The datapath is likewise used in every cycle since two EX stage cycles, one is for address calculation and the other is for data transfer, are taken by STR instruction. Actually, the memory accessing is the major limiting factor that limits the number of cycles taken by a sequence of instructions.

Fig. 2.3 ARM7TDMI single-cycle instruction 3-stage pipeline operation.

Fig. 2.4 ARM multi-cycle instruction 3-stage pipeline operation.

2.3.2 Programmer’s Model

ARM7TDMI can be in one of two states, ARM state or THUMB state. In ARM state, the processor executes 32-bit, word-aligned ARM instructions; while in THUMB state, the processor executes 16-bit, halfword-aligned THUMB instructions.

Since the proposed designs described in Chapter 3 and 4 implement ARM ISA without THUMB instructions and coprocessor instructions, we only focus on the ARM state in this thesis.

ARM7TDMI has 37 registers including 31 general-purpose registers and six status registers as shown in Fig. 2.5 [5]. The processor state and operating modes determine which registers are available to the programmer. ARM7TDMI supports seven operating modes including user mode, system mode, fiq mode, supervisor mode, abort mode, irq mode, and undefined mode. The non-user modes, a.k.a. privileged modes, are used for system level programming and typically for handling interrupts or exceptions.

R0

System & User FIQ Supervisor Abort IRQ Undefined

ARM State General Registers and Program Counter

ARM State General Program Status Registers

CPSR CPSR CPSR CPSR CPSR CPSR

SPSR_fiq SPSR_svc SPSR_abt SPSR_irq SPSR_und

= banked register

Fig. 2.5 ARM7TDMI register organization in ARM state.

ARM7TDMI has six program status registers (PSRs) including a current program status register (CPSR) and five saved program status registers (SPSRs). These registers are used to hold information about the most recently executed ALU operation, control the enabling and disabling of FIQ/IRQ interrupts, and set the processor operating mode. Fig. 2.6 [5] shows the PSR format. The top four bits (N, Z, C, V), a.k.a. the condition code flags, may be affected by a result of an arithmetic/logical instruction, and may be used to determine whether an instruction should be executed. The bottom eight bits are the control bits. When I bit or F bit set,

processor is in ARM state. The M4, M3, M2, M1, M0 bits are the mode bits. The mode bits determine the current operating mode of the processor. Table 2.3 shows the mode bit values and the corresponding operating modes.

Fig. 2.6 Program status register format.

Table 2.3 PSR mode bit values.

M[4:0] Mode 10000 User 10001 FIQ 10010 IRQ 10011 Supervisor 10111 Abort 11011 Undefined 11111 System

ARM exceptions can be classified into three groups:

z Exceptions generated as the direct effect of executing an instruction, such as software interrupt (SWI), undefined instruction, and prefetch abort (instruction fetch memory fault).

z Exceptions generated as the side-effect of an instruction, such as data abort (data access memory fault).

z Exceptions generated externally, unrelated to the instruction flow, such as reset, IRQ (normal interrupt request), FIQ (fast interrupt request).

When two or more exceptions arise at the same time, the exception priorities determine the order in which the exceptions are handled. Table 2.4 shows the exception priorities from high to low.

Table 2.4 Exception priorities.

Priority Exception 1 Reset

2 Data abort

3 FIQ 4 IRQ

5 Prefetch abort

6 Undefined instruction Software interrupt

2.3.3 ARM Instruction Set Architecture Version 4

The ARM7TDMI adopts the ARM ISA version 4. It can be classified into eight instruction types:

z Data processing instructions. These instructions perform arithmetic and logical operations on data values in registers and typically require two operands and produce one single result. These instructions allow the second register operand performing a shift or rotate operation before it is operated with the first operand.

z Program status register transfer instructions. The MRS instruction moves PSR contents to a register. The MSR instruction moves a register contents or a 32-bit immediate value to the relevant PSR. Note that in user mode, the control bits of the CPSR are protected from change, so only the condition code flags of CPSR can be changed.

z Multiply instructions. These instructions perform multiply and multiply-accumulate operations.

z Data transfer instructions. These instructions copy memory values into registers (lord instructions) or copy register values into memory (store instructions).

z Swap instruction. This instruction exchanges values between a memory location and a register.

z Branch instructions. These instructions execute to switch to a different address, either permanently (B) or saving a return address (BL).

z Coprocessor instructions. This class of instructions is used to tell a coprocessor to perform some data operations or data transfers.

z Software interrupt instruction. The SWI instruction is used to enter supervisor

mode in a controlled manner.

Fig. 2.7 shows the encoding formats, and Table 2.5 lists the complete set of the ISA [5]. In the Fig. 2.7 [5], the most significant four-bit segment of each instruction is called condition field. In ARM state, all instructions are conditionally executed according to the CPSR condition codes and the condition field of the instruction. The instruction is only executed when the condition is true, otherwise it is ignored. Table 2.6 lists the conditions [5].

2.3.4 ACARM7 Verilog RTL Model

The ACARM7 Verilog RTL model is an ultra low-power, high performance ARM7-like processor core. This Verilog model and our proposed design are co-developed based on the same ISA. We use ACARM7 Verilog RTL model as our reference model for co-simulation, co-verification and performance comparison.

Fig. 2.7 ARM instruction set formats.

Table 2.5 The ARM Instruction set

Mnemonic Instruction Action

ADC Add with carry Rd := Rn + Op2 + Carry

CDP Coprocesor Data

Processing LDC Load coprocessor from

memory

Coprocessor load

LDM Load multiple registers Stack manipulation (Pop) LDR Load register from

memory

Rd := (address) MCR Move CPU register to

coprocessor register

cRn := rRn {<op>cRm}

MLA Multiply Accumulate Rd := (Rm * Rs) + Rn MOV Move register or constant Rd : = Op2

MRC Move from coprocessor

register to CPU register

Rn := cRn {<op>cRm}

MRS Move PSR status/flags to register

MVN Move negative register Rd := 0xFFFFFFFF EOR Op2

ORR OR Rd := Rn OR Op2

Table 2.5 The ARM Instruction set (Continued)

Mnemonic Instruction Action

RSB Reverse Subtract Rd := Op2 - Rn RSC Reverse Subtract with

Carry

Rd := Op2 - Rn - 1 + Carry SBC Subtract with Carry Rd := Rn - Op2 - 1 + Carry

STC Store coprocessor

register to memory

address := CRn

STM Store Multiple Stack manipulation (Push) STR Store register to memory <address> := Rd

SUB Subtract Rd := Rn - Op2

SWI Software Interrupt OS call SWP Swap register with

memory

Rd := [Rn], [Rn] := Rm TEQ Test bitwise equality CPSR flags := Rn EOR Op2

TST Test bits CPSR flags := Rn AND Op2

Table 2.6 Condition code summary.

Code Suffix Flags Meanings

0000 EQ Z set equal

Chapter 3

Proposed SystemC

Register-Transfer Level Model

This chapter describes the proposed SystemC RTL design. An ARM7-like, 32-bit RISC embedded processor core is implemented in the register-transfer level with SystemC language. Section 3.1 depicts the block diagram of the proposed design.

Section 3.2 describes the control logic of the proposed design. Section 3.3 describes how an arithmetic/logical operation functions in EX stage, and Section 3.4 describes the mechanism of multi-cycle multiplication in EX stage.

3.1 Block Diagram

The proposed SystemC RTL model consists of 10 major functional blocks including the decoder, register file, barrel shifter (BS), arithmetic/logical unit (ALU), 32x8 multiplier (MUL), forwarding unit, address register, read data selector, write data selector, and control logic. Fig. 3.1 shows the block diagram of the proposed design.

Fig. 3.1 Block diagram of the proposed SystemC RTL model.

The decoder unit obtains the instruction from IF phase and decodes it to generate all the information needed for the other functional units.

The register file is composed of total 37 registers – 31 general-purpose 32-bit registers and six status registers as shown in Fig. 2.5.

The execution stage consists of a BS, an ALU, and a 32x8 multiplier. The arithmetic and logical operations are implemented with the ALU module whose second operand is received from the BS to perform shift operations if needed. The 32x8 multiplier produces a 40-bit product. More details of multi-cycle multiply operation will be described in Section 3.4.

The forwarding unit forwards the output data of EX stage to make sure that the

next instruction can get these data as soon as possible.

The address register chooses one valid address from four sources including the PC incrementer, the ALU output, LDM (lord multiple)/STM (store multiple) output, and the interrupt as shown in Fig. 3.2.

Fig. 3.2 Source selection of the address register.

The read/write selector performs data alignment operations. As shown in Fig. 3.3, the read selector shifts byte data and halfword data to the bottom of a 32-bit register with zero-extended or sign-extended when the processor reads byte data or halfword data from memory. For writing halfword data to memory, the write selector copies the low halfword part to the high halfword part to fill the 32-bit data width. For writing byte data to memory, the write selector copies the least significant byte to the other three more significant bytes.

3.2 Control Logic

The control logic controls all the data flows of combinational logic and the sequential logic. The instruction operation cycles are controlled by five finite state machines (FSMs) including lord/store sub-FSM, shift sub-FSM, multiplication sub-FSM, branch sub-FSM, and one main FSM as shown in Fig. 3.4.

Fig. 3.4 FSMs of the control logic.

The lord/store sub-FSM handles four types of instructions including store, load with branch, lord without branch, and swap. Store instructions contain the single data store instruction and the multiple data store instruction. For the single data store instruction, the destination address is calculated at the first state and the data is written to memory at the second state then finishes the instruction. For the multiple data store instruction, the sub-FSM stays at the second state until all data transfers are done.

Except the store instructions, all the other instructions finish their last execution cycle at the normal main state. Like store instructions, load instructions also contain the single data load instruction and multiple data load instruction. The load address is calculated at the first state, and then the data is fetched from memory at the second state. The single data load instruction finishes their last execution cycle at the normal main state; and the multiple data load instruction moves to the normal main state at the last execution cycle until all data transfers are done. The load with branch instruction will moves to the branch sub-FSM after the load operation. The swap instruction performs data exchange between a register and external memory, and can be implemented by a load operation followed by a store operation.

The shift sub-FSM handles data processing instructions with shifted register amount. When the destination register is r15 (PC), the processor moves to branch sub-FSM to perform branch operation.

The multiplication sub-FSM handles the multiply instructions. The finish signal is sent to the main FSM and the processor moves to the normal main state at the last execution cycle. More details will be discussed in Section 3.4 later.

The branch sub-FSM handles the instructions that use r15 as the destination register. A new PC address is calculated at the first state and the proper new

3.3 Arithmetic/Logical Operation in Execution Stage

Fig. 3.5 shows the detailed block diagram of EX stage.

32x8

Fig. 3.5 Detailed block diagram of EX stage.

Arithmetic and logical operations are performed by ALU. Src_a and Src_b are the two inputs of the EX stage and are fed to the multiplier directly to perform multiplication instructions which will be discussed in the next section.

Src_b can be fed into BS to perform any type of shift/rotate operations including

logical shift left, logical shift right, arithmetic shift right, and rotate right. Then the output of BS and Src_a are fed into a reverse-inverse-multiplexer which decides whether the two inputs should be inversed due to the consideration about subtraction or reversed for specific instructions like RSB and RSC.

The results of the reverse-inverse-multiplexer are sent to a logical unit to perform a logical operation, or sent to a 64-bit adder [7] to perform an arithmetic operation.

The arithmetic instructions like ADD, SUB, ADC, …, use the higher 32-bit part of the adder. The lower 32-bit part is filled with zeros.

3.4 Multi-cycle Multiplication in Execution Stage

For a multiply instruction, Src_a and Src_b are fed to the multiplier [8] directly to perform multiply operation. Fig. 3.6 shows the 7-stage multiplication sub-FSM.

S1 S2 S3 S4

S_Finish

MLA Lwrite

multiplier >> 8-bit multiplier >> 8-bit multiplier >> 8-bit

Finish

(Without writing 64-bit result and accumulation) Finish

(Without writing 64-bit result)

With accumulation Writing 64-bit result without

accumulation)

Writing 64-bit result Finish

(With 64-bit result)

Fig. 3.6 7-stage multiplication sub-FSM.

The 40-bit product of the multiplier is calculated by the 32-bit multiplicand and bottom 8-bit of the multiplier. A normal multiply instruction without accumulation needs at least 2 cycles since a 40-bit product is obtained in the first one cycle and added into 64-bit adder to get final 64-bit result. While all 32-bit of multiplier is valid, four cycles are needed (8-bit once). In each cycle, the 64-bit adder result is added by the 40-bit product obtained from the preceding cycle. Therefore, a normal multiply instruction takes at most 5 cycles to finish the multiplication. However, a multiply with accumulate instruction takes one more cycle and writing 64-bit result out to two 32-bit registers also takes one more cycle. As a result, a 32-bit multiplicand multiplying a 32-bit multiplier with accumulation and writing a 64-bit result to 2 32-bit registers takes 7 cycles totally.

While the multi-cycle multiply operation is finished, a finish signal is sent from

While the multi-cycle multiply operation is finished, a finish signal is sent from

相關文件