DESIGN THE SA8051 - 低耗電量非同步嵌入式處理器SA8051設計與實作

This chapter is organized as follows. First, we describe the architecture of SA8051 and model it in Balsa. We then describe the interface among the CPU, memory and external environment. We then discuss the optimization in control path. Finally, we describe the technique of bypassing the buses and ALU.

3-1 The Architecture of SA8051

activate_0r reset_0d

P0_out P1_out P2_out P3_out

P0_in P1_in P2_in P3_in rom__addr_0r rom__addr_0a rom_addr_0d rom_en

rom_addr

rom__data_0r rom__data_0a rom_data_0d rom_data

Rom_rfd

reset_0r reset_0a

handshake interface

5 MHz

Figure 16: The architecture of SA8051

The global structure of the SA8051 consists of CPU, ROM, RAM, four inputs and four outputs as depicted in figure 16. The CPU is activated when the signalactivate_0r is set to 1. If the CPU is not activated, it is in idle mode and consumes little energy. The CPU communicates with RAM and ROM through handshaking interface. It has four output ports including P0_out, P1_out, P2_out and P3_out. Each has one byte and is mapped to one location of RAM. It has four input ports including P0_in, P1_in, P2_in and P3_in. They are used to receive data from the environment. When we set the signal reset_0d, the CPU will initialize the contents of all special function registers (SFRs).

Figure 17: The architecture of CPU

Figure 17 depicts the architecture of the CPU in SA8051. It is a little different from synchronous architecture. The ALU (Arithmetic and Logic Unit) has three inputs (T1, T2 and T3), two outputs (result1 and result2) and a PSW (Program Status Word). Most instructions only use two inputs (T1 and T2) and one output (result1). Few instructions use three inputs and two outputs such as MUL, DIV and JMP. But some instructions do not use ALU such as MOV. It can reduce power consumption and promote performance but need extra cost. The registers T1, T2 and RAR (RAM Address Register) have an input and an output port in order to support some instructions with bypassing technique. We will describe this technique in the following section. The broken lines in this figure separate the processor core and the peripheral of SA8051.

Figure 18: Balsa program for main loop of CPU

The main loop of the Balsa program for the CPU takes the form as shown is figure 18.

Initially, SA8051 resets the contents of each SFR (Special Function Register) and PC (Program Counter). In the loop, prior to executing an instruction there has to be a check to see the reset has occurred. Then, fetch the first byte of the instruction and increment the PC. The opcode is in the first byte of each instruction. Finally, the execution unit decodes the opcode and executes the corresponding operations.

3-2 Design the Fetch_ir Unit

Fetching an instruction involves sending an address to the program memory, receiving the corresponding instruction opcode and incrementing the program counter as shown in figure 19. par_b and p_b mean PAR bus and PB bus respectively as shown in figure 17. First, pc assigns its value to par_b and par receives value from par_b. Then, par assigns its value to p_b and p_b is sent out by the channel of rom_addr. Finally, the bus receives data from the channel of rom_data and assigns its value to ir (instruction register).

Figure 19: Balsa program for Fetch_ir

3-3 Design the ALU

Figure 20 shows the block diagram of the ALU in Balsa. The ALU has six input ports and five output ports. We describe the meaning of these ports in table 4 in detail. The port alu_op decides which operation the ALU will do. The two input flags src_cy and src_ac are bit 6 and bit 7 respectively inside PSW. The ALU has three data input ports including src_1, src_2 and src_3. Most instructions only use src_1 and src_2. Some instructions use src_3 like MOVC and CJNE. There are two data output ports in ALU including result_1 and result_2.

Most instructions only use result_1 but some instructions like MOVC and CJNE use result_1 and result_2. Some operations will update the flags like ADD and SUB.

Figure 20: The block diagram of ALU

I/O Port Port Type Port Size Description

alu_op in 5 bits ALU Operation Code

src_1 in 8 bits ALU input data

src_2 in 8 bits ALU input data

src_3 in 8 bits ALU input data

src_cy in 1 bit Carry flag

src_ac In 1 bit Auxiliary Carry Flag

result_1 out 8 bits ALU Result1

result_2 out 8 bits ALU Result2

result_cy out 1 bit Carry flag

result_ac out 1 bit Auxiliary Carry flag

result_ov out 1 bit Overflow flag

Table 4: The description of the ports in ALU

Balsa adopts the method of syntax-directed compilation. The transparent compilation of a Balsa program into an asynchronous circuit implies that for each expression in the Balsa text a separate piece of hardware is generated. We can reduce area by sharing some pieces of hardware. For example, we combine the function of ADD and SUB. We use shared procedure to implement it in Balsa

Figure 21 is an example for combing these functions of ADD and SUB. The shared

procedure does add src_1, src_2 and ci or subtract src_2 and ci from src_1. If the shared function does SUB, it adds src_1, inverter of the src_2 and inverter of the src_cy. The shared function also updates the flags: carry, auxiliary and overflow flag. Similar shared functions can be programmed for the bit-wise Boolean operations AND, OR and XOR.

Figure 21: (a) Balsa shared function for ADD and SUB (b) operands assignment used in AddSub function.

3-3 Design the Decoder Unit

After fetching the first byte of the instruction, the CPU decodes the instruction opcode in register ir, and executes the statements associated with that instruction. If we observe the 8051

instruction set in table 4, we can find the partial regularity of the instruction set. We can take this advantage to decode an instruction in order to reduce area cost. For example, in row R8 to RF each column has the same instruction only differing in the index of the operand Rn.

Similar arguments go for rows 6 and 7. In this table, the regular part is gray and irregularity increases when going to the above. So, we can decode the instruction set in rows (least four significant bits) first and decode in columns (most four significant bits) to determine the instruction to execute.

Table 5: Regular (gray part) and Irregular (white part) part of the 8051 instruction set

We need a decoder to judge whether the instruction opcode belongs to the regular or irregular part. The shared function judge_regular is described in figure 22. The L_ir is the least four significant bits of the instruction register and H_ir is the most four significant bits of the instruction register. If the instruction belongs to regular part, the register regular is set.

Figure 22: The judge_regular shared function

Most instructions in the regular part have the same characteristic as shown in figure 23.

They get its first operand from ROM or RAM and store it in the register T1. Then, they may get the second operand from RAM and store it in the register T2. Finally, they execute the corresponding operation and store the result in the destination register.

Figure 23: The structure of the regular part

3-4 Deal with Bit-Operation Instructions

The 8051 contains 210 bit-addressable locations, of which 128 are at byte addresses 20H through 2FH, and the rest are in the special function registers. The instructions using bit-addressing mode can be classified into two kinds as shown in table 5. First, those instructions fetch a bit from the data memory and don’t modify it. Second, those instructions fetch a bit from data memory, modify it and write it back.

MNEMONIC DESCRIPTION

JC rel Jump if Carry set

JB bit, rel Jump if bit set

JBC bit, rel Jump if bit set and clear bit JNB bit, rel Jump if bit not set

First kind

JNC rel Jump if Carry not set

MOV C, bit Move bit variable

ANL C, <src-bit> ANL bit with C, ANL NOT bit with C

CLR bit Clear bit

CPL bit Complement bit

Second kind

ORL C, <src-bit> OR bit with C , OR NOT bit with C

Table 6: Instructions with bit-addressing mode

When a bit-addressable instruction is executed, the byte data containing this bit will be fetched from the data memory. We store this byte in register T1 and need a register bit_index to record which bit we want to read or modify. If an instruction wants to modify this bit, it will modify it in register T1 indexed by the register bit_index. Figure 24 depicts the Balsa program for dealing with bit-addressable instruction.

Figure 24: (a) Set the value of the rar and bit_data_index (b) Get byte from the data memory and store it in register T1

3-5 Handshake Interface to the Memory

We add a handshake interface between the memory and the CPU due to the synchronous RAM and ROM. When the CPU wants to fetch an instruction from the ROM, it sets the both signals rom_addr_0r and rom_data_0r and sends out the address. A C-element is employed in order to check if rom_addr_0r and rom_data_0r are both set or reset. When the both are set, the ROM is enabled and after the latency 6 ns the Rom_rfd is set. After delaying about a clock cycle, the signals rom_addr_0a and rom_data_0a are set. The following is the return-to-zero portion of the handshake protocol. In order to make it quick, we employ an asynchronous CLR input in a D-type flip flop. The acknowledge signals of the CPU can be reset quickly when the signal Rom_rfd is reset.

The handshake interface between the CPU and the RAM is a little different from the ROM. If the CPU wants to read data from RAM, it set the signal ram_in_data_0r. If the CPU wants to store data in RAM, it set the signal ram_out_data_0r. So, an OR gate is employed to connect the both signals. The data is wrote or read according to the signal Ram_rNw_0d when

the signal Ram_en is set. After the latency 6 ns, the Ram_rfd is set and delaying about a clock cycle the acknowledge signals are set.

The worst case of memory access is delaying about 2 clock cycles due to the handshake interface. Hence, it is important to reduce the number of times of fetching the data from the ROM or RAM. For example, in the synchronous 8051 the machine cycle 1 of the execution scheme read data from the ROM two times. Not all instructions need to fetch two pieces of data from the ROM. We avoid this situation in the SA8051 in order to increase the performance. This can also reduce the energy dissipation for the memory.

Figure 25: Handshake interface between Memory and CPU

3-6 Bypassing the Bus and ALU

There are three buses: I-Bus, P-Bus and PAR-Bus in the 8051 as shown in figure 17. It is possible to mimic the synchronous bus implementation by introducing the variables IB, PB and PARB in the Balsa syntax. Each communication between any two registers is finished by using the buses. The source register is copied to the bus first and then the destination register receives data from the bus. For example, if the content of the register PC wants to be copied to the register PAR, we write

PBus := PC ; PAR := PBus

If we use the bus bypassing technique, the above statements can be rewritten as PAR := PC

This can reduce the area cost due to the deleted sequencer component (;). But it introduces multiplexers (BreezeCall component) in the front of the destination register when more than one assignment to the register is happened. If the bus bypassing technique is not introduced, there is only one multiplexer on the writeport of the variable PBus and the PAR does not need a multiplexer.

The less the number of times of data is accessed, the more energy is saved. So, it can save energy dissipation by introducing the bus bypassing technique on the frequently used communication paths. The table 6 shows the opportunities for bypassing the bus among the registers.

Source Register Destination Register The bypassing bus PC (Program Counter) PAR (Program Address

PARB

Result1 @ Result2 (ALU results) Buffer IB

T1 RAR (RAM Address

T1 T2 IB

T2 RAR IB

Table 7: The opportunity for bypassing the bus

In the 8051 not all instructions need the arithmetic or logic operations. In other words, some instructions don’t transfer the data to the ALU and wait for the operation completed in order to speed up and save energy dissipation. For example, the instruction MOV just moves the data between the registers and need not any arithmetic or logical operations. So, it does not do the ALU operations. In the asynchronous architecture we can achieve the bypassing ALU technique naturally.

3-7 Optimizations in Control Path

As the previous descriptions in section 3-2 the transparent compilation of a Balsa program into an asynchronous circuit implies that for each statement in the Balsa text a separate piece of hardware is generated. We can optimize the control path in the Balsa text.

For example, the 8051 CPU contains the following fragment of the program

The signal isel is a bit data for selection. Each of these four statements (S0, S1, P0, P1) represents a piece of hardware. The corresponding handshake circuit generated by Balsa is shown in figure 26 (a).

Figure 26: handshake circuit for the case-statement (a) not optimized (b) optimized

The case-statement can be rewritten as

The corresponding handshake circuit generated by Balsa is shown in figure 26 (b). We can compare the two handshake circuits. In figure 26 (a) there are a case (labeled “@”), two sequencer (labeled “;”) and two call (labeled “|”) components. On the other hand, in figure 26 (b) there are only a sequencer and a case component. It is better in terms of area, speed and power than the circuit in figure 26 (a).

3-8 Concluding Remarks

In this chapter we illustrate the architecture of the asynchronous 8051 and model it in Balsa language. We describe some techniques for optimizing the ALU and the decoder unit in the SA8051. We then describe the method to deal with bit-operation instructions. The handshake interface is design due to communicating with synchronous memory. The bypassing techniques are also introduced in order to reduce the power and area cost. Finally, we describe some optimizations in control path due to the syntax-direct compilation in Balsa.

Chapter 4 Implementation and

在文檔中低耗電量非同步嵌入式處理器SA8051設計與實作 (頁 30-46)