RELATED WORK - 低耗電量非同步嵌入式處理器SA8051設計與實作

This chapter is organized as follows. We first briefly introduce the architecture of the synchronous 8051. We then describe the classification of asynchronous circuits according to the delay assumptions. Finally, we describe several basic cells generated by Balsa synthesis tool.

2-1 Overview of 8051

In this section we will describe the instruction set and the architecture of the Intel 8051.

2-1-1 Instruction Set

The 8051 is a complex instruction set computer (CISC). It has 255 variable-length instructions from one to three bytes and supports various addressing modes. The opcode of an instruction is always encoded in the first byte. The second and third bytes are operands. The instruction set is divided among five functional groups: arithmetic, logical, data transfer, Boolean variable and program branching. The 8051 is a Harvard architecture: instruction memory and data memory are separate.

The instruction set provides eight addressing modes [11] as depicted in figure 4 : (a) in register addressing, instructions are encoded using the three least-significant bits of the instruction opcode (b) in direct addressing, the operand is specified by an 8-bit address field in the instruction representing an address in the internal data RAM or a special-function register (SFR) (c) in indirect addressing, the instruction specifies a register (R0 or R1) containing the address of the operand in data memory (d) in immediate addressing, the constant operand value is part of the instruction (e) in relative addressing, a relative address (or offset) is an 8-bit signed value, which is added to the program counter to form the address of the next instruction executed. (f) in absolute addressing, these instructions allow branching

within the current 2K page of code memory by providing the 11 least-significant bits of the destination address. (g) in long addressing, these instructions include a full 16-bit destination address as bytes 2 and 3 of the instruction. (h) Indexed addressing uses a base register (either the program counter or the data pointer) and an offset (the accumulator) in forming the effective address for a JMP or MOVC instruction.

(a) Register addressing (e.g. ADD A, R5)

(b) Direct addressing (e.g. ADD A,55H)

(d) Immediate addressing (e.g. ADD A,#44H)

(e) Relative addressing (e.g. SJMP AHEAD)

(f) Absolute addressing (e.g. AJMP BACK)

(g) Long addressing (e.g. LJMP FAR_AHEAD)

(h) Indexed addressing (e.g. MOVC A, @A+PC) n

Opcode

Opcode Direct address

Opcode i

Opcode Immediate data

Opcode Relative offset

Opcode

A10-A8 A7-A0

Opcode A15-A8 A7-A0

PC or DPTR ACC

Base Register

+

Offset

=

Effective address

Figure 4: The 8051 addressing mode

Table 2: The 8051 instruction set. All mnemonics copyrighted Intel Corporation 1980

Table 2 is the complete instruction set of 8051. In this table the rows represent the four least significant bits of the opcode while the columns represent the four most significant bits.

Thus, the instruction at entry PiRj has opcode ij in hexadecimal notation. Rows R8 to RF are combined into one row because these instructions only differ in the last three bits which specify a register. Rows R6 and R7 are also combined into one row because the last bit of opcode indicates which register (R0 or R1) will be used as indirect address. Note that only one entry (PA R5) in this table does not contain an instruction.

2-1-2 Synchronous Architecture

Figure 5: The architecture of the synchronous 8051

Figure 5 is the architecture of the synchronous 8051 [12]. It has three buses: IB, PB, PARB bus. IB-bus acts as the communication channel between any two registers. PB-bus acts the communication channel among PAR (Program Address Register), Buffer, PC Incrementer, PC and DPTR. PAR sent out program address on PAR-bus. The width of the IB bus is 1 byte while the PARB and PB are 2 bytes. The internal memory consists of on-chip ROM and on-chip data RAM. The on-chip RAM contains a rich arrangement of general-purpose storage, bit-addressable storage, register banks, and special function registers (SFR). The registers and input/output ports are memory mapped and accessible like any other memory location and the stack resides within the internal RAM rather than in external RAM. SFRs take care of the communication between CPU and peripherals. There are four bidirectional ports (P0 – P3) for communication to and from the outside world.

The 8051 also includes bit operations, which only affect single bit in a given registers.

Only some locations of the internal RAM are bit-accessible including address from 20H to 2FH and some SFRs. Internally, the bit operations are performed by reading the whole byte from internal memory, modifying the single bit, and then writing the value back in the same operation cycle.

Table 3 is the instruction scheme of the synchronous 8051 [13]. Each instruction is executed in one, two or four machine cycles. A machine cycle consists of a sequence of 6 states, numbered S1 through S6. Each state time lasts for two oscillator periods. Therefore, with an internal clock frequency of 12 MHz the performance will be below 1 MIPS. In each state of the execution scheme a specific action takes place. The one-cycle instructions execute the first machine cycle C1, while the two-cycle instructions execute C1 and C2 consecutively.

The scheme results in many redundant cycles during execution because not all actions are required in one machine cycle. For example, two program fetches are generated during each machine cycle, even if the instruction being executed does not require it.

S1 S2 S3 S4 S5 S6 C1 Access

ROM

ACC -> T2 Access RAM

Access ROM

OP->T1 or T2 ALU->dest.

S1 S2 S3 S4 S5 S6

C2 Access ROM

Calculate jump address PC incr. OP->T1 or T2 ALU->dest.

Table 3: Instruction execution scheme

When access for external memory, Port 0 has the data byte and the least significant byte of the address multiplexed on it. Address Latch Enable (ALE) is used to signal external circuitry to latch the address LSB before Port 0 switches to either reading or writing the data byte. If a 16-bit address is used, Port 2 is used to output the high byte of the address. In this mode, Port 2 also uses strong internal pull-ups to output the address MSB. Finally, pins 6 and 7 of Port 3 are used to signal a write or a read on the bus respectively. However, for the SA8051, all of the instructions are in internal memory.

2-2 Classification of Asynchronous Circuits

Figure 6: A circuit fragment with gate and wire delays

At the gate level, asynchronous circuits can be classified as being delay-insensitive, quasi-delay-insensitive, speed-independent, self-timed depending on the delay assumptions that are made [4]. Figure 6 serves to illustrate the following discussion. In this figure there are three gates (A, B, C) and three wires (W1, W2, W3). dA, dB and dC represent the gate delay for A, B and C respectively. d₁, d₂ and d₃ represent the wire delay of W₁, W₂ and W₃ respectively.

(a) Delay-Insensitive (DI): a circuit that operates correctly with positive, bounded but unknown delays in wires and gates. Referring to figure 6 this means arbitrary dA, dB, dC,

d1, d2 and d3.

(b) Quasi-Delay-Insensitive (QDI): a QDI circuit is DI with the exception of some carefully identified wire forks called “isochronic forks”. Referring to figure 6 this means arbitrary d_A, d_B, d_C, d₁ but d₂ = d₃.

(c) Speed-Independent (SI): a SI circuit is a circuit that operates correctly assuming positive, bounded but unknown delays in gates and ideal zero-delay wires. Referring to figure 6 this means arbitrary d_A, d_B, d_C but d₁ = d₂ = d₃= 0.

(d) Self-Timed (ST): a self-timed circuit contains a group of self-timed elements. Each element is contained in an “equipotential region”, where wires have negligible or well-bounded delay. An element itself may be an SI circuit, or a circuit whose correct operation relies on use of local timing assumptions. However, no timing assumptions are made on the communication between regions. That is, communication between regions is DI.

2-3 Balsa Back-End

The Balsa back-end generates gate level netlist to import into target CAD systems in order to produce circuit implementations [14]. In this section we will describe some basic cells for Xilinx technology generated by Balsa such as Muller C element and S element. We also describe some handshake components in Balsa synthesis system.

2-3-1 Basic Elements

The gate level netlist generated by Balsa for Xilinx technology only uses some basic cells including AND, OR, NOR, XOR, NADN, BUF, XNOR, INV, FD (D-type flip-flop),

FDC and FDCE. Basic elements are composed of these cells.

C i0

i0 i1 0 0 0 1 1 0 1 1

no change no change 1

q

(a)

(b)

(c)

Figure 7: The Muller C-element, (a) symbol (b) true table (c) gate-level implementation

Figure 7 shows the Muller C-element. It is one of the most common additions to the basic set of logic gates made in order to make the implementation of asynchronous circuits easier. It is a state-holding element like an asynchronous set-reset latch. When both inputs are 0, the output is set to 0. When both inputs are 1 the output is set to 1. For other input combinations the output does not change. A Muller C-element is a fundamental component that is extensively used in asynchronous circuits.

Figure 8: The NC2P-element (a) symbol (b) true table (c) gate-level implementation

Figure 8 shows the NC2P element. When i0 is equal to 0, the output is 0. When i0 and i1 are equal to 1, the output is 1. For other input combinations the output does not change. It is much like inverter of C-element except that when i0 is equal to 0 and i1 is equal to 1, the output is 1.

Figure 9: The S-element (a) symbol (b) gate-level implementation (c) handshaking protocol

Figure 9 shows the S-element which is a circuit element commonly found in the implementation of handshake components [1]. An S-element has 4 pins including 2 request/acknowledge handshake pairs – ‘Ar’/’Aa’ and ‘Br’/’Ba’. In Balsa system it replaces

the “inverter of C-element” with “nc2p”. Hence, it can reduce the number of gates because

“inverter of C-element” uses 3 AND gates, 1 OR gate and 1 Inverter but “nc2p” uses 2 AND gates, 1 NOR gate and 1 Inverter.

Figure 10: The multiplexer (a) function block (b) true table (c) gate level implementation

Figure 11: The de-multiplexer (a) function block (b) true table (c) gate level implementation

Figure 10 and figure 11 are the multiplexer and de-multiplexer elements. They are used in many elements such as Basla full adder and BrzCase.

2-3-2 Handshake Components

Balsa has about 40 components that use handshake signaling for communication. Each of “handshake components” has a concrete gate level implementation. In the following we

illustrate some handshake components [14] .

Figure 12: The Fetch Component (a) handshake component (b) gate level implementation

Figure 12 is the Fetch component. This component is the most common way of controlling a datapath from a control tree. Transferrers are used to implement assignment, input and output channel operations in Balsa by transferring a data value from a pull datapath and pushing it towards a push datapath [14].

Figure 13: The Sequence Component (a) handshake component (b) gate level implementation

Figure 14 and 15 are sequence and concurrent component respectively. They form a large part of handshake circuit control trees [14]. They are used to activate a number of commands under the control of activate handshake.

Figure 14: The Concurrent Component (a) handshake component (b) gate level implementation

Var

Read[0]

Read[1]

write

(a)

(b)

Figure 15: The Variable Component (a) handshake component (b) gate level implementation

Figure 15 is the variable component. It uses D-type flip-flop to store data. The source of clock is the signal write_0r. When a piece of data is wanted to be stored, the signal write_0r is set and then the signal is reset. When a piece of data is wanted to be read, the signal read_0r or read_1r is set. It is natural to achieve the effect of gating clock.

2-4 Concluding Remarks

In this chapter we introduce the synchronous 8051 architecture. 8051 is a complex instruction set computer. It has variable-length instructions from one to three bytes. Each state of a machine cycle uses the bus. Hence, it is not easy to overlap execution of instructions, i.e.

to implement pipelining. We then introduce the classification of the asynchronous circuits.

Asynchronous circuits can be classified as being SI, DI, QDI, ST depending on the delay assumptions. Finally we illustrate the Balsa back-end. Balsa synthesis system is composed of about 40 components. Each can be translated to gate level netlist. They use handshaking protocol for communication.

在文檔中低耗電量非同步嵌入式處理器SA8051設計與實作 (頁 17-30)