Designing a Scalable Communication Scheme for On-Chip Multiprocessors

(1)

1

Designing a Scalable Communication Scheme for On-Chip Multiprocessors

Tsung-Hsien Chen, Wei-Chung Hsu

Department of Electrical Engineering and Computer Science Columbia University

New York, USA tc2225@columbia.edu wh2138@columbia.edu

1. Introduction

We try to develop a scalable network-on-chip design to support communication across a grid of processors to be hosted on a single chip. Our method is to design an on-chip switch and a network protocol to support cross processors communication. We also develop an interface between cores, switch and memory to translate the regular signals to on-chip packets.

Section 2 describes the related work that what source codes and modules we build on. Section 3 describes our system architecture, including the block-diagram, MIPS core, external memory, and interfaces. Section 4 explains our design of on-chip network. Section 5 describes our implementation.

Section 6 is our system evaluation.

2. Related Work

We took the source code of MIPS R2000 core from SoC MOBINET [1], which is a 32-bit MIPS R2000 implemented in SystemC. They also use the Timer module from AVR core, another project on their website. Hence, we need to install AVR core too. We try to borrow the idea from MPARM [2] paper that they use linker script to distribute the instruction data into multi-cores. However, we can not figure out how to use the linker script to do that. As a result, we load complete program image into every core in the system.

3. System Architecture

Our system consists of four kinds of components:

(1) 50MHz MIPS R2000 Core (up to 4 cores), (2) 50MHz External Memory (up to 4), (3) 1GHz interface connecting processors and memory modules with routers, and (4) 1GHz on-chip network. We try to tune the network speed up to make processor be able to receive data from memory at the same cycle it sends the request. Figure 1 shows block diagram of our system.

Figure 1: Block-Diagram with 4 cores architecture

The Icore0~Icore3 are the core-side interfaces and Imem0~Imem3 are the memory-side interfaces. There are 4 kinds of operation inside this network architecture: (1) read request from core to memory (r-req type), (2) write request from core to memory (w-req type), (3) read response from memory to core (r-res type), and (4) write response from memory to core (w-res type).

For the read request operation, the CPU core sends read request signals such as address signal, read-write signal and byte select signal to interface.

After receiving these signals, the interface translates them to packet and sends to the router. After the router forwards the packet to destination, the memory-side interface translates the packet to original core request signals and sends to memory.

(2)

After r-req, the memory-side interface will be held until the memory sends the data signal back.

Then this interface translates this data signal to our r-res packet and sends to switch. Finally, the core-side interface gets the r-res packets and translates to regular signals to core.

For the write request operation, the CPU core sends write request along with the data signal to core-side interface which will send w-req type packets to switch. The memory-side interface gets the packet, translates to data signals and then writes to external memory. At the same time, memory-side interface sends w-res packets back which core-side interface will get and treat as the acknowledge signal.

3.1 MIPS R2000 Core structure

The original design in MIPS R2000 core from SoC MOBINET has two memory modules, data memory and instruction memory. They load the whole binary file into these two memory modules. Since we want communication between cores, we break up the connection between processor and data memory and use network between them. For CPU core, it doesn’t know the external memory and will send/get the regular signal to/from the interface. We leave all the core structure same as original one except the connection between processor and data memory. In other words, each core still has its own instruction memory and loads its own program code. Therefore, each core will run an independent program and communicate with each other through the external memory. This is our tricky way to implement on-chip communication. The core runs in 50 Mhz.

3.2 External memory Structure

We use the original memory module in MIPS R2000 implementation to implement our external memory. The reason is that this module can directly process the data access signal from CPU core as well as address and byte-select functions. The size of each memory bank is 0x8000 = 32 KB. This size could be changed in source code. According to the link file we derive from original MIPS R2000 codes, the memory starting address is 0x0750 for the first memory bank.

Since we are going to connect several memory modules, we need a way to specify which memory bank we are going to access. Our approach is using modular to specify memory bank. For example, address 0xabcd is in bank 1 with shift address 0x2bcd.

Therefore, each core can access all memory banks but

it needs to take care of the race condition and reader-writer issue. This module runs in 50 MHz.

3.3 Interface Structure

There are two kinds of interfaces, the core-side and the memory-side. Here, we focus on the FSM of each interface. Figure 2 is the FSM for core-side interface. It runs from S0, waiting the CPU request signals. In r-req case, when the CPU request signal comes, it translates these signals to r-req packets, sends the r-req header to switch and moves to S2 state.

At S2, it sends second packet, the address packet, to switch and moves to S3. Then this interface would be held in S3 until it receives the response packets from memory side. Whenever it receives the r-res header packet, it will move to S8. In S8, it receives r-res address packet and move to S9. In S9 it gets the r-res data packet, translates to CPU data signals, sends to CPU, and then moves to S0 for next request signal.

In w-req case, it receives data and address signals from CPU. After that, it sends w-req header packet to switch and moves to S5. In S5, it sends w-req address packet and moves to S6. In S6, it sends w-req data packet and moves to S7 where the interface will be held until the w-res packet from memory-side comes in. Whenever it receives w-res header from switch, it will move to S10. In S10, it receives w-res address packet and moves to S0 for next signal.

Figure 2: FSM for core-side interface implementation

(3)

The Figure 3 is the FSM for memory-side interface.

It runs from S11 waiting the CPU side packets come in. In r-req and r-res case, first it receives the r-req packet header and moves to S12. In S12, it receives the r-req address packet, translates to regular signals, sends to memory and moves to S13 where it will be held until the memory sends data signal back. In S13, whenever it receives data signal form memory, it sends r-res header packet to switch and moves to S19.

In S19, it sends r-res address packet to switch and moves to S20. In S20, it sends r-res data packet to switch and moves to S11 for next core-side packet coming.

In w-req and w-res case, first it receives w-req packet header and moves to S14. In S14, it receives the w-req address packet stored in a 32-bit register for next cycle use and move to S15. In S15, it receives w-req data packet, translate to regular signals, and sends to memory. At the same time, it directly sends w-res header back to switch and moves to S17. These signals must be sent to memory at the same cycle and that’s why the interface must hold all the incoming packets until the last data packet comes. In S17, it sends w-res address packet to switch and moves to S11 for next core-side packet coming.

Figure 3: FSM for memory-side interface implementation

4. On-Chip Network

We divide our on-chip network into three parts – network topology, communication protocol, and switching mechanism and routing algorithm. We will discuss them respectively.

We select fat-tree topology as our network topology, because it provides scalability to connect several processors. Besides, it also uses fewer routers than mesh or torus topology when we want to connect more than four processors. In our design, each router will connect with up to four parents and four children.

For example, when we connect two processors and two memory modules, we use one router to connect them with each other (Figure 4a). When we connect four processors and four memory modules, we use four routers, two children and two parent routers, to connect them with each other (Figure 4b).

Figure 4

To support this topology, each switching requires eight input ports and, of course, eight output ports. Half of them are connected to children of the switching and others are connected to parents of the switching (Figure 5).

Figure 5

(4)

As for communication protocol, we regard processor as master and memory module as slave. All communication must start from processor sending a request and then memory module replies by sending a response. For processor, there are two possible requests, write request and read request. After receiving the packet from processor, memory module reply with write response or read response according to what type of packet it receives from processor. Here, we make an assumption to simplify our design. We assume that all transmission in the network do not loss. In other words, we do not have to handle with the situation in which terminals have to re-send packets because of losses of packets. In our design, every request and response is split into several flits according to its type. Figure 6 shows what each type of packet is made up of.

Figure 6

From figure 4, we can see that there are three different kinds of flit. Basically, the address flit and the data flit are simply memory address and data in a processor. The only special flit which does not exist in an ordinary processor is header flit. The structure of header flit is showed in Figure 7. There are five fields in header flits. Packet type indicates the type of packets.

Processor id and memory id are unique address of processors and memory modules. In order to avoid all-zero headers, we make memory id start from 32.

Waiting cycle records how many cycles the packet is blocked and waited in the input buffer.

Figure 7

As for switching mechanism and routing algorithm, our approach is based on wormhole switching. Therefore, packets must entirely cross the channel before the channel can be used by another

packet. If a header flit of packet arrives and finds out that none of possible outputs is available, it will be blocked and waited in input buffer. After a header flit is transmitted through an output path, all following flits will be transmitted by the same path until the whole packet is completely transmitted. Figure 8 shows the state diagram of switching operation.

Figure 8

We need to take care of two things while switching, how we determine the flit is a header or not, and how we allocate output path and to whom. In our design, if the destination of packet is children of the router, then it will have only one possible output path. Otherwise, packets need go through parents of the router, and then all paths connect with parents of the router are possible output paths for packets. After receiving a header flit, the router will try to find an available possible output path and allocate it to the packet. Here, we use pointers in both input and output buffers. When allocating, the pointer in input buffer will point to the output buffer which is allocated, and the pointer in output buffer will point to input buffer. A question about allocation is that if there are more than one packet is going to allocate the same path at the same cycle, whom should the router select? Our approach is to select the packet which has the largest waiting cycles first to reduce the latency of packets. Packets that are not selected this time will be blocked and waited in input buffer. Every cycle they have waited will add one to the field, waiting cycle, in their header flit. If packets which are going to allocate the same path have the same number of waiting cycles, then we just randomly pick one.

Packet Type Flits (32bit) Write request Header, Address, Data Read request Header, Address Write response Header, Address Read response Header, Address, Data

Packet Type

Processor ID

Memory ID

Byte Select

Waiting

Cycle Unused

(5)

To allocate packets, the router needs to know whether the flit is a header or not, since it only allocates output path when the flit is a header. In our design, we use a counter in every output buffer and set it to zero while booting. At this point of time, the router anticipates the next flit it receives is a header.

After receiving the first flit, the router will try to allocate an available output path to the packet and, if successful, set the counter in that output path to corresponding number of flits that the packet has. For example, if the packet is a read request, then it has two flits and the router will set the counter in output buffer to two. Whenever a flit goes through this output path, the counter will be decreased by one. If the counter is equal to zero, then the router knows the packet has been completely transmitted and the output path is available to other packets now.

5. Implementation

Our implementation is written in SystemC and runs in CLIC machine which is a Linux base machine.

We also install a GCC cross-compiler for MIPS. Then we download the test program in SoC-MOBINET for our testing. In order to compile and get binary file, we need “ crt0.s ” assembly file, a bootstrap file that initializes the stack pointer and jumps to main() function. This bootstrap file has a small bug that they wrote ”beq” to ”bne” so that it never runs the test program actually. We fixed this bug and it works pretty well in our platform. We made the compiler to produce the assembly file and check this assembly file with MIPS core output. It shows that everything is correct and works fine.

This is a huge project that needs a lot of installation. First, you need to install Cross-Compiler to compile test program. We write an install script file

“setup_cross_compiler.sh” for cross-compiler. Second step is to install SystemC and we also offer a script

“setup_systemc.sh” for this installation. Third step is to install the MIPS core and our modification files.

we offer an installation script “setup_mips_core.sh”.

This is a tricky part because the original source codes are messy and have a lot of customized environment paths. Therefore, it doesn't guarantee work in your account and you need to change some paths in Makefile.

Right now, we have 4 test programs named test0.c, test1.c, test2.c and test3.c so that we could put into 4 cores architecture. You could modify these test programs for what you want to test. In addition, we put 3 main files for 3 kinds of architecture. The one

core architecture names as “main_1core.cpp”,

“main_2core.cpp” is 2 cores, and “main_4core.cpp” is the 4 cores architecture. You just change one of these main files to main.cpp and recompile it, then you could get this system architecture.

6. Evaluation

Since each core has its own program code, we write a C program for each core. The following Figure 4 is one of our test programs. We use “array = (int

*)(0xaddress)” to specify the memory area this core is going to use. This is a very simple test program that just reads and writes the external memory bank and we dump all the on-chip network packets to check if it is correct.

int *array;

int main(void) {

int a, b, i;

a=1;

b=2;

a=b+3;

array = (int *)(0x0800);

array[0]=1;

for(i=1;i<100;i++) array[i]=array[i-1]+1;

a=array[1];

b=array[10];

return 0;

}

7. Conclusion

For us, the most difficult thing in this project is not implementation of network architecture. The most difficult thing is to integrate processors with routers.

For examples, it seems that there is no memory stall n the implementation of MIPS R2000. As a result, we have to speed up the clock frequency of network and then cause some clock issues between them. These clock issues somehow make the whole system not work in a correct way. Besides, we have been stuck in thinking about how to distribute a program’s

instructions and data into several cores and memory modules for a long time. We still think our current approach does not well simulate a true program for multiprocessors.

References

(6)

[1] SoC-MOBINET: courseware elements,

http://www2.imm.dtu.dk/~jan/socmobinet/coursewar e/elements/mips_core.htm

[2] Luca Benini and Davide Bertozzi,“MPARM:

Exploring the Multi-Processor SoC Design Space with SystemC”, Springer J. of VLSI Signal Processing, vol.

41, no. 2, pp. 169-182, 2005.