Figure 2-1 shows a high level block diagram of the CBE processor hardware. The CBE
processor is a multicore processor with 9 processor elements in total and a shared coherent memory on-a-chip. The functionality of processors can be categorized into two kinds. One is the PowerPC Processor Element (PPE) and the other is the Synergistic Processor Element (SPE). There are one PPE and eight identical SPEs. All processor elements are connected to each other and to the on-chip memory and I/O controllers by the memory-coherent element interconnect bus (EIB).
Figure 2-1 Block Diagram of Cell Broadband Engine
PowerPC Processor Elements
The PowerPC Processor Element (PPE) is a general-purpose, dual-threaded, 64-bit RISC processor that conforms to the PowerPC Architecture, with the vector/SIMD multimedia extensions. The PPE consists of two main units, the PowerPC processor unit (PPU) and the PowerPC processor storage subsystem (PPSS) as shown in Figure 2-2.
Figure 2-2 PPE Block Diagram
The PPU performs instruction execution. It has a level-1 (L1) instruction cache and data cache and six execution units. It can load 32 bytes and store 16 bytes independently and memory-coherently, per processor cycle. The PPSS handles memory requests from the PPU and external requests to the PPE from SPEs or I/O devices. It has a unified level-2 (L2) instruction and data cache. The PPU and the PPSS and their functional units are shown as Figure 2-3.
Figure 2-3 PPE Functional Units
PPU could further divided into the following functional units.
z Instruction Unit (IU)
The IU contains a 2-way set-associative and reload-on-error 32KB L1 instruction cache.
The cache-line size is 128 bytes. The IU performs the instruction-fetch, decode, dispatch, issue, and completion portions of execution.
z Branch Unit (BRU)
The BRU performs the branch functionality.
z Fixed-Point Unit (FXU)
The FXU performs fixed-point operations, including add, multiply, divide, compare, shift, rotate, and logical instructions.
z Load and Store Unit (LSU)
The LSU contains a 4-way set-associative and write-through L1 data cache with 32 KB.
The cache-line size is 128 bytes. The LSU performs all data accesses, including load and store instructions.
z Vector/Scalar Unit (VSU)
The VSU contains a floating-point unit (FPU) and a 128-bit vector/SIMD multimedia extension unit (VXU), which together execute floating-point and vector/SIMD multimedia extension instructions.
z Memory Management Unit (MMU)
The MMU contains a 64-entry segment look-aside buffer (SLB) and 1024-entry, unified, parity protected translation look-aside buffer (TLB). The MMU manages address translation for all memory accesses.
The PPSS handles all memory accesses by the PPU and memory-coherence operations from the element interconnect bus (EIB). The PPSS has a unified, 512-KB, 8-way set-associative, write-back L2 cache with error-correction code (ECC). The cache-line size for the L2 is 128 bytes as the same as L1 cache-line size. The PPSS performs data-prefetch for the PPU and bus arbitration and pacing onto the EIB. There are MMU, L1 instruction cache, and L1 data cache of PPU getting data from PPSS by a shared 32-byte load port. There are MMU and L1 data cache of PPU putting data to PPSS by a shared 16-byte store port. The interface between the PPSS and EIB supports 16-byte load and 16-byte store buses. One storage access occurs at a time, and all accesses appear to occur in program order. The interface supports resource allocation management.
Synergistic Processor Elements
The eight Synergistic Processor Elements (SPEs) execute a new single instruction, multiple data (SIMD) instruction set—the Synergistic Processor Unit Instruction Set Architecture. Each SPE is a 128-bit RISC processor specialized for data-rich, compute-intensive SIMD and scalar applications. It consists of two main units, the synergistic processor unit (SPU) and the memory flow controller (MFC), as shown in Figure 2-4.
Figure 2-4 SPE Block Diagram
The LS is a 256 KB, error-correcting code (ECC)-protected, single-ported, noncaching memory. It stores all instructions and data used by the SPU. It supports one access per cycle from either SPE software or DMA transfers. SPU instruction prefetches are 128 bytes per cycle. SPU data access bandwidth is 16 bytes per cycle, quadword aligned. DMA-access bandwidth is 128 bytes per cycle. DMA transfers perform a read-modify-write of LS for writes less than a quadword.
Each SPU has its own MFC. The MFC serves as the SPU’s interface, by means of the element interconnect bus (EIB), to main-storage and other processor elements and system devices. The MFC’s primary role is to interface its LS-storage domain with the mainstorage domain. It does this by means of a DMA controller that moves instructions and data between its LS and main storage. The MFC also supports storage protection on the main-storage side of its DMA transfers, synchronization between main storage and the LS, and communication functions (such as mailbox and signal-notification messaging) with the PPE and other SPEs and devices.
Figure 2-5 SPE Functional Units
Figure 2-5 shows the SPE functional units. The SPU issues two instructions to its two execution pipelines respectively. The pipelines are referred to as even (pipeline 0) and odd (pipeline 1). Whether an instruction goes to the odd or even pipeline depends on the instruction type. The functional units in SPU are described as follows.
z SPU Odd Fixed-Point Unit (SFS)
The SFS executes byte shift, rotate mask, and shuffle operations on quadwords.
z SPU Load and Store Unit (SLS)
The SLS executes load and store instructions and hint for branch instructions. It also handles DMA requests to the LS.
z SPU Control Unit (SCN)
The SCN fetches and issues instructions to the two pipelines. It performs control functions such as branch instructions, arbitration of access to the LS and register file, etc.
z SPU Channel and DMA Unit (SSC)
The SSC manages communication, data transfer, and control into and out of the SPU.
z SPU Even Fixed-Point Unit (SFX)
The SFX executes arithmetic instructions, logical instructions, word SIMD shifts and rotations, floating-point comparisons, and floating-point reciprocal and reciprocal square-root estimations
z SPU Floating-Point Unit (SFP)
The SFP executes single-precision and double-precision floating point instructions, 16-bit integer multiplies and conversions, and byte operations. The 32-bit multiplies are implemented in software using 16-bit multiplies.
Element Interconnect Bus
Figure 2-6 shows the element interconnect bus (EIB), which is the communication path for data commands and data among the PPE, SPEs, main system memory, and external I/O. The EIB data network consists of four 16-byte-wide data rings: two running clockwise and the other two counterclockwise. Each ring allows up to three concurrent data transfers, as long as their paths don’t overlap.
Figure 2-6 Element Interconnect Bus (EIB)
To initiate a data transfer, bus elements must request data bus access. The EIB data bus arbiter processes these requests and decides which ring should handle each request. The arbiter always selects one of the two rings that travel in the direction of the shortest transfer, thus ensuring that the data won’t need to travel more than halfway around the ring to its destination. The arbiter also schedules the transfer to ensure that it won’t interfere with other
in-flight transactions. To minimize stall on reads, the arbiter gives priority to requests coming from the memory controller. It treats all others equally in round-robin fashion. Thus, certain communication patterns will be more efficient than others.
The EIB operates at half the processor-clock speed. Each EIB unit can simultaneously send and receive 16 bytes of data every bus cycle. The EIB’s maximum data bandwidth is limited by the rate at which addresses are snooped across all units in the system, which is one address per bus cycle. Each snooped address request can potentially transfer up to 128 bytes, so in a 3.2GHz Cell processor, the theoretical peak data bandwidth on the EIB is 128 bytes x1.6 GHz = 204.8 Gbytes/s.
However, the actual data bandwidth achieved on the EIB depends on several factors: the destination and source’s relative locations, the chance of a new transfer’s interfering with transfers in progress, the number of Cell chips in the system, whether data transfers are to/from memory or between local stores in the SPEs, and the data arbiter’s efficiency. EIB bandwidth would be reduced in some non-ideal cases.
Inter Processor Communication
Cell Broadband Engine (CBE) has many attributes of a shared-memory system. The PowerPC Processor Element (PPE) and all Synergistic Processor Elements (SPEs) have coherent access to main storage. But the CBE processor is not a traditional shared-memory processor. SPE only can execute programs and directly access data from and to its own local store (LS). Because of lacking directly accessing to shared memory, SPE must using three primary communication mechanisms to communicate with other elements on EIB: DMA transfers, mailbox messages, and signal notification. All these three communication mechanisms are controlled by SPE’s memory flow controller (MFC). The communication mechanisms are summarized as follow:
z DMA transfers
Used to move data and instructions between main storage and a local store(LS). An MFC supports naturally aligned DMA transfer sizes of 1, 2, 4, 8, and 16bytes and multiple of 16 bytes. For naturally aligned 1, 2, 4, and 8-byte transfers, the source and destination addresses must have the same 4 least significant bits (LSB). A single DMA command could transfer up to 16 KB between an LS and shared memory storage. The throughput of a DMA transfer when the source and destination addresses are 128-byte aligned is double as compared to that of a mis-aligned transfer within a cache line. It’s because that the mis-aligned transfer is a partial cache-line transfer, and actually there may be two bus requests for this transfer. Peak performance is achieved when the size of the transfer is a multiple of 128 bytes and both the effective address (EA) and the local store address (LSA) of the DMA transfer are 128-byte aligned. SPEs rely on asynchronous DMA transfers to hide memory latency and transfer overhead by moving data in parallel with synergistic processor unit (SPU) computation.
A MFC has only 16 entries in MFC SPU command queue. A DMA list is sequence of eight-byte list elements, stored in an SPE’s LS, each of which describes a DMA transfer and only occupy one of the SPU command queue. DMA list commands can be initiated only by SPU programs, not by other devices. A DMA list command can specify up to 2048 DMA transfers, each up to 16 KB in length. Thus, a DMA list command can transfer up to 32 MB, which is 128 times the size of the 256 KB LS, more than enough to accommodate future increases in the size of LS. The space required for the maximum-size DMA list is 16 KB.
DMA list commands are used to move data between a contiguous area in an SPE’s LS and possibly noncontiguous area in the effective address space.
z Mailboxes
Used for control communication between an SPE and the PPE or other devices.
Supporting the sending and buffering of 32-bit messages. Each SPE can access three mailbox channels, each of which is connected to a mailbox register in the SPU’s MFC. Two one-entry mailbox channels: the SPU Write Outbound Mailbox and the SPU Write Outbound Interrupt Mailbox, which are provided for sending messages from the SPE to the PPE or other device.
One four-entry mailbox channel: the SPU Read Inbound Mailbox, which is provided for sending messages from the PPE, or other SPEs or devices.
z Signal notification
Used for control communication from the PPE or other devices. SPE signal-notification channels are connected to inbound registers (into the SPE). The PPE, other SPEs, and other devices use the signal notification registers to send information, such as a buffer-completion synchronization flag, to an SPE. An SPE has two 32-bit signal-notification registers, each of which has a corresponding MMIO register that can be written with signal-notification data.