Cell Broadband Engine

The Cell Broadband Engine is the first incarnation of a new family of microprocessors conforming to the Cell Broadband Engine Architecture (CBEA, or, informally, "Cell"). The CBEA is a new architecture that extends the 64-bit PowerPC Architecture. The CBEA and the Cell Broadband Engine are the result of collaboration between Sony, Toshiba, and IBM, known as STI, formally started in early 2001 [2][3].

Figure 2-1 shows the block diagram of CBE processor hardware. The CBE processor is a multi-core processor with 9 processor elements and a shared coherent memory on-a-chip. The functionality of 9 processors can be specialized into 2 categories: the Power Processor Element (PPE) and the Synergistic Processor Element (SPE). There are 1 PPE and 8 SPEs in the CBE processor. In order to improve the productivity of PlayStation 3, only 6 SPEs are available for the programmers.

SPE 1 SPE 3 SPE 5 SPE 7 BEI: Cell Broadband Engine Interface PPE: PowerPC Processor Element

Figure 2-1 The block dia ram of CBE processor

The PPE complies with the 64-bit PowerPC Architecture. PPE could run 32-bit and 64-b

it operating systems and applications. On the other hand, the SPE is optimized for running compute-intensive applications. The PPE and the SPEs could work in a collaborative scenario. The PPE runs the operating system and the top-level thread control for applications.

The SPEs provide the computing power to boot the performance of applications. Brief block diagrams of PPE and SPE are shown in Figure 2-2 [4].

Figure 2-2 Brief block diagrams of PPE and SPE

The Cell processor could be viewed as a 9-way multiprocessor for the application programmer. The PPE is suitable for control-intensive tasks and task switching. The SPEs are suitable for compute-intensive tasks but not task switching. The more significant difference between the SPE and PPE lies in how they access memory. The PPE can access main storage at all 264 memory addresses, also called effective addresses (EA), with the help of caches.

The SPEs, in contrast, access main storage with the help of direct memory access (DMA) commands directed explicitly by programmers. Each SPE has its own local store (LS) which contains 256 KB. The LS is a scratchpad memory of each SPE and could be access by PPE or other SPE via DMA. Table 2-1 summarizes some differences between PPE and SPE.

Table 2-1 Differences between PPE and SPE

Feature PPE SPE Addressability 2⁶⁴ bytes 256-KB LS

Load Latency variable (cache) 6 cycles [20]

128-bit SIMD Registers 32 128

Doubleword SIMD no yes

Usage control computation

PowerPC Processing Elements

The PowerPC Processor Element (PPE) is a general-purpose, dual-threaded, 64-bit RISC processor with the vector/SIMD multimedia extensions. The PPE is responsible for overall control of a CBE system and the operating systems. The PPE consists of two main units as shown in Figure 2-3 [4]. The PowerPC processor unit (PPU) is the computation unit, and the PowerPC processor storage subsystem (PPSS) is for the purpose of storage.

Figure 2-3 The PowerPC Processor Element (PPE).

The PPU could further divided into the following units.

• Instruction Unit (IU)

The IU contains a 2-way set-associative and reload-on-error L1 instruction cache with 32

KB. The cache-line size is 128 bytes. The IU performs the instruction-fetch, decode, dispatch, issue, and completion portions of execution.

• Branch Unit (BRU)

The BRU performs the branch functionality.

• Fixed-Point Unit (FXU)

The FXU performs fixed-point operations, including add, multiply, divide, compare, shift, rotate, and logical instructions.

• Load and Store Unit (LSU)

The LSU contains a 4-way set-associative and write-through L1 data cache with 32 KB.

The cache-line size is 128 bytes. The LSU performs all data accesses, including load and store instructions.

• Vector/Scalar Unit (VSU)

The VSU contains a floating-point unit (FPU) and a 128-bit vector/SIMD multimedia extension unit (VXU), which together execute floating-point and vector/SIMD multimedia extension instructions.

• Memory Management Unit (MMU)

The MMU contains a 64-entry segment look-aside buffer (SLB) and 1024-entry, unified, parity protected translation look-aside buffer (TLB). The MMU manages address translation for all memory accesses.

The PPSS contains a unified, 512-KB, 8-way set-associative, write-back L2 cache with error-correction code (ECC). The cache-line size for the L2 is 128 bytes as the same as L1 cache-line size. The PPSS handles all memory accesses by the PPU and memory-coherence (snooping) operations from the element interconnect bus (EIB). The PPSS performs data-prefetch for the PPU and bus arbitration and pacing onto the EIB. There are MMU, L1 instruction cache, and L1 data cache of PPU getting data from PPSS by a shared 32-byte load port. There are MMU and L1 data cache of PPU putting data to PPSS by a shared 16-byte

store port. The interface between the PPSS and EIB supports 16-byte load and 16-byte store buses.

Synergistic Processor Elements

Figure 2-4 The synergistic processor elements (SPE)

Each SPE is a 128-bit RISC processor for data-rich, compute-intensive applications. It consists of two main units, the synergistic processor unit (SPU) and the memory flow controller (MFC), as shown in Figure 2-4 [4]. The data interface consists of a 128-bit read bus and a 128-bit write bus. The MFC can send up to 16 outstanding MFC commands. It supports atomic requests and snoop requests (read and write) of the SPU’s LS memory and the MFC’s

MMIO registers.

Figure 2-5 The functional units in SPU

Figure 2-5 shows the functional units in SPU. The SPU issues two instructions to its two execution pipelines respectively. The pipelines are referred to as even (pipeline 0) and odd (pipeline 1). The units in SPU could be pointed out as follows.

• SPU Odd Fixed-Point Unit (SFS)

The SFS executes byte shift, rotate mask, and shuffle operations on quadwords.

• SPU Load and Store Unit (SLS)

The SLS executes load and store instructions and hint for branch instructions. It also handles DMA requests to the LS.

• SPU Control Unit (SCN)

The SCN fetches and issues instructions to the two pipelines. It performs control functions such as branch instructions, arbitration of access to the LS and register file, etc.

• SPU Channel and DMA Unit (SSC)

The SSC manages communication, data transfer, and control into and out of the SPU.

• SPU Even Fixed-Point Unit (SFX)

The SFX executes arithmetic instructions, logical instructions, word SIMD shifts and rotations, floating-point comparisons, and floating-point reciprocal and reciprocal square-root estimations.

•SPU Floating-Point Unit (SFP)

The SFP executes single-precision and double-precision floating point instructions, 16-bit integer multiplies and conversions, and byte operations. The 32-bit multiplies are implemented in software using 16-bit multiplies.

Element Interconnect Bus

Figure 2-6 [5] shows the element interconnect bus (EIB), the heart of the Cell processor’s communication architecture, which enables communication among the PPE, the SPEs, main system memory, and external I/O. The EIB has separate communication paths for commands and data. The EIB data network consists of four 16-byte data rings: two running clockwise and the other two counterclockwise. Each ring allows up to three concurrent data transfers, as long as their paths don’t overlap.

Bus elements request data bus to initiate a data transfer. The data bus arbiter gives the first priority to requests coming from the memory controller to minimize the stalls of reading.

It treats all others equally in round-robin fashion. The arbiter receives these requests and decides which ring should handle each request. It selects one of the two rings that travel in the same direction of the shortest transfer to ensure that the data won’t need to travel more than

halfway around the ring to its destination. The arbiter also schedules the transfer to avoid the interferences with other in-flight transactions. The EIB operates at the speed of half the processor-clock. Each bus element could simultaneously send and receive 16 bytes of data every bus cycle.

Figure 2-6 The element interconnect bus (EIB)

The EIB’s maximum data bandwidth is limited by the rate at which addresses are snooped across all units in the system. The rate is one address per bus cycle. Each snooped address request can potentially transfer up to 128bytes, so in a 3.2GHz Cell processor, the theoretical peak data bandwidth on the EIB is 128 bytes * 1.6 GHz = 204.8 Gbytes/sec. The maximum bandwidth of Cell processor is summarized in Figure 2-7. However, the actual data bandwidth depends on several factors: the relative locations of destination and source, the new transfer’s interferences with in-flight transfers, and the efficiency of data arbiter, etc.

25.6 GB/s 25.6 GB/s 25.6 GB/s 25.6 GB/s 25.6 GB/s

25.6 GB/s 25.6 GB/s 25.6 GB/s 25.6 GB/s 25.6 GB/s 51.2 GB/s 51.2 GB/s 51.2 GB/s 51.2 GB/s 51.2 GB/s

51.2 GB/s 51.2 GB/s 51.2 GB/s 51.2 GB/s

Figure 2-7 Maximum bandwidth of Cell processor

Inter-processor Communication

There are many attributes of the shared-memory system. The PowerPC processor element (PPE) and all synergistic processor elements (SPEs) have coherent access to the main storage. All communication mechanisms are implemented and controlled by the SPE’s memory flow controller (MFC). The SPEs must explicitly use the following three communication mechanisms: DMA transfers, mailbox messages, and signaling messages in order to communicate with other bus elements in the system. Table 2-2 summarizes the three mechanisms mentioned above.

Table 2-2 Three primary mechanisms of interprocessor communication

Mechanism Description

Signaling

Used for control communication from the PPE or other devices.

Signaling utilizes 32-bit registers for one-sender-to-one-receiver signaling or many-senders-to-one-receiver signaling.

Mailboxes

Used for control communication between an SPE and the PPE or other devices. Each SPE has two mailboxes for sending and one mailbox for receiving 32-bit messages.

DMA Transfers

Used for data communication between main storage and an LS of the SPE. The asynchronous DMA transfers of MFC hide the memory latency and transfer overhead by moving data in parallel with SPU computation.

DMA

An MFC supports naturally aligned DMA transfer sizes of 1, 2, 4, 8, and 16 bytes and multiples of 16 bytes. For naturally aligned 1, 2, 4, and 8-byte transfers, the source and destination addresses must have the same 4 least significant bits (LSB). A single DMA command could transfer up to 16 KB between an LS and shared memory storage.

The throughput of a DMA transfer when the source and destination addresses are 128-byte aligned is double as compared to that of a mis-aligned transfer within a cache line.

It’s because that the mis-aligned transfer is a partial cache-line transfer, and actually there may be two bus requests for this transfer. Peak performance is achieved when the size of the transfer is a multiple of 128 bytes and both the effective address (EA) and the local store address (LSA) of the DMA transfer are 128-byte aligned. The following performance guidelines for DMA commands in CBE could be made.

• Minimize small transfers

• Align source and destination addresses to a 128-byte cache-line boundary.

• Minimize the use of synchronizing and data-ordering commands.

• Have SPEs (not PPE) initiate DMA transfers. The reasons state in Table 2-3.

Table 2-3 The reason why we have SPE initiate DMA commands

Feature SPE PPE

Processor Amount 8 1

MFC Command Queue 16 8

Synchronization easy hard

# of Cycles to Initiate a DMA transfer smaller larger

Mailbox

Mailboxes take charge of the 32-bit messages between an SPE and other devices. There are three mailbox channels of each SPE: two one-entry mailbox channels and one four-entry mailbox channel. The SPU Write Outbound Mailbox and the SPU Write Outbound Interrupt Mailbox which belong to one-entry mailbox channels are used for sending mails from the SPE to the PPE or other bus elements. The SPU Read Inbound Mailbox which belongs to four-entry mailbox channel is used for sending messages from the PPE or other bus elements to the SPE. Table 2-4 gives details about the mailbox channels and their associated MMIO registers.

Table 2-4 The mailbox channels and their associated MMIO registers

Mnemonic SPU_Out_Mbox SPU_In_Mbox SPU_Out_Intr_Mbox

# of Entries 1 4 1

R/W R W R

Width (bits) 32 32 64

在文檔中在PlayStation 3上的即時多媒體處理 (頁 21-33)

 PowerPC Processing Elements

 Synergistic Processor Elements

 Element Interconnect Bus

 Inter-processor Communication

PowerPC Processing Elements

Synergistic Processor Elements

Element Interconnect Bus

Inter-processor Communication