In this chapter, the core architecture of Analog Device’s Blackfin processor and its dual-core version— BF561 will be introduced.
4.1 Blackfin Core
Blackfin processors are a new breed of 16-/32-bit embedded processor designed specifi-cally to meet the computational demands and power constraints of today’s embedded audio, video and communications applications. Based on the Micro Signal Architecture (MSA) jointly developed with Intel Corporation, Blackfin processors combine a 32-bit RISC-like instruction set and dual 16-bit multiply accumulate (MAC) signal processing functionality with the ease-of-use attributes found in general-purpose microcontrollers. This combina-tion of processing attributes enables Blackfin processors to perform equally well in both signal processing and control processing applications—in many cases deleting the require-ment for separate heterogeneous processors. This capability greatly simplifies both the hardware and software design implementation tasks.
As shown in Figure 4.1, Blackfin core contains two 16-bit multipliers, two 40-bit ac-cumulators, two 40-bit arithmetic logic units (ALUs), four 8-bit video ALUs, and a 40-bit
21
Figure 4.1: Blackfin core architecture.
shifter, along with the functional units. The computational units process 8-, 16-, or 32-bit data from the register file. The compute register file contains eight 32-bit registers. When performing compute operations on 16-bit operand data, the register file operates as 16 inde-pendent 16-bit registers. All operands for compute operations come from the multiported register file and instruction constant fields. Each MAC can perform a 16- by 16-bit multi-ply per cycle, with accumulation to a 40-bit result. Signed and unsigned formats, rounding, and saturation are supported. The ALUs perform a traditional set of arithmetic and logical operations on 16-bit or 32-bit data. Many special instructions are included to accelerate various signal processing tasks. These include bit operations such as field extract and pop-ulation count, divide primitives, saturation and rounding, and sign/exponent detection. The set of video instructions includes byte alignment and packing operations, 16-bit and 8-bit adds with clipping, 8-bit average operations, and 8-bit subtract/absolute value/accumulate (SAA) operations. Also provided are the compare/select and vector search instructions.
CHAPTER 4. THE ARCHITECTURE OF ANALOG DEVICE BF561 23
For some instructions, two 16-bit ALU operations can be performed simultaneously on register pairs [4].
4.2 Blackfin ADSP-BF561
ADSP-BF561 is a member of Blackfin processor family of products targeting consumer multimedia applications. At the heart of this device are two independent enhanced Black-fin processor cores that offer high performance and low power consumption while retaining their ease-of-use and code-compatibility benefits. As shown in Figure 4.2, the two Blackfin cores are connected via buses, which is a complicated bus system. In addition to L1 instruc-tion SRAM and L1 data SRAM, there is a L2 SRAM works around half speed compared to L1 SRAM and it could be accessed by both cores.
Figure 4.2: Block diagram of BF561 architecture.
4.2.1 Memory Hierarchy
Blackfin products support a modified Harvard architecture in combination with a hierar-chical memory structure shown in Figure 4.2. Generally speaking, a hierarhierar-chical memory architecture means there exists multi-level memory blocks and they run under different speeds from fast to slow. The memory block near the processor core often works on the highest speed and we call it Level 1 (L1) memory. Following the principle, the follower is L2, L3,... memory . A hierarchical memory structure is designed for cost and power effective.
Level 1 (L1) memory of Blackfin BF561 operates at the full processor speed with little or no latency. At the L1 level, the instruction memory holds instructions, the data memory holds data, and a dedicated scratchpad data memory stores stacks and the information of local variables.
L1 instruction SRAM consists of 32Kb SRAM, of which 16Kb can be configured as a four-way set-associate cache. If we configure it as a general instruction SRAM, it could be put not only instructions but also data. However, the data put in the instruction SRAM can be moved only by DMA and the core can not take the data from L1 instruction SRAM directly.
L1 data SRAM consists of two banks of 32Kb each. Half of each bank is always configured as SRAM while the other half can be configured as SRAM or a two-way set associate cache. In addition, there exists a block of 4Kb L1 scratchpad SRAM, which runs at the full speed but is only accessible as a data SRAM and cannot be configured as a cache memory.
For safe memory access, the Memory Management Unit (MMU) provides memory protection for individual tasks that may be operating on the core and can protect system registers from unintended access.
The ADSP-BF561 dual cores share an on-chip L2 memory system, which provides
CHAPTER 4. THE ARCHITECTURE OF ANALOG DEVICE BF561 25
high speed SRAM access with somewhat longer latency than the L1 memory banks. The L2 memory is a unified instruction and data memory and can hold any mixture of code and data required by the system. It could be only configured as SRAM and cannot configured as a cache. On the other hand, it could be set to cache-able to data cache; this means it could be cached by the data cache. The total L2 SRAM size in BF561 is 128Kb.
The L1 instruction SRAM and data SRAM could be broken into 4Kb sub-banks, which can be accessed independently by the DMA and the core simultaneously.
External (off-chip) memory is accessed via the External Bus Interface Unit (EBIU).
This 32-bit EBIU provides a gluless connection to as many as four banks of synchronous DRAM (SDRAM) and as many as four asynchronous memory devices including flash memory, EPROM, ROM, SRAM, and memory-mapped I/O devices. The PC133-complaint SDRAM controller can be programmed to interface to up to 512 MBs of SDRAM.
4.2.2 DMA Support
To see the architecture of ADSP-BF561, we could easily be attracted by the two DMA controllers. DMA is well known for efficient data movement, and exists not only in general CPUs but also in DSP processors. The advantage of the DMA devices in BF561 is that the buses are independent while connecting to internal L1 SRAM and L2 SRAM. This is special because most DMA devices in other processors are designed connecting to the main bus and share the bus access with processor cores and other devices connecting to the bus; that’s why we say “cycle stealing”. However, “cycle stealing” doesn’t exist in BF561 due to the independent DMA accesses; this means the utilization of DMA on BF561 could promote higher performance.
Since we say DMA accesses to internal L1, L2 SRAM could benefit from independent buses, the access to external SDRAM is all controlled by EBIU. This seems to make no big difference between core access and DMA access. However, the DMA access could be more efficient since it works under burst read/write.
For different purposes, the DMAs on BF561 can be categorized to three functions:
• Peripheral DMA (DMA): It is used to transfer data between peripheral devices and internal L1, L2 SRAM
• Memory DMA (MDMA): It is used to transfer data between external SDRAM and internal L1, L2 SRAM.
• Internal Memory DMA (IMDMA): It is used to transfer data between internal L1/L2 SRAM.
The Figures 4.3 shows the bus architectures of Blackfin BF561. we could see that there are independent buses connecting to L1 SRAM and L2 SRAM. If we can manipulate the accesses by the DMA devices and the cores overlapping, the performance can be promoted.
CHAPTER 4. THE ARCHITECTURE OF ANALOG DEVICE BF561 27
Figure 4.3: Memory and bus architecture of BF561.