PAC DSP CORE AND APPLICATION PROCESSORS
David Chih-Wei Chang, I-Tao Liao, Jenq-Kuen Lee, Wen-Feng Chen, Shau-Yin Tseng, Chein-Wei Jen
SoC Technology Center
Industrial Technology Research Institute Hsinchu, Taiwan 310, R.O.C.
[email protected]
ABSTRACT
This paper provides an overview of the Parallel Architecture Core (PAC) project led by SoC Technology Center of Industrial Technology Research Institute (STC/ITRI) in Taiwan. The background of PAC project, a brief introduction to PAC core technologies, PAC SoC development suite, PAC benchmarks, and applications are presented. The main objective of the PAC development plan is to enhance industrial development competitiveness in the core technology related to key components, especially for portable multimedia applications.
1. INTRODUCTION
In recent years, the markets of communication systems and consumer electronics grow dramatically and this also drive the demand for digital signal processor (DSP) solutions. In order to fulfill increasing high-performance, multi-function, and real-time multimedia processing requirements, DSP solutions have been embedded in a wide variety of consumer electronics and home entertainment products, such as cellular phones, MP3 players, GPS, digital cameras, DVD players, set-top boxes, and DTV consoles. The advances in DSP implementations can be in the form of ASIC chips or DSP cores. Considering the flexibility for system and leading-edge algorithm design, a programmable DSP core (DSP chip or DSP/MPU SoC) is the ideal choice for supporting multi-application, high-bandwidth, and multiple communication standard required by emerging mobile multimedia devices.
Because DSP core is regarded as the key component of modern communication and consumer electronics appliances, the PAC project was initiated in early 2004. It aims at developing a 32-bit programmable DSP core based solution to enable richer multimedia capabilities, reduce development efforts, and shorten time to market. The highly integrated PAC SoC platform features a dual-core architecture that combines the command and control capabilities of the RSIC MPU with the high- performance/low-power DSP core having parallel processing capability. The PAC application processor is developed mainly for the next-generation media-rich and
multi-function portable devices, such as PMP, PDA, and smart phones.
2. PAC CORE TECHNLOGIES
PAC DSP is a 32-bit fixed-point low power high performance DSP with 5-way VLIW (Very Long Instruction Word) architecture targeted for mobile applications. It has one scalar unit and two data stream clusters. Each data stream cluster contains two functional units and a distinct partitioned low power register file structure. PAC DSP has a rich, but optimized, instruction set which supports 8-bit and 16-bit SIMD operations. It is targeted to run at a maximum frequency of 250-300MHz.
The PAC DSP core can be used as a co-processor in a dual- core processor architecture platform (e.g. PAC SoC Platform) or used as standalone unit in a single-processor DSP platform. Along with the development of PAC DSP processor, a complete tool chain of compiler, assembler, and linker is also developed. High performance assembly code library will be provided as well for multimedia applications. It targets, but not limited to, the following application domain:
Video and Image processing (H.264, MPEG-4, JPEG, Color space transform, etc.)
Audio and Speech processing (MP3, AAC, and GSM speech processing, etc.)
Voice processing and enhancement (digital hearing aid, voice-controlled gadgets, VoIP Telephony, etc.) 2.1. PAC DSP Core
PAC DSP is a silicon-proven IP core developed by STC. It employs VLIW architecture and SIMD Instruction Set to ensure high parallel computing ability. The PAC DSP kernel contains the instruction pipeline and is the computation engine of PAC DSP. The application-specific Customized Function Unit (CFU) is used to enhance the computational power of PAC DSP kernel. One example of such a CFU is a motion-estimation engine for video encoding application. The CFU executes in parallel with the PAC DSP kernel and interface with the kernel using either the PAC DSP data memory and CFU interface. Fig. 1
289
1424403677/06/$20.00 ©2006 IEEE ICME 2006
shows the PAC DSP Core Architecture. Fig. 2 illustrates a block diagram of the architecture of the PAC DSP Kernel.
It has three main components: a program sequence control unit, a scalar unit, and 2 clusters of VLIW data path.
nReset
Embedded ICE
Power Control Unit
BIU
AHB Master Interface
AHB Slave Interface
PACDSP Core
JTAG Interrupt
Interface
Interrupt Interface
Execution ControlInterface
InstructionMemory InterfaceDataMemoryInterface DebugInterfaceCustomizedFunctionUnitInterfacePowerControlInterface
PACDSP CLK
Kernel
MIU&LocalData MemoryIMIU&Instruction Cache
MemorySubsystem
Accelerators
InterruptInterface
Fig. 1 PAC DSP Core architecture
DSP Kernel V LIW Datapath Program Sequence
Control Unit
Public Ping -Pong RF (16) Private RF (1 6)
Memory Interface Unit (MIU )
Coeff.
RF (32)
Load/ S tore Unit Customized FU
A rithmetic Unit Customized FU
Private RF ( 8)
Public Ping -Pong RF (16 ) Private RF (1 6)
Coeff.
RF (32)
Load/ Store Unit Cu stomized FU
A rithmetic Unit Cu stomized FU
Private RF ( 8) Dispatch
Unit
Interrupt Handler
S calar Unit S calar Unit
RF (8)
Cluster1 Cluster2
Customized Functional Unit
Accelerators Bus Interface Unit (BIU)
Fig. 2 The Architecture of PAC DSP Kernel
The program sequence control unit dispatches instructions to the scalar unit and VLIW data path. It also handles the interrupt and exception events. The scalar unit executes the scalar instructions and has 8 local registers; most of the program sequence control instruction is defined in this unit.
The VLIW data path is composed of two clusters taking care of executing data operations in the program. The number of clusters in the VLIW data path can be scaled up or down based on target application’s performance requirement. Each cluster contains a load/store unit (L/S) and an arithmetic unit (AU). Both units can execute instructions concurrently. Thus, two instruction slots in the instruction packet are allocated for a cluster.
Each cluster has its own register files structure. There are private register files for L/S unit and Arithmetic unit. The
private register file for L/S unit is address register file and the private register file for Arithmetic is AC register file.
The communication between two units is through ping- pong register file. The specific operation defined in ping- pong register file will reduce power consumption. And the data communication between clusters is achieved using explicit“data broadcast” and “receive” instructions.
The effective data communication among register files can be ensured because of well-established register file structure.
The area and power consumption are greatly reduced through register file port reduction using register file partition scheme and Ping-Pong register file structure.
VLIW architecture saves more power than Super Scalar architecture for the static instruction schedule methodology.
It suited the low power requirement in portable applications.
The dynamic and static power management methodologies are defined in PAC DSP. The static power management provides the control register for turning off the sub-block of PAC DSP. The dynamic power management methodology will turn off the unused processing elements in data path dynamically.
In addition, PAC DSP uses Variable Instruction/Packet Length for solving Low Code Density problems. The built- in Hierarchical Encoding/Decoding Technical feature can successfully eliminate complex Dispatch impacts.
Enough performance with minimized power consumption is the requirement for embedded systems. In order to fulfill the requirements of different applications in multi-function portable devices, the computing power of PAC DSP can be re-defined during in design time and well-designed power management methodology can reduce the power consumption.
2.2. PAC DSP Software Development Suite
PAC DSP Software Development Suite offers common user interfaces on Linux environment that allows for easy learning and developing across platforms. From PC to PAC-based platforms, this cross platform functionality empowers you to repurpose originally developed applications and gives you a head start for entering PAC- based product development. Such a suite provides ever- expanding support for the features of PAC’s latest DSP processors, including dual-core technology, cluster-wise Ping-Pong architecture technology, and joint VLIW-SIMD ISA technology.
PAC DSP Software Development Suite includes C Compiler, Assembler/Linker, Debugger, Libraries, and other Supporting Utilities. Those help system developers
290
deliver applications with good code quality. For example, the PAC DSP C Compiler, which is ported from ORD compiler, ensures that PAC DSP application can be developed in a programmer-friendly environment, thus reducing time-to-market and development cost for the end products.
2.3. PAC SoC Platform
The PAC SoC Platform is designed sophisticatedly to provide an application processor SoC for the next- generation mobile devices such as PMP, smart phones, and PDA. PAC SoC platform features a dual-core architecture that combines the command and control capabilities of the MPU with the high-performance and low-power capabilities of DSP core. The dual-core architecture utilizes both RISC MPU and VLIW DSP technologies.
Fig. 3 shows the basic SoC Platform Architecture. For different applications, it can be either scaled up or down to meet the performance requirements. Basic PAC SoC Platform consists of Dual-Core Processor (MPU + DSP), Memory Subsystem, System DMA, I/O Peripherals, and on- chip System Bus Network. They communicate through the on-chip System Bus Network.
PAC Platform uses ESL (Electronic System Level) design methodology. ESL is a platform that provides co- verification of hardware and software design. In hardware RTL design, compare to traditional verification, ESL platform can provide real data such as H.264 stream data;
in software design, ESL provides verification environment for compilers and debuggers, such as step-by-step debug tools, memory and register analysis.
TIMERs
MPU AHB DSP AHB DMA AHB
ROM Flash SDRAM
TIMERs TIMERs
DMA APB TIMERs
DSPAPB
MPU APB
PAC Core
MPU Peripherals DSP
Peripherals DMA Peripherals
PAC SoC Platform Architecture
On-Chip SRAM
AHB/
APB TIMERs WATCH
DOG PWM I2C UARTs
RTC GPIOs SPI SSP
TIMERs
WATCH DOG I2S
AHB/
APB
Smart Card UARTs
USB OTG
LCD Controller AHB/
APB MPU
M
DSP
M DMA
M
S M
MPU VIC
S
DSP VIC
S
Mail Box
S S
S SMI
S S S
SDRAM Controller
S S S
SSSSSS
S S
S
Fig. 3 PAC SoC Platform Architecture
Besides, PAC Platform uses DVFS (Dynamic Voltage and Frequency Scaling) to solve the problem of power gap.
Power gap is one of major challenges of IC design, and multiple Vdd (mVdd, ie. voltage scaling) is one of most
important and effective low-power design methodology.
PAC uses mVdd and power-aware management technology;
thus it can save 5-70% of original power.
2.4. PAC SoC Embedded Software
PAC platform provides embedded Linux software solution.
Compare to the standard version kernel, lots of features are added to meet the requirement for consumer electronics products, including fast-boot, XIP, hard real-time and power management. And the Inter-Processor Communication (IPC) software framework support makes the communication between dual-cores architecture become easy. With embedded Linux technology, PAC will be a stable, flexible, and extensible platform for dual-cores architecture developers. Fig. 4 shows reference embedded software structure for PMP. The embedded software for PAC platform includes following components: HAL library and boot monitor, Embedded Linux, Middleware, Codec engine & applications, DSP microkernel.
Fig. 4 PAC PMP Reference Software Structure
2.5. PAC Benchmarks
PAC DSP achieves the great power performance ratio as shown in Figure 5.
Fig. 5 Benchmarks of DSP Cores (1)
The signal processing performance of PAC DSP is pre- evaluated using a suite of DSP benchmarks developed by Berkeley Design Technology Inc (BDTI). The figure 6 demonstrates execution cycle count results of each kernel
Multimedia framework
Power-aware IPC
Embedded Linux kernel fast boot, power-management, preemptive, XIP, flash file system, network management DPM
MPU/IOs (USB, IDE, A/V, WLAN ,LCD,…)
Video Audio
DSP/IOs Resource management
DPM, Drivers Data
flow
Micro Kernel/
BIOS DSP Application Layer
USB 2.0, IDE, WLAN H.264
player MP3 player Application
Layer
Photo viewer and extractor
Portable Media Player Platform Hardware
Drivers Linux Kernel Libs &
Services Audio/Video
Codecs User
Interface Middleware
Graphics Libs/W.S.
A/V Drivers
& other I/O Drivers IPC IPC,
F/W, Bootloader F/W, Bootloader
Multimedia framework
Power-aware IPC
Embedded Linux kernel fast boot, power-management, preemptive, XIP, flash file system, network management DPM
MPU/IOs (USB, IDE, A/V, WLAN ,LCD,…)
Video Audio
DSP/IOs Resource management
DPM, Drivers Data
flow
Micro Kernel/
BIOS DSP Application Layer
USB 2.0, IDE, WLAN H.264
player MP3 player Application
Layer
Photo viewer and extractor
Portable Media Player Platform Hardware
Drivers Linux Kernel Libs &
Services Audio/Video
Codecs User
Interface Middleware
Graphics Libs/W.S.
A/V Drivers
& other I/O Drivers IPC IPC,
F/W, Bootloader F/W, Bootloader
StarCore
1.6mm2 -
- 1.2mm2 Area
Yes Yes Yes Yes
Power Management
0.08 (Without Memory) 0.098
(Without Memory) -
0.08 (Without Memory) Power
Consumption (mW/MIPS)
3600 1830 1500 ~
2100 1250
Performance (MIPS)
0.13µm 0.13µm 0.13µm ~
90nm 0.13µm Process
450 305 250~350 250
Frequency (MHz)
8 way VLIW 6 way VLIW 6 way VLIW 5 way VLIW Architecture
CEVA-X 1620 SC1000 (SC1400) SC2000 (SC2400) PAC DSP
v2.0
CEVA ITRI/STC
Vender Property
Yes - 0.107 (Without Memory) 1600 0.13µm ~ ?
400 4 issue Superscalar
ZSP500 LSI
0.16mm2 Yes 0.125 (Without Memory) 640 0.13µm
320 2 way Superscalar
SP5 3DSP
291
for PAC DSP and its competitors. With the same MACs resource, 30% of the benchmarking results of PAC DSP are better than competitors’. The optimized ISA and special architecture of PAC DSP are the main reasons. In Fig. 7, PAC SoC Processor compares with famous Low-Power Application Processors offered by TI, Freescale, and Intel.
PAC DS P CEVA-X 1620
CEVA-X 1640
StarCore SC1200
StarCore SC1400
TI C6414 250M HZ 450M HZ 340M HZ 305M HZ 300M HZ 1000M HZ
Ve ctor Add 21 33 18 19 19 27
Ve ctor D ot 23 26 19 25 16 25
Ve ctor M ax 43 29 22 44 27 36
Control 444 639 639 425 425 475
Bit unpack 146 106 61 164 124 97
R e al-vauledB lock FIR 317 351 182 354 185 194
Comple x-vaule dB lock FIR 993 1330 690 1333 675 674
SS FIR 18 21 19 16 14 26
IIR 19 9 8 10 9 16
LM S 34 29 24 26 19 37
Vite rbi 3505 2304 1925 2880 1935 1740
FFT 1684 2207 1248 3230 1631 1246
DSP Platform
Architecture 4-way VLIW + Scalar 2MACs
4-way VLIW 2 MACs
6-way VLIW 4 MACs 8-way VLIW
4 MACs 8-way VLIW
2 MACs 8-way VLIW
Fig. 6 Benchmarks of DSP Cores (2)
Note: PAC DSP was submitted for seeking BDTI’s official certification.
PAC TI OMAP
2410/20
TI OMAP1610 (1611/1612)
Freescale
MXC275-30 Intel PXA800F Processor
Core I
ARM9/
S+Core ARM1136JF-S ARM926EJ-S ARM1136JF-S XScale
Freq (MHz) 244 330 204 532 312
Processor
Core II PAC DSP 2.0 TMS320C55x TMS320C55x StarCore
SC140e DSP MSA DSP (Frio)
Freq (MHz) 300 220 204 208 104
Accelerator
(s) Custom Core 2D/3D Graphics,
Video
Video, Security
Security (HW/SW)
16-bit SIMD, Viterbi, Voice Power
(mW@MHz) 450@300 650@330 240@204 650@532 350@312
IC Process 0.13µm 0.09um 0.13µm 0.09um 0.13µm
Core 1.2V N/A 1.1~1.5V Not Open 1.2V
Peripheral
Voltage 2.5/3.3V N/A 1.8V/3.0V 1.8V~3.3V 1.8V~3.3V
Package 288 BGA 289 BGA 289 BGA Not Open 294 TPBGA
Fig. 7 Benchmarks of Low-Power Application Processors
3. PAC APPLICATION PROCESSORS STC cooperates with several fabless IC design companies in Taiwan for developing applications based on PAC design.
Those primary target at low power and superior performance portable multimedia devices which need to process an enormous amount of digital audio and video stream, such as PDA, Smart Phone, PMP, DSC, and DVR;
or VoIP handset/gateway which require real-time signal processing.
PMP and PDA/Smart Phone are two key potential implementations. With PAC as the system fundamental, the PMP will possess multiple multimedia functions, such as MP3/AAC audio encoding/decoding, MPEG-4 D1 resolution encoding/decoding, H.264D1 decoding/QCIF encoding, signal equivalent/amplify control. It also has different kinds of peripheral controls, including monitor, audio/video I/O, and external memory, to meet the hardware requirement of next-generation PMP. PDA/Smart Phone is regarded as biggest market segment for PAC applications. The PAC Media Processor embraces a multimedia application processor (connect to a Baseband
processor externally) and standard peripherals for high- performance PDA/smart phones. PAC Media Processor will be introduced and promoted to Taiwan-based companies in the beginning phase. The goal is to step into the market currently dominated by foreign Media Processor providers.
4. CONCLUSION
PAC SoC Platform consists of 32-bit PAC DSP core and MPU, memory subsystem, DMA, I/O peripherals, and on- chip system bus network. In addition, low-power methodology, performance evaluation, and hardware/
software co-verification techniques are developed during the design process. The complete software tools and hardware development environment further reduce development risks and shorten time to market. Featuring high performance operations at optimized low power consumption, the dual core PAC platform provides an ideal application processor solution to implement more robust SoC designs for next-generation multimedia mobile devices.
ACKNOWLEDGEMENT
We wish to thank those experts who offer valuable advice in the PAC project, especially Dr. HT Kung, William H.
Gates Professor of Computer Science and Electrical Engineering of Harvard University, and Dr. Paul Lin, General Director of Information and Communications Research Laboratories of ITRI. On the other hand, we are very grateful to all PAC team members who made all this work. We would also like to express our appreciation of the assistance given by Alan Kang and Winnie Chu who are planners of the Planning & Promotion Division of STC/ITRI, in compiling information for this paper.
REFERENCES
[1] V. K. Madisetti, “VLSI Digital Signal Processors: An Introduction to Rapid Prototyping”, IEEE Press, 1995
[2] Keshab K. Parhi,“VLSI Digital Signal Processing Systems”, John Wiley and Sons, Inc., 1999.
[3]“DSP56800E 16-Bit DSP Core Reference Manual”, Freescale, Inc.
[4]“TMS320C55x DSP Function Overview”, Texas Instruments, Inc.
[5]“TMS320C6000 Technical Brief”, Texas Instruments, Inc.
[6] Yung-Chia Lin, Chung-Lin Tang, Chung-Ju Wu, Ming-Yu Hung, Yi-Ping You, Ya-Chiao Moo, Sheng-Yuan Chen and Jenq Kuen Lee, “Compiler Supports and Optimizations for PAC VLIW DSP Processors”, LCPC 2005, USA, Oct. 2005 (Also to appear in LNCS).
[7] Chien-Yuan Lai, Jin-Hon Lin, Yaw-Feng Wang, “DVFS SoC Architecture and Implementation”, SoC Technology Journal, vol. 3, pp.84~91, Nov. 2005
[8] CE Linux Forum (CELF) Kernel XIP Specification http://tree.celinuxforum.org/CelfPubWiki/KernelXIPSpecificat ion
292