• 沒有找到結果。

Lossless Data Compression (COMP) Engine

Chapter 3 Integrated DOT/ECG/EEG Biomedical Multiprocessor for Portable Brain-Heart

3.3 Hardware Design and Implementation

3.3.2 Core Module Design

3.3.2.6 Lossless Data Compression (COMP) Engine

The hardware architecture of the lossless data compression engine is shown in Figure 3-42. It comprises three pipeline stages including first, a prediction and parameter estimation stage, second, a Golomb-Rice entropy coding stage, and third, an output packaging stage.

Since Golomb-Rice codes vary in length, the number of clock cycles needed to completely

76

pack an encoded stream onto a fixed bus width varies at the final packaging stage. To prevent pipeline overflow, a ready-acknowledge handshaking mechanism is employed throughout every stage in the engine, including the input and output ports that connect to the PDS and UART respectively.





= N k log 2 A

Figure 3-42 Hardware architecture of the lossless data compression engine

To conserve hardware resources, the prediction circuitry is shared among the different biomedical signals. The upstream PDS unit functions as a multiplexer/arbitrator, feeding the otherwise parallel multi-channel biomedical signals serially into the lossless data compression engine. At the same time, the PDS unit also annotates each sample with its signal type, precision and bypass mode through the CHSEL, SP and BYPASS pins respectively. Table 3-18 shows a detailed description of these control signals.

The prediction stage, upon receiving a biomedical sample, determines its context and loads the context statistics from memory. In the following clock cycle, the k parameter is calculated, the prediction error remapped, and the context variables updated and written back to the memory. The prediction stage maintains the individual memory locations for the

77

context statistics of the different biomedical data channels. For DOT, DPCM prediction is performed on an inter-frame, per pixel basis, and context-based k parameter estimation is not performed, due to memory space limitations.

Table 3-18 Control signal parameters of the lossless data compression engine

Port Name Width Description

BYPASS 1

If the input data comes with the BYPASS signal set to high, data compression is bypassed and the input data is directly packed as raw data by the output packaging unit

CHSEL (channel select)

4

Value Channel Value Channel

0 EEG/ICA Ch0

5 DOT Pixel

1 EEG/ICA Ch1

2 EEG/ICA Ch2 6 EKG Ch0

3 EEG/ICA Ch3 7 EKG Ch1

4 HRV Coefficients 8 EKG Ch2

SP (sample precision)

3

0 8-bit 4 16-bit

1 10-bit 5 18-bit

2 12-bit 6 20-bit

3 14-bit 7 32-bit

The Golomb-Rice entropy coder in Figure 3-42 implements the Golomb-Rice coding table shown in Figure 3-9b for various values of k. It calculates the quotient Q and remainder R based on the estimated Golomb-Rice k parameter and input remapped prediction errors, and outputs the result to the next stage within a single clock cycle.

The output packaging unit is the final pipeline stage and maintains four separate output buffers for the different biomedical signals, which are filled up as samples are encoded into bit streams. Whenever any of the buffers become full, the buffer value is driven onto the output bus, together with an appended data type ID indicating the type of biomedical data, as described previously in 3.3.1.4. In case two or more buffers are full simultaneously, a priority scheme is enforced such that minimal output latency is achieved.

The first two stages, namely the prediction and Golomb-Rice encoding stages, correspond to the proposed algorithm described in 3.2.4 to which the overhead energy consumption Ecomp (3.2.4.2) is attributed. The PDS and output packaging stages perform mainly data routing and packing, which operate regardless of whether compression is employed or not.

78 3.3.3 Functional Verification

The functionality of the complete system from algorithm down to hardware implementation must be verified thoroughly before taping out the design for chip fabrication.

To achieve this, detailed functional verification is performed at key points in the pre-silicon phase of the IC development flow shown in Figure 3-43. In general, the algorithms (i.e. DOT, ICA, HRV and COMP) have been from the very beginning developed with hardware issues considered as much as possible. However, as one cannot foresee all of the problems that may arise in the chip implementation down the road (either due to lack of judgment or experience), redesign at the RTL, architectural or even algorithm level is typically necessary to resolve any area, timing closure or power issues that may block the way to successful physical implementation. In the following discussion, the verification approaches at both individual subsystem and chip integration levels are described further in detail.

MATLAB

Figure 3-43 IC development flow methodology

79

The algorithms for the individual biomedical signal processing engines, including lossless data compression, have been previously developed and verified using MATLAB.

Simulation results of the developed ICA algorithm demonstrate a verified average correlation coefficient performance of 0.86 using randomly generated mixed super-gaussian sources.

Performance of the HRV algorithm was evaluated and verified using both real ECG signals (from the MIT-BIH database) and artificially controlled RRI data [68]. Reconstructed DOT images based on fNIR light intensities acquired from self-developed DOT sensor array board were compared to reference images using MSE [50]. Finally, the lossless data compression algorithm was evaluated using publicly available reference biomedical data such as the MIT-BIH and UCI-KDDI databases as well as outputs of the upstream algorithm blocks [69]. Table 3-19 shows a summary of the function verification and performance evaluation procedures performed for the various biomedical signal processing algorithms as described above.

Table 3-19 Verification of the various biomedical signal processing algorithms

Subsystem ICA DOT HRV COMP

Algorithm

model MATLAB MATLAB MATLAB MATLAB

Input pattern 1. Random mixed gaussian signals 2. Real EEG acquired using Neuroscan EEG

In order to ensure that the hardware implementation behaves the same way, golden test patterns at key points in the algorithms are generated from the MATLAB models, and attached to the RTL, gate-level and post-layout simulation test benches for automated data checking. Individual unit testing of the ICA, DOT, HRV and COMP engines at the RTL level is first performed before moving up to chip integration level testing so as to allow easier

80

localization of bugs or defects. While the main focus during unit testing is to check the correctness of the individual processing engines against their respective algorithms, system-level verification focuses more on the correctness of data communication and chip integration.

The system-level test bench used to verify the chip design and integration is shown in Figure 3-44. After the test patterns have been attached to their respective models and monitors, chip operation is initiated by the main test control either through a manual pin trigger or a system mode command transmitted through UART. The DUT then begins to periodically request for raw biomedical data samples for processing. In response, the AFE model retrieves the requested data from attached raw input data files and sends them back to the DUT according to the AFE interface protocol. Processed data received by the UART model is unpacked, decoded and finally verified on-line against expected data generated from the MATLAB models.

Figure 3-44 System test bench for functional verification of biomedical multiprocessor IC

3.3.4 Chip Implementation in UMC 65nm CMOS Technology

The functionally and physically

fabricated using UMC 65nm CMOS 1P10M technology. The die size is 1,317 x 1,317 um and the core size is 680 x 680 um

estimated using Synopsys Prim

condition of 1.0V core voltage and 24MHz clock frequency.

breakdown across the different modules are shown in

Chip Implementation in UMC 65nm CMOS Technology

functionally and physically

using UMC 65nm CMOS 1P10M technology. The die size is 1,317 x 1,317 um and the core size is 680 x 680 um

estimated using Synopsys Prim

condition of 1.0V core voltage and 24MHz clock frequency.

breakdown across the different modules are shown in of the fabricated chip

specifications is given in Table packaged chip is shown in

45 Die micrograph and chip layout of the integ

Technology

Chip Implementation in UMC 65nm CMOS Technology

functionally and physically

using UMC 65nm CMOS 1P10M technology. The die size is 1,317 x 1,317 um and the core size is 680 x 680 um2, comprising a total of 368,314 gates. Power consumption estimated using Synopsys Prime Power reports a total of 3.6mW consumed

condition of 1.0V core voltage and 24MHz clock frequency.

breakdown across the different modules are shown in of the fabricated chip is shown in

Table 3-20. The chip bonding map is shown in is shown in Figure

Die micrograph and chip layout of the integ Table 3-20 Chip specifications summary

104 pins

81

Chip Implementation in UMC 65nm CMOS Technology

functionally and physically verified chip

using UMC 65nm CMOS 1P10M technology. The die size is 1,317 x 1,317 um comprising a total of 368,314 gates. Power consumption e Power reports a total of 3.6mW consumed

condition of 1.0V core voltage and 24MHz clock frequency.

breakdown across the different modules are shown in is shown in

The chip bonding map is shown in Figure 3-47.

Die micrograph and chip layout of the integ Chip specifications summary

1.317 x 1.

3.6 mW (full system mode) 104 pins (functional /

Chip Implementation in UMC 65nm CMOS Technology

chip has been

using UMC 65nm CMOS 1P10M technology. The die size is 1,317 x 1,317 um comprising a total of 368,314 gates. Power consumption e Power reports a total of 3.6mW consumed

condition of 1.0V core voltage and 24MHz clock frequency.

breakdown across the different modules are shown in Table is shown in Figure

3-The chip bonding map is shown in

Die micrograph and chip layout of the integ Chip specifications summary

Chip Implementation in UMC 65nm CMOS Technology

has been successfully

using UMC 65nm CMOS 1P10M technology. The die size is 1,317 x 1,317 um comprising a total of 368,314 gates. Power consumption e Power reports a total of 3.6mW consumed

condition of 1.0V core voltage and 24MHz clock frequency. The details of the area and power Table 3-21. The

-45 and a summary of the chip The chip bonding map is shown in

Die micrograph and chip layout of the integrated biomedical DSP Chip specifications summary

comprising a total of 368,314 gates. Power consumption e Power reports a total of 3.6mW consumed at an operating The details of the area and power

The floor plan and and a summary of the chip The chip bonding map is shown in Figure

rated biomedical DSP

M 0.680 x 0.680 mm2

3.6 mW (full system mode)

/ core power : 32 / 64 / 8

implemented and using UMC 65nm CMOS 1P10M technology. The die size is 1,317 x 1,317 um2, comprising a total of 368,314 gates. Power consumption at an operating The details of the area and power floor plan and die comprising a total of 368,314 gates. Power consumption at an operating The details of the area and power die and a summary of the chip and the

Table

Table 3-21 Area and power breakdown across different modules Modules 01 I_MANUAL_START 02 P_SVSSIO15 03 P_SVDDIO19

SVDDIO26 32

Area and power breakdown across different modules Area (um2 03 P_SVDDIO19 04 P_SVDDIO0 05 O_DOT_CHSEL1 06 O_DOT_CHSEL3 07 I_RX 08 O_AIC_VALID_CONVERSION 09 O_LED_SEL0 10 P_SVSSIO0

1 2

Figure 3-47 Chip packaged in 28mm 128 82

Area and power breakdown across different modules

2) 10 P_SVSSIO0 11 (unconnected) 12 (unconnected) 13 P_SVDD6 14 O_AIC_CLK_SLOW 15 (unconnected) 16 I_RESET 17 (unconnected) 18 P_SVSS1

1

Figure 3-46 Chip bonding map

Chip packaged in 28mm 128

Area and power breakdown across different modules

Leakage Dynamic 18 P_SVSS1 19 P_SVSSIO14 20 (unconnected) 21 (unconnected) 22 P_SVDD1 23 I_CLK 24 P_SVSS6 25 P_SVDDIO23 26 P_SVSSIO1

17

Area and power breakdown across different modules Power (mW) 26 P_SVSSIO1 27 P_SVSSIO23 28 P_SVDDIO1 29 P_SVDDIO13 30 O_TX 31 P_SVDDIO14 32 P_SVSSIO19

Chip bonding map

Chip packaged in 28mm 128-pin CQFP Area and power breakdown across different modules

Power (mW)

83

Chapter 4 Integrated DOT/ECG/EEG Multiprocessor IP for SoC Implementation

In the previous chapter, the design and implementation of a highly-integrated DOT/ECG/EEG biomedical signal multiprocessor IC was presented. Although it possesses desirable features of small size, low power and a high degree of functional integration, the designed chip lacks the flexibility and adaptability to be extended in future derivative applications. In this chapter, the extended development of the proposed biomedical multiprocessor as an SoC-compatible IP is presented as a solution.

4.1 Motivation for SoC Implementation

SoC stands for system-on-a-chip, and is literally the large scale functional integration of traditional discrete PCB (printed circuit board) components comprising a complete computer system onto a single chip. A typical SoC design is a HW/SW integration of at least one programmable microcontroller, microprocessor or DSP core; on-chip memory blocks such as ROM, RAM, EEPROM and flash; HW accelerators that perform special processing tasks (ex. data encryption, multimedia encoding, 3D graphics rendering, etc.); I/O connectivity modules such as ADC/DAC, UART, USB, SPI, SD, Ethernet, Firewire, Thunderbolt, LCD and GPIO to interface with the outside world; computer architecture peripherals such as direct memory access (DMA), memory and interrupt controllers, bus arbiters and bridges, counter timers and real-time clocks; bus interconnects for data transfer and control; oscillators, phase-locked loops (PLL), voltage regulators, power management circuits and reset generators for providing clock, power and reset infrastructure; and most importantly, the embedded software that controls it. Table 4-1 shows a summary of the above enumeration and Figure 4-1 shows a typical SoC hardware architecture.

Table 4-1 Typical hardware components comprising an SoC Category/Function Component(s)

Main processor Microcontroller, microprocessor or DSP core

Memory ROM, RAM, EEPROM, Flash

Architecture

redesign, verify and fabricate new chip Architecture

control lines Bus interconnects, encoders, decoders and multiplexers HW accelerator IPs Data encryption, multimedia encoder/decoder, 3D graphic

(examples)

I/O connectivity ADC/DAC, UART, USB, SPI, SD, Ethernet, LCD, GPIO and reset Oscillators, PLLs, voltage regulators, power m

and

Bus interconnects, encoders, decoders and multiplexers Data encryption, multimedia encoder/decoder, 3D graphic (examples)

ADC/DAC, UART, USB, SPI, SD, Ethernet, LCD, GPIO Oscillators, PLLs, voltage regulators, power m and clock generators

Figure 4-1 Atmel AT91SAM9263 SoC architecture Such high degree of system integration is possible t

Bus interconnects, encoders, decoders and multiplexers Data encryption, multimedia encoder/decoder, 3D graphic ADC/DAC, UART, USB, SPI, SD, Ethernet, LCD, GPIO

Oscillators, PLLs, voltage regulators, power m clock generators

Atmel AT91SAM9263 SoC architecture Such high degree of system integration is possible t

semiconductor device fabrication technology where [70]. As more

and power consumption enhanced significantly Bus interconnects, encoders, decoders and multiplexers Data encryption, multimedia encoder/decoder, 3D graphic ADC/DAC, UART, USB, SPI, SD, Ethernet, LCD, GPIO

Oscillators, PLLs, voltage regulators, power m

Atmel AT91SAM9263 SoC architecture Bus interconnects, encoders, decoders and multiplexers Data encryption, multimedia encoder/decoder, 3D graphic ADC/DAC, UART, USB, SPI, SD, Ethernet, LCD, GPIO

Oscillators, PLLs, voltage regulators, power management

Atmel AT91SAM9263 SoC architecture Such high degree of system integration is possible thanks to the

transistor sizes have shrunk

components are moved from the board into power consumption become greatly

due to faster internal Bus interconnects, encoders, decoders and multiplexers

Data encryption, multimedia encoder/decoder, 3D graphics engine, etc.

ADC/DAC, UART, USB, SPI, SD, Ethernet, LCD, GPIO

anagement circuits, reset

Atmel AT91SAM9263 SoC architecture

the steady advance have shrunk

moved from the board into greatly reduced moved from the board into reduced while moved from the board into while

-85

effective silicon that can be shared across different applications, shorter software development schedules, lower component counts and assembly costs, dramatic reductions in the total system cost can be achieved. With these many advantages, SoC-based systems have become extremely popular that they can be found in virtually every modern electronic device today.

SoC development is not without challenges however. As more and more functions are pushed into the SoC, the SoC becomes increasingly complex that the time and effort needed for design and verification becomes a significant issue. To make matters worse, product lifecycles and development schedules demanded by the market have only but shortened over the years. To work around these difficulties, many different approaches have been widely adopted by the industry such as design reuse, orthogonal design partitioning, and higher abstraction levels (ex. transistors, gates, RTL, behavioral) to increase productivity. An SoC design methodology called platform-based design is a culmination of these ideas, whose practice has changed from option to necessity in recent years.

In a typical platform-based design, basic system functionality is provided by a reusable HW/SW platform (Figure 4-2, left side), comprising a reference SoC platform design, a corresponding basic set of device driver software, and optionally an operating system with system function libraries. This pre-verified HW/SW platform serves as a stable foundation upon which application-specific hardware and software (Figure 4-2, right side) can be rapidly and reliably built to allow system customization/differentiation for various target applications. This is in particular supported by the extensive use of well-designed system interfaces such as libraries and APIs for software and standard data transfer protocols, memory maps and interrupt schemes for hardware (Figure 4-2, in red colored font). Such organized structure of hardware and software architectures facilitates easy functional extension, parallel development (both in-house and third-party) and maximal reuse of IP at both platform and component levels; as a result, system complexity becomes manageable and development time, effort and costs become reduced significantly.

86

Figure 4-2 Platform-based system design

In recent years, as SoC platforms have become more common, system design is increasingly performed from the perspective of software running on an SoC embedded processor [71]. As much as possible, differentiation is implemented in software avoiding the use of silicon, with the goal of achieving a performance just enough to meet the requirements.

Only when really called for based on software profiling are a few select functions elected for specialized hardware implementation. With bulk of the commercial value getting associated more and more with system design and application differentiation nowadays, modeling of customized systems becomes a top priority. Examples of some new methodologies further built on top of the framework of platform-based design are early high-level exploration of function versus architecture in electronic system-level (ESL) design, and separation of computation and communication as manifested in transaction-level modeling (TLM).

In line with these recent trends, the previously developed integrated DOT/ECG/EEG biomedical multiprocessor design is repackaged and delivered as a pre-verified IP core for ready integration in today’s ubiquitous platform-based SoCs. Among the many different SoC

87

platforms out in the market, we choose the platform based on the ARM processor and AMBA bus architecture due to its status as the most widely adopted platform in the industry today. In the remaining sections of this chapter, the development of the proposed biomedical multiprocessor IP as an application extension to SOCLE’s Cheetah ARM SoC platform is presented in greater detail.

4.2 SoC-based Design of the Biomedical Multiprocessor System 4.2.1 SoC Architecture

The Cheetah ARM SoC platform solution offered by SOCLE Technology Corp.

(platform vendor) is chosen for the development of the proposed multiprocessor IP, for its rich set of SoC features and friendly technical support. With respect to Figure 4-2, the Cheetah SoC platform solution comprises 1) for hardware, a reusable SoC platform based on the ARM926EJ-S processor and AMBA AHB/APB bus protocols, including many of the peripheral/support hardware components listed in Table 4-1; and 2) a software suite comprising a toolchain set, board support package, boot loader (U-boot), operating system (OpenLinux), root file system tools (BusyBox) and device drivers. A feature list of the Cheetah ARM SoC is shown in Table 4-2 and its architecture is shown in Figure 4-3.

Table 4-2 Select features of the SOCLE Cheetah ARM SoC platform

CPU (ARM926EJ-S) Clock frequency up to 266MHz

AHB bus Clock frequency up to 50/133MHz (FPGA expansion/no FPGA exp.) Bus extensions: 2 Master, 2 Slave

Memory Controller NOR-Flash/NAND-Flash/SRAM/ROM (4 banks) SDRAM (4 banks)

System Controller Reset control, power mode control (normal/idle/slow) Clock control: CPU/AHB clock ratios 8:1/4:1/3:1/2:1/1:1 DMA Controller 4 channels, mem-to-mem/IO-to-mem/mem-to-IO transfers Interrupt Controller 31 sources, programmable rise/fall/high/low scheme, 1 FIQ General Purpose I/O (GPIO) 8 individually programmable input/output pins

System Controller Reset control, power mode control (normal/idle/slow) Clock control: CPU/AHB clock ratios 8:1/4:1/3:1/2:1/1:1 DMA Controller 4 channels, mem-to-mem/IO-to-mem/mem-to-IO transfers Interrupt Controller 31 sources, programmable rise/fall/high/low scheme, 1 FIQ General Purpose I/O (GPIO) 8 individually programmable input/output pins