Implementation of Local Motion Estimation

CHAPTER 2 DIGITAL IMAGE STABILIZATION

2.3 Implementation of Local Motion Estimation

Fig. 9 The block diagram of the proposed CMV generation method.

2.3 Implementation of Local Motion Estimation

After programming the DIS system, we find that it can do an excellent off-line job comparing with the RPM fuzzy set theory [32]. But it is hard to implement the algorithm into practical consumer camcorders. Although there are still several DIS systems which could do the on-line job by using PC [45], it is still a long way to go into the DIS hardware design.

Therefore it is necessary to analyze the DIS computation load before the system implement on practical camcorders. We record the computation time from each DIS step and compare them

might lack of accuracy due to different test videos, but we still can recognize that most of the computation loading belongs to the motion estimation step. Fig. 10 shows the percentage of the computation loading of each block in DIS system. The motion estimation step contains 80 % processing time and the motion compensation contains the rest 20%. And more than halve computation time belongs to local motion estimation step. This is because the RPM method in LMV step need to load the image SAD values into memories first and find the minimum value pixel by pixel. The tasks of finding global minimum position by using a DSP processor is comparing and storing the minimum value of neighbor pixel and jump to next pixel. The minimum position information should also store in the memory. The larger the processing unit is, the longer processing time it takes. Our strategy is to design an application specific chip which could highly reduce the computation time in LMV estimation. The less the processing time in LMV estimation step the higher capability of real-time operations for image stabilization processing is.

The LMV estimation chip is designed with CNN technology to solve the heavy computation time problem. Compared with conventional digital technology, CNN-based computing is capable of realizing these TeraOPS-range image processing tasks in a cost-effective implementation. The design concept of ASCNN chip is shown in Fig. 11. By using an 8-bit D/A converter, the absolute image difference which ranges from 0~256 could store into the CNN local analog memories (LAM). And with the aid of CNN technology, the LMV position could be easily found compared with the DSP processor. The CNN theory and chip design are introduced in Chapter 3 and Chapter 4.

Fig. 10 Computational complexity of the DIS flow.

Fig. 11 The design concept of ASCNN chip architecture.

Compensated Images

LMVs Estimation

Motion Estimation ~80%

Global Motion Vector Estimation

Motion Compensation ~20%

Compensation Motion Vector Estimation

LMVs

Image

Compensation ^CMV

~63% ~17%

Original Images

Store in Buffer

GMV

3. CHAPTER 3

Cellular Neural Network

The original Cellular Neural/Nonlinear Networks (CNN) paradigm was first introduced by Chua and Yang [20]. CNN technology is both a revolutionary and experimentally proven new computing paradigm. The two of most fundamental ingredients of the CNN are: the use of analog processing cells with continuous signal values, and local interaction within a finite radius. CNN possesses some of the key features of neural network, which has important potential applications in such areas as image processing and pattern recognition. The CNN theory and architecture will be introduced first and the next is CNN circuit. The last will include the inversion and adaptive threshold properties of CNN which are used to calculate the LMV.

3.1 CNN Theory

CNN can be considered an implementable alternative to fully connected neural networks and a remarkable improvement in hardware implementation of artificial Neural Networks. Local interconnection and simple synaptic operators are the most attractive features of the CNN for VLSI implementation in high-speed, real-time applications [35] and the CNN are widely used in several application fields, such as image processing and pattern recognition. Several hardware implementations of the CNN have been reported in the literatures [37], [38], [39].

The state equation of CNN can be represented by

( ) ( ) ( ) ( )

point in the neighborhood within a radius r of the cell i, j. A and B are the nonlinear cloning templates [40]. Fig. 12 shows the dynamic route of state in CNN. The feature of the Eq. (3.2) has been plotted at Fig. 13.

In many applications, the templates (A,B) and the threshold I are translation invariant. In the case of single variable A and B functions, the linear (space-invariant) template is represented by the additive terms as Eq. (3.1).When the template is space invariant, each cell is described by a simple identical cloning template defined by two (2r + 1) × (2r + 1) real matrices A and B, as well as the constant term I. In addition, as a very special case, if the input and the initial state values are sufficiently small and f is piecewise linear, then the dynamics of the CNN array is linear.

Unlike other standard analog processing arrays, or neural networks, the one-to-one geometric (topographic) correspondence between the processing elements and the processed signal-array elements (e.g., pixels) is of crucial importance. Moreover, the template has geometrical meanings which can be exploited to provide with geometric insights and simpler design methods.

Fig. 12 The dynamic route of state in CNN.

3.2 CNN Architecture

The basic circuit unit of CNN is called a cell. It contains linear and nonlinear circuit elements, which typically are linear capacitors, linear resistors, linear and nonlinear controlled sources, and independent sources. The structure of CNN is similar to that found in cellular automata, and each cell in a CNN is connected only to its neighbor cells. Adjacent cells can interact directly with each other. Cells not directly connected together may affect each other indirectly because of the propagation effects of the continuous-time dynamics of the network. A typical example of a cell C(i, j) is shown in Fig. 14, where the suffixes u, x, and y denote the input, state, and output, respectively.

Fig. 14 The circuit of a CNN cell.

The differential equation governing a CNN in Eq. (3.1) is rewritten as follow :

⎥⎥ cloning template. The synapse weights of the shift-invariant CNN can be described by the feedback and feed forward cloning templates:

⎥⎥ The matrix B can be defined in the similar way. Then, the maximum value of x in the steady state is the sum of absolute values of all inputs from the neighborhood cells,

∑

3.3 CNN Circuit Design

The current-mode approach [40], [42] is used in CNN circuit design because it has superior mathematical addition properties. The summation of weighted currents is simply done by appropriate transistor sizing. The piecewise-linear function is achieved by cascading two current limiters as shown in Fig. 15.

(a) (b)

Fig. 15 Piecewise linear function. (a) Schematic view and (b) Transfer characteristics of two current limiters in the cascade [36].

The limiting operation of the input current denoted by Ix first takes place at a negative value

Ix = -IQ and at a positive value Ix = IQ. For Ix

≤−

IQ

, there is no currents that flow through the transistors

M

and M

4. Therefore, IDS5

= I

DS6

= 2IQ and Iy = -IQ, where I

DS5 represents the drain-to-source current of M5, and so on. For

Ix

>−

IQ

I

DS3

= I

DS4

= IQ + Ix and I

DS5

= I

DS6

= (IQ – Ix) produce the output current Iy = IQ - I

DS6

= Ix. However, if Ix

IQ

, then

I

DS5

= I

DS6

= 0 and Iy = IQ.

Figure 16 shows a detailed schematic diagram of a neuron cell [36]. The synaptic weight is realized by M11

- M

14 for a0 and M15

– M

16 for a1. Four copies of a current mirror are used to provide the weight for fore neighboring cells. The external input current, bias current, feedback current, and those from the neighboring cells are summed at the drain terminal of M1. The offset circuit provides a bias current which is set by bias voltage VBI. The output voltage generator is made of a simple current comparator using a cascade of two inverters. The input currents from neuron circuit, weighting circuit and reference voltage set by VBB are compared to produce an output Vy which represents the sign of the neuron output.

Fig. 16 The CNN cell with fixed weights (templates).

3.4 CNN Template Consideration

3.4.1 Image Difference

The first step of RPM method will subtract the present sub-region pixels with past representative point pixel color. We can implement subtraction step with image inversion and current addition. The inversion template [43] lists below. The input of CNN is grayscale representative sub-region.

Figure 17 shows the simulation result of CNN inversion template [44]. Fig. 17 (a) shows the input of CNN. Fig. 17(b) shows the initial state of CNN, and the state will subtract from input.

(a) (b) (c)

Fig. 17 Simulation of CNN inverse template. (a) Input of gray-scale image. (b)The initial state of CNN. (c) Output after difference processing.

Because we take current mode CNN as processing core, the addition step can set initial state of CNN as zero and directly combine the input current between the inversion representative sub-region and present sub-region.

3.4.2 Global Minimum

To search the minimum position in a specify area not only takes time but also consumes lots power. Comparing previous value and storing the minimum value is the basic processing step.

The larger area need to be determined, the more clock cycle, ie., power, it takes. Egusa [17]

proposed to use analog circuit to find the global minimum value. But the circuit only suit for few input application. Therefore, we propose to use CNN adaptive threshold template with capability of finding the global minimum position in larger array and can process with less clock period.

The adaptive threshold template lists below.

⎥⎥

The adaptive threshold template not only simplifies the CNN state equation (3.1), but also makes the template easier to implement in VLSI. We can write an equation to represent Eq.(3.8) as:

We then set the set the initial state of CNN as zero which is reasonable for circuit design and also don’t need to implement any initial circuit for CNN core. And we set the sigmoid function will saturate at ±20

uA

. This is because we give every current source of CNN core as 20uA.

The CNN adaptive threshold template analysis is shown in Table1.

Table 1： CNN Adaptive Threshold Template Analysis

Case 1 Case 2

10

¹⁰ ¹⁰ ⁰

We use the property of 【case 2】 to implement adaptive threshold template on searching global minimum position in RPM method. With the aid of CNN array, a brand-new searching method has developed for DIS algorithm. As shown in Fig. 18, the SAD values are plotted in 3D view, and the CNN output will change from logic 1 to logic 0 while any of the difference value is below the threshold level and others remain logic 1. If none of the position in the processing array flip its output logic, CNN bias control circuit will tune the threshold to a higher level until the minimum input in the array is lower than bias current.

Fig. 18 Searching global minimum by using CNN adaptive threshold template.

4. CHAPTER 4

CIRCUIT DESIGN OF CNN-BASED LMV ESTIMATION

The process of finding LMV is very computationally intensive, requiring billions of operations for each image. The most complexity operations occur in 1) computing the motion vector and the difference value and 2) storing the difference value with the position information if it is smaller than any previous value. Since the operation slows down the computation, the CNN architecture is suitable for motion computation and is done by CNN with a fixed template and the tunable bias current circuit for each cell.

The tested image is a tennis player video of 312×200 pixels. Since image sensors can get more pixels than the video image requires, each image captured by sensors will first cut out a specific boundary pixels which saves as compensation area. Removing boundary area is called pre-processing step, and each image size now becomes 300×190 pixels. The motion estimation block diagrams which finds LMVs with CNN processing is shown in Fig. 19. Before entering CNN processing, the captured image has to be cut into four regions, and each of region will be found their own LMV. After this, each region again is divided into 30 sub-regions as mentioned in Chapter 2. Each sub-region is the size of 19×25 pixels processing block. We first store the center point image color value (0~255) of every sub-region in the prescribed region which is captured by (t-1)^th image sequence. According to the RPM method described in Chapter 2, the absolute difference information should be computed by t^th and (t-1)^th sub-images. Through the digital-to-analog converter, the difference information could be stored in Local Analog Memories (LAM). The 30 sub-regions’ absolute difference value array will stack into CNN LAM from each processing region and the memories voltage information will vary from the difference values stored in the LAM. CNN does not begin to compute motion vectors until LAM accumulates all the difference information of 30 sub-regions. The CNN processor will check the global minimum position by using a 32-level threshold bias. The minimal difference value and the position information would be found within 32 clock cycles and then be latched in the location registers.

The processing time is not effected by the size of CNN. Therefore the larger the difference array is, the faster processing time compared with DSP processor will be.

The system shown in Fig. 19 includes windowing (RPM and SAD), the 19×25 CNN array and LAM, the bias control circuit, and the addressing decoder.

Fig. 19 The flow of CNN-based local motion estimation.

Figure 20 shows the architecture of the application-specific CNN (ASCNN) design. By the windowing component of Fig. 19, the sequential images are segmented into many 19×25 pixels sub-regions for each region, and the absolute difference [17] of two images is calculated.

Through an 8-bit D/A converter, the difference value is loaded and accumulated into LAM which consists of switch MOS capacitors. The bias current circuit will adjust the CNN array’s threshold according to the values of global output connected chains. If all difference values are higher than the given bias, the higher current fed into the CNN input from bias circuit will be. The process

Capture single image Cut four regions Cut 30 sub-images for a process element

bias circuit X Reg Y Reg

Capture single image Cut four regions Cut 30 sub-images for a process element

bias circuit X Reg Y Reg

will detect the X and Y position information in no time and store in the registers.

Fig. 20 AS-CNN chip architecture.

4.1 D/A Converter

Analyzing the minimum image difference value of the input video sequence is necessary for the D/A converter. Fig. 21 shows the image sequences verses the minimum pixel difference values of each frame. We can calculate that the mean of minimum difference value is located in 529 difference values, but the data would vary from each video. Therefore the upper bound of the minimum difference has 1024 pixels. The upper bound takes maximum charging input for four times, ie., 256× , before LAM reach the 3.3 volts. Note that there is an exception in the 4 sequence No.48. The minimum value is over 1024. This is because there’s a great movement for the whole region which is caused by intentionally moving the camcorder or the object is too large in this region so that it is considered as the background. LMV located in this region is not dependable and should not be stored. This kind of situation can be detected and discarded by the CNN controller.

0 200 400 600 800 1000 1200 1400

0 10 20 30 40 50 60 70 80 90

Frame No.

difference (pixel count)

Fig. 21 Analysis of minimum difference value for tennis player video frames.

An 8-bit D/A converter is used to translate the image absolute difference code into analog current and load into local analog memory. In the first, the DAC input will pass registers in order to synchronize with digital control circuit, which is trigger by 20 MHz clock.

The DAC is made of eight sets of current mirrors shown in Fig. 22 and the output stage is the 475 sets of LAM. Table 2 lists the DAC specification to keep the function accuracy of the input stage design.

Table 2： 8-bit D/A Converter Specification Table.

Model Application Specific CNN 8-bit DAC

Output Loading 50Ω

Operating Voltage Range (VDD=3.3v) MAX @ 11111111

Figure 23 shows the output voltage variation of 256 levels with 50Ω loading. The charging voltage of the analog memory in the output stage is proportional to the input current and the loading time. The larger the image difference value is, the higher the memory voltage will be.

Fig. 23 The output voltage variation due to input change of (8-bit) 256 steps.

2. DNL

Figure 24 is the DNL analysis between the output voltage in Fig. 23 and ideal voltage curve ie. V_ideal=50Ω× 0.43091/1000× t. The value 0.43091e-3 is the slope of output voltage due to the input change from 0 to 1. The coordinate Y represents the difference percentage of two curves versus LSB. The largest DNL in Fig. 24 is about 0.25 LSB. The coordinate X represents the time and the time step is 50 ns while input change data from 0 to 255.

Fig. 24 DNL analysis.

3. INL

The INL is accumulated form all DNL data in Fig. 24. The result shows that, INL = 0.1615 LSB < 0.5 LSB

4. SNR

Because the input working frequency is set in 20MHz ie, fclk=20MHz, we give a sine wave input with 20MHz/28 frequency for DAC and record the output wave form as shown in (a). Then we sample the output stable voltage at 25 ns and get 140 sampling points. DFT position analysis is shown in (b). We take logarithm for the x coordinate and find the

Power DC= 0.1479 Power Sine = 0.0369

Power Homonic = 1.6947e-007 Power Noise = 9.6831e-008

SNR_dB = 55.8148dB > 49 dB

Fig. 25 SNR analysis of 8-bit DAC. (a) sine wave output wave form of DAC; (b) DFT analysis of DAC output form with 140 sampling points; (c) The logarithm value for the x- coordinate in (b).

Layout of the DAC is shown in Fig. 26. The 8-bit input is from the output of synchronize registers and the maximum current output is 2.1064mA which is designed to make each LAM cell has five times charging capability before reaching the 3.3 volt upper bound.

Fig. 26 Layout of 8-bit current mode D/A converter

4.2 Local Analog Memory

Local analog memory (LAM) is designed to store the image difference value. The basic structure of the LAM is shown in Fig. 27 is a LAM cell. It consists of a transmission gate controlled by the CNN controller and a 2p MOS capacitor. defined in input stage. Vctrl_P and Vctrl_N are switch signals which are from the position decoder. Although there exists non-linear problems for the MOS capacitor design, for area consideration we still choose it rather than the poly capacitor. For a 2pf poly capacitor design, the nonlinear problems is much easier to solve, but the area take 2314um² while the MOS capacitance only cost 1800um² [51]. Therefore using of MOS capacitor could save much on-chip area. The characteristic curve of MOS capacitor is shown in Fig. 28. Since MOS capacitor has nonlinear transformative property, the capacitor value will have 600f variation while the gate voltage Vgs change from 0 to 0.65 volts. Pre-charging a specific voltage, i.e., 0.65 should be done before loading information into each capacitor. Layout view is show in Fig. 29. Symmetric layout style is used for every two LAM cells to reduce the mismatch during fabrication.

However, accuracy of MOS capacitor value is not important. As long as the global LAM array has uniformly capacitor properties, CNN processor will be able to find the correct global minimum position.

Fig. 28 The characteristic curve of MOS capacitor.

Fig. 29 Layout of MOS capacitor.

4.3 Voltage to Current Converter

A voltage to current converter (VCC) shown in Fig. 30 is required to transform the LAM voltage into CNN current input. With properly design the W/L ratio, the output of VCC to CNN is limited to 20uA. VCC maximum power consumption is 104.6uW when the input signal is 3.3 volts. Fig. 31 shows the layout view of VCC. A common-centroid arrangement is used for the current mirror device M1 and M2.

Fig. 30 Schematic of voltage to current converter.

在文檔中使用細胞神經網路架構實現影像穩定處理之震動向量估測晶片 (頁 27-0)

Implementation of Local Motion Estimation

CHAPTER 2 DIGITAL IMAGE STABILIZATION

2.3 Implementation of Local Motion Estimation

2.3 Implementation of Local Motion Estimation

Compensated Images

LMVs Estimation

Motion Estimation ~80%

Global Motion Vector Estimation

Motion Compensation ~20%

Compensation Motion Vector Estimation

LMVs

Image

Compensation CMV

~63% ~17%

Original Images

Store in Buffer

GMV

3. CHAPTER 3

Cellular Neural Network

3.1 CNN Theory

( ) ( ) ( ) ( )

3.2 CNN Architecture

∑

3.3 CNN Circuit Design

Ix = -IQ and at a positive value Ix = IQ. For Ix

IQ

M

and M

= I

= 2IQ and Iy = -IQ, where I

Ix

IQ

I

= I

= IQ + Ix and I

= I

= (IQ – Ix) produce the output current Iy = IQ - I

= Ix. However, if Ix

IQ

I

= I

= 0 and Iy = IQ.

- M

– M

3.4 CNN Template Consideration

3.4.1 Image Difference

3.4.2 Global Minimum

uA

Case 1 Case 2

10

4. CHAPTER 4

CIRCUIT DESIGN OF CNN-BASED LMV ESTIMATION

4.1 D/A Converter

2. DNL

3. INL

4. SNR

4.2 Local Analog Memory

4.3 Voltage to Current Converter

Compensation ^CMV