A novel scalable architecture with memory interleaving organization for full search block-matching algorithm

(1)

1997 IEEE International Symposium on Circuits

and

Systems, June 9-12,1997,

Hong Kong

A

NOVEL SCALABLE

ARCHITECTURE w I m MEMORY INTERLEAVING

ORGANIZATION FOR FULL SEARCH BLOCK-MATCHING ALGORITHM

Yeong-Kang Lai, Liang-Gee Chen, Tsung-Hun. Tsai, and Po-Cheng Wu

Department of Electrical Enigineering National Taiwan University

Taipei, Taiwan, R. 0. C. China

ABSTRACT

This paper describes a high-throughput scalalble architecture for full-search block-matching algorithm (FSBMA). T h e number of processing elements (PES) is scalable according to the variable algorithm parameters and the performance required for different applications. By use of the efficient PE-rings and the intelligent memory-interleaving organization, the efficiency of the architecture can be in- creased. Techniques for reducing interconnections and ex- ternal memory accesses are also presented. Our results demonstrate t h a t the scalable PE-ringed architecture is a flexible and high-performance solution for

FSBMA.

1. INTRODUCTION

Among various video compression techniques, the motion- compensated hybrid coding is the most popular one and is adopted by several standards and proposals [I]-[3]. Block matching algorithm for motion estimation is nowadays used in a wide variety of applications. It removes the temporal redundancy within frame sequences, thus it provides these coding systems with significant bit-rate reduction. Howev- er, it also requires a large amount of computation and a heavy memory bandwidth.

The different performance requirements are needed for different real-time video applications. T h e performance requirements can be evaluated by some essential parameters such as the block size, the search area size, the frame size, and the frame rate. T o meet real-time video application, the number of P E S depends on the performance requirements. For numerous application domains, optimal block-matching parameters have not been specified yet. Furthermore, flexibility in parameters is explicitly favored for emulation of algorithms in an early phase of system development. Hence, there exists a substantial need for a flexible motion estima- tor, allowing the user t o select his own parameters and t o check the influence of parameter variations.

Several dedicated hardware implementations have been realized for full-search (FS) or full-search-based block matching algorithms [4]-[ll]. Principally, these realizations are systolic arrays, laid out for a specific set of parameter values. Hence they usually offer only a limited flexibility, or even no flexibility a t all. Moreover, the parallel architecture with multiple PES can efficiently increase throughput. However, the number of input pins, the difficulties on d a t a addressing, and interconnection complexity between memory modules and PES make it hard t o implement. We propose the novel scalable PE-ringed architecture effectively t o solve all these problems.

In the presented paper, a scalable, parametrizable and

0-7803-3583-XI97 $10.00 01997

IEEE

O B I B 2 B i 04 05 0 6 0 7 B8 BY B I O B l i 812813 B l 4 B i 16017HIX . ~ B l i l B l l

.I-.

12 013 1174 O M 047 48 B4Y H i l l BA2 B h i 64 865 1166 B7Y BXI a(0.0)a(0,l)a(o.Z) . , , , . .q7,o)a(l,l)a(l,z) a(2,0)a(2.l)a(2.2). . . , , , . 224 0225 8226. . 240 BZ4l 0242 ~~ -8 16x1 6 template block from current frame

X

'

31 x31 search area from previous frame

7

Figure: 1. Template block and search area for 256 possible displacement

( N

= 16,p = 8).

yet efficient full search block-matching architecture is de- scribed. The proposed approach offers the various d a t a flow of different parameters and thus achieves an efficiency close t o 1 0 0 ]percent. It also utilizes the data-reuse t o reduces the number of input pins. Section 2 shows the overall scalable architecture. In Section 3 , we present some techniques of memor:y bandwidth reduction, d a t a arrangement, and interconnection simplification. In Section 4, the performance of the proposed VLSI architecture along with various conven- tional full search block-matching architectures is analyzed. Finally, Section 5 gives a conclusion.

2. THE PROPOSED VLSI ARCHITECTURE

The procedure of a block-matching algorithm is t o find the best matched displaced block from the previous frame Ft-1, within a search range, for each

N

x N block in the present frame .Ft. A straightforward method, the full search, ex-

haustively matches all possible candidates t o find the displacement (called motion vector) with a minimal distortion. As a c.riterion of distortion, the mean absolute difference (MAD) is calculated for each candidate location ( U , v ) :

(2)

H MM columns and H P E columns _{from the above}_PE _{from memory module} current block - DIXBIS v MM l O W S v PE rows

...

time-sharing common bus I

motion vector

Figure 2 . The scalable PE-ringed architecture.

N-1 N-1

= _IFt(z

+

_{I , Y}

+

_{m ) -}_Ft-i(z

+

_{I +} U , Y + m +

.)I

1=0 m = O

(1) where ( z , y ) is the coordinate of the upper-left pixel of the current block in F,, and the values of U and TJ are limited

t o between - p t o p - 1. Fig. 1 illustrates the procedure of FSBMA. T h e candidate blocks are labeled by BO N B255.

Fig. 2 shows the scalable PE-ringed architecture t o perform the FSBMA. T o exploit the parallelism in FSBMA, the proposed architecture is consists of H x V identical on-chip Memory Modules (MMs) and H x V identical processing elements (PES). T h e P E S are connected in a ringed fashion. Each of the P E S is composed of an absolute difference unit, an accumulator, and a final-result latch in a pipelined fashion. T h e detailed hardware architecture of a PE is shown in Fig. 3. According t o the performance requirements of different real-time video applications, the number of the MMs and the PES, i.e., H x V , is determined by the following evaluation equation. Assume t h a t the cycle time is Tp for

t h e N x N current block with the search range of - p t o p - 1. ( 2 ) x ( 2 ~ ) ’ x N 2 x Tp 1

f r

< -

H x V T =

where T denotes the total block-matching time of all blocks in a frame (frame size: W X L). It should be small- er than frame period, l / f r . The search area pixels t o be computed are first stored in the MMs. T h e current block pixels are sequentially input and broadcasted t o all P E S in

current block pixels from the left PE

to time-sharing common bus to the below PE to .the

right PE

Figure 3. Architecture of PE.

a raster-scan order. The upper-left H x V of the all candidate blocks are first computed. After 256 clock cycles, the MAD of each candidate block is produced in each PE and transferred to PE’s final-result latch. These H x V latched MADS can then be sent t o the minimum extractor t o get a minimum MAD. During the comparisons, the PES continue t o perform the block-matchings of the next H x V candidate blocks. It does not consume extra cycles t o fill u p the pipeline operations.

3. T E C H N I Q U E S FOR DATA M A N A G E M E N T This architecture is based on some d a t a management techniques: (1) On-chip memory configuration for data-reuse, ( 2 ) Memory interleaving organization for parallel d a t a accesses, and (3) Propagation of the accumulated partial- results for eliminating the interconnection overheads between the PES and the MMs, These techniques are de- scribed in the following subsections. Fig. 4 shows a simple example t o illustrate the operation of the proposed architecture. In terms of (Z), we assume t h a t the 4x4 MMs and 4x4 PES are needed t o meet a certain real-time application.

3.1. On-chip Memory C o n f i g u r a t i o n for Data- The memory bandwidth is the main bottleneck t o support the high performance motion estimators. However, pixels are repeatedly used several times t o evaluate different candidate blocks. To avoid the extremely high bandwidth requirements for chip 1/0 and memory systems, we utilize the overlap between the search areas of the adjacent current blocks (see Fig. 5) and propose the scheme of three half- search-area (HSA) segments. Each of the memory modules is partitioned into three HSA segments, as shown in Fig. 6(a). One HSA block (31 x 16 pixels) is interleaved t o reside the same labeled segments in the 4 x 4 MMs. Fig. 6(b) illustrates the operations of the three HSA segments. When executing task 0 (matching current block 0 ) , the PES access the search area pixels from the segments 0 and 1. During performing the FS of current block 0, the HSA segment 2 in all memory modules is being filled by the right-half of the search area for task 1 (HSA C). When executing task 1 (matching current block l), the search area pixels from the

(3)

L I I ' I ' i ' I '

l

Figure 4. A simple architecture e x a m p l e for

N

= 4, p = 2, H = 4, and V = 4.(some interconnections are o m i t t e d . )

HSA blocks (31 x 16) ~ w r s n l blocks (16 x 16)

P

a

I I 1 I I I

A R C D E F G H - - .

+_I

--

ssarch area of c u m n l block 0 (task 0) search area of current block 1 (task 1)

1

- search area of current black 2 (lark 2)

F i g u r e 5 . Overlapped search areas of a d j a c e n t cur-

rent blocks.

segments 1 and 2 of the 4 x 4 MMs are accessed by the PES, and the new d a t a (HSA D) are input to segment 0. In this cyclic manner, the 31 x 16 new pixels can be easily accessed from system memory during performing the block-matching of the current block by only one input port.

3.2. M e m o r y I n t e r l e a v i n g O r g a n i z a t i o n

T h e on-chip memory is used t o reduce the 1oa.d for chip

1/0 and memory system. The next problem is: how to provide all required d a t a for the H x V P E S simultaneously. Our approach is t o divide the memory into H x V memory modules, and to interleave the input pixels to these H x V modules. An example are shown in Fig. 7. The label (0 N 15) for each pixel indicates the module t h a t stores

the pixel. This memory interleaving organization provides

a solution for parallel d a t a accesses, but it stipulates t h a t each of P E S is able t o access the corresponding memory module in parallel. HSA segment 0 HSA segment 1 HSA segrnent 2 memory module i

I

segment2

1

contents of segments TO T1 T2 T3 T4 T5 T6 A I n / D D m G G B B m E E M H m C C I F / F F m * time

H full search for current block Ti: task i

F i g u r e 16. ( a ) The three partitioned HSA s e g m e n t s . (b) Contents of the three HSA s e g m e n t s ( w r i t e operations are m a r k e d b y small f r a m e s ) .

3.3. P r o p a g a t i o n of the accumulated partial-

In most of multiple

PES

designs, a fully connected interconnection is demanded between the multiple PES and multiple

MMs. However, the interconnection gives the extra delay and consumes larger routing space due to large numbers of buses, multiplexers and tri-state buffers. To overcome the drawback, the adjacent P E S are connected in the horizon- tal rings and the vertical rings, and we arrange the d a t a flow from the memory modules to the P E S and redistribute the PES operations. This method allows the operations be- longing i o a candidate block t o be performed by the several PES. Then, for every clock cycle, P E propagates the accumulated partial-result to t h e adjacent

PE

t o form the next partial-result of the block candidate. After 256 clock cycles, the final result of each candidate block is produced in each P E . The d a t a flow is shown in Table 1. By use of the propagation of the accumulated partial-results, each of the P E S is only connected to one memory module. This eliminates the complicated interconnection and the switching circuitry between the P E S and the memory modules.

results

4. P E R F O R M A N C E A N A L Y S I S

In this section, we analyze the performance of the proposed VLSI architecture for the full search block-matching algorithm. Table 2 presents a comparison between the proposed and previous architectures. T h e 2-D array provides high-speed motion estimators with very high costs. T h e 1-

D array is an efficient low-cost design for some low-speed applications. However, the proposed architecture gives a satisfactory solution taking into account scalability, input ports and computation speed.

5. C O N C L U S I O N

A scalable PE-ringed architecture for F S B M A has been de- scribed. The number of processing elements (PES) is scalable according to the variable algorithm parameters and the

(4)

Architecture No. of t h e PES

No. ofinDut d a t a Dins

proposed architecture 2-D array [4] 1-D array [5]

128

I

64

1

32

[

16 128

1

64

1

32

1

16 128

1

64

1

32

I

16 16 I 16 I 16 I 16 24 I 24 I 24 1 24 136 I 72 I 40 I 24

Figure 7. An example for the distribution of search area pixels to 4 x 4-memory modules.

performance required for different applications. A configuration of random-access on-chip memory modules solves t h e problems of chip 1/0 and memory bandwidth require- ment. Input d a t a are arranged in the memory modules by memory interleaving organization. Combined with a tech- nique of t h e propagation of accumulated partial results, the interleaved memory module provides every PE with its required d a t a simultaneously without introducing complicated interconnections and switching circuitry. In summary, t h e proposed architectures have t h e following desirable fea- tures: (1) low hardware cost, ( 2 ) high throughput rate, ( 3 )

low latency delay, (4) low 1/0 and memory bandwidth requirements, and (5) 100 percent computation efficiency.

REFERENCES

[l] C C I T T SGXV Working P a r t y XV/1 Specialists Group on Coding for Visual Telephony, Document #584, Nov.

1989.

[2]

ISO/IEC/JTCl/SC29/WGll/Draft/CD

13818-2,

“Recommendation H.262,” November, 1993.

[3]

ISO/IEC/JTCl/SC29/WGll/Draft/CD

13818-1,

“MPEG-2 Systems,” November, 1993.

[4] Luc De Vos and Michael Stegherr,

,

“Parameterisable VLSI Architectures for the Full-Search Block-Matching Algorithm

,”

I E E E Transactions o n Circuit and Sys- tem., vol. 36 no. 10, p p 1309-1316, Oct. 1989. [5] K . M. Yang, M. T . Sun, and L. Wu, “ A family of VLSI

designs for t h e motion compensation block-matching algorithm,” I E E E Transactions o n Circuit and S y s t e m

for Video Technology., vol. 36, no. 10, pp.1317-1325,

Oct. 1989.

Clock cycles per block matching

Table 1. Data flow for FSBMA ( N = 1 6 , p = 8 )

T i m e I D a t e S e q u e n c e I PBO I P B l I . I P B 1 4 1 P B 1 6 O X 1 6 4 0 I 512 1024 2048 4096 512 1024 2048 4096 512 1024 2048 4096 ... ... ... ... ... ... . ( 0 , 1 4 ) B 2 O X 16+14 OX16+16 .(0,16) B 1 1X 16+0 . ( 1 , 0 ) B 4 8 1X 1 6 + 1 .(l,l) BKl ... ... ... ... ... ... 1 x 1 6 1 1 4 a ( 1 . 1 4 ) BKO 1 X 1 6 + 1 6 & ( 1 , 1 6 ) 8 4 9 ... ... ... ... ... ... ... ... ... ... ... ... 14X 16+0 r(14,O) 8 3 2 14X 16+1 4 1 4 . 1 ) 8 3 6 1 4 X 1 6 + 1 4 1 4 x 1 6 + 1 6 16X 16+0 16X 16+1 B 4 9 ... ... ... a ( 1 4 , 1 4 ) 8 3 4 B 3 6 . B 1 6 B 1 7 .(14,16) B 3 3 8 3 4 . B 1 9 8 1 6 a(16.0) B16 B l t . B 1 B 3 ~ ( 1 6 ~ 1 ) B 1 9 B 1 6 . B 1 B 2 ... ... B 6 0 ... ... 1 6 1 1 6 + 1 4 1 6 X 1 6 + 1 6 ... ... ... B 3 1 ... ... . . . ... ... ... . . . ... a ( l K , l 4 ) B 1 1 B 1 9 . BO B 1 E l ? B l 8 . B 3 BO a( 1 6 , l S ) I ... I ... I . . . I . . . I . I ... I I ... I ... I ... l . . . I . I ... I ...

I

” ’

I

”

I

...

I

N B X T

1 . 1

...

1

... ... ... ... I ( ” I

[6] Y. S. Jehng, L. G. Chen,and T. D. Chiueh, “ An ef-

ficient and simple VLSI tree architecture for motion estimation algorithm,’’ I E E E Transactions o n Signal

Proc., vol. 41, no. 2 , Feb. 1993.

[7] H. M. Jong, L. G. Chen, and T. D. Chiueh, I‘ Paral- lel Architecture for 3-Step Hierarchical Search Block- Matching Algorithm,” I E E E Transactions o n Circuit and S y s t e m for Video Technology., vol. 4 , no. 4, Aug.

1994.

[8] Shifan Chang, Juin-Haur Hwang, and Chein-Wei Jen

,

“Scalable Array Architecture for Full Search Block Matching

,”

I E E E Transactions o n Circuit and S y s t e m

f o r Video Technology., vol. 5 no. 4 , pp 332-353, Aug.

1995.

[9] Luc De Vos and Matthias Schobinger

,

“VLSI Architec- ture for a Flexible Block Matching Processor

,”

IEEE

Transactions o n Circuit and S y s t e m for Video Tech- nology., vol. 5 no. 5, pp 417-428, Oct. 1995.

[lo] Santanu D u t t a and Wayne W o l f , “A Flexible Paral- lel Architecture Adapted t o Block-Matching Motion-

Estimation Algorithms

,”

IEEE Transactions on Cir- cuit and S y s t e m for Video Technology., vol. 6 no. 1, p p

74-86, Feb. 1996.

[ll] Gangan G u p t a and Chaitali Chakrabarti

,

“Archi- tectures for Hierarchical Other Block Matching Algo- rithms

,”

I E E E Transactions o n Circuit and S y s t e m

for Video Technology., vol. 5 no. 6, p p 477-489, Dec.

1995.