• 沒有找到結果。

A novel scalable architecture with memory interleaving organization for full search block-matching algorithm

N/A
N/A
Protected

Academic year: 2021

Share "A novel scalable architecture with memory interleaving organization for full search block-matching algorithm"

Copied!
4
0
0

加載中.... (立即查看全文)

全文

(1)

1997 IEEE International Symposium on Circuits

and

Systems, June 9-12,1997,

Hong Kong

A

NOVEL SCALABLE

ARCHITECTURE w I m MEMORY INTERLEAVING

ORGANIZATION FOR FULL SEARCH BLOCK-MATCHING ALGORITHM

Yeong-Kang Lai, Liang-Gee Chen, Tsung-Hun. Tsai, and Po-Cheng Wu

Department of Electrical Enigineering National Taiwan University

Taipei, Taiwan, R. 0. C. China

ABSTRACT

This paper describes a high-throughput scalalble architec- ture for full-search block-matching algorithm (FSBMA). T h e number of processing elements (PES) is scalable ac- cording to the variable algorithm parameters and the per- formance required for different applications. By use of the efficient PE-rings and the intelligent memory-interleaving organization, the efficiency of the architecture can be in- creased. Techniques for reducing interconnections and ex- ternal memory accesses are also presented. Our results demonstrate t h a t the scalable PE-ringed architecture is a flexible and high-performance solution for

FSBMA.

1. INTRODUCTION

Among various video compression techniques, the motion- compensated hybrid coding is the most popular one and is adopted by several standards and proposals [I]-[3]. Block matching algorithm for motion estimation is nowadays used in a wide variety of applications. It removes the temporal redundancy within frame sequences, thus it provides these coding systems with significant bit-rate reduction. Howev- er, it also requires a large amount of computation and a heavy memory bandwidth.

The different performance requirements are needed for different real-time video applications. T h e performance re- quirements can be evaluated by some essential parameters such as the block size, the search area size, the frame size, and the frame rate. T o meet real-time video application, the number of P E S depends on the performance requirements. For numerous application domains, optimal block-matching parameters have not been specified yet. Furthermore, flex- ibility in parameters is explicitly favored for emulation of algorithms in an early phase of system development. Hence, there exists a substantial need for a flexible motion estima- tor, allowing the user t o select his own parameters and t o check the influence of parameter variations.

Several dedicated hardware implementations have been realized for full-search (FS) or full-search-based block matching algorithms [4]-[ll]. Principally, these realizations are systolic arrays, laid out for a specific set of parameter values. Hence they usually offer only a limited flexibility, or even no flexibility a t all. Moreover, the parallel architec- ture with multiple PES can efficiently increase throughput. However, the number of input pins, the difficulties on d a t a addressing, and interconnection complexity between mem- ory modules and PES make it hard t o implement. We pro- pose the novel scalable PE-ringed architecture effectively t o solve all these problems.

In the presented paper, a scalable, parametrizable and

0-7803-3583-XI97 $10.00 01997

IEEE

O B I B 2 B i 04 05 0 6 0 7 B8 BY B I O B l i 812813 B l 4 B i 16017HIX . ~ B l i l B l l

.I-.

12 013 1174 O M 047 48 B4Y H i l l BA2 B h i 64 865 1166 B7Y BXI a(0.0)a(0,l)a(o.Z) . , , , . .q7,o)a(l,l)a(l,z) a(2,0)a(2.l)a(2.2). . . , , , . 224 0225 8226. . 240 BZ4l 0242 ~~ -8 16x1 6 template block from current frame

X

'

31 x31 search area from previous frame

7

Figure: 1. Template block and search area for 256 possible displacement

( N

= 16,p = 8).

yet efficient full search block-matching architecture is de- scribed. The proposed approach offers the various d a t a flow of different parameters and thus achieves an efficiency close t o 1 0 0 ]percent. It also utilizes the data-reuse t o reduces the number of input pins. Section 2 shows the overall scalable architecture. In Section 3 , we present some techniques of memor:y bandwidth reduction, d a t a arrangement, and inter- connection simplification. In Section 4, the performance of the proposed VLSI architecture along with various conven- tional full search block-matching architectures is analyzed. Finally, Section 5 gives a conclusion.

2. THE PROPOSED VLSI ARCHITECTURE

The procedure of a block-matching algorithm is t o find the best matched displaced block from the previous frame Ft-1, within a search range, for each

N

x N block in the present frame .Ft. A straightforward method, the full search, ex-

haustively matches all possible candidates t o find the dis- placement (called motion vector) with a minimal distortion. As a c.riterion of distortion, the mean absolute difference (MAD) is calculated for each candidate location ( U , v ) :

(2)

H MM columns and H P E columns from the above PE from memory module current block - DIXBIS v MM l O W S v PE rows

...

...

time-sharing common bus I

motion vector

Figure 2 . The scalable PE-ringed architecture.

N-1 N-1

= IFt(z

+

I , Y

+

m ) - Ft-i(z

+

I + U , Y + m +

.)I

1=0 m = O

(1) where ( z , y ) is the coordinate of the upper-left pixel of the current block in F,, and the values of U and TJ are limited

t o between - p t o p - 1. Fig. 1 illustrates the procedure of FSBMA. T h e candidate blocks are labeled by BO N B255.

Fig. 2 shows the scalable PE-ringed architecture t o per- form the FSBMA. T o exploit the parallelism in FSBMA, the proposed architecture is consists of H x V identical on-chip Memory Modules (MMs) and H x V identical processing el- ements (PES). T h e P E S are connected in a ringed fashion. Each of the P E S is composed of an absolute difference unit, an accumulator, and a final-result latch in a pipelined fash- ion. T h e detailed hardware architecture of a PE is shown in Fig. 3. According t o the performance requirements of dif- ferent real-time video applications, the number of the MMs and the PES, i.e., H x V , is determined by the following evaluation equation. Assume t h a t the cycle time is Tp for

t h e N x N current block with the search range of - p t o p - 1. ( 2 ) x ( 2 ~ ) ’ x N 2 x Tp 1

f r

< -

H x V T =

where T denotes the total block-matching time of all blocks in a frame (frame size: W X L). It should be small- er than frame period, l / f r . The search area pixels t o be computed are first stored in the MMs. T h e current block pixels are sequentially input and broadcasted t o all P E S in

current block pixels from the left PE

to time-sharing common bus to the below PE to .the

right PE

Figure 3. Architecture of PE.

a raster-scan order. The upper-left H x V of the all candi- date blocks are first computed. After 256 clock cycles, the MAD of each candidate block is produced in each PE and transferred to PE’s final-result latch. These H x V latched MADS can then be sent t o the minimum extractor t o get a minimum MAD. During the comparisons, the PES continue t o perform the block-matchings of the next H x V candi- date blocks. It does not consume extra cycles t o fill u p the pipeline operations.

3. T E C H N I Q U E S FOR DATA M A N A G E M E N T This architecture is based on some d a t a management tech- niques: (1) On-chip memory configuration for data-reuse, ( 2 ) Memory interleaving organization for parallel d a t a ac- cesses, and (3) Propagation of the accumulated partial- results for eliminating the interconnection overheads be- tween the PES and the MMs, These techniques are de- scribed in the following subsections. Fig. 4 shows a simple example t o illustrate the operation of the proposed archi- tecture. In terms of (Z), we assume t h a t the 4x4 MMs and 4x4 PES are needed t o meet a certain real-time application.

3.1. On-chip Memory C o n f i g u r a t i o n for Data- The memory bandwidth is the main bottleneck t o support the high performance motion estimators. However, pixels are repeatedly used several times t o evaluate different can- didate blocks. To avoid the extremely high bandwidth re- quirements for chip 1/0 and memory systems, we utilize the overlap between the search areas of the adjacent current blocks (see Fig. 5) and propose the scheme of three half- search-area (HSA) segments. Each of the memory mod- ules is partitioned into three HSA segments, as shown in Fig. 6(a). One HSA block (31 x 16 pixels) is interleaved t o reside the same labeled segments in the 4 x 4 MMs. Fig. 6(b) illustrates the operations of the three HSA segments. When executing task 0 (matching current block 0 ) , the PES access the search area pixels from the segments 0 and 1. During performing the FS of current block 0, the HSA segment 2 in all memory modules is being filled by the right-half of the search area for task 1 (HSA C). When executing task 1 (matching current block l), the search area pixels from the

(3)

L I I ' I ' i ' I '

l

Figure 4. A simple architecture e x a m p l e for

N

= 4, p = 2, H = 4, and V = 4.(some interconnections are o m i t t e d . )

HSA blocks (31 x 16) ~ w r s n l blocks (16 x 16)

P

a

I I 1 I I I

A R C D E F G H - - .

+_I

--

ssarch area of c u m n l block 0 (task 0) search area of current block 1 (task 1)

1

- search area of current black 2 (lark 2)

F i g u r e 5 . Overlapped search areas of a d j a c e n t cur-

rent blocks.

segments 1 and 2 of the 4 x 4 MMs are accessed by the PES, and the new d a t a (HSA D) are input to segment 0. In this cyclic manner, the 31 x 16 new pixels can be easily accessed from system memory during performing the block-matching of the current block by only one input port.

3.2. M e m o r y I n t e r l e a v i n g O r g a n i z a t i o n

T h e on-chip memory is used t o reduce the 1oa.d for chip

1/0 and memory system. The next problem is: how to provide all required d a t a for the H x V P E S simultaneously. Our approach is t o divide the memory into H x V memory modules, and to interleave the input pixels to these H x V modules. An example are shown in Fig. 7. The label (0 N 15) for each pixel indicates the module t h a t stores

the pixel. This memory interleaving organization provides

a solution for parallel d a t a accesses, but it stipulates t h a t each of P E S is able t o access the corresponding memory module in parallel. HSA segment 0 HSA segment 1 HSA segrnent 2 memory module i

I

segment2

1

contents of segments TO T1 T2 T3 T4 T5 T6 A I n / D D m G G B B m E E M H m C C I F / F F m * time

H full search for current block Ti: task i

F i g u r e 16. ( a ) The three partitioned HSA s e g m e n t s . (b) Contents of the three HSA s e g m e n t s ( w r i t e op- erations are m a r k e d b y small f r a m e s ) .

3.3. P r o p a g a t i o n of the accumulated partial-

In most of multiple

PES

designs, a fully connected intercon- nection is demanded between the multiple PES and multiple

MMs. However, the interconnection gives the extra delay and consumes larger routing space due to large numbers of buses, multiplexers and tri-state buffers. To overcome the drawback, the adjacent P E S are connected in the horizon- tal rings and the vertical rings, and we arrange the d a t a flow from the memory modules to the P E S and redistribute the PES operations. This method allows the operations be- longing i o a candidate block t o be performed by the several PES. Then, for every clock cycle, P E propagates the accu- mulated partial-result to t h e adjacent

PE

t o form the next partial-result of the block candidate. After 256 clock cycles, the final result of each candidate block is produced in each P E . The d a t a flow is shown in Table 1. By use of the prop- agation of the accumulated partial-results, each of the P E S is only connected to one memory module. This eliminates the complicated interconnection and the switching circuitry between the P E S and the memory modules.

results

4. P E R F O R M A N C E A N A L Y S I S

In this section, we analyze the performance of the proposed VLSI architecture for the full search block-matching algo- rithm. Table 2 presents a comparison between the pro- posed and previous architectures. T h e 2-D array provides high-speed motion estimators with very high costs. T h e 1-

D array is an efficient low-cost design for some low-speed applications. However, the proposed architecture gives a satisfactory solution taking into account scalability, input ports and computation speed.

5. C O N C L U S I O N

A scalable PE-ringed architecture for F S B M A has been de- scribed. The number of processing elements (PES) is scal- able according to the variable algorithm parameters and the

(4)

Architecture No. of t h e PES

No. ofinDut d a t a Dins

proposed architecture 2-D array [4] 1-D array [5]

128

I

64

1

32

[

16 128

1

64

1

32

1

16 128

1

64

1

32

I

16 16 I 16 I 16 I 16 24 I 24 I 24 1 24 136 I 72 I 40 I 24

Figure 7. An example for the distribution of search area pixels to 4 x 4-memory modules.

performance required for different applications. A config- uration of random-access on-chip memory modules solves t h e problems of chip 1/0 and memory bandwidth require- ment. Input d a t a are arranged in the memory modules by memory interleaving organization. Combined with a tech- nique of t h e propagation of accumulated partial results, the interleaved memory module provides every PE with its re- quired d a t a simultaneously without introducing complicat- ed interconnections and switching circuitry. In summary, t h e proposed architectures have t h e following desirable fea- tures: (1) low hardware cost, ( 2 ) high throughput rate, ( 3 )

low latency delay, (4) low 1/0 and memory bandwidth re- quirements, and (5) 100 percent computation efficiency.

REFERENCES

[l] C C I T T SGXV Working P a r t y XV/1 Specialists Group on Coding for Visual Telephony, Document #584, Nov.

1989.

[2]

ISO/IEC/JTCl/SC29/WGll/Draft/CD

13818-2,

“Recommendation H.262,” November, 1993.

[3]

ISO/IEC/JTCl/SC29/WGll/Draft/CD

13818-1,

“MPEG-2 Systems,” November, 1993.

[4] Luc De Vos and Michael Stegherr,

,

“Parameterisable VLSI Architectures for the Full-Search Block-Matching Algorithm

,”

I E E E Transactions o n Circuit and Sys- tem., vol. 36 no. 10, p p 1309-1316, Oct. 1989. [5] K . M. Yang, M. T . Sun, and L. Wu, “ A family of VLSI

designs for t h e motion compensation block-matching algorithm,” I E E E Transactions o n Circuit and S y s t e m

for Video Technology., vol. 36, no. 10, pp.1317-1325,

Oct. 1989.

Clock cycles per block matching

Table 1. Data flow for FSBMA ( N = 1 6 , p = 8 )

T i m e I D a t e S e q u e n c e I PBO I P B l I . I P B 1 4 1 P B 1 6 O X 1 6 4 0 I 512 1024 2048 4096 512 1024 2048 4096 512 1024 2048 4096 ... ... ... ... ... ... . ( 0 , 1 4 ) B 2 O X 16+14 OX16+16 .(0,16) B 1 1X 16+0 . ( 1 , 0 ) B 4 8 1X 1 6 + 1 .(l,l) BKl ... ... ... ... ... ... 1 x 1 6 1 1 4 a ( 1 . 1 4 ) BKO 1 X 1 6 + 1 6 & ( 1 , 1 6 ) 8 4 9 ... ... ... ... ... ... ... ... ... ... ... ... 14X 16+0 r(14,O) 8 3 2 14X 16+1 4 1 4 . 1 ) 8 3 6 1 4 X 1 6 + 1 4 1 4 x 1 6 + 1 6 16X 16+0 16X 16+1 B 4 9 ... ... ... a ( 1 4 , 1 4 ) 8 3 4 B 3 6 . B 1 6 B 1 7 .(14,16) B 3 3 8 3 4 . B 1 9 8 1 6 a(16.0) B16 B l t . B 1 B 3 ~ ( 1 6 ~ 1 ) B 1 9 B 1 6 . B 1 B 2 ... ... B 6 0 ... ... 1 6 1 1 6 + 1 4 1 6 X 1 6 + 1 6 ... ... ... B 3 1 ... ... . . . ... ... ... . . . ... a ( l K , l 4 ) B 1 1 B 1 9 . BO B 1 E l ? B l 8 . B 3 BO a( 1 6 , l S ) I ... I ... I . . . I . . . I . I ... I I ... I ... I ... l . . . I . I ... I ...

I

” ’

I

I

...

I

N B X T

1 . 1

...

1

... ... ... ... I ( ” I

[6] Y. S. Jehng, L. G. Chen,and T. D. Chiueh, “ An ef-

ficient and simple VLSI tree architecture for motion estimation algorithm,’’ I E E E Transactions o n Signal

Proc., vol. 41, no. 2 , Feb. 1993.

[7] H. M. Jong, L. G. Chen, and T. D. Chiueh, I‘ Paral- lel Architecture for 3-Step Hierarchical Search Block- Matching Algorithm,” I E E E Transactions o n Circuit and S y s t e m for Video Technology., vol. 4 , no. 4, Aug.

1994.

[8] Shifan Chang, Juin-Haur Hwang, and Chein-Wei Jen

,

“Scalable Array Architecture for Full Search Block Matching

,”

I E E E Transactions o n Circuit and S y s t e m

f o r Video Technology., vol. 5 no. 4 , pp 332-353, Aug.

1995.

[9] Luc De Vos and Matthias Schobinger

,

“VLSI Architec- ture for a Flexible Block Matching Processor

,”

IEEE

Transactions o n Circuit and S y s t e m for Video Tech- nology., vol. 5 no. 5, pp 417-428, Oct. 1995.

[lo] Santanu D u t t a and Wayne W o l f , “A Flexible Paral- lel Architecture Adapted t o Block-Matching Motion-

Estimation Algorithms

,”

IEEE Transactions on Cir- cuit and S y s t e m for Video Technology., vol. 6 no. 1, p p

74-86, Feb. 1996.

[ll] Gangan G u p t a and Chaitali Chakrabarti

,

“Archi- tectures for Hierarchical Other Block Matching Algo- rithms

,”

I E E E Transactions o n Circuit and S y s t e m

for Video Technology., vol. 5 no. 6, p p 477-489, Dec.

1995.

數據

Figure  2 .   The scalable PE-ringed  architecture.
Figure  4.  A  simple architecture  e x a m p l e  for  N  =  4,  p  = 2,  H  = 4,  and  V  = 4.(some  interconnections  are  o m i t t e d
Figure  7.  An example for  the distribution  of  search  area pixels  to  4  x  4-memory modules

參考文獻

相關文件

Upon reception of a valid write command (CMD24 or CMD25 in the SD Memory Card protocol), the card will respond with a response token and will wait for a data block to be sent from

– Each time a file is opened, the content of the directory entry of the file is moved into the table.. • File Handle (file descriptor, file control block): an index into the table

– Maintain a heavy path: Instead of recalculate all ances tors' value, only the corresponding overlapping subpa th will be recalculated. It cost O(1) time for each verte x, and

Note: BCGD has stronger global convergence property (and cheaper iteration) than BCM..1. Every cluster point of the x-sequence is a minimizer

In particular, we present a linear-time algorithm for the k-tuple total domination problem for graphs in which each block is a clique, a cycle or a complete bipartite graph,

• The randomized bipartite perfect matching algorithm is called a Monte Carlo algorithm in the sense that.. – If the algorithm finds that a matching exists, it is always correct

• We shall prove exponential lower bounds for NP-complete problems using monotone circuits. – Monotone circuits are circuits without

• Consider an algorithm that runs C for time kT (n) and rejects the input if C does not stop within the time bound.. • By Markov’s inequality, this new algorithm runs in time kT (n)