504 I
IEEE Transactions on Consumer Electronics, Vol. 39, No. 3, AUGUST 1993
AN AREA-EFFICIENT MEDIAN FILTERING IC FOR IMAGENIDEO APPLICATIONS*
Po-Wen Hsieh, Jer-Min Tsai, and Chen-Yi Lee
Dept.
ofElectronics Eng. and Institute
of Electronics,
National Chiao Tung University,
1001,
University Road, Hsinchu 300, Taiwan, ROC
ABSTRACT
An area-efficient I C for high-throughput median filtering applications is present,ed in this paper. This IC imple- ments a modified delete-and-insert, sorting algorithm which is very efficient for running order statistics applications. In hardware design, we first m a p t h e algorithm onto a reg- ular P E structure, where each PE consistis of shift regis- ter, comparator, and some control gates. Then we conduct full-custom circuit/layout design of the P E t o meet perfor- mance requirement. A proto-type chip for 64 input samples is implemented and tested. Results show t h a t clock rate up t o 50 MHz can be achieved using a 1.2 pm CMOS dou- ble metal technology. Two outstanding features of this IC are: (1) any specified order of input patterns can be pro- duced within one clock cycle; (2) each chip can handle a t most 64 data a n d can be cascaded as the number of sorted data i s over 64. Thus this I C releases the bottle-neck of median search in hardware realization for many system de- signs, making real-time performace achievable.
1. INTRODUCTION
Image median filters are well known for being able t o remove impluse noise [l], t o preserve image edges, a n d recently t o improve coding efficiency
[a]
as well as to enhance color signal processing [3]. Many algorithmic approaches with hardware implementation can be found in t h e literature [l, 4,5,6] a n d they show t h a t the applicability of this median filtering technique is very application-dependent. Moreover different windows and pixel locations may be required t o achieve specified behavior for different target applications.In principle, these algorithms and methods can be clas- sified into two categories: word-level and bit-level as dis- cussed in [6]. In this paper, only t h e word-level median filters are concerned since they offer high throughput ca- pability as required in many real-time image/video sys- tems. However a very cost-effective hardware solution t o meet this goal is often difficult t o achieve and hence system
*WORK SUPPORTED BY THE NATIONAL SCIENCE COUNCIL O F TAIWAN, ROC UNDER GRANT NSC-83-0404- E009-184.
Manuscript received March 8, 1993
-1-
-rr
performance becomes degraded t o allow trade-off between hardware cost and achievable performance. For example, a fast median filter based on bubble sorting algorithm can be found in [5]. By means of a set of processing elements or PES, the required values can be obtained with a latency of N cycles, where N is the number of input samples. However the hardware overheads of a parallel bubble sorter increase rapidly with the number of input samples. In addition t o this sorting kernel, it is necessary t o provide extra hard- ware in t h e form of a d a t a buffer t o rearrange input samples for the parallel processing and hence increase the memory bandwidth. Another solution is a message passing method
[8] realized on systolic array architecture [9]. Both deletion and insertion messages pass through the systolic arrays until certain conditions are encountered. Although the hardware complexity depends on t h e number of input samples (N), the latency remains the same as that needed in the parallel bubble sorter. This latency of N cycles may not be allowed when real-time performance is concerned.
In this paper, we present a more cost-effective hardware solution which can b e integrated with other hardwares with- out degrading overall system performance. This is achipved by reducing the latency of median search on a set of in- put samples so t h a t the median can be obtained immedi- ately without causing a stall on t h e d a t a flow. For example, the median can be obtained right after t h e last sample is presented and then immediately passed t o next stage. In section 2 , an algorithm suitable for such an implementa- tion is first introduced. T h e main idea is to partition t h e sorted items into two groups and then perform shfit op- erations o n one of the groups according t o the operation mode. Section 3 presents a shiftable content address mem- ory (SCAM) VLSI architecture which is a processor element ( P E ) based structure. Each P E contains two different cells- one (sort-cell) stores sorted items and t h e other (compare- cell) compares current input sample with the sorted items
so that i t can be placed appropriately t o reach high speed sorting. A proto-type chip based o n this design approach is given in section 4 t o evaluate design efficiency. Also some comparisons with available approaches are included t o high- light t h e performance of t h e SCAM VLSI architecture for sorter-based applications.
0098 3063/93 $03.00 1993 IEEE
I
Hsieh, Tsai and Lee: An Area-Efficient Median Filtering IC for Image/Video Applications
insert 20
0
. . . .. . ... ... ... . . .. ... . . . ... .. ..
delete 16
4
(b) deleteFig.1. Illustration of the Delete and Insert sorting operations.
2. ALGORITHM DESCRIPTION AND
TRANSFORMATION
Since median search can be decomposed into two stages- (1) sorting input samples and (2) selecting specified order from sorted sequence, and t h e former is much more complex t h a n the latter, throdghput can only be improved when complex- ity of the sorting stage is overcome. We select the delete- and-insert algorithm [8] due t o its low memory bandwidth requirement and being suitable for raster-scan sequence.
The Delete-and-Isert Sort Algorithm T h e delete-
and-insert sort algorithm is one of t h e many available sort algorithms. Its operations can be described as follows. For d a t a insertion as shown in Figure l ( a ) , t h e current input sample is known t o be 20 and should be placed in between
25 and 16 which are already sorted and stored in a mem- ory. Each time when a new sample is given, i t can be placed according t o this scheme. After all samples are processed, the sorted sequence can be obtained from the memory. For d a t a deletion, i t only needs t o identify the sorted item whose value is equal t o the input sample’s value as shown in Fig- ure l ( b ) . T h u s for any running order operations, this delete operation becomes useful since only those samples o u t of t h e mask region need t o be removed and new input sam- ples need t o be inserted. In addition t o these features, this algorithm can also provide ascending sequence if smaller items are placed a t left side instead of right side as used for descending sequence (as illustrated in Figure 1).
This algorithm description implies t h a t a lot. of compar- isons are needed in order t o allocate a position in which the input sample can be correctly placed or removed. More- over a lot of move operations are needed once the input sample is inserted or deleted. For real-time image/video processing, either insert or delete operation has t o be done within a very stringent timing constraint. T h u s if this al- gorithm is implemented on general-purpose computer, the
result is obviously
505
not suitable for real-time applications ~~
and hence the algorithm has to be modified. A modified version, called optimized delete-and-insert or OD1 sort al- gorithm is thus developed. It fully explores parallelism so
t h a t a very high throughput requirement can be obtained by exploiting VLSI advantages such as speed and density.
The OD1 Sort Algorithm As described above, the bot-
tleneck of enhancing the delete-and-insert sorter’s throngh- put lies in two factors- ( 1 ) identification of insert/delete tafget and (2) data movement among sorted items. T h e complexity of these two factors becomes higher as t h e num- ber of input samples increases. However if we explore fully parallelism inherent in the algorithm, then this bottleneck can be coped with. Here i t is found t h a t for each input sample, N comparisons are needed in order t o find t h e po- sition where t h e input sample should be placed or removed. This parallel-comparison process can be realized on a con- tent addressable memory (CAM) to identify the target t o b e inserted or deleted. Once the target is identified, the next step is t o perform d a t a movement on part or all t h e sorted items. For example, Figure l ( a ) illustrates the insertion of d a t a item whose value i s 20. For those sorted items whose value is less than or equal t o the input item, they have t o move one position right; while in d a t a deletion, those sorted items whose value is less than or equal t o the input item have t o move one position left as shown in Figure l ( b ) . In other words, the sort,ed items are divided int,o two gronps- one is the LE group (i.e. sorted items less than or equal t o input sample) and the other is the G T group (i.e. sorted items greater than input sample). Thus the d a t a movement can be replaced by shift operations working on t h e LE group instead of read/write operations on memory and in partic- ular, these shift operatios can be executed in parallel. T h u s this modified OD1 algorithm operates on every sorted items simultaneously and generates any order if both concepts of content-addressable memory and shift registers are used.
3. ARCHTIECTURE AND CIRCUIT DESIGN
ISSUES
In t h e previous section, we discussed how the OD1 algorithm solved t h e bot,tleneck t o speed u p throughput by means of the support of content addressable memory and shift registers. In this section, we discuss in more detail how the OD1 algorithm is mapped onto the shiftable content- address memory (SCAM) architecture which is very suit- able for VLSI implementation.
The PE-Based Structure Suppose t h a t there are N samples t o be sorted, we can obtain any specified order right after t h e last sample is given. This is the specification of our high-speed median filter design. We also assume t h a t the sorted sequence is in a descending way. T h u s N storage spaces are needed t o store the sorted samples. Initially all contents of SCAM are reset to zero. For each input sample,
I
...
IEEE Transactions on Consumer Electronics, Vol. 39, No. 3, AUGUST 1993
Fig.2. The shiftable content address memory is realized on a
PE structure.
Fig.3. Circuit diagram of the sort-cell. Note t h a t a weak inverter is placed a t INV2 t o overcome leakage paths. the content of each storage space is read out and compared
with t h e input sample in order t o identify the position where the input sample should be placed. Once the position is al- located, the content of each storage space in t h e LE group has t o b e shifted. This implies t h a t the architecture consists of N processor elements (PES) and each of which contains two basic cells- (1) sort-cell a n d ( 2 ) compare-cell (as shown in Figure 2). T h e former is a shift register containing the sorted d a t a and can be shifted right or left while the latter is designed t o provide control signals orchestrating how t h e former should be operated.
Detailed Architecture and Circuit Design for the
Sort Cell From t h e previous discussion, i t can be found
t h a t the sort-cell should have the following functionalities:
1. shift d a t a right, 2. shift d a t a left, 3. load d a t a ,
4. d a t a remains unchanged, 5 . reset content.
Here items 1, 3, a n d 4 are required in insertion operations, while items 2 and 4 are needed in deletion operations. Item 5 is only used when a new input set is to be processed. With these defined functionalities, t h e corresponding circuit for such a cell can b e easily derived as shown in Figure 3. Only
3 inverters a n d 6 pass transistors are needed for each bit cell. T h e first inverter acts as an internal buffer for d a t a pre-shifting t o enhance clock rate (see the last paragraph of this section for more details). Note t h a t a two-phase non- overlapping clocking strategy is used here and all control gates are funct,ioning at 42.
Detailed Architecture and Circuit Design for the Compare Cell This cell is designed t o generate all re-
quired control signals for those used in the sort cell. Here different control signals are generated respectively when t h e following conditions are encountered:
0 shift dat,a right ( s h r ) : This occurs when d a t a insertion
is under execution. However only those sorted items whose value is less than or equal t o (or belong to the LE
group as defined above) t h a t of current input sample will be shifted right and hence the carry (C, in zth compare-cell) is demanded. T h u s the shr is activated whenv shc is low (for insertion) and C, is high. 0 shift d a t a left ( s h l ) : This .shl is activated when d a t a
deletion is performed (i.e. shc is high) and C, is high.
0 load d a t a ( l o a d ) : Since only one P E requires this sig-
nal, i.e. the input sample can only be placed a t one correct position, the generation of load signal becomes more complex than the previous two control signals. Not only shc and C , are considered, but also t h e carry from t h e previous stage (Cf-l) should be taken into ac-
count. This is obviously since the correct storage space for current input sample is in between those items in the G T group (or greater than the input sample, i.e. C,-l=O) and LE group (or t,hose items less than or equal t o the input sample, i.e. C,=l).
To speed u p t h e carry generation, a carry-look-ahead technique [lo] is exploited. T h e overall circuit diagram for this compare-cell is shown in Figure 4.
Operation and Clocking Stragety To speed up the
clock rate, a 2-phase non-overlapping clock is exploited here. All t h e control signals are generated at 41 a n d d a t a shift operations are performed a t 42. In addition, a pre-
shzjt strategy is also exploited t o improve clock speed. This strategy shifts sorted items t o the buffer (the first inverter INVl of t h e sort-cell) of next P E during
41
and then are conditionally stored a t 42. T h e detailed operation for d a t a deletion o n t h e sort-cell is illustrated in Figure 5 . Note t h a t the first inverter of each sort cell acts as a buffer during pre- shifting as needed for both shift right and left operations.4. EVALUATION AND DISCUSSIONS
To evaluate this SCAM architecture, a proto-type chip for maximum of 64 input samples has been fabricated and tested. In this section, we first present some experimen- tal and test results about this OD1 chip and then provide some comparisoiifi with the two approaches described in the Introduction.
Hsieh, Tsai and Lee: An Area-Efficient Median Filtering IC for Image/Video Applications ~ 507 . . . ... . . . < S E -
i
Re-right ($d*E)1
cie
To next stag t I1
From Output of To Next Stage w Fig.4. Circuit diagram of the compare-cell. The carry is gen- erated by the carray-look-ahead technique to enhance clock speed.Design of the OD1 Chip Block diagram of this chip is
given in Figure 6. This chip can be used either as a high- speed d a t a sorter or as a n image median filter. Input sam- ples are first sorted through the shiftable content address memory. T h e n t h e specified order, such as median. can be identified from the sorted items by means of a d y n m i i c se- lection circuit. For raster scan images, this chip provides an optimal solution since the required order can be obtained right after t h e last sample is given. It should be noted t h a t using t h e pre-shift strategy, both phases ($1 and $2) are quite balanced. Experimental results show t h a t clock rate up t o 50 MHz can be achived. Die photo of this chip is shown in Figure 7. Some key features of this OD1 chip are given below :
0 Each chip can handle 64 samples a n d can be cascaded for more samples when d a t a sorting is considered. 0 Any specified order can be obtained right after the last
sample is given, i.e. without latency. This is very useful for pipelined architecture design.
0 Running order can b e handled by exploiting both t h e deletion and insertion operations.
0 Design is very regular and hardware complexity is lin- early proportional t o the number of input samples. Detail specifications of this chip are given below:
Die size: 5053 p m x 4774 pm;
0 Pin-count: 67;
0 Transistor-count: N-type: 13060, P-type: 10141, To- tally: 23201;
?
.
A..*..*..*..*. . .Y&l
...
Fig.5. Delete data flow on the sort-cell. ( a ) pre-shift stage a t
$1 and (b) data shift a t $2. Note the dark lines indicate how the circuit works a t different phase.
Clock rate: 50 MHz;
0 Technology: 1.2 pm CMOS double-metal technology.
Comparisons with Other Approaches We only com-
pare the results based on t h e bit-parallel approaches as given in t h e Introduction since they are more practical for real-time image/video applicat,ions.
T h e parallel bubble sorter as decribed in [5] requires a lot of hardware as t h e number of input samples increases. In addition, its input, format has t o be adjusted so t h a t
a set of input samples can b e fed simultaneously t o the sorter arrays. T h i s implies t h a t large memory bandwidth is needed. In addition, given a set of N input samples, t h e required order can only be obtained right after N cycles. Obviously, our OD1 chip does outperform t h a n this bubble sorter in terms of area and latency.
T h e ROS sorter from [8] based on systolic architecture offers similar hardware complexity compared t o our design. However our design offers higher throughput and expansi- bility.
In addition t o t h e specific features of less area and no d a t a latency, our design also provides a testable strategy since test patterns can easily be applied t o the SCAM and detected from the selection circuit or from both shift out,put ports.
5. CONCLUSION
In this paper, we have preseneted a high-throughput median filter design based on the OD1 algorithm and the SCAM architecture. This chip is obtained by first exploring the
508 IEEE Transactions on Consumer Electronics, Vol. 39, No. 3, AUGUST 1993
1
maxi”
olt 8-bit gtit
sott-cels sorecelk w h compare canpare
...
__I inun PE PE PEFig.6. Block diagram of the OD1 chip.
Fig.7. Plot o f the OD1 chip design
inherent parallelism and then exploiting shift operations t o achieve high speed sorting so that any order of input samples can be obtained. Using the shiftable content ad- dress memory architecture, we have developed a very cost- effective hardware solution for median search as well as for d a t a sorting. Final test results show that the OD1 chip does outperform than available approaches for median search in both area and throughput. Also this area-efficient solution can be integrated with other hardware when median ser- arch is demanded in system development. We are currently looking into the applicability of exploiting this high-speed sorting architecture for adaptive coding which is often used in entropy coding t o enhance compression ratio.
Acknowledgement: T h e authors would like t o thank
their colleagues wit~hin the VLSI/CAD group of NCTU for many fruitful discussions. Also the M P C support from Chip Implementation Center (CIC) of NSC is acknowledged.
REFERENCES
[l] A.K. Jain, “Fundamentals of Digital Image Process- ing”, Prentice-Hall, 1989.
[a]
D.H. Kang, J.H. Choi, Y.H. Lee, and C. Lee, “Appli- cations of Q DPCM System with Median Predictors for Image Coding”, I E E E Trans. on Consumer Electronics, Vol. 38, No. 3, pp. 429-435, Aug. 1992.[3] H. Rantanen, M. Karlsson, P. Pohjala, and S. Kalli, T o l o r Video Signal Processing with Median Filters”, I E E E Trans. on Consumer Electronics, Vol. 38, No. 3, pp. 157-161, Aug. 1992.
[4] H.M. Lin and A.N. Jr. Willson, “Median Filters with Adaptive Length”, I E E E Trans. on CAS, Vol. 35, No. 6, pp. 675-690, June 1988.
[5] J. Offen and R. Raymond, “VLSI Image Processing”, McGraw-Hill, 1985.
[6] C.L. Lee and C.W. Jen, “Bit-sliced Median Filter De- sign Based on Q Majority Gate”, IEE Proc-G V139, No 1, pp. 63-71, Feb. 1992.
[7] V.V. Bapeswara Rao and K. Sankara Rao, “A New AI- gorithm for Real-Time Median Filtering”, I E E E Trans. on ASSP, Vol. ASSP-34, pp. 1674-1675, No. 6 , Dec. 1986.
[8] A.L. Fisher, “Systolic Algorithms for Running Order
Statistics”, in Signal and Image Processing, Dept. of Computer Science, Carnegie Mellon University, Pitts- burgh, July 1981.
[9] H. T. Kung, “U’hy Systolic architectures”, IEEE Com- puter, Vol. 15, no 1, Jan., 1982.
[lo] N. Weste and K. Eshraghian, “Principles of CMOS
VLSI Design- A Systems Perspective”, Addison-
I
509
Hsieh, Tsai and Lee: An Area-Efficient Median Filtering IC for Image/Video Applications