Context model simplification - Case Study: Hardware Architecture

Chapter 3 Line-based Adaptive Lossless Video Compression

3.4 Case Study: Hardware Architecture

3.4.3 Context model simplification

Figure 24. Context merging by symmetric property

In JPEG-LS, the states of context model can be reduced into half based on the property of symmetric. The symmetric implementation in software is illustrated in Figure 24. The method needs two stage of table look up (LUT). In Figure 25, the first LUT is quantization mapping. G1 and G2 are quantized into 9 levels (4 ~ -4). The second LUT is for symmetric merging mapping. The first LUT is mapping 512 to 81 and the second LUT is mapping 81 to 41, which is good for software implementation due to fast execution in LUT. For hardware implementation, we can use a simple combinational circuit to calculate the final entry address.

The two LUTs can match with the memory requirements.

Figure 25. Context model entry address

3.4.4 Summary

In hardware design considerations, we give the architecture of temporal prediction module for mode decision and temporal residuals production. Memory requirement is reduced by the modified module implementation. Overall system could get the simple structure and efficient circuit area, which is beneficial for real-time system implementation.

Chapter 4 Experimental Results

4.1 Performance & Results

4.1.1 Optimized Performance Evaluation

The mode decision of LALVC plays the important role in choosing better case between the 2 modes, the difference mode and the raw data mode. When the Diff mode is applied, the zero-motion residuals are fed into the line-based encoder.

To evaluate the best performance of the zero-motion architecture, a near optimum case is observed. We encode each line in the difference mode and the raw data mode first, and then apply the better case for real encoding of the current line. Based on the context model statistics property, the method called as Opt_1_line could not derive the near optimum performance. The prior mode decision will alter the statistics property, which changes the coding states of the next lines. For some sequences, even forced zero-motion method would get the better compression ratio.

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8

0 50 100 150 200 250 300

forced zero motion Intra_Coding Opt_1_line

Figure 26. Compression ratios for each Frame of the sequence “Foreman”

In Figure 26, the method “Opt_1_line” represents that Foreman sequence is encoded by three times each line. Tow iterations of pre-encoding choose the better one between the difference and raw data modes. Figure 26 shows the compression ratio for each frame.

Opt_1_line can not keep the highest performance in comparison with intra-frame coding and enforced zero-motion prediction coding. Figure 27 show the compression of the 86^th frame to observe the performance of line-based compression.

In Figure 27, Opt_1_line performs better than forced-zero-motion in the beginning. The curve shows a decreasing ratio for Opt_1_line. To test the Opt_1_line algorithm, we set the previous mode selection pretended as the enforce-zero-motion, latter part are operated by Opt_1_line. The curve is represented by the “Test” in Figure 27. Figure 27 shows the Opt_1_line could not find the global best mode selection. The following shows the extension version of Opt_1_line to find the near-upper bound performance under the architecture for low-complexity requirement

Observe extending windows of selection in line mode. Opt_1_line introduced in the previous section could not be aware of the optimum case of the image. Context model Frame 86

1.3 1.5 1.7 1.9 2.1 2.3 2.5

0 50 100 150 200 250

line_index

compressed ratio

Opt_1_line forced zero-motion Test

Figure 27. Line compression condition detail in 86^th frame of “Foreman”

statistics are increased during coding. Under the structure of LALVC, context is empty at the beginning. Each decision would change the overall coding performance. Opt_1_line could be viewed as one-line range window for pre-coding test. Now, we extend the window to more than one line. When the window size is extended to the whole frame, the performance is the best. For encoding the CIF sequence (352*288), the number of lines is 288. For practical computation, the worse case is that all lines are used for the mode selection. Each line will be encoded by 2²⁸⁸ times for the worst case scenario in the frame windows.

To simply the derivation of optimal case, we set the window size to cover a range of 16 lines. Each line is at most encoded by 2¹⁶ times. In the desktop platform, Pentium 4 2.0 GHz, windows XP, each frame encoding takes 20 minutes, which does not fit the real requirement and some complexity reduction is needed.

The following results are derived from the simulations with a window of 16 lines. For slow motion sequence, Akiyo, compression ratio is shown in Table 14. The mothod of Opt_16_line is the same as forced zero-motion method.

Table 14 tells the compression results of three methods. For the overall performance, Opt_16_line could get the best performance on the average. In addition to Table 14, we analyze the frame level conditions to evaluate the overall performance.

(1). Akiyo is a typical slow motion sequence. Figure 29. Compression ratio for each frame of the sequence “Akiyo”. OPT_16 is identical to the enforced zero-motion method.

Zero-motion residuals are better for removing coding redundancy.

Figure 28. The diagram of optimum performance evaluation.

Table 14. The list of compression ratios

Intra_coding Zero_motion Opt_16_line

Akiyo.Y(300-frames) 2.70 6.76 6.76

Bus.Y(150-frames) 1.63 1.44 1.64

Football.Y(250-frames) 1.99 1.75 2.03 Foreman.Y(300-frames) 1.97 1.84 2.03 Mobile.Y(300-frames) 1.43 1.36 1.43

Silent.Y(300-frames) 1.86 2.61 2.64

(2). Foreman is a sequence with more motions than Akiyo and Silent. In Figure 30, Opt_16_line locates at the top of the three curves, which proves a windows size of 16 lines

can capture the near-optimum case for the Foreman sequence.

(3). Mobile is a sequence with a moving train and zooming in/out. In Figure 31, Opt_16_line does not occupy the top positions of several frames, which means a larger window size is used to capture the top case.

2 3 4 5 6 7 8 9

0 50 100 150 200 250 300

Intra Ratio ZeroMotion Ratio OPT_16

Figure 29. Compression ratio for each frame of the sequence “Akiyo”

1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8

0 50 100 150 200 250 300

Intra Ratio ZeroMotion Ratio OPT_16

Figure 30. Compression ratio for each frame of the sequence “Foreman”

1.25 1.3 1.35 1.4 1.45 1.5 1.55

0 50 100 150 200 250 300

Intra Ratio ZeroMotion Ratio OPT_16

Figure 31. Compression ratio for each frame of the sequence “Mobile”

In frame-level statistics (Figure 26), intra-frame coding is the best one for coding of the 86^th frame. In Figure 27, we find that Opt_1_line shows the best gain in the beginning. After 10^th line, intra-frame becomes the better one. Ultimately, intra-frame coding gets the better performance. Each pixel coding would alter the context model that affects the coding of the subsequent pixels. With the issue, we set two factors to observe the performance variation.

1. Context Model Interference:

The energy of residuals generated form the difference mode is smaller. The property of the context model for the difference mode should be distinguished from the raw data mode. Hence, the following simulation will show the performance comparison between 1 model and 2-separate models.

2. Context Model Continuity:

Context models are derived based on statistical property of the processing contents. If the model prediction performs very well, the probability of zero residuals will be increased, which indicates the OGDS (one side geometric distribution) would be steeper.

Originally, the reset interval is set as the range of one frame. If the successive frames have some stationary property, enlarging the reset interval may be beneficial for the prediction performance. The remaining issue is the decision of the reset moment.

In the first issue of context model interference, we investigate the control factor of 1 or 2 models for context models.

Table 15. 2-Model and 1-Model performance comparison

Size (Kilo-Bytes) Compression ratios Sequences Opt_16_line

In Table 15, separate context model does not have significant improvement on performance. When the mode decision can choose as the same mode as Opt_16_line, 1-Model approach has identical compression ratios as 2-Model approach. For low complexity, 1-Model instead of 2-Model is used.

Table 16. Performance compression for the factor of state number of Context Model

Size(Kilo-Bytes) Compression ratios

Sequenc

e 2M3G 2M2G 1M3G 1M2G 2M3G 2M2G 1M3G 1M2G

Akiyo 4456 4671 4457 4675 6.665 6.358 6.664 6.353

bus 9093 9085 9091 9085 1.633 1.635 1.633 1.635

Football 12222 12423 12249 12438 2.025 1.992 2.021 1.990

Foreman 14918 15098 14939 15152 1.991 1.967 1.988 1.960

Mobile 20900 21070 20958 21168 1.421 1.410 1.417 1.403 Silent 11330 11740 11330 11748 2.621 2.530 2.621 2.528

OutR 495 477 492 476 20.000 20.755 20.122 20.798

OutG 460 447 460 447 21.522 22.148 21.522 22.148

OutB 417 406 416 407 23.741 24.384 23.798 24.324

Total 74291 75417 74392 75596 2.532 2.494 2.528 2.488

*Bold number represents the best in the row of table

In Table 16, “2M3G” means 2-Models plus 3 gradients applied on the algorithm. Other items mean the same. In Table 16, the best performance is 2M3G. Table 15 shows that 2-Model is proven to have near performance to 1-Model, which has slightly less coding efficiency due to mismatch between the mode decision and Opt_16_line.

The Y component of D1 resolution sequences has 720*480 pixels. The four testing sequences are full of motion. We divide the original sequence into four small sequences by cutting the spatial area. In the high resolution sequence, the performance improvement is not significant. When we divide the sequence into sub-resolution sequences width reduced width to evaluate the effect of the line length. In Figure 15, performance difference for the different sub-resolution process is minor. Cutting D1 into 360*480 resolution sequences could get minor improvement by 0.006 ratios. Side information of tag increases with more segments used. In the resolution of 240*480, D1 sequence is cut into 3 small ones. The average performance may not be better than the resolution of 360*480.

Table 17. Simulation of D1 sequence

(720*480) Ratio Sequence WinRAR WinZip LALVC(1M3G) Intra ZeroMotion

crew(300) 2.047 1.456 2.347 2.348 1.980

harbour(300) 1.494 1.162 1.991 1.991 1.869

night(230) 1.858 1.396 2.220 2.194 2.023

pour_water(1017) 2.279 1.583 2.626 2.608 2.298

rolling_tomato(222) 2.549 1.705 2.883 2.845 2.616

sailormen(300) 1.806 1.310 1.938 1.949 1.683

Total 2.024 1.452 2.366 2.357 2.094

Table 18. Simulation of sub-resolution sequences cut from D1

Sequence 720*480 (360*480)*2 (240*480)*3

crew(300) 2.347 2.345 2.342

harbour(300) 1.991 1.992 1.990

night(230) 2.220 2.220 2.217

pour_water(1017) 2.626 2.644 2.646

rolling_tomato(222) 2.883 2.891 2.894

sailormen(300) 1.938 1.932 1.924

Total 2.366 2.372 2.370

4.1.2 Comparison of Tool Performance

In LALVC, we have four modes including Skip, DC, Diff and Raw modes. We compare the gains of various tools. The results of comparisons are shown in Figure 32, Figure 33 and Figure 34. In Figure 32, the Diff mode enhances the improvement for the two slow motion sequences. In Figure 33, the mode decision selects the raw data for encoding of fast motion sequences. Consequently, the Skip and Diff modes do not show improvement for the sequences. In Figure 34, the Skip mode retains the best performance for computer generated sequences, which covers most of area in screen display that is still for successive frames.

Figure 32. Tool performance of low motion sequences

Figure 33. Tool performance of fast motion sequences

Figure 34. Tool performance of computer generated sequences

4.1.3 Execution Speed Evaluation

4.1.3.1 Profile analysis

Figure 35. Profiling of LALVC encoding modules

In LALVC, the major difference of computation load in encoder and decoder is the inter-process including mode decision and residuals production. We run the software profile

and show the results in Figure 35. The platform is P4 2.0GHz desktop. OS is Windows XP.

The test sequence is Foreman. In the profiling analysis, 5% of execution time is occupied by the inter-process. To evaluate whether the codec is symmetric, we do the simulations for statistics of run time.

4.1.3.2 Encoding and decoding rates

Table 19. Coding speed for various natural sequences

Time(sec) Coding rate (fps)

Sequence Encode Decode Encode Decode

Akiyo 2.78 2.23 107.87 134.29

Bus 2.97 2.28 50.54 65.76

Football 4.86 3.59 51.45 69.56

Foreman 6.02 4.5 49.87 66.67

Mobile 6.31 5.73 47.53 52.32

Silent 5.97 4.05 50.27 74.11

Average 55.36 71.46

For real encoding case, the encoding rate is 55.4 fps on the average for nature sequences.

The decoding rate is 71 fps, which present the processing of 16 more frames than encoding.

The detail results are shown in Table 19. In the other case, computer generated sequences are easy for execution than natural sequences. In Table 20, 133 fps is the encoding rate. The decoding rate is up to 584 frames.

Table 20. Coding speed of computer generated sequences

Time(sec) Coding rate (fps)

Sequence Encode Decode Encode Decode

OutR 0.75 0.17 133.33 584.8

OutG 0.77 0.17 130.72 584.8

OutB 0.74 0.17 136.05 584.8

Average 133.33 584.8

4.1.3.3 Complexity Analysis

LALVC is low complex, which is proper for software implementation and hardwired circuit design. For the preprocessing, if the frame resolution is N*N, LALVC needs (7N²-6N) adders to execute the mode decision and zero-motion prediction residual production as in Figure 13. In addition, we reduce the context model from three to two gradients to retain the identical performance with less memory requirement. The fixed predictor and entropy coder have been proven to have low complexity in JPEG-LS. The execution time of LALVC algorithm is observed under P4 2.0 GHz Desktop and Windows XP. The results show that the encoding rate is about 55 frames per second (fps) for nature CIF sequences and near 133 fps for computer-generated CIF sequences. For the LALVC decoder, the averaged decoding rates are 70 fps and 584 fps for the natural video and synthetic video bitstreams, respectively.

In [11], the mode decision needs the side information that represents a block (5 by 5 pixels) of data. After calculating the MSE for spatial, spectral and temporal predictors, the suitable one is chosen. The MSE computation is higher than LALVC.

4.1.4 Hardware Implementation Result

The simple image coder JPEG-LS, which includes fixed predictor and context model, is implemented in hardware [14]. The hardware implementation reveals that the chip area is dominated by the SRAM size than circuits of function blocks. The memory requirement is discussed in the previous section. The following hardware implementation does not include the real SRAM. The gate count statistics in Table 21 only cover the function blocks used in LALVC without the SRAM. To give the results of hardware implementation, the flow of functions is shown in Figure 36 and the gate count is listed in Table 21.

Figure 36. Flow of function blocks in Hardware

Table 21. Gates count statistics of hardware modules (.18µm)

Area Gate count

Regular Mode 13497.5 1377.30

Run Mode 88205.24 9000.53

Mode Switch(5ns) 678.8 69.27

Temporal(5ns) 10751.4 1097.08

Cont 3338 340.61

Shift_Reg(5ns) 7882.3 804.32

Total 116470.94 11884.79

4.2 Summary

Simulation results of LALVC including the compression ratio and execution speed are given here. For evaluating the LALVC performance, we take both natural and computer-generated video sequences. For natural video sequences, we take the same test sequences for MPEG standards. The natural video sequences are in CIF (352x288) and YCbCr=4:2:0 formats. Each pixel is presented in 8 bits. Only Y component is used for simulation. The synthetic video

sequences including scrolling web pages and adapting window size are captured from the computer screen.

Table 22. Compression ratio list*

WinRAR JPEG-LS CALIC LAVLC Opt_16_line

Akiyo _Y 5.85 6.74 3.43 6.35 6.76

Bus _Y 1.29 1.44 1.09 1.64 1.64

Football _Y 1.67 1.75 1.28 1.99 2.03

Foreman _Y 1.80 1.84 1.26 1.96 2.03

Mobile _Y 1.32 1.36 1.07 1.40 1.43

Silent _Y 2.52 2.61 1.49 2.53 2.64

Browser _R 30.28 12.84 18.20 20.80 20.63

Browser _G 29.73 13.69 18.64 22.15 22.15

Browser _B 29.46 15.07 19.16 24.32 24.50

*The sequences is preprocess with enforced-zero-motion prediction.

The synthetic video sequences are in CIF resolution and RGB format. The comparisons of compression ratio are shown in Table 22 for former six natural sequences and the last three synthetic sequences. As to the lossless compression algorithms, the observations of JPEG-LS, CALIC and WinRAR are provided. WinRAR is commercial software for generic data compression. The performance of context modeling, intra predictor and entropy coder in LALVC is compressed with the algorithms including WinRAR, CALIC and JPEG-LS. Prior to encoding, each input sequence is changed by zero motion prediction and sign remapping to form a new sequence with unsigned prediction residuals. Table 22 gives the comparisons based on the total amount of bits used to represent the new sequences. LALVC has the best performance in the nature sequences on the average. As to the computer-like sequences, LALVC can perform better than JPEG-LS and CALIC on the average. Only WinRAR is better than LALVC in the synthetic video sequences.

Chapter 5 Conclusion

5.1 Contributions

In this thesis, we have presented a low-complexity and low-latency LALVC algorithm for real-time interactive applications for universal multimedia access environment. LALVC consists of three parts covering the preprocessing, the mode-dependent spatial prediction and coding prediction. The preprocessing can efficiently remove the temporal redundancy via the zero-motion prediction and optimized mode decision. The proposed mode decision can adapt the predictor to make accurate prediction on various video sources. In addition, the syntax is simple and easy for the encoder and decoder implementation. The simulation results show that LALVC has significant compression ratios for both natural and synthetic sequences. For hardware implementation issue, we design several schemes for memory reduction. Unrolling the loop execution is beneficial for real-time coding due to parallel execution in hardware.

The properties of low complexity and low delay can match real-time and low consumption requirements. Since no multipliers and no dividers are required in LALVC, hardware design architecture is applicable for hardware realization including FPGA implementation or ASIC design.

5.2 Future Works

The structure of LAVLC is based on raster scanning order. The simulation results show the syntax of line-based header information. For a high resolution display, the extension version can adjust the line-level information. Division of the lines into several small portions, and each line is followed by one tag to avoid the large local area variation. The control range of the tag is another issue to evaluate

Color plane coding scheme needs more investigations for achieving the better encoding performance.

If random access is supported, I-frame encoding concept could be incorporated into the structure at cost of the coding efficiency.

For a complete hardware implementation, SRAM module and post-layout simulation should be involved for real simulation.

References

[1] X. Wu and N. Memon, ”Context-based, adaptive, lossless image coding, ” IEEE Trans.

on Communications, vol. 45, no. 4, pp.437–444, Apr. 1997.

[2] M.J. Weinberger, G. Seroussi and G. Sapiro, ”The LOCO-I lossless image compression algorithm: principles and standardization into JPEG-LS,” IEEE Trans. on Image Processing, vol.9, no.8, pp.1309-1324, Aug. 2000.

[3] R. Barequet and M. Feder, ”SICLIC: a simple inter-color lossless image coder,” Proc.

DCC '99, 29-31 March 1999, pp.501–510.

[4] Xiaolin Wu and N. Memon, ”Context-based lossless interband compression-extending CALIC,” IEEE Trans. on Image Processing, vol.9, no.6, pp.994–1001, June 2000.

[5] E.S.G. Carotti, J.C. De Martin and A.R. Meo, ” Backward-adaptive lossless compression of video sequences,” Proc. ICASSP'02. vol.4, vol.4, 13-17 May, 2002, pp.3417-3420.

[6] D. Brunello, G. Calvagno, G.A. Mian and R. Rinaldo, ” Lossless compression of video using temporal information,” IEEE Trans. on Image Processing, vol.12, no.2, pp.

132–139, Feb. 2003.

[7] N.D. Memon and K. Sayood, ”Lossless compression of video sequences, ” IEEE Trans.

on Communications, vol. 44, no.10, pp.1340–1345, Oct. 1996.

[8] S. Todd, G. G. Langdon, Jr., and J. Rissanen, “Parameter reduction and context selection for compression of the gray-scale images,” IBM Jl. Res. Develop., vol. 29 (2), pp.

188-193, Mar. 1985.

[9] S.W. Golomb, “Run-length encodings,” IEEE Trans. Inform. Theory, vol. IT-12, pp.399-401, 1966.

[10] http://www.xs4all.nl/~brw/ds_products/hot_math.html

[11] A. J. Penrose and N. A. D., Eurographics "Extending lossless image compression", Fitzwilliam College, Cambridge, ISBN 0-9521097-8-6,1999, UK '99, 13-15 Apr 1999 (http://www.cl.cam.ac.uk/users/nad/pubs/#COMPRESSION)

[12] M.-F. Zhang, J. Hu and L.-M. Zhang, “Lossless video compression using combination of temporal and spatial prediction,” Proc. NNSP’2003, vol.2, pp.1193-1196, Dec. 14-17, 2003.

[13] “Lossless and near-lossless coding of continuous tone still images(JPEG-LS),”ISO/IEC JTC 1/SC29/WG1 FCD 14495-public draft 1997/7/16

[14] Andreas Sayakis and Michael Piorun, “Benchmarking and hardware implementation of JPEG-LS,” Proc. ICIP’02, vol.2, pp.949-952, 2002.

[15] Kyeong Ho Yang, and A. Farid Faryar,”A contex-based predictive coder for lossless and near-lossless compression of video,” Proc. ICIP00, vol.1, pp.144-147, 2000.

Appendix A. Testing Sequences

A. CIF format (352*288) natural sequences

1. Akiyo 2. Bus

3. Football 4. Foreman

5. Mobile 6. Silent

B. CIF format (352*288) computer-like sequence

C. D1 resolution (720*480) natural sequences

1. Crew 2. Harbour

3. Night 4. Pour_water

5. Rolling_tomato 6. Sailormen

在文檔中低複雜度無失真視訊壓縮 (頁 44-0)