JPEG2000編碼器之加速和TI DSP系統平台上之實現

全文

(1)國立交通大學電機學院 IC 設計產業研發碩士班碩. 士. 論. 文. JPEG2000 編碼器之加速和 TI DSP 系統平台上之實現. Acceleration and Implementation of JPEG2000 Encoder on TI DSP Platform. 研究生：劉建志指導教授：杭學鳴. 博士. 中華民國九十五年十二月.

(2) JPEG2000 編碼器之加速和 TI DSP 系統平台上之實現 Acceleration and Implementation of JPEG2000 Encoder on TI DSP Platform 研究生: 劉建志指導教授: 杭學鳴. Student: Chien-Chih Liu Advisor: Dr. Hsueh-Ming Hang. 國立交通大學電機學院 IC 設計產業研發碩士班碩士論文. A Thesis Submitted to College of Electrical and Computer Engineering National Chiao Tung University in Partial Fulfillment of the Requirements for the Degree of Master in Industrial Technology R & D Master Program on IC Design. December 2006. HsinChu, Taiwan, Republic of China. 中華民國九十五年十二月.

(3) JPEG2000 編碼器之加速和 TI DSP 系統平台上之實現研究生: 劉建志. 指導教授: 杭學鳴博士國立交通大學電機學院碩士班. 摘要由於數位影像應用的逐漸普及，為了提供更有壓縮效率以及支援更多功能的影像處理，一個新一代的靜態影像壓縮標準 JPEG2000 於是產生。它在高壓縮率下也能夠提供相當好的主觀品質，此外，它在壓縮效能和傳送位元流時提供了更細緻的調整功能。然而，JPEG2000 在計算上的複雜度相當的高，在本論文中，我們將 JPEG200 編碼器實現在 TI DSP 平台上。我們根據 JPEG2000 中最複雜的 Tier 部份，提出兩種改善方法，並且加上 TI DSP 最佳化的各種相關工具來進行加速。我們的參考軟體採用了 openJPEG ver.1.0，因為這套軟體的小波轉換模組已經使用一維補嘗式結構(lifting scheme)來進行加速，所以針對佔了整個編碼器九成運算量的 Tier1 模組，我們先探討常見的改善方式，並實際在我們所使用的平台上做測試，然後我們提出了兩種改進方法，一種稱為 VGOSS(Variable group of sample skip)，另外一種則是修改 VGOSS 的方式，來達成減少運算量的目的。這個方式是將需要編碼的資料紀錄起來，減少對不需要的編碼的資料所浪費的檢查時間。另外，我們改變了原來編碼的順序，提供更快的運算架構。當我們對影像使用無失真編碼時，除了採用所提供的加速方法，還有使用 DSP 的編譯程序最佳化、及程式碼的加速技術、還有快取記憶體的重新配置等功能，在最後的在 DSP 系統上的實驗數據顯示，我們使用以上所有技術後，可以比最原始的效能還要快 32 倍，如果比較在同樣的 DSP 最佳化設定還有記憶體配置下，我們的快速演算法仍然可以減少 45%的運算量。關鍵字： JPEG2000、TI DSP、DSP 系統加速、EBCOT i.

(4) Acceleration and Implementation of JPEG2000 Encoder on TI DSP Platform Student: Chien-Chih Liu. Advisor: Dr. Hsueh-Ming Hang. College of Electrical and Computer Engineering National Chiao Tung University. Abstract Because the usage for digital imagery gets increasingly popular, to enhance the compressed image efficiency and features, a new still image coding standard called JPEG2000 was proposed. It provides an excellent subjective quality at low bit rates. It also offers fine granularity scalability in compression efficiency and transmitting compressed bit stream. However, JPEG2000 is also very complicated in computational complexity. In this thesis, we implement a JPEG2000 encoder on the TI DSP platform. We propose two speed-up methods and use the TI DSP optimization tools to accelerate the Tier1 module, which is the most complex part in the JPEG2000 standard. We start with the ver.1.0 OpenJPEG reference software, which has adopted the 1-D lifting scheme to accelerate the DWT module. Thus we focus on the Tier1 module, which takes about 90% of total computing time. We study the previous methods first and examine their effectiveness on our DSP platform. Then, we propose two improved methods, one is called VGOSS (Variable Group Of Sample Skip), and the other is a modified VGOSS method. We eliminate the unnecessary checking cycles by recording the NBC (Need-to-Be-Coded) samples on a list. Furthermore, the sample index is reordered to facilitate fast execution. In the DSP implementation of the proposed methods, we use code acceleration techniques and DSP compiler-level optimization. We also tune the cache allocation to reduce memory access time. The experimental results show that the best performance is up to 32 times faster than the original program without any optimization on the DSP platform. If the original program is ii.

(5) compiled with the DSP optimization tools and proper cache assignment, our fast algorithm can still reduce the computation by 45%.. Key words: JPEG2000、TI DSP、DSP platform acceleration、EBCOT. iii.

(6) 誌謝在這兩年的研究生涯中，經歷了許多挑戰和挫折，感謝神能讓我在這裡學習和成長，我要感謝我的指導教授杭學鳴老師，他的細心指導和豐富的學識經歷，讓我獲益良多，當我研究遇到瓶頸時，老師總是能以關心體諒的方式，讓我重新拾起信心，在這短暫的兩年，獲得的比我想像中要多更多。我還要感謝通訊電子與訊號處理實驗室的夥伴們，在這裡的研究和討論讓我解決許多課業和研究上的問題，VK 學長總是給了我相當棒的提醒和建議，還要感謝峰誠學長、俊榮學長、崑健學長、繼大學長、雄哥、思浩、Osban，大師、Mark、Stan、小新、John、家賢、阿竹、蔡蟲、Geni、旻弘…和實驗室的所有人，你們讓我的研究生生活變得更多彩多姿，還有許多幫助和支持我的朋友們，謝謝你們。最後要感謝家人和女友的支持和關心，讓我能專注在自己的研究上，不論在生活上或是求學上也給了我最大的鼓勵和支持。. 謝謝所有陪我走過這一段歲月的師長、家人和朋友!. 誌於 2006.12 風城交大建志. iv.

(7) Contents 摘要 .............................................................................................................................................i Abstract .....................................................................................................................................ii 誌謝 ...........................................................................................................................................iv Chapter 1 Introduction................................................................................................................2 1.1 Introduction ..................................................................................................................2 1.2 Overview of the Thesis.................................................................................................3 Chapter 2 Conspectus of JPEG2000 Algorithm .........................................................................4 2.1 Introduction to JPEG2000 ............................................................................................4 2.2 Pre-Processing ..............................................................................................................8 2.2.1 Image Tiling.......................................................................................................8 2.2.2 DC Level Shifting..............................................................................................8 2.2.3 Component Transformation...............................................................................9 2.3 Discrete Wavelet Transform and Quantization...........................................................10 2.4 Embedded Block Coding with Optimized Truncation ...............................................14 2.4.1 Tier-1 Coding...................................................................................................14 2.4.2 Tier-2 Coding...................................................................................................19 Chapter 3 DSP Implementation Environment ..........................................................................21 3.1 DSP Platform Introduction .........................................................................................21 3.2 Major DSP Module.....................................................................................................23 3.2.1 Central Processing Unit ...................................................................................24 3.2.2 Memory and Peripherals..................................................................................25 3.3 Coding Development Environment ............................................................................26 3.3.1 Code Composer Studio....................................................................................26 3.3.2 Code Development Flow .................................................................................27 3.3.3 Simulation Tools..............................................................................................28 3.4 Optimization on TI DSP Platform ..............................................................................29 3.4.1 Architecture of TI TMSC6000 Family ............................................................29 3.4.2 Compiler-Level Optimization..........................................................................31 3.4.3 Program-Level Optimization...........................................................................33 Chapter 4 Analysis of Embedded Block Coding and Speed-Improving Methods ...................36 v.

(8) 4.1 Parameters and Software Environment ......................................................................36 4.1.1 Jasper and OpenJPEG Reference Software .....................................................36 4.1.2 Parameter Configuration..................................................................................39 4.2 JPEG2000 Encoder Complexity Analysis ..................................................................42 4.3 Major Encumbrances..................................................................................................44 4.3.1 Memory System...............................................................................................44 4.3.2 Analysis of Bit-Plane Coding ..........................................................................47 4.4 A Few Known Speed-Improving Methods .................................................................52 4.4.1 CUPS and PP Methods ....................................................................................52 4.4.2 SS and GOCS Methods ...................................................................................57 4.4.3 PPP Method .....................................................................................................62 Chapter 5 Acceleration of JPEG2000 Encoder on DSP Platform ............................................64 5.1 Proposed Acceleration Method...................................................................................64 5.1.1 Coding Procedure of VGOSS method.............................................................64 5.1.2 Modified VGOSS method ...............................................................................81 5.1.3 Advantages of the Proposed Methods .............................................................83 5.1.4 Software Speed-up Techniques........................................................................87 5.2 Experimental Results..................................................................................................89 Chapter 6 Conclusions and Future Work................................................................................100 6.1 Conclusion ................................................................................................................100 6.2 Future Works ............................................................................................................101 References..............................................................................................................................102 自. 傳 ..................................................................................................................105. vi.

(9) List of Figures Figure 2-1 General block diagram of JPEG2000 encoder [1]............................................................................ 6 Figure 2-2 General block diagram of JPEG2000 decoder [1]............................................................................ 6 Figure 2-3 Tiling, DC-Level shifting, and Component transformation (optional)........................................... 8 Figure 2-4 2-D forward discrete wavelet transform ......................................................................................... 11 Figure 2-5 2-D DWT decomposition .................................................................................................................. 11 Figure 2-6 Hierarchical of multi-level 2-D DWT .............................................................................................. 12 Figure 2-7 An example of Lena image for multi-level 2-D DWT..................................................................... 12 Figure 2-8 Two tiers of EBCOT algorithm........................................................................................................ 14 Figure 2-9 Diagram of tile, code-block, bit-plane, stripe and coding pass...................................................... 15 Figure 2-10 Context window and Neighbors states .......................................................................................... 16 Figure 2-11 Basic operation of the AE ............................................................................................................... 18 Figure 3-1 SMT395 module and SMT310 carrier ............................................................................................ 22 Figure 3-2 Block diagram of emulator system .................................................................................................. 22 Figure 3-3 Block diagram of the TMS320C64x DSPs [13]............................................................................... 24 Figure 3-4 Development cycle............................................................................................................................. 27 Figure 3-5 Code composer studio development ................................................................................................ 27 Figure 3-6 Code develop flow ............................................................................................................................. 28 Figure 3-7 TMS320C64x hierarchical memory ................................................................................................ 31 Figure 3-8 C/C++ compiler ................................................................................................................................. 32 Figure 3-9 SIMD example for using word access for adding short data......................................................... 35 Figure 4-1 Gray level test images ....................................................................................................................... 38 Figure 4-2 Comparison of the 5-3 filter and the 9-7 filter................................................................................ 40 Figure 4-3 Comparison of different decomposition levels (Goldhill) .............................................................. 40 Figure 4-4 Bike 2048x2560.................................................................................................................................. 41 Figure 4-5 Impact of tile size on coding performance ...................................................................................... 41 Figure 4-6 Complexity profiling of the JPEG2000 encoder on the C64xx simulator .................................... 43 Figure 4-7 Complexity profiling of the JPEG2000 encoder on the C6416 simulator .................................... 43 Figure 4-8 Profile using file level optimization (-o3) on C64xx simulator ...................................................... 46 Figure 4-9 Profile using L2 cache and file level optimization (-o3) on C6416 simulator............................... 46 Figure 4-10 Flowchart of bit-plane coding ........................................................................................................ 49 vii.

(10) Figure 4-11 Analysis of Pass Contribution ........................................................................................................ 51 Figure 4-12 Flowchart of the CUPS method ..................................................................................................... 53 Figure 4-13 Significant sample inheritance ....................................................................................................... 54 Figure 4-14 Significance Propagation ................................................................................................................ 55 Figure 4-15 Four conditions in continuous-five mode ...................................................................................... 56 Figure 4-16 Boundary extension ........................................................................................................................ 56 Figure 4-17 Prediction table for Pass1 ............................................................................................................... 56 Figure 4-18 Concept of SS method..................................................................................................................... 57 Figure 4-19 Flowchart of SS method ................................................................................................................. 58 Figure 4-20 Example of the GOCS method....................................................................................................... 58 Figure 4-21 Flowchart of sample checking........................................................................................................ 59 Figure 4-22 Analysis with different number of columns as a group [25] ........................................................ 60 Figure 4-23 Parallel processing of passes .......................................................................................................... 62 Figure 5-1 Diagram of the stripe and coding pass ............................................................................................ 65 Figure 5-2 Flag-block and code-block................................................................................................................ 65 Figure 5-3 Address order of the stripe in the rearranged code-block ............................................................. 67 Figure 5-4 Rearranged flag-block with paddings ............................................................................................. 67 Figure 5-5 Restored flag-block with paddings .................................................................................................. 69 Figure 5-6 Flowchart of the bit-plane coding.................................................................................................... 73 Figure 5-7 Flowchart of the Pass3 process ........................................................................................................ 75 Figure 5-8 Flowchart of the Pass1 process ........................................................................................................ 76 Figure 5-9 Flowchart of the Pass2 process ........................................................................................................ 77 Figure 5-10 Flowchart of the Pass1 process ...................................................................................................... 82 Figure 5-11 Checking cycles of the GOCS method ........................................................................................... 85 Figure 5-12 Checking cycles of the VGOSS method......................................................................................... 85 Figure 5-13 RENORME and modified procedure ............................................................................................ 87. viii.

(11) List of Tables Table 2-1 Part of the JPEG2000 standard........................................................................................................... 7 Table 2-2 Le Gall 5-3 analysis and synthesis filter coefficients ........................................................................ 10 Table 2-3 Daubechies 9-7 analysis and synthesis filter coefficients ................................................................. 10 Table 2-4 Coding Pass Classification ................................................................................................................. 16 Table 2-5 Contexts for the significance propagation pass and cleanup coding passes................................... 17 Table 2-6 Contributions of the vertical (and the horizontal) neighbors to the sign context.......................... 18 Table 2-7 Contexts for the magnitude refinement coding pass........................................................................ 18 Table 3-1 Different data types ............................................................................................................................ 34 Table 4-1 PSNR (dB) of different images using JasPer Ver.1.701 encoder ..................................................... 37 Table 4-2 PSNR (dB) of different images using OpenJPEG Ver.1.0 encoder ................................................. 37 Table 4-3 Cycles on different simulators ........................................................................................................... 42 Table 4-4 The effect of using L2 cache memory................................................................................................ 45 Table 4-5 Calls of Pass3 process function .......................................................................................................... 52 Table 4-6 Comparison with the CUPS method ................................................................................................. 53 Table 4-7 SS method on C64xx simulator.......................................................................................................... 61 Table 4-8 SS + GOCS (different columns as a group) method on C64xx simulator ...................................... 61 Table 4-9 Comparison of processing time using PPP method [26] .................................................................. 63 Table 5-1 Effect of the data rearrangement ...................................................................................................... 68 Table 5-2 Comparison between original and modified updating flag procedures ......................................... 73 Table 5-3 Percentage of encoding samples in each pass process...................................................................... 83 Table 5-4 Cycles of the MQ coder on the C6416 simulator.............................................................................. 88 Table 5-5 DWT module on C6416 simulator..................................................................................................... 88 Table 5-6 Comparison using C64xx simulator without compiler-level optimization..................................... 90 Table 5-7 Comparison using C64xx simulator with file level optimization .................................................... 91 Table 5-8 Comparison using C6416 simulator without compiler-level optimization..................................... 92 Table 5-9 Comparison using C6416 simulator with file level optimization .................................................... 93 Table 5-10 Comparison by C6416 simulator using L2 cache........................................................................... 94 Table 5-11 Comparison using C6416 simulator with L2 cache and file level optimization ........................... 95 Table 5-12 Comparison using C64xx simulator (Best solution) with file level optimization ......................... 96 ix.

(12) Table 5-13 Comparison using C6416 simulator (Best solution) with file level optimization ......................... 97 Table 5-14 Comparison of the executing time on the C6416 emulator ........................................................... 98 Table 5-15 Comparison between the C6416 simulator and the C6416 emulator ........................................... 99. 1.

(13) Chapter 1 Introduction. 1.1 Introduction Digital image is an essential part of our daily information in the world today. The standards for the efficient representation and interchange of digital images are important. JPEG2000 is a well-known image algorithm for its excellent coding performance especially in low bit-rate. It is the most recent addition to a family of international standards developed by the Joint Photographic Experts Group (JPEG). This group operates under the auspices of Joint Technical Committee 1, Subcommittee 29, Working Group 1 (JTC 1/SC 29/WG 1), a collaborative effort between the International Organization for Standardization (ISO) and International Electro technical Commission (IEC). The JPEG committee has already released the JPEG and JPEG-LS standards. The JPEG standard is the most popular image compression in recent years. However, the JPEG committee intends to create a new image coding system for different types of still images (bi-level, gray-level, color, multi-component), with different characteristics (natural images, scientific, medical, remote sensing, text, rendered graphics, and etc.). The targets of the JPEG2000 coding system are expected to be the low bit-rate operation with a rate-distortion and subjective image quality performance superior to the existing image standards. The JPEG2000 standard implements an entirely new way of compressing images based on the wavelet transform, in contrast to the discrete cosine transform (DCT) used in the JPEG standard. It also supports lossy and lossless compression of single-component (gray level) images and multi-component (color) images. In addition to this basic compression functionality, a number of other features are provided, including progressive recovery of an image by fidelity or resolution, region of interest coding, random accessing and so on. However, the complexity of JPEG2000 algorithm is the most critical issue in the 2.

(14) implementation on an embedded system. Typically, the Embedded Block Coding with Optimized Truncation (EBCOT) is the major part and computationally intensive in the JPEG2000 algorithm. The EBCOT employs a post-compression Rate-Distortion Optimization (RDO) tool, which truncates the bit-stream at the target bit-rate providing optimal image quality. Because of these tools, the JPEG2000 algorithm has a much higher computation than the JPEG algorithm. In order to reduce the cost and power consumption, we analyze and accelerate the JPEG2000 algorithm in this study.. 1.2 Overview of the Thesis In this thesis, the JPEG2000 encoder is implemented on an embedded system-a TIDSP platform. A few speed-up methods are adopted in our encoder. In the Chapter 2, the concepts of the JPEG2000 algorithm are introduced and all coding modules are presented in the following sections. Chapter 3 introduces the implementation environment including the DSP platform, coding development tools, and some typical optimization methods. In Chapter 4, the JPEG2000 encoder is profiled and analyzed. Some previous accelerating methods are reviewed and modified in our DSP platform. Then, we propose our improved methods to accelerate the JPEG2000 encoder in Chapter 5 and extensive experiments using different methods are also presented in Chapter 5. Finally, we give a summary of this project and also discuss the future possible work in Chapter 6.. 3.

(15) Chapter 2 Conspectus of JPEG2000 Algorithm The JPEG standard has been in use for almost a decade now. It provides a valuable tool during all these years, but it cannot fulfill the advanced requirements for image coding of today. The JPEG2000 standard provides a set of features that are important to many high-end and emerging applications by adopting new technologies. This chapter introduces the feature set and provides an overview of the Part1 of JPEG2000 standard Part 1. It is the core of the JPEG2000 for image coding system. The details of JPEG2000 Part 1 can be found in [1].. 2.1 Introduction to JPEG2000 Starting from March 1997, a new call for contributions was launched for the development of a new standard for the compression of still images, the JPEG2000 standard [1], [2]. The requesting compression technologies had been submitted to an evaluation during the November 1997 WG1 meeting in Sydney, Australia. The JPEG2000 standard has been achieved many desired features including different types of still image, different characteristics, and different imaging models within a unified system. The most important features [7] of JPEG2000 algorithm are listed as below.. Superior low bit-rate performance: While superior performance at all bit-rates was considered desirable, improved performance at low bit-rate (e.g. below 0.25 bpp), with respect to JPEG, was considered to be an important requirement for JPEG2000. JPEG2000 has a compression advantage over JPEG of roughly 20% and a subjective quality benefit. Continuous-tone and bi-level compression: 4.

(16) Seamless compression of image components (e.g., R, G, or B), each from 1 to 16 bits deep, was desired from one unified compression architecture. Progressive transmission by pixel accuracy and resolution: Progressive transmission that allows images to be reconstructed with increasing pixel accuracy or spatial resolution is essential for many applications. For examples, World Wide Web, image archival and printers, are common applications. Lossless and lossy compression: JPEG2000 provides both lossless and lossy compression, again from single compression architecture. It is desired to provide lossless compression in the natural course of progressive decoding. Region-of-Interest Coding: Some parts of an image are more important than others, and would like to be transmitted with better quality and less distortion than the rest of the image. Users can define certain ROI’s in the image to be coded and transmitted first. Random code-stream access and processing: This feature allows users to define certain ROI’s in the image to be coded and transmitted with less distortion than the rest of the image. Besides, rotation, filtering, translation, scaling and feature extraction are supported. Robustness to bit-errors: It is desirable to consider robustness to bit-errors while designing the codestream. In the noisy communication channels (e.g., wireless), proper design of the codestream can aid subsequent error correction systems in alleviating catastrophic decoding failures. Open architecture: It is desirable to allow open architecture to optimize the system for different image types and applications. A decoder is only to implement the core tool set and a parser that understands the codestream. Furthermore, unknown tools could sent from the source and be adopted by the decoder. Content-based description: Image archival, indexing and searching is an important in image processing. Content-based description of images might be available as part of the compression system. Side channel spatial information (transparency): Side channel spatial information such as alpha planes and transparency planes are useful 5.

(17) for transmitting information for processing the image for display, printing or editing. Protective image security: Protection of a digital image can be achieved by means of watermarking, stamping, encryption, and labeling. The SPIFF has implemented labeling method, and JPEG2000 must be easy to achieve the target.. Coded Image. Source Image Data. Pre-Processing. Forward DWT. Tier-2 Encoder. Tier-1 Encoder. Uniform Scalar Quantization. Rate Control. Figure 2-1 General block diagram of JPEG2000 encoder [1]. Coded Image. Tier-2 Decoder. Tier-1 Decoder. Dequantization. Reconstructed Image. Post-Processing. Inverse DWT. Figure 2-2 General block diagram of JPEG2000 decoder [1]. Due to above-mentioned attractive features, JPEG2000 has a very large potential application base. Some possible application areas include: document imaging, digital photography, desktop publishing, Internet, image archiving, medical imaging, remote sensing, and web browsing. The JPEG2000 standard compression engine (Encoder and Decoder) is illustrated in block diagrams in Figure 2-1 and Figure 2-2. It is comprised of numerous parts, 6.

(18) several of which are listed in Table 2-1. Part 2 [3] and Part 3 [4] describe extensions to the baseline codec that are useful for certain specific applications such as intraframe-style video compression. For convenience, we will refer to the codec defined in Part 1 of the standard as the baseline codec. Before introducing the major block of the codec, we should know that the most parts of the JPEG2000 standard are written from the point of view of the decoder. Besides, the decoder is the reverse of the encoder. We will only describe the JPEG2000 encoding tools in the following sections.. Part 1. Title. Purpose. Core coding system. Specifies the core codec for the JPEG2000 family of standard. 2. Extensions[3]. Specifies additional functionalities that are useful in some applications but need not be supported by all codec. 3. Motion JPEG2000[4]. Specifies extensions to JPEG2000 for intraframe-style video compression. 4. Conformance testing[5] Specifies the procedure to be employed for compliance testing. 5. Reference software[6]. Provides sample software implementations of the standard to serve as a guide for implementations. Table 2-1 Part of the JPEG2000 standard. 7.

(19) 2.2 Pre-Processing The Pre-Processing block includes three types of processes, which are “Image Tiling”, “DC Level Shifting”, and “Component transformations”. We will describe these terms as follows.. Figure 2-3 Tiling, DC-Level shifting, and Component transformation (optional). 2.2.1 Image Tiling The standard operations, including component mixing, wavelet transform, quantization and entropy coding, works on image tiles which are the partition of the original image. The image tiles are rectangular non-overlapping blocks which are compressed independently. Tiling reduces memory requirements, and since they are reconstructed independently, they can be used for decoding specific parts of the image instead of the whole image.. 2.2.2 DC Level Shifting After tiling image, all samples of the each tiles are dc level shifted by subtracting the same quantity 2P-1, where P is the component’s precision. DC level shifting is performed on samples of components that are unsigned only.. 8.

(20) 2.2.3 Component Transformation The followed stage is an optional inter-component transformation. It reduces the correlation between components, and lead to improved coding efficiency [8]. The JPEG2000 supports multiple-component image, and different bit depths. For the reversible (i.e. lossless) systems, the only requirement is that the bit depth of each output image component must be identical to the bit depth of the corresponding input image component. The JPEG2000 supports two different component transforms, irreversible component transformation (ICT) for lossy coding and reversible component transformation (RCT) for lossless or lossy coding. All image component samples I0(x, y), I1(x, y), I2(x, y), corresponding to the first, second, and third components, produce transform samples Y0(x, y), Y1(x, y), Y2(x, y). The forward and inverse RCT are achieved by means of (2.2-1) and (2.2-2). The other one, ICT, refers to (2.2-3) and (2.2-4).. ⎛ ⎢ I 0 + 2 I1 + I 2 ⎥ ⎞ ⎜ ⎥⎦ ⎟ 4 ⎛ Y0 ⎞ ⎜ ⎢⎣ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ Y1 ⎟ = ⎜ I 2 − I1 ⎟ ⎜Y ⎟ ⎟ ⎝ 2 ⎠ ⎜ I 0 − I1 ⎜ ⎟ ⎝ ⎠. Forward RCT (2.2-1). ⎛ ⎢Y + Y ⎥ ⎞ ⎜ Y0 − ⎢ 1 2 ⎥ ⎟ ⎛ I1 ⎞ ⎜ ⎣ 4 ⎦⎟ ⎜ ⎟ ⎜ ⎟ ⎜ I 0 ⎟ = ⎜ Y2 + I1 ⎟ ⎜ I ⎟ ⎜Y + I ⎟ 1 1 ⎝ 2⎠ ⎜ ⎟ ⎝ ⎠. Inverse RCT (2.2-2). 0.587 0.114 ⎞⎛ I 0 ⎞ ⎛ Y0 ⎞ ⎛ 0.299 ⎜ ⎟ ⎜ ⎟⎜ ⎟ 0.5 ⎟⎜ I1 ⎟ ⎜ Y1 ⎟ = ⎜ − 0.16875 − 0.33126 ⎜ Y ⎟ ⎜ 0.5 − 0.41869 − 0.08131⎟⎠⎜⎝ I 2 ⎟⎠ ⎝ 2⎠ ⎝ 0 1.402 ⎞⎛ Y0 ⎞ ⎛ I 0 ⎞ ⎛1.0 ⎜ ⎟ ⎜ ⎟⎜ ⎟ ⎜ I1 ⎟ = ⎜1.0 − 0.34413 − 0.71414 ⎟⎜ Y1 ⎟ ⎜ I ⎟ ⎜1.0 ⎟⎜ Y ⎟ 1.772 0 ⎠⎝ 2 ⎠ ⎝ 2⎠ ⎝. Forward ICT (2.2-3). Inverse ICT (2.2-4). 9.

(21) 2.3 Discrete Wavelet Transform and Quantization The wavelet transform is used for analysis of the tile components into different decomposition levels. These decomposition levels contain a number of subbands, which consist of coefficients that describe the horizontal and vertical spatial frequency characteristics of the original tile component. Due to the statistical properties of these subband signals, the transformed data can usually be coded more efficiently than the original untransformed data. In JPEG2000 system, two wavelet transform kernels are provided. The DWT can be irreversible or reversible. The default reversible transformation is implemented by means of the Le Gall 5-3 filter, the analysis and the corresponding synthesis filter coefficients are given in Table 2-2. The other one, default irreversible transform, is implemented by means of the Daubechies 9-7 filter, and the corresponding coefficients are given in Table 2-3.. Analysis Filter Coefficients. Synthesis Filter Coefficients. i. Low-Pass Filter h L(i). High-Pass Filter h H(i). Low-Pass Filter g L(i). High-Pass Filter g H(i). 0. 6/8. 1. 1. 6/8. ±1. 2/8. -1/2. 1/2. -2/8. ±2. -1/8. -1/8. Table 2-2 Le Gall 5-3 analysis and synthesis filter coefficients Analysis Filter Coefficients. Synthesis Filter Coefficients. i. Low-Pass Filter h L(i). High-Pass Filter h H(i). Low-Pass Filter g L(i). High-Pass Filter g H(i). 0. 0.6029490182363579. 1.115087052456994. 1.115087052456994. 0.6029490182363579. ±1. 0.2668641184428723. -0.5912717631142470. 0.5912717631142470. -0.2668641184428723. ±2. -0.07822326652898785. -0.05754352622849957. -0.05754352622849957. -0.07822326652898785. ±3. -0.01686411844287495. 0.09127176311424948. -0.09127176311424948. 0.01686411844287495. ±4. 0.02674875741080976. 0.02674875741080976. Table 2-3 Daubechies 9-7 analysis and synthesis filter coefficients. 10.

(22) LPF (h L(i)). LPF (h L(i)). ↓2. LLi+1. HPF (h H(i)). ↓2. LHi+1. LPF (h L(i)). ↓2. HLi+1. HPF (h H(i)). ↓2. HHi+1. ↓2. Source Image Data. HPF (h H(i)). ↓2. Vertical Filtering. Horizontal Filtering. Figure 2-4 2-D forward discrete wavelet transform. Figure 2-5 2-D DWT decomposition. Usually, the two-dimensional (2-D) discrete wavelet transform is accomplished by cascading two one-dimensional (1-D) discrete wavelet transform. It is decomposed by one-dimensional discrete wavelet transform with 2-channel in horizontal and vertical directions respectively, as shown in Figure 2-4. After one-dimensional vertical discrete wavelet, two subbands are formed. The low-pass samples represent a downsampled low-resolution version of the original set. The high-pass samples represent a downsampled residual version of the original set. And then the subbands pass through the other horizontal filter. The four higher-level subbands are all composed of quarter original image size such as Figure 2-5. 11.

(23) Power of 2 decompositions is allowed in the form of dyadic decomposition (in Part I) as shown in Figure 2-6. For a N by N image through the M-level two-dimensional discrete wavelet transform decomposition, the size of each subband is N/2M by N/2M. An example of a dyadic decomposition into subbands of the image ‘Lena’ is illustrated in Figure 2-7.. Figure 2-6 Hierarchical of multi-level 2-D DWT. N. N/2. N/4 N/4. N/2 N. Figure 2-7 An example of Lena image for multi-level 2-D DWT. 12.

(24) After transformation, all coefficients are quantized. Sever quantization options are provided in JPEG2000 standard. Only the uniform scalar quantization which is the default quantization method in JPEG2000 standard Part 1 would be introduced here. In integer mode, the quantizer step sizes are always fixed at one, effectively bypassing quantization and forcing the quantizer indices and transform coefficients to be one and the same. In this case, lossy coding is still possible, but rate control is achieved by other mechanism. In the case of real mode, the quantizer step sizes are chosen in conjunction with rate control. Each of the transform coefficients ab(u,v) of the subband b is quantized to the value qb(u,v) according to the formula (2.3-1). Since the step size Δ b is represented relative to the dynamic range Rb of the subband b, it is defined in (2.3-2). The exponent/mantissa pairs (εb, μb) are either explicitly signaled in the bit stream syntax for every sub-band.. ⎢ a (u, v) ⎥ qb (u , v) = sign( ab (u, v)) ⎢ b ⎥ ⎣ Δb ⎦. Δ b = 2 Rb −ε b (1 +. μb 211. (2.3-1). ). (2.3-2). 13.

(25) 2.4 Embedded Block Coding with Optimized Truncation Embedded block coding with optimized truncation (EBCOT) [9] is adopted for the entropy coding of JPEG2000.The EBCOT consists of two major coding step, tier-1 and tier-2, as shown in Figure 2-8. The tier-1 part is the embedded block coding (EBC) which is composed of the context formation (CF) and the arithmetic encoder (AE). The tier-1 coder divides each subband coefficient into code-blocks and all code-blocks are coded separately into a block-based embedded bit-stream. The coding is performed using the bit-plane coder described later in next section. For each code-block, an embedded code is produced, comprised of numerous coding passes and the output of the tier-1, block-based embedded bit-stream, is a collection of coding passes for the various code-blocks. After that, the tier-2 truncates the embedded bit-stream to minimize the overall distortion. We will introduce the two tiers in following sections.. Context. DWT Coefficients. Context Formation. Decision. Arithmetic Encoder. Rate-Distortion Optimization. Full-featured bit-stream. EBC Tier-1. Tier-2. Figure 2-8 Two tiers of EBCOT algorithm. 2.4.1 Tier-1 Coding The tier-1 coding is also a known as the embedded block coding (EBC). It includes the context formation (CF) and the arithmetic encoder (AE) and its basic coding unit is a code-block. The EBC is a bit-level processing algorithm, and the code-block is coded in a bit-plane by bit-plane manner which is from the most significant bit (MSB) bit-plane to the least significant bit (LSB) bit-plane in a code-block. Every bit-plane takes three passes and it is scanned in a stripe-based method, as presented in Figure 2-9. 14.

(26) Figure 2-9 Diagram of tile, code-block, bit-plane, stripe and coding pass. 2.4.1.1 Context Formation (CF). The embedded block coding is essentially a context-adaptive arithmetic encoder as shown in Figure 2-8. The context formation (CF) generates context-decision pairs for the arithmetic encoder (AE). The context is adopted to adapt the probability of the decision by the AE. In context modeling, all code-blocks are coded a bit-plane at a time starting from the MSB bit-plane with a non-zero element to the LSB bit-plane. For each bit-plane in a code-block, a special scan pattern is use for each of three coding passes. The three coding passes are coded in order as Pass1 (significance propagation pass), Pass2 (magnitude refinement pass), and then Pass3 (cleanup pass). Each coefficient bit from DWT is coded in 15.

(27) only one of the three coding passes, and the coding condition is shown in Table 2-4.. Coding Pass Pass1 (Significance Propagation Pass). Coding Condition Insignificant sample with at least one significant neighbor Significant sample Insignificant sample with all. Pass2 (Magnitude Refinement Pass) Pass3 (Cleanup Pass). Table 2-4 Coding Pass Classification. 0. 4. 8. 12. 16. 20. 1. 5. 9. 13. 17. 21. 2. 6. 10. 14. 18. 22. 3. 7. 11. 15. 19. 23. D0. V0. D1. H0. X. H1. D2. V1. D3. Context window. n n+1. Stripe. n+2 n+3. Figure 2-10 Context window and Neighbors states. Since the context-based arithmetic coding is employed, a means to select context selection is necessary. Figure 2-10 shows the context window and the 4-connected or 8-connected neighbors of a sample is selected that is performed by examining state information. The first coding pass (Pass1) for each bit plane is the significance propagation pass. During the significance propagation pass, a bit is coded if its location is not significant, but at lease one of its 8-connected neighbors is significant. Nine context labels (Table 2-5) are created based on how many and which ones are significant. The significance propagation pass includes only bits of coefficients that were insignificant and have a non-zero context. All other coefficients are skipped. If the value of this bit then the significance state is set to 1 and then the sign coding must be performed. The sign coding is determined using another context table.. 16.

(28) Only four neighbors are considered, and each neighbor may have one of three states: significant positive, significant negative, or insignificant. Both vertical and horizontal give the different contribution for the context table. The nine permutations of the vertical and horizontal contributions are reduced into five context labels as shown in Table 2-6. The decision of sign coding can be obtained by performing the logic XOR operation with the XOR bit of the sign context table. The second coding pass (Pass2) for each bit plane is the magnitude refinement pass. This pass signals subsequent bits after the most significant bit for each sample. If a sample was found to be significant in a previous bit plane (except those that have just become significant in the immediately proceeding significance propagation pass), the next most significant bit of that sample is conveyed using a single binary symbol. The context used in magnitude refinement coding is determined by the summation of the significance state of the horizontal, vertical, and diagonal neighbors as shown in Table 2-7. All the remaining coefficients in the bit-plane are insignificant and have the context value of zero during the significance propagation pass. These are all included in the cleanup pass (Pass3). The cleanup coding not only uses the neighbor context, like that of the significant coding from Table 2-5, but also a run-length coding. If the four contiguous samples in a column and the context labels of the four samples are all zeros, the run-length coding is performed.. LL and LH sub-bands (vertical high-pass) ΣH. ΣV b. HL sub-band (horizontal high-pass). HH sub-band (diagonally high-pass). Context Label. ΣD. ΣH. ΣV. ΣD. Σ(H+V). ΣD. X. X. 2. X. X. ≧3. 8. 2. X. 1. ≧1. X. ≧1. 1. X. ≧1. 2. 7. 1. 0. ≧1. 0. 1. ≧1. 0. 2. 6. 1. 0. 0. 0. 1. 0. ≧2. 1. 5. 0. 2. X. 2. 0. X. 1. 1. 4. 0 0 0. 1 0 0. X ≧2 1. 1 0 0. 0 0 0. X ≧2 1. 0 ≧2 1. 1 0 0. 3 2 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. Table 2-5 Contexts for the significance propagation pass and cleanup coding passes. 17.

(29) Horizontal contribution 1 1 1 0 0 0 -1 -1 -1. Vertical. contribution 1 0 -1 1 0 -1 1 0 -1. Context Label 13 12 11 10 9 10 11 12 13. XOR bit 0 0 0 0 0 1 1 1 1. Table 2-6 Contributions of the vertical (and the horizontal) neighbors to the sign context. ΣH+ΣV+ΣD Xb ≧1 0. First refinement for this sample False True True. Context Label 16 15 14. Table 2-7 Contexts for the magnitude refinement coding pass. Figure 2-11 Basic operation of the AE (Most Probable Symbol, Least Probable Symbol, and Renormalization). 18.

(30) 2.4.1.2 Arithmetic Encoder (AE). The decision which is produced by the CF is coded during arithmetic encoder. The AE is an adaptive, binary MQ-coder [10]. The basis of the binary arithmetic coding process is the recursive probability interval subdivision of Elias coding. Since it is a binary AE, there are only two sub-intervals. With each binary decision, the current probability interval is subdivided into two sub-intervals, and the codestream is modified (if necessary) so that points to the base (lower bound) of the probability sub-interval assigned to the symbol as shown in Figure 2-11. Besides, a lazy coding mode is used to reduce the number of symbols that are arithmetically coded. According to this mode, after the fourth bitplane is coded, the first and second pass are included as raw, while only the third coding pass of each bitplane employs arithmetic coding.. 2.4.2 Tier-2 Coding The tier-2 encoding follows the tier-1 encoding, and the input of the tier-2 encoding process is the set of bit-plane coding passes generated during tier-1 encoding. Each coding pass is a candidate of truncation point of a code-block and the coding pass information is packaged into data units called packets in tier-2 coding. For meeting a target bit-rate or transmission time, the packaging process imposes a particular organization of coding pass data in the output codestream. Thus rate control assures that the desired number of bytes is used by the codestream while assuring the highest image quality possible. We will review the RDO algorithm in following section. In the encoder, rate control can be achieved through two distinct mechanisms, the choice of quantization step size and the selection of the subset of coding passes to include in the codestream. When lossless coding is employed, only the first mechanism may be used. The quantization step sizes must be fixed to one. In lossy coding mode, both of the two mechanisms may be employed. If the quantization step sizes are changed, the tier-1 encoding must be performed again. Since tier-1 coding requires a lot of computation, changing step sizes may not be practical in the encoder. The encoder can elect to discard coding passes in order to control the rate. The contribution of each coding pass makes to rate, and calculates 19.

(31) the distortion reduction. Using this information, the encoder can include the coding passes in order of decreasing distortion reduction until the bit budget has been exhausted. The goal of rate control is to minimize the distortion while keeping the rate smaller than the target rate, RT. The problem is mapped into Lagrange optimization problem [11] as (2.4-1).. min( D + λR) = min(∑ ( Dizi + λRizi )). (2.4-1). i. The D means total distortion, and R means total bit rate. The Lagrange multiplier(λ) is used to minimize J = D+Rλ,and thus the derivative of J is set to zero. The candidate corresponding pass m of the bit-plane k in the code-block i (Bi) is represented as Zi. Then the optimalλ,(*λ), and the slop of R-D curve can be obtained as (2.4-2).. *λ = −. ∂D ∂R. (2.4-2). For each code-block Bi, the slop of R-D curve is corresponding to the number of Zi as (2.4-3). The SiZi means the reduction speed when Bi is truncated at Zi. The optimal solution proved in [11] is constrained as below (2.4-4). The *Zi is the optimal truncation point of Bi, and the rate-distortion optimization can be achieved when Zi is sufficiently closed to *Zi.. S. Zi i. ΔDiZ i DiZ i − DiZ i +1 = = Zi ΔRiZ i Ri − RiZ i +1. ⎧− S iZ i ≥ *λ , ⎪ ⎨ Z ⎪− S i i < *λ , ⎩ and. ∑R. *Z i i. (2.4-3). Z i ≥ *Z i Z i < *Z i. (2.4-4). ≤ RT. i. 20.

(32) Chapter 3 DSP Implementation Environment In this chapter, we will briefly introduce the DSP platform environment and some optimization methods. We use the DSP module (SMT395) made by Sundance. It houses two important chips, TMS320C6416T DSP chip made by Texas Instrument and Xilinx Virtex II Pro FPGA. As our implementation is software base system, we only focus on the DSP chip. In addition, we will introduce the software development tool, the Code Composer Studio (CCS), and bring in some efficient optimization methods by using this environment.. 3.1 DSP Platform Introduction Our DSP platform includes two major modules, SMT395 and SMT310. The DSP module, SMT395, is based on the 1GHz 64-bit TMS320C6416T DSP which is manufactured on the 90nm wafer technology. It is also supported by the T.I. Code Composer Studio and 3L Diamond RTOS to enable full multi-DSP systems with minimum efforts by the programmers. We use the TI’s PCI module carrier (SMT310) to communicate between SMT395 and personal computer. Our emulation results could be passed from SMT310 PCI bus and shown on CCS windows. Figure 3-1 shows the pictures of SMT395 and SMT310. The SMT395 module can be installed on the SMT310 carrier and SMT310 can be installed on a personal computer. The block diagram of emulator system is shown in Figure 3-2. We will introduce the main DSP module (SMT395) and software environment in following sections.. 21.

(33) SMT395. SMT310. Figure 3-1 SMT395 module and SMT310 carrier. Figure 3-2 Block diagram of emulator system. 22.

(34) 3.2 Major DSP Module In our emulator system, the DSP module (SMT395) is the most important part of this system. First, we list some important features of SMT395 module as follows [12].. . 1GHz TMS320C6416T fixed point DSP. . 8000MIPS peak performance. . Xilinx Virtex II Pro FPGA. XC2V920-6 in FF896 package. . 256 Mbytes of SDRAM @ 133MHz using k4s511632M. . Two Sundance High-speed Bus (50MHz, 100MHz or 200MHz) ports 32 bits wide. . Eight 2 Gbit/sec Rocket Serial Links (RSL) for Inter-Module communications. . Six Comports up to 20 Mbytes/sec each for Inter-DSP communication/configuration. . 8 Mbytes Flash ROM for configuration and booting. . JTAG diagnostics port. The TMS320C6416T DSP is the highest-performance fixed-point DSP generation in the TMS320C64X series of the TMS320C6000 DSP family. It is based on the second-generation high-performance, advanced VelociTI very-long-instruction-word (VLIW) architecture (Called VelociTI.2) developed by Texas Instruments [13]. The VelociTI.2 extensions in the eight functional units include new instruction to accelerate the performance in key applications and extend the parallelism of the VelociTI architecture. The functional block and DSP core diagram of TMS320C64x series is shown in Figure 3-3. In the following sections, three major parts of TMS320C64x DSP are introduced respectively. They are central processing unit, memory, and peripherals.. 23.

(35) Figure 3-3 Block diagram of the TMS320C64x DSPs [13]. 3.2.1 Central Processing Unit The DSP core of C64x series consists of eight independent functional units, 64 general purpose registers, program fetch unit, instruction dispatch (attached with advanced instruction packing), instruction decode unit, two data path, test unit, emulation unit, interrupt logic, and etc. The instruction dispatch and decode units could decode and arrange the eight instructions to eight functional units respectively. The eight functional units in the C64x architecture could 24.

(36) be further divided into two data paths, data path A and B as shown in Figure 3-3. Each path has one unit for multiplication operations (.M), another one for logical and arithmetic operations (.L), another one for branch, bit manipulation, and arithmetic operations (.S), and another one for loading/storing, address calculation and arithmetic operations (.D). The (.S) and (.L) units are for arithmetic, logical, and branch instructions. All data transfers make use of the (.D) units. Two cross-paths (1x and 2x) allow functional units from one data path to access a 32-bit operand from the register file on the other side. There are 32 general purpose registers, but some of them are reserved for specific addressing or used for conditional instructions. Each functional unit has its own 32-bit bus for writing into a general-purpose register file. All functional units which end in 1 (for example, (.L1)) write to register file A while all functional units which end in 2 ( for example, (.L2)) write to register file B.. 3.2.2 Memory and Peripherals The C64x uses a two-level cache-based architecture and has a powerful and diverse set of peripherals. The level 1 program cache (L1P) is a 128 Kbit direct mapped cache and the level 1 data cache (L1D) is a 128 Kbit 2-way set-associative cache. The level 2 memory/cache (L2) consists of an 8 Mbit memory space or combinations of cache (up to 256 Kbytes) and mapped memory. Besides, the TMS320C6416T uses two external memory interfaces (EMIF) to access asynchronous memories (SRAM and EPROM) and synchronous memories (SDRAM, SBSRAM, ZBT SRAM, and FIFO). The C64x contains some peripherals such as enhanced direct memory access (EDMA) controller, host-port interface (HPI), external memory interface (EMIF), PCI, and etc. The EDMA supports up to 64 EDMA channels which service peripheral devices and external memory. For the C64x device, the association of an event to a channel is fixed, and each of the EDMA channels has one specific event associated with it. These specific events are captured in the EDMA event registers even if the events are disabled by the EDMA event enable registers. The HPI is a parallel port through which a host processor can directly access the CPU’s memory space. The host can direct access to memory-mapped peripherals and has ease of access. The PCI module supports connection of the C6000 device to a PCI host via the integrated PCI master/slave bus interface. 25.

(37) 3.3 Coding Development Environment In this Section, we will give a briefly introduction about the coding development environment in this project. The code composer studio (CCS) and the coding development flow are illustrated. The tutorial [14] introduces the key features of CCS and the programmer’s guide [15] gives a reference for programming TMS320C6000 digital signal processor (DSP) devices. A programmer needs to be familiar with coding development flow and CCS for building a new project on the DSP platform efficiently.. 3.3.1 Code Composer Studio Code Composer Studio (CCS) speeds and enhances the development process for programmers who create and test real-time, embedded signal processing applications. The CCS extends the basic code generation tools with a set of debugging and real-time analysis capabilities which is described as Figure 3-4. In addition, the CCS includes the following components which are listed below and all of these work together as shown in FIG... . TMS320C6000 code generation tools. . Code Composer Studio Integrated Development Environment (IDE). . DSP/BIOS plug-ins and API. . RTDX plug-in, host interface, and API. The code generation tools provide the foundation for the development environment provided by the CCS such as C compiler, assembler, assembler optimizer, linker, archiver and etc. The code composer studio integrated development environment is designed for editing, building, and debugging DSP target programs. During the analysis phase of the software development cycle, traditional debugging features are ineffective for diagnosing subtle problems that arise from time-dependent interactions. Therefore the DSP/BIOS plug-ins provides real-time analysis such as program tracing, performance monitoring, and file streaming. In addition, the real-time data exchange (RTDX) provides real-time, continuous visibility into the way DSP applications operate in the real world. It allows system developers 26.

(38) to transfer data for bi-directional real-time communications between a host computer and the DSP devices without stopping their target application.. Figure 3-4 Development cycle. Figure 3-5 Code composer studio development. 3.3.2 Code Development Flow Traditional development flows in the DSP industry have involved validating a C model for correctness on a host PC or UNIX workstation and then painstakingly porting that C code to hand coded DSP assembly language. But this is both time consuming and error prone, the recommended code development flow involves utilizing the C6000 code generation tools to 27.

(39) aid in optimization rather than forcing the programmer to code by hand in assembly. These advantages allow the compiler to do all the laborious work of instruction selection, parallelizing, pipelining, and register allocation. The phases of recommended code development flow are described as Figure 3-6. In phase 3, writing linear assembly code is not adopted unless the software pipelining efficiency is hardly achieved or the unbalanced resource allocation is hardly solved by the compiler with C code.. Write C Code. Refine C Code. Assembly Code. Compiler. Compiler. Assembler. Profiler. Profiler. Profiler. Efficient? No. Yes. Efficient?. Complete. Phase 1 Develop C Code. Complete. No Yes. Yes. No. Efficient? Yes Complete. More C Optimization? Phase 2 Refine C Code. Phase 3 Write linear assembly. Figure 3-6 Code develop flow. 3.3.3 Simulation Tools In the code develop flow mentioned in Figure 3-6 we know that profiling is an essential step for analyzing coding efficiency. We use the C64xx CPU cycle accurate simulator to simulate the core of the C64xx processor with cycle accuracy. This is faster than the device cycle accurate simulators but does not simulate peripherals and cache system (use a flat memory system). In addition, we use another simulator called C6416 device cycle accurate simulator to simulate the C64xx XDS510 emulator. It simulates the C6416 processor and supports L1D, L1P, L2 cache, EDMA, QDMA, Interrupt Selector, McBSP(3), Timer(3), TCP, VCP and EMIF. It also supports interfacing with Async, SDRAM and Generic sync RAM Memory models. Finally, we use C64xx XDS510 emulator with the hardware board to verify 28.

(40) our project. The TMS3206416T which is connected via the XDS510 that emulator sets the I/O ports on our DSP platform. In the following sections, the profiling results of all the simulators are presented. But we like to remind that the C64xx XDS510 emulator cannot profile the CPU cycles.. 3.4 Optimization on TI DSP Platform As Figure 3-6 indicates, the optimization tools increase execution performance. In the following sections, several optimization technologies using VelociTI architecture and software technologies are introduced and adopted in this project.. 3.4.1 Architecture of TI TMSC6000 Family The TMS320C6000 series use the VelociTI architecture which is a high-performance, advanced very-long-instruction-word (VLIW) architecture. The architecture contains multiple execution units running in parallel, which allow them to perform multiple instructions in a single clock cycle. This makes an excellent choice for multi-channel, multi-function, and performance-driven applications. In addition, the C6000 pipeline can dispatch eight parallel instructions every cycle and parallel instructions proceed simultaneously through the same pipeline phases. It eliminates traditional architectural bottlenecks in program fetch, data access, and multiple operations. More detail features about this architecture are introduced in [16]. The TMS320C621x, TMS320C671x, and TMS320C64x DSPs of the TMS320C6000 DSP family have the two-level memory architecture for program and data. The first-level program cache is designated L1P, and the first-level data cache is designated L1D. Both the program and data memory share the second-level memory, designated L2. The L2 is configurable allowing for various amounts of cache and SRAM. Figure 3-3 shows the block diagram of the C64x DSP. The L1P and L1D provide a fast on-chip memory. Accesses by the CPU to these first level caches can complete without CPU pipeline stalls. If the data requested by the CPU is not contained in cache, it is fetched from the next lower memory level. However, over the past years the performance of processors has improved at a much faster 29.

(41) pace than that of memory. As a result, there is a performance gap between CPU and memory speed. High-speed memory is available but consumes much more size and is more expensive compared with slow memory. Hierarchical memory architecture is commonly adopted in the embedded system as Figure 3-7. A fast but small memory is placed close to the CPU that can be accessed without stalls. The next lower memory levels are increasingly larger but also slower the further away from the CPU. Addresses are mapped from a larger memory to a smaller but faster memory higher in the hierarchy. Typically, the higher-level memories are cache memories that are automatically managed by a cache controller. L2 memory is configurable and can be split into L2 SRAM (addressable on-chip memory) and L2 cache for caching external memory locations. The L2 cache is a 4-way set associative cache whose capacity varies between 32 Kbytes and 256 Kbytes depending on its mode. It services cache misses from both L1P and L1D as well as DMA accesses using the EDMA controller. On a C6416T DSP for instance, the sizes of L1D and L1P are 16 Kbytes respectively. The size of L2 is 1 Mbytes and external memory can be several Mbytes large. Although the L2 memory can operate as SRAM, as cache, or as both, the L2 SRAM and L2 cache act with little difference. For example, a single L1D read miss takes 6 cycles when serviced from L2 SRAM, and 8 cycles when serviced from L2 cache. The detailed specifications are described in [17]. In practical implementation, image program usually takes lots of memory space for instant processing. Although there is a L2 memory configured as a SRAM after a reset, it is not enough for all the program instructions and the data. The L1 cache controller fetches most data from external memory with lots of CPU stall. In order to exploit all of the L2 SRAM, programmers must specify the relative data in the linker command file and modify the data structure. This expands time and affects program structure. Because the L1 cache is not large enough, L2 cache is a convenient way to decrease CPU stalls. There are two ways to configure L2 cache on DSP platform. If the DSP/BIOS is used, L2 cache is enabled automatically. Otherwise, L2 cache can be enabled in the program code by issuing the appropriate chip support library (CSL) commands. Additionally, in the linker command file the memory to be used as L2 SRAM has to be specified. Since L2 cache cannot be used for code or data placement by the linker, all sections must be linked into L2 SRAM or external memory. Further, external memory addresses are optional for cacheable or non-cacheable in the setting of program codes. The real effect on DSP platform is going to be 30.

(42) presented in following chapters.. Figure 3-7 TMS320C64x hierarchical memory. 3.4.2 Compiler-Level Optimization The compiler, which includes the parser and optimizer, accepts C/C++ source code and produces C6x assembly language source code. The Figure 3-8 gives a description of the C/C++ compiler. The optimizer can reduce code size and improve executing time by using compiler options. There are four optimization levels which are register (-o0), local (-o1), function (-o2), and file (-o3).. 31.

(43) Figure 3-8 C/C++ compiler. The register level (-o0) performs optimizations with control-flow-graph simplification, allocating variables to registers, loop rotation, eliminating unused code, simplifying expressions and statements, and expanding calls to functions declared inline. Next, the local level (-o1) performs all –o0 optimizations, plus local copy/constant propagation, removing unused assignments, and eliminating local common expressions. The function level (-o2) performs all –o1 optimizations, plus software pipelining, loop optimizations, eliminating global common sub-expressions and unused assignments, converting array references in loops to incremented pointer form, and loop unrolling. Finally, the highest level, file level (-o3), performs all –o2 optimizations, plus removing all functions never called, simplifying functions with return values never used, inline calls, reordering function declarations, propagates arguments into function bodies, and identifying file-level variable characteristics. In general, using the –o2 or –o3 level is necessary for performance and code size. The option is also used with the assembly optimizer. Some key optimizations such as software pipelining and loop unrolling are specified with these options. 32.

(44) 3.4.3 Program-Level Optimization Except the optimizations as mentioned in previous sections, there are several methods to speed up the program. First, the linker command file allocates the data sections in different memory. The data which are accessed frequently should be allocated in the higher and fast memory level such as SRAM or cache. Programmer need to analyze the frequency of data accessing for better performance. Although the L2 cache provides an easy way to access external memory, exploiting the SRAM sometimes gets better performance than using the L2 cache. Besides, the missing cycles are Second, the C6000 C/C++ compiler supports such pragmas like CODE_SECTION, DATA_SECTION, MUST_ITERATE, UNROLL and etc. We know that branch prediction takes lots of cycles when it failed. Through the pragma, such as MUST_ITERATE, the information is provided to aid the compiler in choosing the best loops and loop transformations which means software pipelining and nested loop transformations. There are three methods to unroll the loop. First, you can use the compiler to unroll the loop automatically. Second, you can suggest that the compiler unroll the loop using these DSP pragmas. The last one is that you can unroll the code yourself. Sometimes it also helps the compiler reduce code size and sometimes unrolling by the compiler generates some redundant loops. The detailed specifications are described in [18]. Some of these pragmas are adopted in this project, and the test results are shown in following sections. Third, the C64x DSPs are fixed-point processors, so they do not directly support floating-point data types. C64x DSPs can simulate floating-point operations, but it takes lots of extra clock. Decreasing floating-point operations is another way to speed up the system. Table 3-1 shows the different data types supported in CCS and take note of the data type “Long” is 40 bits width. Besides, use the short date type for fixed-point multiplication inputs whenever possible because this data type provides the most efficient use of the 16-bit multiplier in the C6000. It is about one cycle for “short × short” versus five cycles for “int × int”. But use int or unsigned int data types for loop counters, rather than short or unsigned short data type, to avoid unnecessary sign-extension instructions.. 33.

(45) Data Type. char. short. int. float. long. long long. double. Size (bits). 8. 16. 32. 32. 40. 64. 64. Table 3-1 Different data types. The loop unrolling is an efficient method to improve compiler performance. The compiler tries to reschedule the assembly with a full pipeline. Mostly, more instructions without dependence make better parallelism and decrease stalls. Compiler level optimizations and. pragmas. facilitate. loop. unrolling.. The. TMS320C6416T. uses. the. Very-Long-Instruction-Word (VLIW) structure called VelociTI.2. It works efficiently with the loop unrolling to make optimal scheduling. Besides, unrolling loop by programmer is efficacious too. Sometimes compiler level optimizations are restricted to some compiler rules so that loop unrolling by hand is a manual work. In order to make VLIW structure efficiently, we use loop unrolling to fill up the function unit slots. The code size expanded with number of unrolling loops is the major shortcoming. However, the C6000 software pipelining mentioned before is a technique to reorganize loops. It interleaves instructions from different iterations without unrolling the loop. Both of two techniques can be applied simultaneously on the platform for H/W and S/W optimization, and the overhead of a loop and the time issues have eased. Finally, there are some special functions, called intrinsics, provided by the C6000 compiler. These functions map directly to inlined C64x instructions to optimize the C/C++ code quickly. All instructions that are not easily expressed in C/C++ code are supported as intrinsics. The trick is that intrinsics use a single load or store instruction to access multiple data (SIMD). For example, it can combine four 8-bit data (char) or two 16-bit data (short) to a 32-bit data type, and then it executes one operation instead of four (char) or two (short) operations. If the SIMD method is employed, the code efficiency is improved substantially. Figure 3-9 shows an example of using SIMD method. Other intrinsics enhance the efficiency in the similar way and are described in [19].. 34.

(46) Single Instruction Multiple Data A1 (short). A2 (short) +. B1 (short). B2 (short) =. A1+B1(short). A2+B2(short). Figure 3-9 SIMD example for using word access for adding short data. 35.

(47) Chapter 4 Analysis of Embedded Block Coding and Speed-Improving Methods In this chapter, we introduce the JPEG2000 software environment and its configuration. JPEG2000 configurations could have an impact on the performance. Some features improve the coding performance but spend lots of memory or complexity. Then we analyze the JPEG2000 encoder and identify the most complex elements in JPEG2000 algorithm. The goal of this chapter is to find algorithms to reduce the JPEG2000 implementation complexity on the DSP platform. Several speed-up methods are presented and compared each other.. 4.1 Parameters and Software Environment 4.1.1 Jasper and OpenJPEG Reference Software In the JPEG2000 standard part 5 [6], it provides two standard reference softwares : JasPer and JJ2000. The JJ2000 is a Java implementation of ISO/IEC 15444-1 (i.e. JPEG2000 image coding standard part1) and the JasPer software is written in the C programming language for the codec specified in ISO/IEC 15444-1. The JasPer is an open-source initiative to provide a free software-based reference implementation of the JPEG2000 codec. All the related documents and software of JasPer could be downloaded from [20]. Now, the latest version 1.701 of the JasPer software is available. We have tested the JasPer reference software and the results are shown as Table 4-1. The main configurations are using 64 by 64 code-block size, 5 decomposition levels, and 1 tile. We use the 512 by 512 gray images, Goldhill, Barb, Lena, and Baboon, which are shown in Figure 4-1.. 36.

(48) BPP 0.04 0.05 0.0625 0.125 0.25 0.5 1 2 3 4 5 6 7 8. Goldhill 25.1 25.7 26.2 28.1 30.1 32.7 35.9 40.7 44.8 49.2 66.5 66.5 66.5 66.5. 5-3 filter Barb Lena 21.9 25.8 22.4 26.7 22.7 27.4 24.6 30.2 27.3 33.2 30.9 36.3 35.8 39.3 41.3 43.4 45.2 47.5 49.5 53.7 66.5 66.7 66.5 66.7 66.5 66.7 66.5 66.7. Baboon 20.0 20.2 20.4 21.3 22.8 25.1 28.6 34.1 39.0 43.9 48.2 58.3 66.9 66.9. Goldhill 25.3 25.9 26.4 28.4 30.5 33.2 36.5 41.9 46.6 46.9 46.9 46.9 46.9 46.9. 9-7 filter Barb Lena 22.3 26.2 22.8 27.0 23.1 27.9 25.2 30.9 28.3 34.1 32.1 37.3 37.2 40.4 43.1 44.6 46.8 47.1 47.0 47.1 47.0 47.1 47.0 47.1 47.0 47.1 47.0 47.1. Baboon 20.1 20.3 20.6 21.6 23.2 25.5 29.1 34.8 40.0 45.4 46.6 46.6 46.6 46.6. Table 4-1 PSNR (dB) of different images using JasPer Ver.1.701 encoder. BPP 0.04 0.05 0.0625 0.125 0.25 0.5 1 2 3 4 5 6 7 8. Goldhill 25.3 25.8 26.3 28.1 30.1 32.7 35.9 40.9 49.5 49.5 Infinite Infinite Infinite Infinite. 5-3 filter Barb Lena 22.1 26.1 22.5 26.9 22.9 27.5 24.6 30.2 27.3 33.1 30.9 36.3 35.8 39.4 41.4 43.7 49.8 54.0 49.8 54.0 Infinite 92.3 Infinite 92.3 Infinite 92.3 Infinite 92.3. Baboon 20.1 20.3 20.5 21.3 22.8 25.1 28.6 34.2 44.0 44.0 Infinite Infinite Infinite Infinite. Goldhill 25.4 26.0 26.5 28.5 30.5 33.2 36.6 41.9 49.5 49.5 49.5 49.5 49.5 49.5. 9-7 filter Barb Lena 22.3 26.5 22.8 27.2 23.4 28.0 25.4 31.0 28.4 34.2 32.3 37.3 37.2 40.4 43.2 44.9 49.3 49.0 49.3 49.0 49.3 49.0 49.3 49.0 49.3 49.0 49.3 49.0. Table 4-2 PSNR (dB) of different images using OpenJPEG Ver.1.0 encoder. 37. Baboon 20.2 20.4 20.7 21.7 23.2 25.6 29.1 34.8 45.6 45.6 50.5 50.5 50.5 50.5.