Experimental Results

Chapter 5 Acceleration of JPEG2000 Encoder on DSP Platform

5.2 Experimental Results

In this section, we are going to present the experimental results using the test images described in section 4.1 The notation “M1” means the scheme uses the SS and GOSS methods described in section 4.4.2 . “M2” represents the proposed VGOSS method described in section 5.1.1 and “M3” represents the modified VGOSS method described in section 5.1.2 . The “Ori” means the original program and the Tier1 module includes the Pass1 module, the Pass2 module, and the Pass3 module. The proposed M2+ and M3+ methods are accelerated with program level optimization described in section 3.4.3 and 5.1.4 . Otherwise, the compiler-level optimization uses the file level optimization and the non-level optimization.

All the ratios in these tables are the cycles of the proposed method divided by the original cycles and are expressed in percentage.

On Table 5-6, we compare different methods using the C64xx simulator. The programs do not use compiler-level optimization. The Tier1 function takes the most cycles of the JPEG2000 encoder. We can see that our proposed methods have achieved about 30%

reduction. In other words, our proposed method has a better performance without enabling any hardware or software optimizations. The comparison of the M2 and M3 will be discussed later.

Pass1 ratio Pass2 ratio Pass3 ratio Tier1 ratio Ori 305,253,023 N/A 230,409,367 N/A 226,859,815 N/A 846,100,895 N/A M1 282,191,607 92% 161,154,959 69% 153,603,555 67% 691,653,125 81%

M2 293,537,043 96% 147,094,303 64% 140,270,753 62% 617,477,074 73%

Goldhill

M3 268,073,463 87% 146,687,443 63% 154,029,099 67% 610,316,669 72%

Ori 298,583,742 N/A 242,601,594 N/A 229,189,368 N/A 854,930,539 N/A M1 271,438,557 91% 171,781,735 71% 155,859,497 68% 686,567,790 80%

M2 285,719,070 96% 157,152,430 65% 142,649,768 62% 622,141,054 73%

Barb

M3 260,057,725 87% 156,718,367 65% 156,544,656 68% 614,892,043 72%

Ori 290,392,605 N/A 199,107,794 N/A 221,486,542 N/A 793,191,469 N/A M1 266,220,518 92% 126,029,839 63% 160,976,617 73% 638,251,568 80%

M2 278,136,726 96% 111,841,674 56% 148,145,016 67% 574,660,595 72%

Lena

M3 253,699,774 87% 111,530,410 56% 162,578,666 73% 569,297,918 72%

Ori 340,504,870 N/A 329,348,990 N/A 251,280,448 N/A 1,011,860,852 N/A M1 307,918,629 90% 260,281,541 79% 148,146,791 59% 810,277,071 80%

M2 325,038,946 95% 242,780,578 74% 135,044,067 54% 739,661,816 73%

Baboon

M3 295,676,364 87% 242,114,405 74% 148,261,575 59% 727,801,158 72%

Table 5-6 Comparison using C64xx simulator without compiler-level optimization

Then, we use the compiler-level optimization to accelerate the encoder. The exact cycles are shown in Table 5-7 and we can see that the cycles are reduced to half of their counterpart in Table 5-6. This means the encoder has been accelerated about two times faster. The original encoder takes about 0.8 sec to encode a 512 by 512 image on the C64xx simulator. The running time is estimated based on 1GHz DSP processor. The encoding cycles are thus

method has a reduction of up to 37% in the Tier1 module. The proposed M2 method seems to have a better performance than the proposed M3 method and the additional program level optimization helps the M2 method to achieve about 46% reduction of computation cycles as shown in the M2+ results. And, the M2+ results are still slightly better than the M3+ results.

Pass1 ratio Pass2 ratio Pass3 ratio Tier1 ratio Ori 161,696,792 N/A 123,912,376 N/A 132,353,060 N/A 435,017,649 N/A M1 127,578,760 78% 81,792,398 66% 83,544,436 63% 310,108,162 71%

M2 125,231,272 77% 65,696,264 53% 71,670,151 54% 278,049,520 63%

M2+ 111,479,250 69% 52,262,769 42% 64,940,416 49% 236,919,268 54%

M3 134,596,715 83% 64,882,190 52% 72,759,624 54% 287,959,889 66%

Goldhill

M3+ 120,941,561 75% 51,041,835 41% 68,490,347 52% 246,919,263 57%

Ori 158,478,206 N/A 129,282,989 N/A 133,847,824 N/A 438,746,356 N/A M1 122,166,076 77% 87,196,372 67% 85,158,426 64% 311,697,932 71%

M2 120,764,739 76% 70,177,352 54% 72,977,347 55% 279,411,207 64%

M2+ 107,855,946 68% 55,856,088 43% 66,049,340 49% 238,014,402 54%

M3 130,671,579 82% 69,308,863 54% 74,122,298 55% 289,863,349 66%

Barb

M3+ 117,796,112 74% 54,551,358 42% 75,907,089 57% 248,421,913 57%

Ori 153,954,906 N/A 108,237,140 N/A 129,077,570 N/A 408,233,549 N/A M1 120,130,825 78% 64,744,920 60% 88,197,594 68% 290,075,454 71%

M2 118,381,074 77% 49,993,868 46% 75,995,528 59% 259,792,581 64%

M2+ 105,516,600 69% 39,695,355 37% 68,746,351 53% 222,169,225 54%

M3 126,991,911 82% 49,370,996 46% 77,238,351 60% 269,292,266 66%

Lena

M3+ 114,192,081 74% 38,759,155 36% 78,765,768 61% 231,705,425 57%

Ori 179,696,714 N/A 171,827,724 N/A 147,598,193 N/A 516,672,503 N/A M1 138,079,206 77% 130,939,612 76% 80,407,186 54% 367,019,323 71%

M2 136,630,164 76% 108,313,813 63% 68,628,442 46% 329,208,921 64%

M2+ 122,000,811 68% 86,310,345 50% 62,199,750 42% 278,849,859 54%

M3 149,193,658 83% 106,981,058 62% 69,600,789 47% 341,680,709 66%

Baboon

M3+ 134,442,629 75% 84,308,963 49% 78,911,706 53% 291,039,854 56%

Table 5-7 Comparison using C64xx simulator with file level optimization

In order to analyze the actual performance on the DSP emulator, we take profiling of the encoder on the C6416 simulator. First, we compare all the methods on the C6416 simulator without any compiler level optimization or program level optimization. The results are presented in Table 5-8. It shows that all methods have little improvement on the C6416 simulator. The Pass1 process checks all samples in the code-block but the other processes encode the NBC samples only. It seems the Pass1 process does not gain any acceleration as compared with Table 5-6.

Pass1 ratio Pass2 ratio Pass3 ratio Tier1 ratio Ori 3,017,620,410 N/A 2,042,494,971 N/A 1,878,381,119 N/A 7,509,481,733 N/A M1 3,025,667,367 100% 1,741,809,460 85% 1,421,319,656 75% 6,770,907,062 90%

M2 3,136,395,302 103% 1,705,536,220 83% 1,401,228,658 74% 6,559,372,183 87%

Goldhill

M3 3,109,291,411 103% 1,705,523,470 83% 1,413,508,168 75% 6,549,455,756 87%

Ori 2,896,173,480 N/A 2,158,293,928 N/A 1,893,826,534 N/A 7,524,305,067 N/A M1 2,878,730,141 99% 1,876,066,087 87% 1,452,988,453 77% 6,786,729,765 90%

M2 2,997,467,962 103% 1,822,898,565 84% 1,418,383,687 75% 6,555,043,830 87%

Barb

M3 2,970,457,703 103% 1,822,891,700 84% 1,430,763,867 76% 6,545,326,635 87%

Ori 2,836,766,126 N/A 1,664,108,463 N/A 1,858,239,288 N/A 6,922,931,525 N/A M1 2,839,909,398 100% 1,353,038,108 81% 1,494,062,471 80% 6,253,649,422 90%

M2 2,954,013,613 104% 1,295,049,729 78% 1,461,410,941 79% 6,026,615,579 87%

Lena

M3 2,928,402,594 103% 1,295,042,543 78% 1,474,297,106 79% 6,018,804,632 87%

Ori 3,307,931,711 N/A 3,113,296,427 N/A 2,018,577,407 N/A 9,047,999,627 N/A M1 3,264,497,611 99% 2,875,062,110 92% 1,401,263,282 69% 8,152,221,728 90%

M2 3,395,751,005 103% 2,819,811,057 91% 1,365,494,753 68% 7,897,700,868 87%

Baboon

M3 3,363,909,744 102% 2,819,796,302 91% 1,377,280,810 68% 7,882,548,295 87%

Table 5-8 Comparison using C6416 simulator without compiler-level optimization

Then we use the file level optimization which has been described in section 3.4.2 and the

memory system as discussed in section 4.3.1 . Until now, the memory system dominates the performance on the C6416 simulator. Although we have a great performance on the C64xx simulator as shown in Table 5-7, the final judgments on the performance is on the real time system, i.e., the C6416 emulator. Because the C6416 emulator cannot profile the cycle information of the encoder on the board, we assume that the experimental results of the C6416 simulator can approximately represent the results of the C6416 emulator.

Pass1 ratio Pass2 ratio Pass3 ratio Tier1 ratio Ori 2,112,041,634 N/A 1,378,876,280 N/A 1,247,956,850 N/A 5,249,595,284 N/A M1 1,909,390,616 90% 1,170,834,214 84% 1,017,962,461 81% 4,609,052,664 87%

M2 2,117,760,455 100% 1,219,529,873 88% 1,008,254,857 80% 4,613,282,518 87%

Goldhill

M3 2,107,525,133 99% 1,220,058,177 88% 999,638,429 80% 4,595,193,586 87%

Ori 2,039,284,465 N/A 1,452,630,437 N/A 1,262,285,138 N/A 5,269,580,191 N/A M1 1,823,491,911 89% 1,265,470,505 87% 1,039,524,522 82% 4,643,920,352 88%

M2 2,024,787,896 99% 1,301,723,052 90% 1,023,242,623 81% 4,617,566,683 88%

Barb

M3 2,016,239,162 99% 1,302,308,385 90% 1,014,583,518 80% 4,601,177,484 87%

Ori 1,988,997,365 N/A 1,138,002,909 N/A 1,247,101,515 N/A 4,878,183,463 N/A M1 1,788,189,759 90% 922,499,516 81% 1,069,658,356 86% 4,284,479,896 88%

M2 1,985,936,734 100% 929,742,870 82% 1,054,472,108 85% 4,237,821,331 87%

Lena

M3 1,976,123,575 99% 930,163,782 82% 1,045,640,767 84% 4,219,831,865 87%

Ori 2,319,243,443 N/A 2,064,641,254 N/A 1,324,106,786 N/A 6,253,178,570 N/A M1 2,073,048,478 89% 1,925,065,285 93% 996,385,327 75% 5,539,745,533 89%

M2 2,297,572,925 99% 2,005,493,307 97% 979,527,354 74% 5,550,726,038 89%

Baboon

M3 2,289,349,786 99% 2,006,364,071 97% 970,994,914 73% 5,535,075,645 89%

Table 5-9 Comparison using C6416 simulator with file level optimization

We know that enabling the L2 cache system can improve the total encoding time on the C6416 emulator. We compare Table 5-8 with Table 5-10. The encoding time of the encoder using the L2 cache system is eight times faster than the former one. As for these speed-up methods, the M1 method could just have about 20 % reduction of the computation cycles in the Tier1 module and our proposed methods have better performance at about 30 % reduction

in the same module.

Pass1 ratio Pass2 ratio Pass3 ratio Tier1 ratio Ori 308,284,632 N/A 251,097,478 N/A 232,841,458 N/A 878,528,789 N/A M1 283,514,595 91% 164,102,018 65% 170,341,982 73% 715,481,512 81%

M2 310,759,475 100% 150,799,659 60% 142,705,474 61% 643,385,635 73%

Goldhill

M3 286,373,249 92% 151,103,318 60% 156,405,272 67% 637,677,407 72%

Ori 301,567,183 N/A 262,931,785 N/A 235,251,179 N/A 887,081,291 N/A M1 272,633,925 90% 174,226,431 66% 172,958,920 74% 710,121,902 80%

M2 301,927,080 100% 161,098,804 61% 145,148,671 62% 647,343,159 73%

Barb

M3 277,311,304 92% 161,388,576 61% 158,987,469 68% 641,531,078 72%

Ori 293,270,703 N/A 213,746,035 N/A 227,449,758 N/A 819,425,979 N/A M1 267,314,454 91% 127,929,966 60% 178,827,591 79% 661,884,222 81%

M2 293,726,038 100% 114,665,949 54% 150,681,664 66% 598,149,363 73%

Lena

M3 270,410,167 92% 114,862,234 54% 165,059,091 73% 594,083,833 72%

Ori 344,015,246 N/A 360,498,318 N/A 257,274,679 N/A 1,055,301,462 N/A M1 309,308,487 90% 263,822,006 73% 164,094,852 64% 834,062,055 79%

M2 344,603,091 100% 248,835,280 69% 137,393,177 53% 770,203,559 73%

Baboon

M3 316,054,573 92% 249,336,149 69% 150,554,300 59% 759,984,175 72%

Table 5-10 Comparison by C6416 simulator using L2 cache without compiler-level optimization

At the end, we enable the file level optimization and all results are shown in Table 5-11.

Our proposed methods, M2 and M3, have a reduction of up to 35 % of computation cycles and, furthermore, the M2+ and M3+ use the program level optimization. It helps our proposed methods (M2+ and M3+) to reduce up to 45 % of computation cycles in the Tier1 module.

These results now can achieve approximately those results of Table 5-7, which are generated by the C64xx simulator (using the flat memory system).

Pass1 ratio Pass2 ratio Pass3 ratio Tier1 ratio Ori 163,080,554 N/A 124,821,177 N/A 133,472,282 N/A 441,174,161 N/A M1 128,930,478 79% 83,076,184 66% 101,499,569 76% 333,581,675 75%

M2 128,567,039 78% 67,302,409 53% 73,328,088 54% 287,411,966 65%

M2+ 112,848,128 69% 53,775,561 43% 66,497,047 49% 243,772,802 55%

M3 135,518,273 83% 66,348,752 53% 77,189,981 57% 297,154,656 67%

Goldhill

M3+ 122,246,007 75% 52,620,095 42% 67,810,545 51% 253,842,164 58%

Ori 159,898,842 N/A 130,222,929 N/A 135,022,022 N/A 445,058,085 N/A M1 125,607,586 79% 88,384,942 68% 100,587,687 74% 334,635,666 75%

M2 123,971,510 78% 71,805,581 55% 74,693,375 55% 288,726,161 65%

M2+ 109,214,696 68% 57,459,450 44% 67,696,874 50% 245,428,000 55%

M3 131,593,386 82% 70,838,058 54% 78,625,116 58% 299,196,088 67%

Barb

M3+ 119,108,218 74% 56,192,138 43% 69,091,530 51% 255,512,543 57%

Ori 155,250,384 N/A 109,054,998 N/A 130,199,119 N/A 414,230,692 N/A M1 123,411,310 79% 65,712,808 60% 104,118,479 80% 313,098,474 76%

M2 121,457,646 78% 51,269,739 47% 77,789,117 60% 268,694,555 65%

M2+ 106,797,331 69% 40,955,168 38% 70,440,728 54% 229,200,081 55%

M3 127,855,746 82% 50,567,223 46% 81,937,980 63% 278,421,068 67%

Lena

M3+ 115,431,429 74% 40,045,260 37% 71,836,295 55% 238,400,634 58%

Ori 181,379,029 N/A 172,997,973 N/A 148,789,097 N/A 523,492,124 N/A M1 141,906,202 78% 132,573,813 77% 95,167,853 64% 390,208,533 75%

M2 140,240,468 77% 110,727,881 64% 70,188,995 47% 339,579,061 65%

M2+ 123,588,182 68% 88,581,223 51% 63,678,663 43% 287,012,673 55%

M3 150,294,582 83% 109,189,994 63% 73,777,013 50% 351,565,266 67%

Baboon

M3+ 135,935,289 75% 86,669,837 50% 64,985,784 44% 298,879,302 57%

Table 5-11 Comparison using C6416 simulator with L2 cache and file level optimization

Comparing the proposed M2 and M3 methods, the M2 method records the offset but it has a better performance than the M3 method on the DSP simulators or emulator. In general, the M3 method records the address without counting the offset which is adopted in the M2 method and thus it should have better performance than the M2 method. However, all results

show that the M2 method is more efficient than the M3 method. This is because the computation cycles of the M3 method in the Pass1 process are tinier than that of the M2 method. According to the assembly code, the M2 method use more counters to compute the offset, but those counters could lead to better parallelism instead.

Now, we summarize the best performance as shown in Table 5-12 and Table 5-13. When the bottleneck of the memory system is ignored, the results (Ori+), which are profiled using the file level optimization on the C64xx simulator, can achieve 1.9 times faster than the original one (Ori) as shown in Table 5-12. Furthermore, our best solution (M2+) can accelerate up to 3.6 times faster than the original one. On the C6416 simulator, the results (Ori+) are measured with the L2 cache and file level optimization and the proposed method (M2+) can accelerate the Tier1 module nearly 2 times faster than (Ori) as the results of the C6416 simulator.

Pass1 Mul. Pass2 Mul. Pass3 Mul. Tier1 Mul.

Ori 305,253,023 N/A 230,409,367 N/A 226,859,815 N/A 846,100,895 N/A Ori+ 161,696,792 1.9x 123,912,376 1.9x 132,353,060 1.7x 435,017,649 1.9x Goldhill

M2+ 111,479,250 2.7x 52,262,769 4.4x 64,940,416 3.5x 236,919,268 3.6x Ori 298,583,742 N/A 242,601,594 N/A 229,189,368 N/A 854,930,539 N/A Ori+ 158,478,206 1.9x 129,282,989 1.9x 133,847,824 1.7x 438,746,356 1.9x Barb

M2+ 107,855,946 2.8x 55,856,088 4.3x 66,049,340 3.5x 238,014,402 3.6x Ori 290,392,605 N/A 199,107,794 N/A 221,486,542 N/A 793,191,469 N/A Ori+ 153,954,906 1.9x 108,237,140 1.8x 129,077,570 1.7x 408,233,549 1.9x Lena

M2+ 105,516,600 2.8x 39,695,355 5.0x 68,746,351 3.2x 222,169,225 3.6x Ori 340,504,870 N/A 329,348,990 N/A 251,280,448 N/A 1,011,860,852 N/A Ori+ 179,696,714 1.9x 171,827,724 1.9x 147,598,193 1.7x 516,672,503 2.0x Baboon

M2+ 122,000,811 2.8x 86,310,345 3.8x 62,199,750 4.0x 278,849,859 3.6x

Table 5-12 Comparison using C64xx simulator (Best solution) with file level optimization

Pass1 Mul. Pass2 Mul. Pass3 Mul. Tier1 Mul.

Ori 3,017,620,410 N/A 2,042,494,971 N/A 1,878,381,119 N/A 7,509,481,733 N/A Ori+ 163,080,554 19x 124,821,177 16x 133,472,282 14x 441,174,161 17x Goldhill

M2+ 112,848,128 27x 53,775,561 38x 66,497,047 28x 243,772,802 31x Ori 2,896,173,480 N/A 2,158,293,928 N/A 1,893,826,534 N/A 7,524,305,067 N/A Ori+ 159,898,842 18x 130,222,929 17x 135,022,022 14x 445,058,085 17x Barb

M2+ 109,214,696 27x 57,459,450 38x 67,696,874 28x 245,428,000 31x Ori 2,836,766,126 N/A 1,664,108,463 N/A 1,858,239,288 N/A 6,922,931,525 N/A Ori+ 155,250,384 18x 109,054,998 15x 130,199,119 14x 414,230,692 17x Lena

M2+ 106,797,331 27x 40,955,168 41x 70,440,728 26x 229,200,081 30x Ori 3,307,931,711 N/A 3,113,296,427 N/A 2,018,577,407 N/A 9,047,999,627 N/A Ori+ 181,379,029 18x 172,997,973 18x 148,789,097 14x 523,492,124 17x Baboon

M2+ 123,588,182 27x 88,581,223 35x 63,678,663 32x 287,012,673 32x

Table 5-13 Comparison using C6416 simulator (Best solution) with file level optimization

Afterward, we record the real time execution on the C6416 emulator, i.e. the hardware platform. We use the timer on the DSP platform to count the executing time. We compare the

‘ori’, ‘M1’, ‘M2+’, and ‘M3+’ and all results are shown in Table 5-14. The ‘M2+’ result can achieve about 45% reduction of the computation time. It is similar to the result of the C6416 simulator. Comparing these reduction ratios in Table 5-11 and Table 5-14, all results are consistent. The simulation results on the C64xx simulator also achieve approximately those results on the hardware platform or the C6416 simulator.

Finally, we compare the difference in executing time between the C6416 simulator and the C6416 emulator (i.e. the hardware platform) shown in Table 5-15. We convert the cycles of the C6416 simulator to seconds by dividing 10⁹ (1GHz DSP). The results show that the original executing time on the C6416 emulator is slower than the one on the C6416 simulator by about 11~14 %. When the L2 cache is adopted, the results show that the executing time on the hardware platform (C6416 emulator) is similar to that on the C6416 simulator. In summary, our proposed method can achieve an better performance and can implement on the DSP platform in an efficient and simple manner. The experimental results on different simulators or emulator show consistent results. It means that our proposed method can

achieve about half reduction on the real system.

L2 cache with file level optimization

(sec)

Ratio

Ori 8.741983 N/A 0.898467 N/A 0.447415 N/A

M1 7.666878 88% 0.717021 80% 0.346838 78%

M2+ 7.646112 87% 0.579569 65% 0.250016 56%

Goldhill

M3+ 7.643923 87% 0.580308 65% 0.259049 58%

Ori 8.752465 N/A 0.904697 N/A 0.451394 N/A

M1 7.654667 87% 0.719655 80% 0.347562 77%

M2+ 7.711458 88% 0.584394 65% 0.251254 56%

Barb

M3+ 7.701879 88% 0.584885 65% 0.260636 58%

Ori 7.786148 N/A 0.835078 N/A 0.420186 N/A

M1 7.052966 91% 0.670789 80% 0.325005 77%

M2+ 7.067648 91% 0.54088 65% 0.23467 56%

Lena

M3+ 7.056214 91% 0.542619 65% 0.2434 58%

Ori 10.2021 N/A 1.077518 N/A 0.53121 N/A

M1 9.19398 90% 0.84524 78% 0.405664 76%

M2+ 9.306895 91% 0.69134 64% 0.294029 55%

Baboon

M3+ 9.328353 91% 0.689299 64% 0.304899 57%

Table 5-14 Comparison of the executing time on the C6416 emulator

C6416 simulator (cycles)

Conversion (sec)

C6416 emulator (sec)

Ratio (simulator/emulator)

Ori 7,509,481,733 7.509482 8.741983 86%

Ori+ 441,174,161 0.441174 0.447415 99%

Goldhill

M2+ 243,772,802 0.243773 0.250016 98%

Ori 7,524,305,067 7.524305 8.752465 86%

Ori+ 445,058,085 0.445058 0.451394 99%

Barb

M2+ 245,428,000 0.245428 0.251254 98%

Ori 6,922,931,525 6.922932 7.786148 89%

Ori+ 414,230,692 0.414231 0.420186 99%

Lena

M2+ 229,200,081 0.2292 0.23467 98%

Ori 9,047,999,627 9.048 10.2021 89%

Ori+ 523,492,124 0.523492 0.53121 99%

Baboon

M2+ 287,012,673 0.287013 0.294029 98%

Table 5-15 Comparison between the C6416 simulator and the C6416 emulator (The executing time of the Tier1 module)

Chapter 6 Conclusions and Future Work

6.1 Conclusion

The main target of this thesis is to accelerate the JPEG2000 encoder and implement the encoder on the TI C6416T DSP platform. We have presented three known methods and proposed two improved methods to accelerate the Tier1 module, which is the major part in the JPEG2000 encoder. We first presented the previous speed-up methods and discussed their advantages and drawbacks. The advantages of these methods may not match the sequential processing environment. Therefore, we proposed the VGOSS and the modified VGOSS methods. Also, the codes are modified to allow program level optimizations as discussed in Section 3.4.3 and 5.1.4 .

. The proposed VGOSS method is constructed on the re-ordered code-block samples and the extension of the GOSS and PP methods. It encodes only the NBC (need-to-be-coded) samples but reduces the checking cycles in the original three pass processes. It is easy and simple to implement and has a good performance on the DSP platform. It can run about 3.6 times faster than the original software on the C64xx simulator, which uses the flat memory system. It means a reduction of the computation cycles up to 72 %. However, the real DSP system may spend extra cycles on accessing external memory. It is profiled by the C6416 simulator explained in Section 3.3.3 . And our best performance is up to 32 times faster than the original one without DSP optimization. If the DSP compiler-level optimization technique is applied to the original codes, the speed-up is about 45%. The improvement is due to two factors, (1) our proposed VGOSS method and (2) the program level optimizations. Finally, we compare the results between the simulator and the emulator (i.e. hardware platform). The modified VGOSS method has about the same execution performance as VGOSS method on

quite significantly. However, the JPEG2000 algorithm is still complicated in hardware implementation. The memory bottleneck is still a challenge to the embedded system. Because our DSP platform supports the L2 cache memory system, we have a great improvement by using this feature. There is still room for improvement to accelerate the JPEG2000 algorithm on different types of the embedded systems.

6.2 Future Works

The block coding takes the major part of the computational cycles in the JPEG2000 encoding process. The experimental results given in this thesis are the lossless image encoding and our proposed methods reduce checking cycles of the Pass2 and Pass3 processes.

The acceleration methods can be used for encoding a lossy image also. It takes some effort to improve the rate control module using our proposed method in lossy image coding. In each bit-plane coding, our proposed methods have counted the number of NBC samples in the first pass process. Also, it means that we can obtain the distortion information before the Pass2 and Pass3 processes in the current bit-plane. Also, our proposed method match DSP hardware well and can be more efficient than the other known methods. In the reference software, the DWT module has been accelerated by a lifting scheme and is about 2 times faster than the previous vision of the reference software. We did not focus on accelerating the DWT module yet, but it becomes an important part in the JPEG2000 encoder after our acceleration. The speed-up of the DWT module can be another topic for research.

References

[1] JPEG2000 Part I Final Draft International Standard (ISO/IEC FDIS15444-1). ISO/IEC JTC1/SC29/WG11 N1855, Aug. 2000.

[2] JPEG2000 Requirements and Profiles, ISO/IEC JTC1/SC29/WG1 N1271, Mar.1999 [3] Information technology, JPEG2000 image coding system Part 2 : Extensions (ISO/IEC

15444-2), 2002.

[4] Information technology, JPEG2000 image coding system Part 3 : Motion JPEG2000 (ISO/IEC 15444-3), 2002.

[5] Information technology, JPEG2000 image coding system Part 4 : Compliance testing (ISO/IEC 15444-4), 2002.

[6] Information technology, JPEG2000 image coding system Part 5 : Reference software (ISO/IEC 15444-5), 2002.

[7] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 Still Image Coding System: An Overview”, IEEE Transactions on Consumer Electronics, Vol.46, No. 4, pp.

1103-1127, Nov. 2000.

[8] M.J. Nadenau and J. Reichel, “Opponent Color, Human Vision and Wavelets for Image Compression”, in Proc. 7^th Color Image Conf., Scottsdale, AZ, 16-19 Nov. 1999, pp.237-242.

[9] D. Taubman and et al, ”Embedded Block Coding in JPEG2000”, in Proceedings of IEEE International Conference on Image Processing, vol. 2, Vancouver, Canada, Sept. 2000, pp.33-36.

[11] D. Taubman, “High Performance Scalable Image Compression with EBCOT”, IEEE Transaction on Image Processing, Vol. 9, No. 7, pp. 1158-1170, July 2000.

[12] Sundance home page : http://www.sundance.com

[13] Texas Instruments, “TMS320C6414T, TMS320C6415T, TMS320C6416T fixed-point Digital Signal Processors”, Literature number SPRS226, NOV. 2003.

[14] Texas Instruments, “TMS320C6000 Code Composer Studio Tutorial”, Literature number SPRU301CI, Feb. 2000.

[15] Texas Instruments, “TMS320C6000 Programmer’s Guide”, Literature number SPRU198I, Mar. 2006.

[16] Texas Instruments, “TMS320C64x Technical Overview”, Literature number SPRU395B, Jan. 2001.

[17] Texas Instruments, “TMS320C64x DSP Two-Level Internal Memory Reference Guide”, Literature number SPRU610B, Aug. 2004.

[18] Texas Instruments, “TMS320C6000 Optimizing Compiler User’s Guide”, Literature number SPRU187L, May. 2004.

[19] Texas Instruments, “TMS320C6000 CPU and Instruction Set Reference Guide”, Literature number SPRU189F, Jan. 2000.

[20] The JasPer Project Home Page : http://www.ece.uvic.ca/~mdadams/jasper/

[21] The OpenJPEG Home Page : http://www.openjpeg.org/index.php?menu=main

[22] The JPEG2000 still image compression standard, ISO/IEC JTC 1/SC 29/WG 1 N 2412 (ITU-T SG 16), Dec. 2002

[23] K.L. Lin, “Analysis and Architecture Design for JPEG2000 Still Image Encoding System”, M.S. thesis, Department of Electrical Engineering, National Central University, Chung-Li, Taiwan, ROC, 2002.

[24] T.H. Tsai and L.T. Tsai, “JPEG2000 Encoder Architecture Design with Fast EBCOT

Algorithm”, IEEE VLSI-TSA International Symposium, page 279-282, April 2005.

[25] C.J. Lian and et al, “Analysis and Architecture Design of Block-coding Engine for EBCOT in JPEG2000”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 3, March 2003.

[26] B.D. Choi, and et al, “DSP Implementation of Real-time JPEG2000 Encoder Using Overlapped Block Transferring and Pipelined Processing”, International Conference on High Performance Computing (HiPC) 2004, Vol. 3296, page 333-341.

[27] J.K. Cho and et al, “Fast DSP implementation of JPEG2000”, TENCON 2004, Vol. A, page 231-234 Vol. 1, Nov. 2004.

[28] B. Valentine and O. Sohm, “Optimizing the JPEG2000 Binary Arithmetic Encoder for VLIW Architectures”, IEEE International Conference on International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vol. 5, P.5, May 2004.

自傳

劉建志, 西元 1978 年生於台中市。西元 2000 年畢業於私立元智大學電機工程學系，之後於金門服役，退伍後於華邦電子公司擔任應用工程師一職長達三年，2005 年進入交通大學攻讀碩士學位，研究方向為數位訊號處理，

影像和視訊壓縮等。

Chien-Chih Liu, born in 1978 and Taichung city. He received the BEng degree in Yuan-Ze University Department of Electrical Engineering in 2000. Once completing my two-year military service in Kinmen, I had a position (field application engineer) in Winbond company for three years. After that, I enrolled in Chiao Tung University and my major course is the digital signal processing, image, and video compression.

在文檔中 JPEG2000編碼器之加速和TI DSP系統平台上之實現 (頁 100-0)

Chapter 5 Acceleration of JPEG2000 Encoder on DSP Platform

5.2 Experimental Results

Chapter 6 Conclusions and Future Work

6.1 Conclusion

6.2 Future Works

References

自 傳

自傳