Department of Electronics Engineering and Center for Telecommunications Research National Chiao Tung University
Hsinchu, Taiwan 30010, R.O.C.
E-mail: [email protected], [email protected]
Abstract
Fine Granularity Scalability (FGS) is a technique spec-ified in the Amendment of MPEG-4. It is developed to the growing need of video delivery over the In-ternet. Compared to conventional techniques, it of-fers a different way to optimize video quality over a range of bitrates. In this work, we implement a real-time MPEG-4 FGS encoder on digital signal pro-cessors (DSPs). The digital signal processing envi-ronment is Innovative Integration’s Quatro62 personal computer plug-in card, which houses several Texas In-struments’ TMS320C6201 DSPs. We use a formerly developed ITU-T H.263+ encoder as the base-layer en-coder, which resides on one DSP. The FGS encoder works at the enhancement layer and resides on a sec-ond DSP. We base our FGS encoder on modifying the publicly available software MoMuSys. In order to achieve real-time encoding on DSP, we replace a few slow blocks in the original C program and further re-fine our code by taking into account the features of the DSP chip to produce a more efficient program. Over-all, we speed up the MPEG-4 FGS encoder on DSP by several-fold. The final encoding speed is about 12 QCIF frames per second with all bitplanes encoded, and about 18 frames per second with two last bitplanes dropped.
1. Introduction
In response to the fast growing network video applica-tions, the Amendment of MPEG-4 has specified Fine Granularity Scalability (FGS) coding to provide en-hanced video deliverty capability for services such as Internet streaming video. Compared to conventional layered scalability techniques, FGS employs a differ-ent strategy to optimize video quality over a range of bitrates. Through FGS coding, the enhancement bit-stream can be truncated to nearly any number of bits
This work was supported in part by the National Science Coun-cil of R.O.C. under grant no. NSC 91-2219-E-009-045.
Fig. 1: Basic FGS encoder structure.
to provide partial enhancement according to the bits delivered or decoded for each frame.
The basic FGS encoder structure is shown in Fig-ures 1. In the encoder, the base layer bitstream is generated from motion compensation, DCT (discrete cosine transform), quantization, and VLC (variable-length coding) according to the MPEG-4 standard.
The FGS enhancement encoder takes the original and reconstructed DCT coefficients as inputs. After ob-taining all the DCT residues of a VOP (video object plane), the maximum absolute value of the residues is found and the maximum number of bitplanes for the VOP is determined. The enhancement bitstream is then generated after each bitplane is coded through the bitplane variable length coding. The bitstream of the FGS enhancement layer may be truncated to nearly any number of bits per picture after the encoding is completed.
Our goal of this work is real-time implementation of MPEG-4 Fine-Grainularity-Scalable video encoder on digital signal processors (DSPs). The environment of our DSP implementation involves a host PC, a DSP board and the DSP chips on the board. The DSP chips are Texas Instruments (TI)’s TMS320C6201. The TMS320C62x is a fixed-point DSP with 5 ns instruc-tion cycle time. It employs the VelociTI Very Long In-struction Word (VLIW) architecture that enables sus-tained throughput of up to eight instructions in parallel
附錄 C
Fig. 2: Architecture of the overall video encoder sys-tem.
[1]. In addition, the C62x DSPs come with on-chip program and data memories, which may be configured as cache on some devices. The DSP board we use is Innovative Integration (II)’s Quatro6x. It is a PCI bus compatible DSP card housing four TI TMS320C62x processors in a symmetric multiprocessing relation-ship with high bandwidth inter-processor communica-tion links.
For convenience, we use an H.263+ encoder as the base-layer encoder. The encoder is a result of earlier work [2]. In the development process, we first com-bine these two encoders on a PC and then convert the environment from PC to DSP. We make use of the fea-tures of the C62x chip to enhance the FGS encoder on the DSP. The resulting system achieves real-time cod-ing speed for QCIF pictures.
2. Architecture of Overall Video Encoder System
Figure 2 shows the overall encoder system architec-ture and how it operates. Image data are caparchitec-tured by the camera and transmitted to the host PC. The host PC is in charge of the communication mecha-nism between PCI and DSP. Of the four DSPs on the Quatro6x card, only one, denoted CPU0, can com-municate with the host directly. We let it implement the base-layer encoder. After CPU0 receives the im-age data from the host, the encoding processing be-gins and the base-layer bitstream is generated. In the middle of the base-layer encoding process, the residues, that is, the difference between the origi-nal and reconstructed DCT coefficients, are gener-ated and transmitted to the enhancement-layer en-coder through the FIFOLink on the Quatro6x between CPU0 and CPU3. As mentioned, we employ a pre-viously developed H.263+ encoder as the base-layer encoder and employ the MPEG-4 FGS encoder as the
Fig. 3: Procedure of FGS coding.
and the enhancement-layer bitstream are packed to-gether and transmitted to the host. The combined bit-stream after post-processing in host is divided into two bitstreams and stored to the disk.
3. FGS Encoder Optimization
3.1. Profile of the Original FGS Encoder on DSP Since an existing software H.263+ encoder on DSP is used for base-layer encoding, the main task of our real-time implementation work, besides system inte-gration, is to obtain an efficient DSP implementation of the FGS encoder. This involves primarily studies to speed up the execution of FGS encoding on DSP.
Figure 3 shows the procedure of FGS coding, where “data preprocessing” (the third block) means the procedure where the residues are altered by the fgs shift matrix and the fgs rectangular shift factor while either frequency weighting or selective enhance-ment is enabled. By profiling the FGS encoder on DSP without doing any optimization, the proportions of the execution time of these procedures are obtained. And they are shown in Figure 4. The bit-plane VLC coding dominates the execution time of encoding. A reason is that FGS does not do motion estimation, motion-compensated prediction, DCT, and IDCT, which often consume the majority of computation time in motion-compensated DCT coding. Moreover, typical general-purpose CPU architectures and compiler properties are at odds with what VLC requires for efficient execution.
Further, the temporal redundancy in video is not ex-ploited in FGS coding as in predictive coding. Conse-quently, the size of the full FGS encoded bitstream is very larger than that of H.263+. We use akiyo qcif.yuv as a test sequence. The H.263+ encoder only encodes
Fig. 4: Proportions of execution time of different pro-cedures.
Fig. 5: FGS output bitstream sizes without frequency weighting or selective enhancement.
weighting and selective enhancement are disabled in the FGS encoder. And in Figure 6, we show the results where frequency weighting and selective enhancement are enabled.
3.2. Code Acceleration
To speed up FGS encoding, we make use of the fea-tures of the C62x chip as well as the relevant provi-sions of the compiler to optimize the FGS encoder.
1. Configuring of Compiler Options Setting TI’s Code Composer Studio (CCS) is a useful GUI tool that helps engineers develop DSP codes.
CCS compiles the C code and assembles it into the COFF file format. Compiler options control the operation of the compiler. Proper configura-tion of the compiler opconfigura-tions helps the compiler generate efficient assembly codes.
2. Software Pipelining
Software pipelining is a technique used to sched-ule instructions in a loop so that multiple iter-ations of the loop execute in parallel. Its real-ization consists of implementing parallel instruc-tions, filling delay slots with useful instrucinstruc-tions, loop unrolling, and maximizing usage of
func-Fig. 6: FGS output bitstream sizes with frequency weighting and selective enhancement.
tional units. Software pipelining is an efficient way to improve performance.
3. Using Intrinsics
TI’s C6000 compiler provides intrinsics, which are special functions that map directly to inlined C62x instructions, to optimize C code. Many efficient DSP instructions that are not easily ex-pressed in C code are supported as intrinsics.
4. Packed Data Processing
In order to maximize data throughput, it is often desirable to use a single load or store instruction to access multiple data values located consecu-tively in memory. When operating on a stream of 16-bit data, for example, we can use word (32-bit) accesses to read two 16-bit values at a time, and then use C62x intrinsics to operate on the data in parallel.
5. Memory Usage Strategy
The C62x accesses to the external memory re-quire more cycles than to the internal memory.
The external memory access time also depend on what kind of RAM is used. The Quatro 62 board uses SDRAM and SBRAM as external memories.
So it is good to use on-chip memory as much as possible to decrease the number of external mem-ory accesses. If some data have to be put in the external memory, one should try to use DMA to load them into the on-chip memory before pro-cessing them.
6. Memory Model and Allocation
To maximize the code efficiency, the compiler schedules as many instructions as possible in parallel. To schedule instructions in parallel, the compiler must determine the dependency be-tween instructions, which means whether one in-struction must be executed before another. For example, a variable must be loaded from mem-ory before it can be used. Because only indent instructions can execute in parallel, depen-dency inhibits parallelism. To help the compiler
determine memory dependencies, we can qualify a pointer, reference, or array with the “restrict”
keyword. This practice helps the compiler opti-mize certain sections of code because aliasing in-formation can be more easily determined.
7. Using Macros
Since it takes some clock cycles to complete a function call and since the compiler is such that software-pipelined loop cannot contain func-tion calls, we may change funcfunc-tions into “define”
macros under some conditions to speed up the ex-ecution. Because macros are expanded in the re-sulting code, the program size is usually bigger than using funciton calls.
8. Short Format for Multiplication
The multiplication units of C62x performs 16-bit by 16-bit multiply operations. Multiplication of longer operands are broken into several such op-erations. So one should use the short data type for multiplication inputs whenever possible because this data type provides the most efficient use of the 16-bit multiplier in C62x. For loop coun-ters, one should use int or unsigned int, rather than short or unsigned short, to avoid unnecessary sign-extension.
9. System Level Pipelining
The basic data flow of the whole system is as shown in Figure 7. CPU 0 does the base-layer encoding and CPU3 performs the enhancement-layer encoding. The base-enhancement-layer encoder encodes one macroblock and feeds the residues of one macroblock to the enhancement-layer encoder.
That means, when the residues of one entire pic-ture are generated, the base-layer encoding is almost done. After receiving all the residues and rearranging the residues to bit-planes, the enhancement-layer encoder begins the bit-plane VLC coding. As the profile in Figure 4 shows, the bit-plane VLC coding occupies most of the execution time. Therefore, the scheme depicted in Fig. 7 causes the CPU0 to idle and wait for the enhancement-layer encoding to finish. This is apparently not an efficient system design. Conse-quently, we reschedule the flow of the whole sys-tem as shown in Figure 8. After CPU0 finishes the encoding of the whole Picture 0, the residues are sent to CPU3 and the output bitstream of Pic-ture 0 is buffered. Then the base-layer encoder continues to encode Picture 1 since it does not have to wait for the end of enhancement-layer en-coding. In this manner, most of the CPU idle time is removed because CPU0 and CPU3 can work in parallel. The whole system is much more efficient than the non-pipelined design.
Fig. 7: System without pipelining.
Fig. 8: System with system-level pipelining.
tions of execution time of different program sections after acceleration.
4. Additional Performance Results
We present some additional performance data of the implemented MPEG-4 encoder. We use the clock functions defined in “time.h” on PC to esti-mate the speed of our system. The test sequence is akiyo qcif.yuv. We use different quantization step sizes to find out the speed of our system under differ-ent conditions.
4.1. With and Without Frequency Weighting and Selective Enhancement
Table 1 shows the overall coding speed under different quantization step sizes in the base layer. We consider three kinds of FGS options: “non-optimized” is the FGS encoder on DSP without any optimization, “soft-ware pipelining” means the encoder that is obtained by setting the compiler options properly and the compiler
Fig. 10: Poportions of execution time of different pro-gram sections after acceleration.
has the ability of doing software pipelining, and “op-timized” means the optimized final code. Frequency weighting and selective enhancement are both disabled in all three cases.
Now we enable frequency weighting and selective enhancement. The fgs shift matrix we use is as fol-lows:
The value of fgs rectangular shift factor is 3. The ex-perimental result is shown in Table 2.
4.2. With and Without Encoding of the Last Two Bitplanes
In our implementation, the FGS output bitstream is transmitted to the base-layer encoder when all the bit-planes are encoded. In fact the channel bandwidth may be smaller than the bitstream. So the FGS stream may be truncated before being transmitted to the channel.
Actually, it is the experience of some researchers that the last few bitplanes of the bitstream may be trun-cated without much effect on the subjective quality of the decoded video. Table 3 shows the performance of the FGS encoder when the last two bitplanes are not encoded. Since the last two bitplanes only affect the last two bits of the residues, the quality of the restruc-tured pictures does not change significantly, while the improvement in speed is significant.
5. Concluding Remarks
We considered real-time implementation of MPEG-4 FGS video encoder on DSPs. We have used a previ-ously implemented H.263+ encoder as the base-layer encoder. And the FGS enhancement-layer encoder is based on the FGS section of the MoMuSys software.
For DSP implementation, we have focused on the speed-up of the FGS encoder and the overall system design, since our system requires the working together of a host PC and two DSP chips. The use of two DSPs was for simplicity of system integration, where one DSP implements the H.263+ base-layer encoder and the other implements the FGS encoder. The code size of the FGS encoder was quite smaller than the DSP’s internal memory size. Therefore, code size reduction was not a major point of our work as in some other implementation studies.
We profiled the FGS encoder and found out the bot-tlenecks in the encoder functions. We then sought to accelerate the code by utilizing the features of the C62x chip, as well as the provisions of the compiler.
The bitplane VLC coding was found to take the ma-jority of the program execution time. In particular, the function of outputting bits to the bitstream was found to cost an unexpected amount of complexity. Simply by rewriting this function, we gained proportionately the most improvement in all the work that we did.
In system integration, we scheduled the workflow so that the H.263+ encoder and the FGS encoder could work in parallel. Since the speed of the FGS encoder was slower than that of the H.263+, the speed of the overall system was dependent on the improvement of the FGS encoder. The final encoding speed of the im-plementation is about 12 QCIF frames per second at no video quality loss by bitplane dropping, which is about 650% speed-up compared to the original encoder with no optimization. With dropping of two last bitplanes, the speed can reach about 18 frames per second.
6. References
[1] N. Seshan, “High VelociTI processing,” IEEE Signal Processing Mag., vol. 15, no. 2, pp. 86–
101, Mar. 1998.
[2] M.-L. Woo, “Real-Time Implementation of H.263+ Using TI TMS320C62x,” M.S. thesis, Department of Electronics Engineering, National Chiao Tung University, June 2000.
Table 1: Overall Coding Speed Without Frequency Weighting or Selective Enhancement Average QCIF frames per second
Software Speed-up (non- Speed-up (proper QPI Non-optimized pipelining Optimized optimized configuation
vs. optimized) vs. optimized)
4 2.10053 5.51755 13.50439 6.42904 2.44754
8 2.08065 5.45524 13.44447 6.46168 2.46451
12 1.98942 5.23286 12.98364 6.52636 2.48117
16 1.97656 5.23259 12.87830 6.51552 2.46117
20 1.92894 5.11169 12.70025 6.58406 2.48455
24 1.90360 5.06380 12.54863 6.59204 2.47810
28 1.88512 4.97661 12.38237 6.56847 2.48811
32 1.87403 5.01128 12.49688 6.66846 2.49375
Table 2: Overall Coding Speed with Frequency Weighting and Selective Enhancement Average QCIF frames per second
Software Speed-up (non- Speed-up (proper QPI Non-optimized pipelining Optimized optimized configuation
vs. optimized) vs. optimized)
4 1.69497 4.73732 12.64702 7.46149 2.66966
8 1.67628 4.66810 12.49844 7.45607 2.67741
12 1.61283 4.51610 12.07438 7.48648 2.67363
16 1.60720 4.49438 11.98179 7.45507 2.66595
20 1.57480 4.42576 11.80638 7.49705 2.66765
24 1.55453 4.35996 11.74260 7.55378 2.69328
28 1.54552 4.33614 11.72058 7.58357 2.70300
32 1.53440 4.29516 11.55268 7.52911 2.68970
Table 3: Overall Coding Speed Without Encoding of Two Last Bitplanes Average QCIF frames per second
QPI Frequency Weighting Frequency Weighting
and Selective Enhancement Disabled and Selective Enhancement Enabled
4 19.33862 19.17178
8 19.28268 19.16076
12 19.22338 18.08318
16 18.88574 18.03101
20 18.33181 17.71793
24 18.17851 17.43983
28 18.13237 17.30104
32 17.63668 17.02128