Chapter 4 Hardware Graphics Accelerator and System Integration 27
4.3 Integration with JMF
4.2.1 Design and Implementation
The idea is to replace the software alpha composition and scaling units by the hardware ones. In order to integrate our hardware logics with the existing software JMF, we have to regenerate the ACE file, which contains the hardware binary codes and the Linux kernel [11] [12].
First, we download the ML403 reference design from the Xilinx web site and add the scaling logic and the composition logic. But, in this version, which includes the IPIF DMA service, the FPGA space is too small to contain the two hardware IPs.
We remove the IPIF DMA service from the two logics, and generate the binary codes.
Then, we must build the BSP(Board Support Package) for re-build linux kernel.
In order to build the BSP, we modify a few software parameters, including:
(1) Change the software platform to Linux
(2) Change the driver for GPIOs from v2.0 to v1.0
The open source Monta Vista Linux can be obtained from the web site, http://source.mvista.com. Rebuilding kernel requires the following steps.
(1) Prepare the cross compiling environment
(2) Copy BSP to the kernel source (drivers and ARCH directory) (3) Cross-build:
the cross-compiler should be listed in the search path
command: make menuconfig ARCH="ppc"
CROSS_COMPILE="powerpc-405-linux-gnu-"
- turn off the network device and the character LCD support, we do not have to patch
command: make dep ARCH="ppc"
CROSS_COMPILE="powerpc-405-linux-gnu-"
command: make zImage ARCH="ppc"
CROSS_COMPILE="powerpc-405-linux-gnu-"
(4) Copy arch/ppc/boot/image/zImage.embedded to the reference design directory for building the ACE file
To integrate the binary codes and the re-building kernel:
(1) Launch EDK shell
(2) command: xmd -tcl genace.tcl -jprog -board user -target ppc_hw -hw implementation/system.bit -elf zImage.embedded -ace system.ace The ACE file is generated after running though the above processes.
Now, we have a ACE file which contains the binary for the accelerating IPs and the corresponding Linux kernel. However, how do we access the registers or data buffers on the hardware IPs under the operation system environment? How do we map the physical addresses of these IPs to the virtual addresses? We can write drivers for these IPs, but there is a simpler way. There is a special device, /dev/mem, in linux.
We can use the special device and the function, mmap(), to map the non-RAM (I/O
memory) address to the corresponding virtual address in the user-space. In order to use the mmap function, we have to include the sys/mman.h header. And the format of the call is as follows:
pa=mmap(addr, len, prot, flags, fildes, off);
The mmap() function shall establish a mapping between the address space of the process at an address pa for len bytes to the memory object represented by the file descriptor fildes at offset off for len bytes. A successful mmap() call shall return pa as its result.
The accelerating IPs are now integrated with the software JMF system successfully. Then, we make a simple performance profiling. We play a 54 seconds, thirty frames per second, QCIF MPEG-1 video, and count the execution time. Table 4.3 is the simple profiling results.
Table 4.3 The simple profiling of the JMF system with various configuration
We can see that the performance with JMF system is not improved as much as the performance in the standalone subsystem. Especially the performance of the composition logic is even slower than the pure software. The reason is the overhead of data access. In the standalone subsystem, DMA is in use to decrease the overhead. But, in the above simulations, we use the “memcpy()” function to access data. We also try to use DMA in this environment. However, we need to write the source and
Execution time (seconds)
Pure software JMF system 900
JMF system with hardware scaling logic 493
JMF system with hardware composition logic 1138 JMF system with hardware scaling logic and hardware
composition logic
728
destination physical addresses to the DMA registers. It is complicated because we do not have the physical address information in the user application space. A possible way is reserving an appropriate size of memory when the system boots with a driver for our IPs. Then, the driver can access the physical addresses of DMA, and it can also use kernel space functions to return the virtual addresses for user applications.
4.2.2 Performance Profiling
In order to evaluate the execution time performance, several scenarios are chosen and the executing time of a specific function is measured. These scenarios simulate typical DTV video and graphics presentations. This scenario simulates that a user navigate available service while watching a program as illustrated by Figure 4.17. The video that users are currently watching is presented on the video plane. At the same time, users select the EPG to navigate the other available services. The GUI components of the EPG are presented on the graphics plane, including several menu items and a component-based video to preview the video content of the service that users want to navigate.
Figure 4.17 Scenario of typical EPG usage
To simulate this scenario, a component-based video and a full-screen-sized translucent AWT component are treated as the EPG GUI components. The decoded source video is in QCIF format. The background video presented on the video plane is scaled to have a full screen dimension: 640×480. The component-based video presented in the graphics plane has a dimension of the original decoded video, and therefore, no video scaling operation is needed. Both the component-based video and the background video have 891 frames.
Table 4.4 The profiling of JMF system
We compare the performance of (a) pure software JMF, (b) JMF with composition and scaling accelerators and (c) JMF with scaling accelerator as shown in Table 4.4. We have the following observations.
(1) In each condition, the performance of JMF with composition and scaling accelerators is improved up to 30% as compared to the pure software version.
(2) The performance of JMF with the composition and scaling accelerators under the two conditions, background video and background video plus
Condition
Component-based video 576 790 354
Background video 208 269 146
translucent AWT component are very close. This is because the operations under the two conditions are the same. Sine every time the background video is changed, the SRC_OVER composition between video plane and graphics plane will execute.
(3) The performance of JMF with composition and scaling accelerators under the condition, background video plus component-based video, is obviously slower than that under the background video only condition. This is because we need to decode two video clips. Also, the composition operations slows down the performance, because the composition will be executed when the graphics plane is changed.
(4) The performance of JMF with the scaling accelerator is better than the performance of JMF with both the scaling and composition accelerators. This is due to the data access overhead problem that are discussed earlier.
Chapter 5 Conclusion and Future Work
In this thesis, two accelerating IPs, scaling and composition circuits, are implemented and integrated to a software MHP graphics and video system. These two hardware accelerate the system performance. To achieve this goal, several important steps have been gone through.
We first need to understand the MHP graphics model and the composition pipeline.
Have an in-depth knowledge of the target environment -- the Xilinx ML403 platform, including the EDK tools and the Linux operating system.
Understand the pure software MHP graphics and video system.
Design and implement the composition logic and the scaling logic.
Implement standalone subsystems for the two accelerating IPs and evaluate their performance.
Rebuild the MontaVista Linux kernel, connect our IPs to the ML403 reference design, generate the binary and finally generate the ACE file.
Replace the software scaling and composition operation by these two accelerating IPs.
However, a few points are worth to continue to work on:
Use more sophisticated algorithms for the scaling logic to eliminate the blocking effect.
Use DMA to speed up the hardware.
References
[1] ETSI TS 101 812 V1.3.1, “Digital Video Broadcasting (DVB);Multimedia Home Platform (MHP) Specification 1.0.3”, June 2003.
[2] IBM, Inc., http://www-03.ibm.com/chips/products/coreconnect/, “CoreConnect Bus Architecture”
[3] Xilinx, Inc.,
http://www.xilinx.com/ipcenter/processor_central/coreconnect/coreconnect.htm,
“CoreConnect Bus Architecture”
[4] Xilinx, Inc.,” ML401/ML402/ML403 Evaluation Platform User Guide”, May 2006
[5] Xilinx, Inc.,” Virtex-4 Family Overview”, January 2007
[6] Xilinx, Inc.,” Virtex-4 User Guide”, October 2006
[7] Xilinx, Inc.,” Embedded System Tools Reference Manual”, October 2005
[8] Xilinx, Inc.,” OS and Libraries Document Collection”, January 2006 [9] Xilinx, Inc.,” OPB IPIF Product Specification”, April 2005
[10] Xilinx, Inc.,” PLB IPIF Product Specification”, August 2004
[11]Brigham Young University,
http://splish.ee.byu.edu/projects/LinuxFPGA/configuring.htm , “Linux onML403 tutorial”
[12]Xilinx, Inc.,
http://toolbox.xilinx.com/docsan/xilinx8/help/platform_studio/html/ps_p_dld_using_g enace_script.htm ,”commands to generate ACE file”
[13]J.E. Bresenham, ”a Algorithm for computer control of a digital plotter.” IBM Systems Journal, vol. 4, no. 1, pp25~30, 1965