• 沒有找到結果。

Chapter 2 MHP Graphics

2.5 Bresenham Algorithm

Bresenham algorithm [13] can be used for scaling operation. It uses only integer subtraction, bit shifting and addition, all of which are very cheap operations in computer architectures.

The line is drawn between two points (x0, y0) and (x1, y1), as shown in Figure 2.12.

Assume that the slope of the line is restricted between -1 and 0. The goal is, for each column x between x0 and x1, to identify the row y in that column which is closest to the line and plot a pixel at (x, y). The general formula for the line between the two points is given

by: 0 0

Figure 2.12 Diagram of the line drawing

Note that y starts at y0, and each time we add 1 to x, we add the fixed value 1 0 the exact y. Moreover, since this is the slope of the line, by assumption it is between 0 and 1. In other words, after rounding, in each column we either use the same y as in the previous column, or we add one to it. We can decide which of these to do by keeping track of an error value which denotes the vertical distance between the current y value and the exact y value of the line for the current x. Each time we increment x, we increase the error by the slope value above. Each time the error surpasses 0.5, the line has become closer to the next y value, so we add 1 to y, simultaneously decreasing the error by 1. Figure 2.13 shows the pseudo code of Bresenham algorithm. Due to the limited time, the Bresenham algorithm is not included in our implementation.

Figure 2.13 The pseudo code of Bresenham algorithm

Bresenham line(x0, x1, y0, y1)

boolean steep := (y1 - y0) > (x1 - x0)

Chapter 3 Development Environment

3.1 Objectives

We already built a pure software video and graphics subsystem conforming to the MHP graphics model in an embedded system environment, but this approach is not fast enough to support high quality video According to the profiling data of the pure software system on the Xilinx ML310 platform, we observe a few bottlenecks, which are video scaling, video plane and graphic plane alpha composing, and screen blitting.

Therefore, the objectives of this thesis are to improve the performance by including hardware logic and software-hardware-design.

The embedded system environment used for deployment is described in the following sections.

3.2 Target Platform and Development Tools

The Xilinx ML403 platform is chosen as the target environment to emulate a DTV set-top box. The hardware and software configurations are described in the following sections.

3.2.1 Xilinx ML403 Platform

The Xilinx ML403 evaluation platform [4] [5] enables designers to develop embedded system and system-on-chip applications. The block diagram of the ML403 platform is shown in Figure 3.1.

Figure 3.1 Xilinx ML403 platform block diagram [4, Figure 1, pp.11]

The Virtex-4 XC4VFX12-FF668 FPGA has one embedded PowerPC 405 processor clocked at 300 MHz maximum. There are 648 Kb block RAM on the FPGA.

Data accesses between the processor and block RAM have better performance than that between processor and memory on the bus. The ML403 platform has 64 MB DDR SDRAM, 32-bit interface running up to 266-MHz data rate. The ML403 platform also has 8 Mb ZBT synchronous SRAM on 32-bit data bus with no parity bits. The CompactFlash card contains a DOS FAT16 filesystem partition and a Linux EXT3 filesystem partition. The DOS filesystem partition contains a set of ACE files to run board diagnostics, as well as to demonstrate the operation of various operating

3.2.2 Embedded Development Kit (EDK)

Embedded Development Kit (EDK) provides a suite of design tools to help us to

design a complete embedded processor system in a Xilinx FPGA device. [7][8]

Though the Xilinx Integrated Software Environment (ISE) must be installed along with EDK, it is possible to create our entire design from beginning to end in the EDK environment.

EDK includes several tools, which are:

● Xilinx Platform Studio (XPS): An integrated design environment in which we can create our complete embedded design.

● The Base System Builder (BSB) Wizard: Allows us to create a working embedded design quickly, using any features of a supported development board or using the basic functionality common to most embedded systems.

● Platform Generator (Platgen): Constructs the programmable system on a chip in the form of HDL and implementation netlist files.

● Xilinx Microprocessor Debugger (XMD): Open a shell for software download and debugging.

● System ACE File Generator (GenACE): Generates a Xilinx System ACE configuration file based on the FPGA configuration bitstream and software executable to be stored in a non-volatile device in a production system.

In addition to the aforementioned tools, in order to development a custom system, EDK also offers the other tools such as “EDK Command Line” to allow us to run the embedded design flows or to change tool options by using a command,

“Library Generator”, to construct a software platform comprising a customized collection of software libraries, drivers and OS, “GCC” and others.

Figure 3.2 EDK Hardware Tools Architecture [7, Figure 1-2, pp.26]

Figure 3.2 shows that how to create system bitstream and how to download the bitstream to the FPGA device.

When generating the system, the platform generator (Platgen) imports the MHS (Microprocessor Hardware Specification) file, the MPD (Microprocessor Peripheral Definition) and PAO (Peripheral Analyze Order) files of all peripherals to the system.

The MHS file defines the configuration of the embedded processor system including

buses, peripherals, processors, connectivity, and address space. The MPD file defines the interface for the peripheral, and the PAO file tells other tools (Platgen, Simgen, for example) which HDL files are required for compilation (synthesis or simulation) for the peripheral and in what order. MPD and PAO files are automatically generated when we create or import a peripheral. Then Platgen will generate the system and wrapper HDL files and the system.BMM file. The BMM (BlockRAM Memory Map) file contains a syntactic description of how individual BlockRAMs constitute a contiguous logical data space. Then the system and wrapper HDL will be fed to the ISE tools such like XST, NGDBuild and so on. Finally, the iMPACT tool will download the bitstream file to the FPGA device by the JTAG cable.

Figure 3.3 EDK Software Tools Architecture [7, Figure 1-2, pp.26]

Figure 3.7 shows that how to create application ELF (Executable Linked Format file) file and how to download the ELF file to the FPGA device.

Before creating a application ELF file, we must generate the corresponding library first. The library generator will import the EDK software library file such like BSP (Board Support Package) and MLD (Microprocessor Library Definition) files

and IP library or user repository (MDD (Microprocessor Driver Description), MLD) files. EDK provides some software libraries to help us to develop the embedded system easily. These libraries include “LibXil File” which provides block access to file systems and devices using standard calls such as open, close, read and write,

“LibXilNet” which includes functions to support the TCP/IP stack and the higher level application programming interface (Socket APIs), “XilFatfs FATfile” which provides read/write access to files stored on a System ACE compact flash, and

“LibXil MFS” which provides function for the memory file system and can be accessed directly or through the LinXil File module. IP library or user repository will be created if the hardware IP is set to include some standard function such as DMA or address range when it was created. When the ELF file is created, the debugger tool (XMD or GDB) will download the file to the FPGA device by the JTAG cable.

3.2.3 Block RAM

The Virtex-4 XC4VFX12-FF668 FPGA has 36 pieces of block RAM and each block RAM stores 18K bits of data. [6] Write and Read are synchronous operations;

the two ports are symmetrical and totally independent, sharing only the stored data.

Each port can be configured in any “aspect ration” from 16Kx1, 8Kx2, to 512x36, and the two ports are independent even in this regard. Figure 3.2 illustrates the dual-port data flow. Table 3.1 lists the port names and descriptions.

A read/write operation uses one clock edge. When reading, the read address is registered on the read port and the stored data is loaded into the output latched after the RAM access time. Similarly, the write address is registered on the write port, and

the data input is stored in memory when writing.

Figure 3.4 Dual-Port Data Flow [6, Figure 4-1, pp.111]

Table 3.1 Port names and descriptions of the BRAM

Besides the basic read and write operations, there are three modes of a write operation, which are WRITE_FIRST, READ_FIRST, and NO_CHANGE. The default mode is WRITE_FIRST mode and mode is set during device configuration. These choices increase the efficiency of block RAM memory at each clock cycle.

These three modes are different on the data available on the output latches after a write clock edge.

Figure 3.5 WRITE_FIRST mode waveform [6, Figure 4-2, pp.112]

As shown in Figure 3.3, in WRITE_FIRST mode, the input data is simultaneously written into memory and stored in the data output (transparent write).

Figure 3.6 READ_FIRST mode waveform [6, Figure 4-3, pp.113]

Figure 3.4 shows that in READ_FIRST mode, data previously stored at the write address appears on the output latches, while the input data is being stored in memory (read before write).

Figure 3.7 NO_CHANGE mode waveform [6, Figure 4-4, pp.113]

As shown in Figure 3.5, in NO_CHANGE mode, the output latches remain unchanged during a write operation.

Because the block RAM is a true dual-port RAM where both ports can access any memory at any time. When accessing the same memory location from both ports, we must be very carefully to avoid conflict.

3.2.4 Intellectual Property InterFace (IPIF)

The IPIF is a module that enables us to easily integrate our own custom application specific value-added IP into CoreConnect based systems. [9][10] Figure 3.8 shows the IPIF positioned between the bus and the user IP. The interface seen by the IP is called the IP Interconnect (IPIC). The combination of the IPIF and the user IP is called a device (or a peripheral). Figure 3.8 also shows that the services provided by the IPIF. In addition to facilitating bus attachment, other services, FIFOs, DMA, Scatter Gather (automated DMA), software reset, interrupt support and bus-master access, are places in the IPIF to standardize functionality that is common to many IPs and to reduce IP development effort.

Figure 3.8 IPIF block diagram [9, Figure 1, pp.2]

The base element of the design is the Slave Attachment. This block provides the basic functionality for Slave operation. It implements the protocol and timing translation between the Bus and the IPIC. Optionally, the Slave Attachment can be enhanced with burst transfer support. This feature provides higher data transfer rates for sequential-address accesses.

The Byte Steering function is responsible for steering data bus information onto the correct byte lanes. This block supports both read and write directions and is required when a target address space in the IPIF or the User IP has a data width smaller than the bus width (32 bits for OPB and 64 bits for PLB).

The User may use a Master Attachment Service. This is needed when the User IP needs to access the OPB as a Master device.

The IPIF provides an optional Direct Memory Access (DMA) Service. This

service automates the movement of large amounts of data between the user IP or IPIF FIFOs and other peripherals (such as System Memory or a Bridge device). The inclusion of the DMA Service automatically includes the Master Attachment Service.

The DMA Service requires an access to the bus as a Master device.

Chapter 4 Hardware Graphics Accelerator

and System Integration

According to the software system profiling data, the video scaling, alpha composing, and screen blitting operations are the bottlenecks of the pure software system on the ML403 platform. We plan to design and implement the hardware accelerators for these three parts to improve the performance.

The pure software implementation is deployed by many tools, including TinyX, Micrownidows-based AWT, JMF, FFMpeg, and others. TinyX is an X server implementation of X Window System included in the Monta Vista Linux.

Microwindows is an open-source windowing system for resource-constrained devices.

JMF is an API for incorporating time-based media into Java applications. FFmpeg is a set of open-source and cross-platform libraries for multimedia applications.

The original AWT does not provide the stacked planes such as background plane, video plane, and graphics plane. In order to present video together with AWT graphics and to alpha-blend the graphics over the video at the same time, the modified graphics model as shown in Figure 4.1 is designed.

Figure 4.1 Modified graphics architecture

The off-screen buffer plays the role of the graphics plane in the MHP graphics model. The X Pixmap is designed to be both the video plane and the composition buffer of the video plane and the graphics plane. The X Window represents the final display device. The composite result is first drawn on the X Pixmap and then is blitted to the X Window to avoid the pixel-by-pixel flickers.

The third part of bottlenecks, screen blitting, is caused by X Pixmap to X Window blitting. Because the whole system is constructed on the TinyX, this part of performance loss is unavoidable.

In order to accelerate the system’s operating speed, we design the alpha composing hardware logic and the scaling hardware logic, and then integrate those logics circuits into the entire system to replace their software counterparts.

4.1 Alpha Composition Circuits

4.1.1 Design and Implementation

In our description in Section 2.3, the graphics components are first composed,

and then the graphics plane (component plane) is combined with the video plane (background plane). The compositions between components occur only in the overlap regions, but the compositions between component plane and background occur on the whole screen. Since the component and background composition takes much more computing time than the other operations, the alpha composition logic is designed to speedup that task. The component plane and the background plane are composed together using the SRC_OVER rule as shown in Figure 4.2.

video

Figure 4.2 Graphics and video composition

Figure 4.3 shows the procedure of composition. There is a data buffer in the composition logic and a few control registers to control the logic or to record the state of logic. We first write one row of graphics data and video data to the data buffer.

Then write the specific number “0x0A” to the start control register to enable the composition logic to start. First, the graphic and video data are composed pixel by pixel, then write back the results to the data buffer. Finally, we write the results back into the data buffer at the destination.

Figure 4.3 The procedure of composition logic

First of all, when we create a new peripheral, we need to choose types of buses, including OPB, PLB and fast simplex link, which are connected with the new peripheral. The PLB bus is chosen to be connected with the IP because of its higher performance. Next, for the EDK graphics user interface, we can choose the optional IPIF (Intellectual Property InterFace) services, including S/W reset and the module information register, burst transaction support, DMA, FIFO, interrupt support, S/W register support, master support and address range support. We include the DMA service to accelerate data accessing, the S/W register service to be control registers and the address range service to support Block RAM interface. Figure 4.4 shows the connections between IPIF and user logic when we include the optional functions in the above.

Figure 4.4 The connections between IPIF and user logic

The purpose of the Bus2IP_BE port is to support byte-by-byte data access. The Bus2IP_RdCE and Bus2IP_WrCE port can be one to 32 bits width and the register addressing is implemented by them. For example, if we want the IPIF to support four registers, the widths of Bus2IP_RdCE and Bus2IP_WrCE port will be four. And if we want to write data to the third register (memory map to BASEADDR+3×BusDataWidth), the Bus2IP_WrCE will be “0010”. The purpose of Bus2IP_ArData, Bus2IP_ArBE, Bus2IP_ArCS and IP2Bus_ArData is to support the address range service. The IPIF supports eight address ranges at the maximum.

We use eight pieces of dual-port block RAM to act as data buffer, and each piece of block RAM is configured in 8-bit×2k-deep. The EDK maps the beginning of the

data buffer to the virtual memory address: XPAR_USER_LOGIC_0_AR0_ADDR.

Figure 4.5 shows the data buffer in the composition logic. Since the PLB bus is 64-bits width, the eight pieces of block RAM are set in one row so that the memory mapping is easier. For example, the virtual memories 0x10010010 to 0x1001001F are mapped to block RAM address ”000_0000_0001”, a mapping of the 17th bit to 27th bit of the virtual memories address. And the data access byte by byte can be disposed by Bus2IP_ArBE signal. If we want to write 32-bit data to the virtual memory address 0x10010110, the block RAM address will be “000_0001_0001”and the Bus2IP_ArBE will be “1111_0000”. Because the full screen size is 640×480 and the composition operation is executed row by row, we only need 640×4 bytes buffer space for each of graphic and video planes. Therefore, the first 512 deep of the data buffer is allocated to graphic, the next 512 deep of the data buffer is designed for video and the rest of the data buffer is allocated to the composition results.

Figure 4.5 Data buffer in the composition logic

After writing one row of graphics and video data to the data buffer of the composition logic, we must write a few parameters to the registers to control the behavior of logic and to trigger the logic to start.

The EDK maps the beginning of the register to the virtual memory address XPAR_USER_LOGIC_0_ADDR. Figure 4.6 shows these control registers in the composition logic. The first eight bits is the start control register. It waits until all other data and parameters are prepared. Then it has to be written with a specific value,

0x0A. The start register is refreshed to 0x00, when the logic is ready to work. If we write any other value but not 0x0A, the start register will simply keep the value and the composition logic will not start. The bit-8 is the busy flag register and the bit-9 is the done flag register. We observe that the single row composition is done or not though these two state registers. These two state registers are initialized to “00”.

When starting, they are changed to “10” and they are changed to “10” when a single row composition is finished. The bit-10 to bit-21 are empty, reserved for future expansion. The next ten bits, bit-22 to bit -31 is the width control register that tells the logic how many pixels have to be disposed.

Figure 4.6 Control registers of the composition logic

Figure 4.7 The FSMD model of the composition logic

Figure 4.7 shows the finite-state machine with datapath model of the composition logic. When the logic is reset, it is set to the “IDEL” state. The state of the logic is not changed until the start register is written “0x0A” and it makes the signal, Com_go, to be high. When Com_go becomes high, the logic moves to the

“Get_data” state to get data from the data buffer, Busy is changed to high, Done is changed to low, and Com_count is initialized to Width, the value contained in the

“Get_data” state to get data from the data buffer, Busy is changed to high, Done is changed to low, and Com_count is initialized to Width, the value contained in the

相關文件