Performance Study - MPEG-4/21 SoC 設計及新世代行動訊之研究-子計畫二：多媒體通訊數位基頻SoC加速架構及嵌入式作業系統界面的研究(III)

In this section, we first introduce our development environment – Xilinx Spartan-3 Developing Board, and then we state the Java benchmark used in this research. Finally, the experiment results are shown and discussed. We analyze the performance on both execution time and power consumption.

4.1 Xilinx Spartan-3 Development Board

The Xilinx Spartan-3 Developing Board is used for the development of the proposed Java VM accelerating algorithm.

The detail spec. of the board is in [21]. The equivalent gate counts of the target

Spantan-3 device are 200,000 gates, and the logic utilization of JOP on the FPGA is 64 percent. The data path of Spartan-3 is 32 bits with an 8-bit memory interface. Shift instruction can be computed in exactly one single cycle. The external memory devices of JOP on Spartan-3 is a 32-bit SRAM blcok of 1M bytes and an 8-bit flash of 2M bits.

Java program is compacted by JCC to *.jop file which is loaded into SRAM.

Configuration data is stored in flash. Finally, the maximum working frequency of this processor is 194.621 MHz, according to the synthesizer.

4.2 Java Benchmark Programs

In this research, we use three small Java benchmark programs, which contain a synthetic benchmark (Sieve of Eratosthenes) and two application benchmarks, Kfl and UDP/IP. [14] We describe them in the following subsection.

i. Sieve of Eratosthenes

This program will produce a list of prime numbers. The algorithm is proposed by Erastosthenes. His method is as following.

First, write down a list of integers. Then mark all multiples of 2. The next step is, move to the next unmarked number, in here is 3, and mark all its multiples. Continue to mark all multiples of the next unmarked number until there are no new unmarked numbers. The numbers which survive from this marking process (the Sieve of Eratosthenses) are primes.

ii. Kfl

Kfl is adopted from a real-time application which is taken from one of the nodes of a distributed motor control system.

The motor control system is a solution to rail

cargo. During loading and unloading goods from wagons, a large amount of time is spent due to the obstacle of contact wires.

Balfour Beatty Austria developed and patented a technical solution called Kippfahrleitung to tilt up the contact wire.

An asynchrony motor on each mast is used for this titling. However, it has to be done synchronously on the whole line. [23]

Each motor is controlled by an embedded system. This system also measures the position and communications with a base station [14]. The base station need to control the deviation of individual positions during the tilt. It also includes the user interface for the operator. In technical term, this is a distributed, embedded real-time control system, communication over an RS 485 network.

A simulation of both the environment (sensors and actors) and the communication system (commands from the master station) forms part of the benchmark, so as to simulate the real-time workload.

iii. UDP/IP

UDP/IP benchmark is composed of a tiny TCP/IP stack (Ejip) for embedded Java.

This benchmark contains two UDP server/clients, exchanging message via a loopback device.

4.3 Experiment Results

We simulated our dynamic code optimization scheme on Spartan-3. The percentage of logic utilization increment is less than 1%, but we have made a big improvement in both execution time and power consumption. Now we are going to discuss in these two aspects.

i. Execution Time

We synthesize our JDCO system with comparisons to DCO (no frequency check) and the original JOP system. The execution time is listed in Table III and shown in Fig.

41. In the table, we can see that the average speedup of our system is 13.8%, and compare to DCO system, we also have 7.1%

execution time speedup.

Let us focus on the results of UDP/IP benchmark. In our JDCO system, it has 9.7% speedup compared to DCO system, while other two benchmarks only have 6.0%

and 5.6% speedup. The reason is that the UDP/IP benchmark has many initialization and executed-only-once code, so our JDCO system can make a big improvement by avoid that cases. Actually, the performance of this system is dependent on the Java program behavior.

TABLE III. Execution Time

0.929

0 5000 10000 15000

Sieve Kfl UDP/IP

JavaBenchmark

Execution Time (milisecond) JOP DCO JDCO

Fig. 41. Execution Time

ii. Power consumption

To estimate the power consumption savings, we can analyze the microcode

execution cycles and the external memory access times. We discuss the two aspects in the following subsections.

iii. Microcode Execution Cycles As we know that the less microcode execution cycles, the less power consumption will be. We analyze the microcode execution cycles of each bytecode and separate them by the number occurrences.

Because we have different microcode execution cycles in different number of occurrences, we should know the total execution times of the modified bytecodes of each benchmark separating by the number of occurrences, which is listed in Table IV. But these are the sum of the four modified bytecodes (putfield, getfield, invokevirtual, and invokeinterface), we should know the percentages of each of them. By analyzing the benchmark programs, we assume the percentages of the bytecodes as following:

180 : 181 : 182 : 185 = 40 : 20 : 20 : 1

TABLE IV. Microcode Execution Cycles of Each Bytecode

For DCO

second and later first

third and later second

first bytecodes

# occurences

Unit: cycles

We can calculate the microcode execution cycles by the following formulation:

T is the execution times in Table V, and P is the percentage of bytecodes. For example, P of getfield is 40 / (40+20+20+1).

The principle of this formulation is to calculate the sum of the execution cycles multiply the execution times. The execution cycles are calculated according to the percentage of each bytecode. Note that the microcode execution cycles of original JOP are always the same as the first time of JDCO.

TABLE V. Execution Times of Bytecodes 180. 181. 182. 185

For JOP second and later first

third and later second

first bytecodes

# occurences

We still calculate the execution cycles of our JDCO system with comparison to DCO and original JOP system. The experimental results are listed in Table VI and shown in Fig. 42. Because we only calculate on the modified bytecodes, we need to know the percentage of them of all bytecodes. By analyzing the benchmark programs, we get that the roughly percentage

is 1/2. That is,

As shown in Table VI, our JDCO has average 20.8% less execution cycles for the modified bytecodes, so for the all bytecodes, we have 10.4% less execution cycles than the original system. However, our JDCO has a little more microcode execution cycles than DCO system. This can be easily explained. By comparing between our JDCO and DCO system, we have less execution cycles for the executed-only-once bytecodes, but the needless first time overhead is happened to all the other bytecodes.

TABLE VI. Microcode Execution Cycles of Bytecodes 180. 181. 182. 185

1.020 benchmarksystem JOP

Unit: cycles

0 200000 400000 600000 800000 1000000 Sieve

Kfl UDP/IP

JavaBenchmark

Microcode Execution Cycles of Bytecodes 180. 181. 182. 185 JOP DCO JDCO

Fig. 42.Microcode Execution Cycles of Bytecodes 180. 181. 182. 185

iv. External Memory Access Times

In addition to the microcode execution cycles, there is another important factor of power consumption. That is the external memory access times. Like the microcode execution cycles, the less external memory accesses, the more power saving.

The calculation is similar to the microcode execution cycles. We also list the external memory access times of each bytecode and separate them by the number occurrences as in Table VI, in which we calculate the sum of memory read and memory write. The times 3 or 5 is based on the number of address we modified because an address is of 32 bits. For example, if the address of modified bytecode is “42 1 2 181”, we should modify the next address because it contains the operand of bytecode 181. For calculating, we use the average 4.

Use this information and the total execution times of the modified bytecodes of each benchmark separating by the number occurrences in Table V, we can calculate the external memory access times by the

following formulation:

TABLE VII. External Memory Access Times of Each Bytecode

For DCO

second and later first

third and later second

first bytecodes

# occurences

Unit: times

The experiment results are listed in 0 and showed in Fig. 43. Our JDCO system has 22.2% less external memory access times of the modified bytecodes, so for the system of total bytecodes, we have 11.1 % less external memory access. If comparing to DCO system, we still have a little more external memory access times. The reason is as we mentioned in the previous subsection.

TABLE VIII. External Memory Access Times of Bytecodes 180. 181. 182. 185

1.012

0 10000 20000 30000 40000 50000 60000 Sieve

Kfl UDP/IP

JavaBenchmark

External Memory Access Times of Bytecodes 180.181.182.185 JOP DCO JDCO

Fig. 43.External Memory Access Times of Bytecodes 180. 181. 182. 185

在文檔中 MPEG-4/21 SoC 設計及新世代行動訊之研究-子計畫二：多媒體通訊數位基頻SoC加速架構及嵌入式作業系統界面的研究(III) (頁 62-65)