Evaluation of the System Performance

System Architecture Design Space Exploration

RTL Synthesis

4.6 Evaluation of the System Performance

This section presents how the system performance is evaluated. In particular, the several evalu-ation metrics are introduced in our simulevalu-ation infrastructure. The optimized design alternative is decided based on the concept of Pareto-based multi-objective optimization.

4.6.1 Evaluation Metric

To evaluate the system performance, we consider the following design metrics: (1) processing speed, (2) silicon area, and (3) hardware utilization. The processing speed is used to measure the system throughput. To measure the speed, we use the average execution cycle counts of one

Sec 4.6. Evaluation of the System Performance

pipeline is dominated by the following four modules: IQIDCT, IIP, DB, and DRAM access.

The silicon area is used to measure the hardware cost. Although the actual chip size is not available until the chip is taped out, we use the equivalent gate count to estimate the silicon area at the system level. In addition, the equivalent gate count includes the cost of the local memories and the logic circuits. One of the problems in[48][49][50][51][52][53][54][55][56][57][58][59]is that the cost of memory units and the cost of logic circuit are not normalized to the same scale.

Speci cally, the cost of memory is measured by the number of storage bits while the cost of logic circuit is measured by the synthesized gate count. In some cases, the smaller logic circuit could be obtained at the expense of more memory cost[51][57]. Therefore, we should normalize these two cost factors to the equivalent gate count for a fair comparison.

The hardware utilization is used to evaluate the ef ciency of a design and to determine how the explored modules are utilized. The higher the hardware utilization, the lower the overheads in the individual deign. Originally, the utilization is the ratio of the computational cycles to the execution cycles of the system. However, the problem is that the factor is not scaled according to its cost. The unbalanced design alternative increases the utilization for the low-cost module while the other high-cost modules are signi cantly underutilized. Because the multiple compo-nents jointly determine the system utilization, the cost of each module should also consider the impact of the utilization on the overall performance.

We use the average utilization normalized by the cost to represent the balance of a design combination. Let X be the random variable that represents how the hardware cost is utilized at the i-th cycle among the M components to be explored.

x[i] =

where the symbols C_j and u_j[i]represent the hardware cost and utilization of the j-th compo-nent at the i-th cycle, respectively. The symbol C represents the overall cost of the M com-ponents. We further calculate the mean value of the symbol X throughout the total execution period of N cycles as follows.

X =

where ^C_C^j is the proportional cost of the j-th component. The average hardware utilization

Uj is the hardware utilization for each individual component. A designer should keep the balance among modules in a pipelined system such that the hardware can be more ef ciently utilized for improving the system throughput. Therefore, we use the area-weighted hardware utilization, which is the summation of product of proportional cost and utilization, to represent the overall balanced system performance.

4.6.2 Simulation Infrastructure

To effectively evaluate the system performance for the numerous design alternatives, we em-ploy the transaction level modeling (TLM) technique to increase the simulation speed. Partic-ularly, the entire system is modeled at mixed levels of abstraction and simulated using CoWare ConvergenSC^TM[60].

4.6.2.1 System Simulation with Mixed Levels of Abstraction

Traditionally, the coding effort of the lower level of abstraction, such as register transfer level (RTL), is costly and its simulation is very time consuming. While we need to check if the DRAM access violates the timing constrains at each cycle, the memory sub-system including

Sec 4.6. Evaluation of the System Performance

DRAM and DRAM controller is simulated at cycle-accurate RTL level using Verilog.

On the other hand, the remaining parts of the system are modeled at a higher level of ab-straction, known as the transaction level, with SystemC [61] to achieve faster simulation speed.

In the SystemC implementation, the interface between modules and the granularity for data exchange are de ned. To synchronize the modules at different levels of abstraction, the data exchanges are triggered by the call of the interface functions bound to the ports of a mod-ule and carried out by the implementation of the channels that connects the ports of different modules. In addition, the functionalities within a module are described by a set of concurrent processes with annotated timing information. Speci cally, within a module, the task modeling is accomplished by the task execution followed by N waiting cycles which is annotated from the associated hardware architecture as shown in Figure 4.14. As compared to the RTL, the trans-action level modeling (TLM) signi cantly reduces the design regression cycle while offering suf cient simulation accuracy.

In summary, the simulation with mixed levels of abstraction allows a system designer to nd the best trade-off between the simulation speed and the modeling accuracy. The mixed level of abstraction not only considers the impact of external memory but also has a faster sim-ulation speed than a pure RTL simsim-ulation such that we can effectively perform the architectural evaluation.

4.6.2.2 Platform Creation

After the model creation, all the modules in the proposed H.264/AVC decoder are assembled and simulated on the Platform Architect of Coware's ConvergenSC^TM, as shown in Figure 4.15. Particularly, the external memory interface modeled in the Verilog is instantiated in the SystemC-based platform using a proxy module, which acts as a wrapper and deals with the

Behavior1

Synchronization Event @ posedge clk Read_if (ping-pong index) Data and Control

Stage1Stage2Stage3

Figure 4.14: The SystemC Implementation with the Annotated Timing

在文檔中視訊嵌入轉碼器之演算法與其硬體架構設計空間探討 (頁 112-116)