Thesis Organization - 應用於WiMAX系統之電子系統層級設計與分析

Chapter 1 Introduction

1.2 Thesis Organization

The rest of this thesis is organized as follows. Chapter 2 introduces the fundamental concepts of the virtual platform environment that we use, including the introduction to SystemC Transaction Level Modeling (TLM), CoWare Platform Architect ESL tool, and the study case – IEEE 802.16e Baseband Transmitter. Chapter 3 gives the detail of the built virtual platform and the application software, with all the skills we use, and then describe several simulation configurations in figures. In chapter 4 we’ll discuss the simulation results of our analysis and make suggestion. Finally, the conclusion and the future work are made in chapter 5.

Chapter 2 Fundamental Concepts

In this chapter, the fundamental concepts will be given, including the introduction to SystemC TLM, CoWare Platform Architect, and the referenced SW code – IEEE 802.16e baseband transmitter.

2.1 SystemC Transaction Level Modeling

Transaction Level Modeling (TLM) is a higher abstraction level than Register Transfer Level (RTL), and uses to reduce the SoC design complexity by higher hardware abstraction level and help system designers to design larger systems. Some classifications of transaction level models are introduced by Lukai Cai and Daniel Gajski [4]. There are several programming languages can be used at transaction level, such as C/C++, System Verilog, and so on; but using SystemC is more convenient for programmers who are familiar with C/C++ because SystemC is based on the C++

programming language. Since the referenced SW code was written in C++

programming language and the CoWare Platform Architect ESL tool supports SystemC IP and SystemC Modeling Library, thus we use SystemC programming language at transaction level.

SystemC is both a system level and hardware description language, it can model designs at RTL or at algorithmic level. At the beginning of the design development, the

needed hardware simulation model at system level should have full functionality, so are as RTL or gate level circuit. It must define the corrected circuit behavior according to the specification, but it doesn’t need to consider how to design the real circuit. For this reason, design a TLM component should be easier than design a RTL component with the same functionalities, and the needed simulation time should be shorter, too.

Figure 2.1 is an example of TLM-based ESL design flow, the four major used cases for TLM and the particular ESL design task supported by each of them is introduced by Tim Kogel and Matthew Braun in [5]. In our study case, the Functional View (FV) corresponds to the referenced SW code for IEEE 802.16e baseband TX; the Architect View (AV) corresponds to the platform environment; the Programmers View (PV) corresponds to the application SW development on the platform; and the Verification View (VV) corresponds to the HW and SW implementation, which is not our concern at present time. In other words, the main work in this thesis is among AV and PV used cases.

Figure 2.1: TLM based ESL design flowchart [5]

Using function calls for communication is the defined foundation of TLM by the OSCI TLM working group [6]. It can minimize the number of events and the amount of information that have to be processed during simulation. In a virtual platform, we have three main types of components. They are instruction-set simulator (ISS), bus model, and peripheral models such as memories and HW accelerators. Bus model for platform simulation have different simulation speed/accuracy trade-off in PV, AV, and VV used cases. Because we need some timing information to estimate the performance for architectural exploration, an AV bus model should be suitable for our study case. The ISS and bus model can be directly come from well developed ESL tools, but the application specified peripheral models should be created by ourself. We use SystemC TLM to model them, with Platform Architect Interface (CoWare TLM API) rather than with a wrapper to avoid wrapping delay problem.

2.2 CoWare Platform Architect [7]

Bin F

Figure 2.2 simply shows the ESL tools classification in [8] with three bins, and the meta-bins consist of two or more bins. F stands for functionality, and P and M represent platform environment and mapping, respectively. Functionality indicates functional representations of a design completely independent of implementation architectures.

Platform concerns the modules used to implement the functional description. Mapping refers to instances of the design in which the functionality has been assigned to a set of correctly interconnected modules. With this classification, CoWare ConvergenSC (the predecessor of Platform Architect) is in metabin PM, which means it combines architectural services and mapping.

Coware Platform Architect can do SystemC platform capture and reconfiguration by Platform Creator graphical user-interface (GUI) which supplies rapid assembly and reconfiguration of hierarchical SoC Platforms. It also supports architecture analysis for Platform-driven ESL Design. The details can be read in its datasheet [9].

In our study case, we use this ESL tool and it’s supported libraries to build a virtual platform. The Platform Creator GUI supports for drag-and-drop assembly of SystemC transaction level platforms. We can see it as an architecture modeler, which manage to create and modify SystemC Transaction Level Architecture. The design procedure is as follows: place blocks and nodes, complete connections, create the memory map, set parameters for each block, export system, and then we can build and run simulation on the exported system – a virtual platform.

The used ISS is ARM926EJS processor support package (PSP), the used bus model is AMBA bus library (BL), and the default TLM level bus simulation model follows AMBA 2.0 specification [10] and helps designs for reusability [11]. We use ARM Symbolic Debugger (ASD) as the simulation user interface, load an ARM executable image into the virtual platform and run system simulation with ISS, bus simulator, and OSCI simulation kernel. The system profiling integrate HW/SW profiling with CoWare

profiling utilities; besides, we simply add some variables in TLM models to easily gather the information we want.

2.3 IEEE 802.16e Baseband Transmitter

Our study case is porting an IEEE 802.16e baseband referenced SW code in C++

programming language to the built virtual platform with several different HW/SW configurations. In this section, we will introduce both of them.

WiMAX is the abbreviation of Worldwide Interoperability for Microwave Access, a name for those Wireless Metropolitan Area Network (WMAN) communication devices which follow the IEEE 802.16 wireless communication standard [12]. IEEE 802.16 can be divided into two parts, i.e., fixed WMAN and mobile WMAN. The IEEE 802.16e is the standard for mobile WMAN, based on IEEE 802.16-2004 plus mobility consideration. Because of the mobility, WiMAX 16e system is suitable for portable devices, and can be used in vehicles.

Our referenced SW code is composed of three parts, namely, the transmitter (TX), the channel model, and the receiver (RX). It simulate the IEEE 802.16e baseband transmission, by randomly creating information bits throughout the TX part to produce the transmit signals, and then putting the transmit signals on the multi path channel model in different velocities (V) and signal to noise ratios (SNR) to simulate the received signals. After the received signals are demodulated by the RX part, the decided bits will be compared with the information bits to get the average bit error rate (BER).

Figure 2.3 shows the briefly flowchart of the referenced SW code. It is useful in algorithmic level, for programmers to verify their algorithms and to evaluate their performances, but that’s not our concern.

Figure 2.3: Brief flowchart of referenced SW

We are working on architecture exploration to gather architectural and performance analyses for finding suitable system configuration to run the prescribed algorithms on portable devices. Thus the first thing we should do is to trace the referenced SW code, and calculate the execution time boundary according to the specification. Therefore, we cut out the channel model part and the RX part from the referenced SW code, keep only the TX part and modify it to a single iteration for later analysis. The functional block and iteration flow diagram is shown in Figure 2.4.

Figure 2.4: Functional block and iteration flow of TX

It transmits a frame during the iteration, and the number of Orthogonal Frequency Division Multiplexing (OFDM) symbols in each frame is two. Thus the execution time boundary is equal to two OFDM symbol times. Equation (2.1) represents the OFDM symbol time, where the useful symbol time is from equations (2.2) and (2.3), and the length of Guard Interval (GI) and the OFDM symbol length in the referenced SW code are 64 and 320, respectively. As such, the GI ratio is

T

G= 4. Table 2.1 lists the bandwidth (BW) in different Fast Fourier Transform (FFT) sizes, and the corresponding OFDM symbol time. In our performance analysis, we must make sure that the whole TX function can be completed within the time boundary.

OFDM symbol time Ts = +

(

¹ G T

)

b (2.1) Useful symbol time _b ^FFT

T N

= F (2.2)

Sampling frequency 8 1

( ) 8000

7 8000

Fs = floor ×BW × × (2.3)

Table 2.1: Execution time boundary in different FFT sizes

NFFT 128 256 512 1024 2048

BW 1.25MHz 2.5MHz 5MHz 10MHz 20MHz

Fs 1.424MHz 2.856MHz 5.712MHz 11.424MHz 22.856MHz Tb 89.8876μs 89.6358μs 89.6358μs 89.6358μs 89.6045μs

G 0.25 0.25 0.25 0.25 0.25 Ts 112.3595μs 112.0447μs 112.0447μs 112.0447μs 112.0056μs Boundary 224719ns 224089ns 224089ns 224089ns 224011ns

Chapter 3 Virtual Platform

After the introduction in the previous chapters, we can see the virtual platform as a SW model of a HW SoC platform. And we also know what a virtual platform can do, and what its characteristics are. We are trying to confirm them in experimental study case, do HW platform architecture exploration and optimization, do SW development, debugging, optimization, do HW/SW co-design, and additionally to feel the high simulation speed, the flexibility, and the usability for users who are not experts in HW designs.

In the following of this chapter, we’ll describe the virtual platform we used in detail and the analysis flow of the study case step by step, and then give several different HW/SW partition configurations. Besides, all the skills that we used to improve the system performance will be also discussed in this chapter.

3.1 Platform Built-up

Figure 3.1 shows the block diagram of our virtual platform in the beginning. We use CoWare Platform Architect and its supported IP libraries (ARM926EJS_AHB_PSP, AMBA BL, Auxiliary, Peripherals) to build it [13]. The design procedure includes place block and nodes, complete connections, create the memory map, set parameters, export system as a virtual platform, and then we can use it to build and run simulations.

Figure 3.1: Block diagram of referenced platform

For later analysis in HW/SW partition, we put some effort in building our own library [14] (with SystemC TLM models,) and use it to build our virtual platform in different system configurations. The user defined block library includes memory TLM models and HW accelerated TLM models. In the next two subsections we’ll introduce those TLM models in detail.

3.1.1 Memory Model

In order to gather some memory statistics such as read/write access counts and the minimum needed memory space, we modified the SystemC memory TLM model in the Peripheral block library from the CoWare Training Class. Figure 3.2 illustrates the memory model and Figure 3.3 and 3.4 are part of the memory module’s source code, where the reverse texts point out what we modified, which refer to the Access counter &

Max address block in Figure 3.2.

receiveWriteData

p_AHB

sendReadData

sendDelayedEoT

Memory

Access counter Max address

Figure 3.2: Block diagram of memory model

…

SC_MODULE(memory_AHB_TLM) {

AMBA::AHBLiteTarget_inoutslave_port<26, 32> p_AHB;

…

memory_AHB_TLM(sc_module_name name_, const int _nr_of_wait_states) : sc_module(name_),

p_AHB("p_AHB"),

nr_of_wait_states(_nr_of_wait_states) { …

rdcnt = 0;

wtcnt = 0;

maxaddr = 0;

} //memory_AHB_TLM() …

unsigned int rdcnt, wtcnt, maxaddr;

}; //SC_MODULE(memory_AHB_TLM)

Figure 3.3: Part of memory_AHB_TLM.h

…

p_AHB.ReadDataTrf‐>setReadData(wtcnt);

} else if (address == 0x3fffffc) {

p_AHB.ReadDataTrf‐>setReadData(rdcnt);

} else if (address == 0x3fffff4) {

p_AHB.ReadDataTrf‐>setReadData(maxaddr);

} else { …

if ((address < 0x3ffffe0) && ((address + accessSize/8) > maxaddr)) maxaddr = address + accessSize/8;

rdcnt++;

} …

} //memory_AHB_TLM::sendReadData()

Figure 3.4: Part of memory_AHB_TLM.cpp

To obtain the statistics we are interested in, first we simply introduce two counters, rdcnt for read access counts and wtcnt for write access counts. The counter rdcnt counts up 1 for every sendReadData() function call, while wtcnt counts up 1 for every receiveWriteData() function call. For the second purpose, we set the address width for port p_AHB to 26 to fill the memory map with 64Mb memory space, then use a register maxaddr to record the maximum address which had been accessed during the simulation process, to evaluate the needed memory space in system. Considering different addressing modes, the maximum address of a memory access would be the sum of the variable address and the variable accessSize/8. The three values can be gotten by reading the last 12 bytes of the memory model. Finally, the variable

function for determines the delay latency of the memory accesses when using CoWare TLM API.

3.1.2 Model Template

In addition to modify memory TLM models, we also created a template for modeling HW accelerators which is illustrated in Figure 3.5. There are two main purposes of our TLM model template. One of the purposes is similar to that in the memory model, which use counters rdcnt and wtcnt to record read and write access counts of a HW accelerator TLM model. Although this information can be received by tracing SW code or recording with SW, however, it’s more convenient and better for analysis without additional useless memory accesses by recording with TLM model.

The other purpose is making the sendReadData() function a multi-cycle access function, and adding the system clock to its sensitive list to ensure the data reading from the model is correct and simulate the HW delay issue if needed. Figure 3.6 is the header file of the HW accelerator TLM model template, followed by the function code of the model template in Figure 3.7, Figure 3.8, and Figure 3.9.

receiveWriteData

p_AHB

sendReadData

Delay Process

other functions

ready

tri_w

tri_r

Figure 3.5: Block diagram of HW model template

SC_MODULE(HW_AHB_TLM) {

sc_module(name_), p_AHB("p_AHB"), clk_p("clk_p"), rst_n("rst_n") {

SC_METHOD(receiveWriteData);

sensitive << p_AHB.getReceiveWriteDataTrfEventFinder();

dont_initialize();

SC_THREAD(sendReadData);

sensitive_pos << clk_p;

sensitive << ready;

dont_initialize();

SC_METHOD(sendEoT);

sensitive << p_AHB.getSendEotTrfEventFinder();

dont_initialize();

SC_METHOD(checkReady);

sensitive_pos << clk_p;

sensitive_neg << rst_n;

sensitive << flag;

dont_initialize();

SC_METHOD(cl_flag);

sensitive << tri_w << tri_r;

dont_initialize();

#define DELAY_HW 0

Figure 3.7: HW accelerator TLM model template (function code part 1)

#define Addr_name 0

cout << "Access Error!" << endl;

}

wtcnt++;

if (p_AHB.getEotTrf()) p_AHB.sendEotTrf();

} //HW_AHB_TLM::receiveWriteData()

Figure 3.8: HW accelerator TLM model template (function code part 2)

void HW_AHB_TLM::sendReadData() { while (1) {

if (p_AHB.getReadDataTrf()) {

unsigned int address = p_AHB.ReadDataTrf‐>getAddrTrf()‐>getAddress();

unsigned long long datatemp;

switch (address) {

p_AHB.ReadDataTrf‐>setReadData(datatemp);

break;

… default:

cout << "Access Error!" << endl;

p_AHB.ReadDataTrf‐>setReadData(0);

}

rdcnt++;

p_AHB.sendReadDataTrf();

}

void HW_AHB_TLM::sendEoT () { // p_AHB.getEotTrf();

// p_AHB.sendEotTrf();

} //HW_AHB_TLM::sendEoT()

Figure 3.9: HW accelerator TLM model template (function code part 3)

We add clock clk_p and reset rst_n to the TLM model template as input signals.

The cl_flag() function models a combinational logic with an exclusive-or gate to raise the signal flag while the signal tri_w is not equal to the signal tri_r. The value of tri_w will be inverted when the model finish receiving input data and starting its behavior, and tri_r will be inverted when the model finish sending output data and preparing for the next inputs. The checkReady() function models a sequential logic, while the value of signal flag is true, it starts counting up the counter value clkcount, and will raise the

signal ready when the value of clkcount is larger than or equal to the defined delay cycle of the model; otherwise, the ready signal remains low.

The sendReadData() function uses SC_THREAD process to allow multi-cycle accesses, a read transaction cannot get data until the ready signal is raised up. Thus the end of transactions is handled by sendReadData() and receiveWriteData() functions, and the sendEoT() function will do nothing.

Finally, the usage of counters rdcnt and wtcnt is the same as in the memory model we described in the previous subsection, and in the meantime we finish creating a template of HW accelerator TLM modes. It means, almost all HW TLM models in our virtual platform are created from this template and using its properties.

3.2 Software Optimization and System Profiling

After building a virtual platform, we still need to build application software for running simulation. In this section, we’ll describe the adopted SW optimization methods step by step during the system profiling and performance analysis. Besides, the HW/SW interface will be introduced in the beginning of using HW accelerators.

We want to know where the system bottleneck is, then deal with it. The referenced SW function profiling is the first step, and it can give us the information such as number of function calls and the function execution time. SW profiling results depend on many factors, from SW to HW, compiler to microprocessor and other components. In order to make a comprehensive survey of profiling, we integrate it with system performance analysis. That is, the SW profiling is completed on porting application SW to the used virtual platform, by applying CoWare profiling utilities. According to the profiling result, we can optimize the functions of the application SW in execution time by reducing function calls, instructions of function, and etc.

Figure 3.10: Block diagram of platform with channel model

Figure 3.10 is the block diagram of our platform after the first step. In this platform, ROM and RAM are modeled by the modified memory TLM model, APB and the Display models are removed, and Channel Model is added on AHB. Display model is not necessary here. We can use the supported semihosting by ARM926 PSP to display values for debugging. Channel Model replaces the corresponding part of referenced SW code because of the large amounts of floating-point arithmetic. During our exploration, we don’t need to consider the precision or the algorithm at this time, but we need to make sure the program result is correct while increasing performance by making any kind of change. Thus the Channel Model uses 64bit data bits and its I/O mapped to AHB address 0x10000000, so that we don’t need to change data type in SW code.

Figure 3.11 is an example SW function code of communicating with the Channel Model HW accelerator, and Figure 3.12 represents the data type reinterpretation in the HW accelerator TLM model.

void Channel(Complex tx_data, Complex rx_data)

Figure 3.11: Communicate SW with HW accelerator

unsigned int access;

tempin |= p_AHB.WriteDataTrf‐>getWriteData() << 32;

datain = reinterpret_cast<double &>(tempin);

}

p_AHB.ReadDataTrf‐>setReadData(datatemp >> 32);

} else {

p_AHB.ReadDataTrf‐>setReadData(datatemp);

} …

} //HW_AHB_TLM::sendReadData()

Figure 3.12: Data type reinterpretation in TLM model

Since the HW TLM model has the same behavior as the original SW functions, we can simply copy them from SW to HW TLM model with a little change on I/O. In this case, the adoption of the Channel Model HW accelerator TLM model decreases the simulation time by about 30k seconds, from over 96% of total execution time to under 0.4%, and reduce 98.4% of ROM accesses and 94.6% of RAM accesses. Although the HW TLM model doesn’t consider timing and is not cycle-accurate, the improvement of simulation speed and the advantages of using it is obvious, and it is also can be used in the simulation of a HW/SW co-design for the users who may even don’t know how to design a HW.

在文檔中應用於WiMAX系統之電子系統層級設計與分析 (頁 14-0)