• 沒有找到結果。

Chapter 3 Virtual Platform

3.1 Platform Built-up

3.1.1 Memory Model

In order to gather some memory statistics such as read/write access counts and the minimum needed memory space, we modified the SystemC memory TLM model in the Peripheral block library from the CoWare Training Class. Figure 3.2 illustrates the memory model and Figure 3.3 and 3.4 are part of the memory module’s source code, where the reverse texts point out what we modified, which refer to the Access counter &

Max address block in Figure 3.2.

receiveWriteData

p_AHB

sendReadData

sendDelayedEoT

Memory

Access counter Max address

Figure 3.2: Block diagram of memory model

… 

SC_MODULE(memory_AHB_TLM) {   

    AMBA::AHBLiteTarget_inoutslave_port<26, 32> p_AHB; 

    … 

    memory_AHB_TLM(sc_module_name name_, const int _nr_of_wait_states) :            sc_module(name_), 

        p_AHB("p_AHB"), 

        nr_of_wait_states(_nr_of_wait_states) {          … 

        rdcnt = 0; 

        wtcnt = 0; 

        maxaddr = 0; 

    } //memory_AHB_TLM()      … 

    unsigned int rdcnt, wtcnt, maxaddr; 

}; //SC_MODULE(memory_AHB_TLM)   

Figure 3.3: Part of memory_AHB_TLM.h

… 

        p_AHB.ReadDataTrf‐>setReadData(wtcnt); 

    } else if (address == 0x3fffffc) { 

        p_AHB.ReadDataTrf‐>setReadData(rdcnt); 

    } else if (address == 0x3fffff4) { 

        p_AHB.ReadDataTrf‐>setReadData(maxaddr); 

    } else {          … 

        if ((address < 0x3ffffe0) && ((address + accessSize/8) > maxaddr))        maxaddr = address + accessSize/8; 

        rdcnt++; 

    }      … 

} //memory_AHB_TLM::sendReadData()   

Figure 3.4: Part of memory_AHB_TLM.cpp

To obtain the statistics we are interested in, first we simply introduce two counters, rdcnt for read access counts and wtcnt for write access counts. The counter rdcnt counts up 1 for every sendReadData() function call, while wtcnt counts up 1 for every receiveWriteData() function call. For the second purpose, we set the address width for port p_AHB to 26 to fill the memory map with 64Mb memory space, then use a register maxaddr to record the maximum address which had been accessed during the simulation process, to evaluate the needed memory space in system. Considering different addressing modes, the maximum address of a memory access would be the sum of the variable address and the variable accessSize/8. The three values can be gotten by reading the last 12 bytes of the memory model. Finally, the variable

function for determines the delay latency of the memory accesses when using CoWare TLM API.

3.1.2 Model Template

In addition to modify memory TLM models, we also created a template for modeling HW accelerators which is illustrated in Figure 3.5. There are two main purposes of our TLM model template. One of the purposes is similar to that in the memory model, which use counters rdcnt and wtcnt to record read and write access counts of a HW accelerator TLM model. Although this information can be received by tracing SW code or recording with SW, however, it’s more convenient and better for analysis without additional useless memory accesses by recording with TLM model.

The other purpose is making the sendReadData() function a multi-cycle access function, and adding the system clock to its sensitive list to ensure the data reading from the model is correct and simulate the HW delay issue if needed. Figure 3.6 is the header file of the HW accelerator TLM model template, followed by the function code of the model template in Figure 3.7, Figure 3.8, and Figure 3.9.

receiveWriteData

p_AHB

sendReadData

Delay Process

other functions

ready

tri_w

tri_r

Figure 3.5: Block diagram of HW model template

SC_MODULE(HW_AHB_TLM) { 

        sc_module(name_), p_AHB("p_AHB"), clk_p("clk_p"), rst_n("rst_n")      { 

        SC_METHOD(receiveWriteData); 

        sensitive << p_AHB.getReceiveWriteDataTrfEventFinder(); 

        dont_initialize(); 

         

        SC_THREAD(sendReadData); 

        sensitive_pos << clk_p; 

        sensitive << ready; 

        dont_initialize(); 

         

        SC_METHOD(sendEoT); 

        sensitive << p_AHB.getSendEotTrfEventFinder(); 

        dont_initialize(); 

         

        SC_METHOD(checkReady); 

        sensitive_pos << clk_p; 

        sensitive_neg << rst_n; 

        sensitive << flag; 

        dont_initialize(); 

 

        SC_METHOD(cl_flag); 

        sensitive << tri_w << tri_r; 

        dont_initialize();     

#define DELAY_HW 0 

Figure 3.7: HW accelerator TLM model template (function code part 1)

#define Addr_name 0 

      cout << "Access Error!" << endl; 

    } 

    wtcnt++; 

    if (p_AHB.getEotTrf()) p_AHB.sendEotTrf(); 

} //HW_AHB_TLM::receiveWriteData()

Figure 3.8: HW accelerator TLM model template (function code part 2)

void HW_AHB_TLM::sendReadData() {      while (1) { 

    if (p_AHB.getReadDataTrf()) { 

        unsigned int address = p_AHB.ReadDataTrf‐>getAddrTrf()‐>getAddress(); 

        unsigned long long datatemp; 

        switch (address) { 

      p_AHB.ReadDataTrf‐>setReadData(datatemp); 

      break; 

      …          default: 

      cout << "Access Error!" << endl; 

      p_AHB.ReadDataTrf‐>setReadData(0); 

      } 

      rdcnt++; 

      p_AHB.sendReadDataTrf(); 

      } 

void HW_AHB_TLM::sendEoT () {        //        p_AHB.getEotTrf(); 

//        p_AHB.sendEotTrf(); 

} //HW_AHB_TLM::sendEoT()

Figure 3.9: HW accelerator TLM model template (function code part 3)

We add clock clk_p and reset rst_n to the TLM model template as input signals.

The cl_flag() function models a combinational logic with an exclusive-or gate to raise the signal flag while the signal tri_w is not equal to the signal tri_r. The value of tri_w will be inverted when the model finish receiving input data and starting its behavior, and tri_r will be inverted when the model finish sending output data and preparing for the next inputs. The checkReady() function models a sequential logic, while the value of signal flag is true, it starts counting up the counter value clkcount, and will raise the

signal ready when the value of clkcount is larger than or equal to the defined delay cycle of the model; otherwise, the ready signal remains low.

The sendReadData() function uses SC_THREAD process to allow multi-cycle accesses, a read transaction cannot get data until the ready signal is raised up. Thus the end of transactions is handled by sendReadData() and receiveWriteData() functions, and the sendEoT() function will do nothing.

Finally, the usage of counters rdcnt and wtcnt is the same as in the memory model we described in the previous subsection, and in the meantime we finish creating a template of HW accelerator TLM modes. It means, almost all HW TLM models in our virtual platform are created from this template and using its properties.

3.2 Software Optimization and System Profiling

After building a virtual platform, we still need to build application software for running simulation. In this section, we’ll describe the adopted SW optimization methods step by step during the system profiling and performance analysis. Besides, the HW/SW interface will be introduced in the beginning of using HW accelerators.

We want to know where the system bottleneck is, then deal with it. The referenced SW function profiling is the first step, and it can give us the information such as number of function calls and the function execution time. SW profiling results depend on many factors, from SW to HW, compiler to microprocessor and other components. In order to make a comprehensive survey of profiling, we integrate it with system performance analysis. That is, the SW profiling is completed on porting application SW to the used virtual platform, by applying CoWare profiling utilities. According to the profiling result, we can optimize the functions of the application SW in execution time by reducing function calls, instructions of function, and etc.

SW

Figure 3.10: Block diagram of platform with channel model

Figure 3.10 is the block diagram of our platform after the first step. In this platform, ROM and RAM are modeled by the modified memory TLM model, APB and the Display models are removed, and Channel Model is added on AHB. Display model is not necessary here. We can use the supported semihosting by ARM926 PSP to display values for debugging. Channel Model replaces the corresponding part of referenced SW code because of the large amounts of floating-point arithmetic. During our exploration, we don’t need to consider the precision or the algorithm at this time, but we need to make sure the program result is correct while increasing performance by making any kind of change. Thus the Channel Model uses 64bit data bits and its I/O mapped to AHB address 0x10000000, so that we don’t need to change data type in SW code.

Figure 3.11 is an example SW function code of communicating with the Channel Model HW accelerator, and Figure 3.12 represents the data type reinterpretation in the HW accelerator TLM model.

void Channel(Complex tx_data, Complex rx_data) 

Figure 3.11: Communicate SW with HW accelerator

unsigned int access; 

        tempin |= p_AHB.WriteDataTrf‐>getWriteData() << 32; 

        datain = reinterpret_cast<double &>(tempin); 

    } 

        p_AHB.ReadDataTrf‐>setReadData(datatemp >> 32); 

    } else { 

        p_AHB.ReadDataTrf‐>setReadData(datatemp); 

    }      … 

} //HW_AHB_TLM::sendReadData() 

Figure 3.12: Data type reinterpretation in TLM model

Since the HW TLM model has the same behavior as the original SW functions, we can simply copy them from SW to HW TLM model with a little change on I/O. In this case, the adoption of the Channel Model HW accelerator TLM model decreases the simulation time by about 30k seconds, from over 96% of total execution time to under 0.4%, and reduce 98.4% of ROM accesses and 94.6% of RAM accesses. Although the HW TLM model doesn’t consider timing and is not cycle-accurate, the improvement of simulation speed and the advantages of using it is obvious, and it is also can be used in the simulation of a HW/SW co-design for the users who may even don’t know how to design a HW.

But in the second step, we keep only the TX part from the referenced SW code as our application SW, and of course, the Channel Model HW was removed. It is because we want to focus on TX first and avoid other parts to affect the performance analysis. In this step, we avoid semihosting by writing the values of pilot subcarriers value, index set, and pilot preamble from text file to SW code as constant variables. We also use several variables to record the frequently used value as a table for reducing math library function calls. Besides, rewrite SW code by reducing heavy arithmetic with bitwise and/or logical operations and loop unrolling are the other effective optimization techniques.

After the second step, in the performance analysis we found almost the execution time in every function is much more than the time constraint. Therefore, we use the same method as channel model in the first step to create several HW accelerator TLM models in the third and later steps. Several HW/SW partition configurations are created for architectural exploration and system performance analysis. One can show the flexibility of the virtual platform by simply and quickly creating a virtual platform with different configurations. Those HW/SW partition configurations are described in detail

3.3 HW/SW Partition Configurations

In this section, three main classes of HW/SW partition configurations will be described in different subsections separately: no HW, individual HWs, and one combined HW. Each subsection may have more than one simulation cases, and we’ll illustrate the platform and the program flow with each simulation case for clear and easy explanation.

3.3.1 Software Only

The simulation case of this subsection is the one introduced in the second step of section 3.2. It uses the optimized TX function from the referenced SW and the simplest platform architecture without any HW accelerators. Figure 3.13 is the platform’s block diagram of this case, and Figure 3.14 is the function blocks and program flow of this case.

Figure 3.13: Block diagram of platform in Case 1

Figure 3.14: Function block and program flow in Case 1

Since there is no HW accelerators used in this simulation case, we take this case as the referenced case and use is as the program validation for the other configurations.

Figure 3.15 shows the program validation flow, where the SW golden function TX1 comes from this case, and the partitioned function TX2 come from the other configurations.

Random number generator

TX1

(SW golden function)

TX2

(Partitioned function)

Comparator

Information bits

TX signal_1

TX signal_2

Match ?

Before system profiling with any simulation case, we run this validation flow to make sure that the modification is functionally correct for those changes which are modified in whether SW C++ program or HW SystemC TLM models. After validation passes, we can be assured that the modification for a new configuration is correct, therefore we can remove the SW golden functions and run simulation again with analyses such as bus information and function profiling.

3.3.2 Individual Hardware Accelerators

The individual HW accelerators subsection explains platform architectures which include one or more individual HW accelerators. After step 3 in section 3.2, we add a HW accelerator which is corresponding to the new bottleneck function to improve system performance step by step. Figure 3.16 and Figure 3.17 illustrate the block diagram of the platform and program flow with function blocks in simulation Case 2 which represents the configuration after step 5.

SW

Figure 3.16: Block diagram of platform in Case 2

Modulation STBC

Figure 3.17: Function block and program flow in Case 2

We keep only this simulation case for the next chapter in this subsection against the third type of configuration in the next subsection. Actually, this floating-point version application SW consumes double numbers of the bus transactions in HW acceleration because of the use of the 64-bit IEEE 754 double precision floating-point data type. We have three methods to solve this problem. The first method is adapting the application SW to a fixed-point version. However, this is not suitable for our aim to do early exploration. The second method is simulating with 64-bit bus, and we need to modify the memory models and the used ISS should support 64 bits, too. The third method is what we used for our study case. We adjust the HW/SW configuration to ignore or clear the influences of 64-bit float-point data. In other words, we simplified the bus transactions between HW/SW interfaces by using only one combined HW accelerator.

Therefore, we can easily make a conversion of transaction counts from 64-bit to 32-bit

3.3.3 Combined Hardware

In this subsection, the using of combined HW means we only have one HW accelerator TLM model on our platform of the configuration type. Thus we only have one block diagram of virtual platform with this configuration, which is shown in Figure 3.18. The combined HW accelerator includes several HW function blocks and their interconnections, so we can reduce the bus transactions between each HW function block and the SW. Although it may increase the area cost and complexity of the corresponding HW, it can simplify the problem from the viewpoint of platform architecture. Using different SW program structures and their corresponding HW accelerators may create many different new configurations, so that we can have many different configurations in the type classified in this subsection. Any fine tune may lead to a different configuration and analysis result with some different assumptions. We’ll discuss two concluded configurations in this subsection, named the simulation Case 3 and Case 4.

Figure 3.18: Block diagram of platform in Case 3 & Case 4

Figure 3.19: Function block and program flow in Case 3

Figure 3.19 illustrates the program flow of simulation Case 3. Obviously, we don’t see any function block inside the SW part in this case. That is, we simulate a configuration that all function blocks in the baseband TX are run within the HW. In this case, the SW part mainly handles the parameter setting for each HW function block.

Besides, it will be used in validation. But still, we can see an assumption here. We assume the input information bits are in 32-bit data format and they are sent to HW accelerator directly. We may need a data buffer if the input information bits are not 32-bit data, or decrease the transaction data width to increase bus transaction counts, which can be regarded as other configurations based on this simulation case.

Another concluded configuration is illustrated in Figure 3.20, which is our simulation Case 4. In this case, the Modulation function block is moved from the HW part to the SW part. This configuration should be more suitable for representing HW/SW co-design on our virtual platform.

M

Figure 3.20: Function block and program flow in Case 4

In this simulation case, we use the same assumption on the input information bits as in Case 3. In order to support the three different modulation modes, i.e., QPSK, 16QAM, and 64QAM, we need an unpacking process to handle the input information bits. In this simulation case, Figure 3.18 shows the 16QAM modulation, and the real part value and the imaginary part value are modulated in turns, so that the unpacking process unpacks input data to 2 bits. Besides, we can see the modulated output in this case is in 64-bit double precision floating-point data format. By the same token, we could create many different configurations based on this case to discuss it. One method is encoded the modulated output to 4-bit no matter what modulation mode has been used, and then we can send the 4-bit encoded data to the HW part or buffered the encoded data to utilize the 32bit bus width. Or we could use fixed-point data format which is lower than 32bit to represent the modulated output without additional codec.

Chapter 4

Result and Analysis

In this chapter, we list the profiling result of the simulation in each simulation case that we defined in the previous chapter, and then use them to do performance analysis.

In section 4.1, we’ll list the profiling result in each case and roughly explain how to gather them. In section 4.2, our performance analysis will be shown by comparing with two or more configurations, followed by some discussion.

4.1 Profiling Result

For each simulation case, we first modify the application SW C++ program to make an ARM executable image file with CoWare ARM926T Bootcode and scatter load.

Second, we modify HW SystemC TLM models (if needed) to create a virtual platform using Platform Creator. Finally, we build and run simulation with ARM Symbolic Debugger (ASD), load an executable image file, enable analysis function from CoWare profiling utilities, and then wait for result [15].

In addition to the analysis reported by CoWare profiling utilities, we also gather some information by ourselves which had been introduced in chapter 3. But the information is displayed on ASD directly and will affect the performance analysis from CoWare profiling utilities. Thus we make two kinds of executable images for each simulation case, one is adding our profiling code in the functions to print out the

simulation information without analysis function; the other is without adding the code but just the analysis function from CoWare profiling utilities.

Our simulation flow for each case is listed below: first, combined SW golden functions with our modified functions to make sure the modification is functionally correct; second, remove SW golden functions and add our own profiling code to run again; third, remove our own profiling code and run simulation with the enabled analysis function form CoWare profiling utilities.

We simulate three supported modulation modes separately for each simulation case and combine the results together. The profiling results include function execution time, memory accesses, bus transactions, etc….

Cache size is one kind of factors that we are interested in. We can adjust both instruction and data cache sizes on the ARM926 PSP by Parameter Editor in Platform Creator. The unit of the size is in kilobytes. Since setting cache size to zero is illegal, we must make another executable image file to disable cache module for the case of without cache. For each simulation case, we have three different results with different cache sizes, which are zero cache, 4k bytes for both instruction and data cache, and 32k

Cache size is one kind of factors that we are interested in. We can adjust both instruction and data cache sizes on the ARM926 PSP by Parameter Editor in Platform Creator. The unit of the size is in kilobytes. Since setting cache size to zero is illegal, we must make another executable image file to disable cache module for the case of without cache. For each simulation case, we have three different results with different cache sizes, which are zero cache, 4k bytes for both instruction and data cache, and 32k

相關文件