5.1 Experiment Environment Setup
There are several kinds of developing tools for us to develop our programs on BF561.
The official integrated tool is Visual DSP++, which is a integrated developing environment (IDE) like ARM Developer Suite (ADS) in ARM-based environments. For more complex applications, they also developed a lightweight real-time kernel called VDK, which has many libraries for real-time applications for developers.
Instead of official tools we have another choice: GNU open-source project. In this project, we could use uClinux and GCC toolchains on Blackfin system; all the toolchains and uClinux are well supported by the community. uClinux is a lightweight version of Linux to support non-MMU processors.
We choose the open-source GNU project for our experimental environment for two rea-sons. First, an open-source environment is more proper for academic researches. Second, if we have Linux kernel support on BF561, we theoretically could transplant the codes from any other Linux-based platform and could exploit the library supports from Linux kernel; this is very convenient for us to develop our applications quickly since resources for Linux-based systems are easy to find on the Internet.
For dual-core BF561, uClinux could run on only one core or both cores. If uClinux runs on one core, the other core is treated as a device and could run programs through
28
CHAPTER 5. IMPLEMENTATION AND OPTIMIZATION 29
driver supports. In addition to running on one core, uClinux also could run on both cores;
it is called “SMP-like” mode.
Why we say it’s “SMP-like” is that BF561 lacks of hardware cache coherency mecha-nism; a “real” SMP must have hardware supported cache coherency mechanism. Hence, cache coherency should be done by software mechanism when needed. This implicates three significant features [5]:
• caches must be in write-through mode,
• more overhead is introduced due to software coherency mechanism, and
• all threads of a process are restricted to be executed on the same core.
Another problem is that the L1 SRAM owned by one core cannot be accessed directly from the other core so that L1 SRAM cannot be used in the kernel. Because it will cause kernel panic while the kernel threads running on one core try to access the kernel resources put in L1 SRAM of the other core. This would reduce the optimization potential because we cannot put critical system calls in the L1 SRAM to optimize Linux kernel. The devel-opments of user space applications also have to be taken care that the user process runs on a specific core if we try to put the data or instruction codes in the L1 SRAM.
We finally configure the uClinux as SMP-like mode because a full Linux supported environment gives us a consistent environment to develop applications. There are no needs to load programs to the other core by special drivers.
5.2 Software-based JPEG2000 Implementation and Pro-filing
There are several projects working on open-source JPEG2000 codec. The most famous are Jasper [18] and OpenJPEG [20].
Jasper is developed and maintained by its main author, Michael Adams, who is affil-iated with the Digital Signal Processing Group (DSPG) in the Department of Electrical and Computer Engineering at the University of Victoria. It is developed for the imple-mentation of JPEG-2000 Part-1 standard (i.e., ISO/IEC 15444-1) and itself is a part of JPEG-2000 Part-5 standard (i.e., ISO/IEC 15444-5).
OpenJPEG implements not only Part-1 standard but also many other features like JP2 (JPEG2000) and MJ2 (Motion JPEG2000) file formats, JPEG2000 Interactive Protocol, and so on. It’s developed and maintained by Communications and Remote Sensing Lab, in the Universit catholique de Louvain (UCL).
With the comparison of two implementations, we choose OpenJPEG for our imple-mentation for two reasons: the source code is easy to trace and the code partition is clear.
Since the source code of OpenJPEG is well written and portable, it’s not too hard to port the code onto our platform. The uClinux is also easy to configure to SMP-like mode.
Figure 5.1 shows the execution time breakdown of JPEG2000 compression on BF561;
the input image is a 640x480 color image taken from OpenJPEG official site and the pro-filing is subject to default setting: DWT level n=5, codeblock b= 64x64, lossless. We see that EBCOT Tier-1 and DWT dominate the JPEG2000 compression; the two components occupy 92% loading of the whole time. Our optimizations will be focused on these two parts because they are not only the hotspot of JPEG2000 compression but also potentially parallel parts.
5.3 Overview of JPEG2000 Optimizations on BF561
5.3.1 Data Locality Optimization
After we finish kernel and JPEG2000 porting to our BF561 environment, where do we start to optimize JPEG2000? As we know, image processing often divides an image (data) into several blocks and in concept the block is a 2-D array. However, memory accesses are
CHAPTER 5. IMPLEMENTATION AND OPTIMIZATION 31
EBCOT-Tier2 1%
EBCOT-Tier1 71%
DWT 21%
pre-processing 7%
Figure 5.1: Execution time breakdown of JPEG2000 compression on the BF561 processor.
practically 1-D; hence, it will have bad performance if we don’t carefully arrange the data in proper location. A big problem is the cache-miss problem. There are many researches in management of data locality in different design levels such as system-level, application-level or compiler application-level [26] [6] [1] [10].
As we discussed in Section 4.2.1, there are a L1 instruction SRAM, a data SRAM and a L2 SRAM companied with two DMA devices on the BF561 architecture. Now we focus on the data SRAM. Some of the L1 data SRAM can only be configured as general data SRAM rather than a cache, and therefore there is no cache-miss problem. The data SRAM works as fast as the core. This data SRAM is a precious resource for us to do the data optimization. For convenience, we simplify the term “general data SRAM” to be “data SRAM”.
The best scenario for the utilization of the L1 data SRAM is that we can put all data in it to achieve best performance. However, this often doesn’t happen due to the limited SRAM size. Hence, we only can move some of them into L1 SRAM; these may include
parts of the input data, output buffer, temporary data, constant data and so on.
On the other hand, we configure the parts that can be configured as a cache to be a cache because we know that this state-of-the-art mechanism could efficiently promote the performance without any software overhead. This configuration is a good choice for general utilization. However, the utilization of general SRAM depends on application developers. Hence, the utilization of SRAM is an emphasis of our optimization.
DMA is a technique designed for data moving and now almost exists in every modern CPU. There are also DMA devices in BF561 and the amount is two. Different to many other SOC and CPU designs, the two DMA devices in BF561 have independent buses and can access the SRAM in one sub-bank while Blackfin core is accessing another. Each of them has 16 channels, 4 of which could be used as Memory DMA (MDMA); it means that we could use them to move data among L1 SRAM, L2 SRAM, and external memory.
As a result, we can move data into L1 data SRAM by DMA before they are needed;
then we move out these data after the processing is completed. Furthermore, it will be the best if the data moving can be overlapped with the accesses from processor cores.
The hotspot instruction codes also can be put in the instruction SRAM like we do in data. For the utilization of the instruction SRAM, GCC supports compiler intrinsics for us to put specific procedures into L1 instruction SRAM. For instance, we can simply use attribute ((l1 text)) to put one procedure into the L1 instruction SRAM while we are writing source code. It is put after the definition of the procedure we want to put in the L1 instruction SRAM. The following is an example to show how to use the intrinsic:
void foo(int a) attribute ((l1 text));
The function foo(int a) will be allocated in the L1 instruction SRAM and the linker will maintain the linking information for the call to foo.
CHAPTER 5. IMPLEMENTATION AND OPTIMIZATION 33
5.3.2 Utilization of Two Cores
After the discussion of SRAM, we talk about the two cores of BF561. If we could put parts of the calculating jobs onto the other core to be processed simultaneously, the per-formance will be promoted significantly. It is widely known that there are two ways to partition calculating jobs to multi-cores: task partition and data partition. Task partition means that many cores run different codes and the data are processed through these cores like a pipeline. Data partition means that many cores run the same code and the data are partitioned to these cores to be processed.
Similar to the principle, David J. Katz and Rick Gentile, the members of Analog De-vices’ Embedded Processor Application Group, use MPEG-2 as an example to show the two partition ways on Blackfin BF561 [14]. The first, as shown in Figure 5.2, is a master-slave model; it’s similar to “data partition”. In this model, the coding process is mainly controlled in master core and it spills some data to be processing in the other core. The advantage of this model is that we don’t need to change codes a lot; the development proce-dure is just similar to the development in one core. However, synchronization overhead is needed and the slave core would not be fully loaded. As the example shown in Figure 5.2, some components of the MPEG-2 compression are parallelized to both cores and some are not. Whether the components can be parallelized may depend on their algorithms. When running the unparalleled components, the slave core is in idle state. In addition, the syn-chronizations are needed after some components in order to make sure that the data for their next components are ready.
The other programming model is a pipelined model; it’s similar to “task partition” and some people call it “stream partition”. As shown in Figure 5.3, the compression procedure is divided into several sub-procedures and then these sub-procedures are dispatched to two cores. If the loading of two cores are balanced enough, the idle states happening in master-slave model don’t happen here. However, the whole developing procedure needs to be
Figure 5.2: Master-slave model of MPEG-2 encoder on dual-core processors.
CHAPTER 5. IMPLEMENTATION AND OPTIMIZATION 35
Figure 5.3: Pipelined model of MPEG-2 encoder on dual-core processors.
changed more and is not straightforward compared to which in master-slave model.
To consider these two models we choose master-slave model for several reasons: first, it’s more scalable while the amount of hardware cores is changed; second, we could easily increase or decrease the loading of the slave core if we need to assign other jobs to the slave core; finally, JPEG2000 is hard to make balanced job partitions according to the profiling results we made, which are presented in Chapter 5.1.
5.4 Optimization of DWT
5.4.1 Data Locality Optimization
As we described in Section 3.2.2, JPEG2000 uses 2-D DWT computation to transform input image to high frequency and low frequency parts. The 2-D DWT computation is shown in Figure 5.4; we perform DWT calculation on the input image line by line in the horizontal and vertical direction, respectively.
Let’s take a close look at the dataflow of the DWT computation in Figure 5.5. Before we perform one-line DWT calculation, we need to move the line data into a buffer for the processor core to do the calculation. Thanks to the well designed (5,3) lifting-based DWT, it is a “in-place” calculation and we only need one buffer. In general case, processor itself can do the data moving well and data cache can cache the subsequent data for potential uses. Hence, it is easy to take the following data for processing in the high speed cache memory if our data are continuous in the memory; in image processing, it means that the data are from horizontal direction. However, this would suffer problems while reading from vertical direction. Furthermore, it is wasted if we just ask the processor core to do the data moving; it should focus on calculating jobs.
In general memory device, data are practically located and moved in 1-D mode even though the high-level description is in 2-D mode. For this reason, we change our view from 2-D to 1-D to see how data are moved into and out of the buffer. Figure 5.6(a) shows
CHAPTER 5. IMPLEMENTATION AND OPTIMIZATION 37
. . .
. . .
1-D DWT
(a)1-D DWT in horizontal direction
(b)1-D DWT in vertical direction
1-D DWT
Figure 5.4: 2D-DWT computation.
the data moving scenario that how data are moved into the buffer from candidate line data in the horizontal direction. We could see that it is continuous reading while data are read to the buffer; this is the best model that cache can perform well.
Since the data is filled into the buffer, DWT computation can be performed to the data in this buffer. As discussed in Section 3.2.2, the DWT computation produces DWT co-efficients and the low frequency and high frequency coco-efficients are regularly interleaved.
After DWT computation, while the data are moved back, we have to separate the low frequency coefficients and high frequency coefficients and put them back to the correct location. How data are moved back is shown in Figure 5.6(b). We see that high frequency and low frequency coefficients are centralized to the start and the middle of the original line data, respectively.
In the vertical direction, however, the candidate line data are not continuous. As shown
buffer
Blackfin core
sta!c void dwt_encode_1(int *a, int dn, int sn, int cas);
data flow
data flow
Figure 5.5: Dataflow of DWT computation performed in one line.
in Figure 5.7(a), the data read from candidate line data are periodically separated by a fixed stride; this is bad for cache to handle. On the other hand, similar to the data restoration in the horizontal direction, we need to put the interleaved low frequency coefficients and high coefficients back to the correct location. Where the data should be put back is shown in Figure 5.7(b).
Through the observation and analysis, the actions of data moving, including data mov-ing into and out of the buffer, which are performed in the horizontal and vertical directions, can all be configured to be the jobs of DMA. The main reason about why DMA can per-form these data moving is that these data moving are regular. Suppose one “data moving”
consists of moving of several data elements, if the elements of the source data are regularly placed in a fixed stride and their target location are also at a fixed stride, we call the data moving “regular” and it can be performed by DMA.
CHAPTER 5. IMPLEMENTATION AND OPTIMIZATION 39
buffer
candidate data in memory
(a) Buffer fill from candidate data before DWT
………
(b) Data restora!on from buffer
………
x : the horizontal length of image y : the ver!cal length of image
……
……
……
candidate data in memory
Figure 5.6: The data moving flow in a horizontal line.
As a result, we can use DMA to move data into and out of the buffer and we just put the buffer into the L1 data SRAM to be accessed in high speed clock rates. The dataflows before and after our optimization are illustrated in Figure 5.8. We add the cache into the figure to show the specialty of our optimization. We can see that our optimization bypass the cache mechanism.
Because of the frequent invocations of DMA operations, a low latency system call to configure DMA controllers is essential. For this reason, we write a lightweight system call instead of standard Linux I/O control driver and put it in the L1 instruction SRAM. In addition, thanks to the problem that L1 instruction SRAM cannot be accessed by the other core, the DMA system call is cloned to the L1 instruction SRAM of both cores in oder to be accessed from both cores.
buffer y
(a) Buffer fill from candidate data before DWT
………
x : the horizontal length of image y : the ver"cal length of image
……
Figure 5.7: The data moving flow in a vertical line.
5.4.2 Utilization of Two Cores
5.4.2.1 Data Partition
After the discussion of optimization using DMA and internal SRAM, we discuss how to partition the calculation jobs to the other core. As we mentioned in Section 5.3, we use data partition to spread the half of the data to the other core to speed up the calculation.
The fact that L1 SRAM cannot be accessed by the other core would still be a problem at this moment. This enforces us to bind the user process to one of two cores; this means that we should enforce the Linux kernel to schedule the process on only one core. This could be achieved by system call int sched setaffinity(pid t pid, unsigned int cpusetsize,cpu set t
*mask). Another problem is that on BF561 a thread can only run on one core with its process due to the lack of hardware cache coherency. As a result, we have to fork a new process and bind it to the other core to help us share calculations. The new process is
CHAPTER 5. IMPLEMENTATION AND OPTIMIZATION 41
buffer
Blackfin core
sta"c void dwt_encode_1(int *a, int dn, int sn, int cas);
dataflow
dataflow
cache buffer
Blackfin core
sta"c void dwt_encode_1(int *a, int dn, int sn, int cas);
DMA data moving
Figure 5.8: Dataflow of DWT computation performed in one line: (a) before data locality optimization (b) after data locality optimization.
generated after performing vfork() and exec() families.
The only parameter needed to pass to the new process is the address of the tile address (which is the start address of the image in our scenario); it is put in shared L2 SRAM. L2 SRAM is now used as a shared memory for us to communicate between two cores. As we discussed, the DWT performed in each line is independent; hence, we divide the line data of the same direction at the same level into two groups: half front parts and half back parts, and perform DWT on different cores as shown in Figure 5.9. Until the computations of dispatched jobs at both cores are finished, the next stage, which may refer to different direction or the next level, are not allowed to start; this means that the synchronization is needed here to make sure both cores finish their jobs. We use shared variables for synchronization; they are placed in shared L2 SRAM.
However, we find that the data partition to two cores is inefficient. The main reason is that the loading of data transfer is heavy—more discussions will be given in Section 6.1.
.
Figure 5.9: The data partition to two cores.
Hence, we propose another method to partition jobs of DWT. This is discussed in the following subsection.
5.4.2.2 Task Partition
Due to the heave loading in data transfer, we try to partition DWT computations in another way (we name two cores as CoreA and CoreB for discussion); we try to partition jobs between memory transfers and DWT computations themselves. We ask CoreA to focus on DMA control; CoreA is responsible to control DMA to move the data into L1 data SRAM
Due to the heave loading in data transfer, we try to partition DWT computations in another way (we name two cores as CoreA and CoreB for discussion); we try to partition jobs between memory transfers and DWT computations themselves. We ask CoreA to focus on DMA control; CoreA is responsible to control DMA to move the data into L1 data SRAM