Difficulty and Overhead - 適用於分散式暫存記憶體多核心平台之多媒體多解析處理應用最佳化

To maintain the identical function behavior, it would introduce memory access overhead for X-direction and Y-direction. Moreover, it has another constraint for tile shifting in

X-direction. The memory transfer has to be alignment access. So the tile’s shift has to map to multiple of 16-byte to prevent bus error. And it equals to 4-pixel. Consequently, the X-direction incurs more serious boundary issue.

image

tile

image tile

x-direction y-direction

Figure 3-3 Boundary issue in X-direction and Y-direction

In Figure 3-3, it illustrates the boundary problem of tile-based scanning. In left graph, computation, across resolutions, belong to this tile would be executed while this tile of data is brought in local memory. But there exists some computation’s data which portion of data is within in tile and rest is out of tile. This incurs both in X-direction and Y-direction. In X-direction, shown in right-up graph, there is an overlap data range between current tile and previous tile. Overlap data is extra memory transfer to maintain same behavior.

3.4 3-Step Optimization Flow

By using data-oriented task partition can improve performance via concerning memory access issue. Furthermore, we can find out there are some parameters would effect the performance enhancement. The small window implementation highly depends on tile size.

The peak performance is related to the number of resolution using small window implementation. Therefore, we propose a platform-dependent optimization flow to elaborate the superior performance. This flow can enhance the data-oriented task partition by three steps as follow. Firstly, we would consider the data allocation in local store. By a simple experiment, we can re-arrangement the data structure stored in local store. Then we can discuss the tile-shape affecting to the performance. Following, we would explore the relativity between task granularity and performance. The entire flow is illustrated in Figure.

Data allocation

Tile-shape exploration

Granularity exploration start

end

Figure 3-4 platform-dependent optimization flow

 Data allocation

This step is to discuss the data allocation in local store. In object detection, the main data structures are kernel program, classifier and integral image (for small window buffer). In CellCV implementation, it considers the frontal stages of classifiers are often used for detection. To be convenient for execution and avoid frequent memory access, it loads frontal-stage classifiers into local store buffer. Rest-stage classifiers is used ping-pong buffer to load while detecting. And it allocates 176×84-pixel of size as small window buffer for storing integral image. And the small window buffer’s size could affect the performance by influence the number of resolution using small window optimization.

Except kernel program, our expectation is extending small window buffer to support more resolutions for small window optimization. Therefore, we can use a simple application-level analysis to be guideline for tradeoff between classifier and integral image.

0 20,000 40,000 60,000 80,000 100,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

stage

# of candidate

Figure 3-5 Application-specific analysis of classifier

Figure 3-5 is a curve of rejection distribution for detecting candidate. Horizontal -axis means stage-id. And vertical-axis means number of candidates. For first dot illustrate about

70,000 candidates are rejected at first stage, and so on. According to this result, 60 percent of candidates are rejected at second stage and 90 percent of candidates are rejected at fifth stage.

It reminds us that the utility of classifier is focusing on frontal five stages. We can shorten the size of resident classifier and expand the small window buffer under the memory size constraint. But we can not sure about five stages of classifier as resident classifier can gain the best performance in Cell processor. So we also use simulation-based analysis to find the solution. In Figure3-6, X-axis represents the number of features storing as resident classifier.

Y-axis represents the execution time in mini-second. Therefore, we can find out the peak

Figure 3-6 Hardware-dependent analysis of classifier

 Tile-shape exploration

After previous analysis, we can confirm the data allocation is the best. And the small window size is also decided. How to use this small window buffer efficiently is a critical issue.

Hence, this step we want to discuss the tile-shape affecting to performance. Different shape can introduce distinct level of boundary problem not only in X-axis but also Y-axis.

Additionally, it can affect number of resolution implemented in small window optimization.

Like slender tile would restrict the resolution implemented in small window mode by shorter side. Consequently, we want to analyze this issue. As shown in Figure 3-7, X-axis represents the tile’s width from short to long and the tile’s length vice versa. We can get the optimal point at w=h (square). This result is as our expectation because of more resolutions using small window buffer in square. But a fixed size of small window buffer can not map to square perfectly all the time. Therefore, this exploration can choose the tile shape to appropriate using small window buffer. The case-(w>h) is better than case-(w<h). It is because of window shifting in X-direction must to be mapped to multiple of 16-byte (alignment access). And case-(w<h) would introduce more times of tile’s movement in X-direction. It would cause more redundant memory access.

400 410 420 430 440 450 460

80 100 120 140 160 180 200

Tile width (pixel)

Exe. time (ms)

Figure 3-7 Tile-shape analysis

 Granularity exploration

Different level of task granularity leads to two issues. Less task number would pose the disadvantage of imbalance workload. Numerous tasks would increase software overhead of

grabbing tasks from centralized queue. Therefore, this step wants to analyze the relativity between different level of granularity and performance. In this experiment, we discuss the case of uniform task. We can merge multiple tasks into a packed task to shorten the task instance. In the reality case of this experiment, the task partition in regular implementation is referencing to resolution-based task partition while the task number in small window implementation is controllable after adjusting tile size. However, this configuration would introduce workload imbalance because of inappropriate task sequence. The former tasks which come from small window implementation are finer than latter tasks which come from regular implementation. Therefore, we finer partition the tasks in regular implementation as position level. And we use task merging into uniform tasks to analysis task granularity influencing performance. In Figure 3-8, it is a simulation-based analysis to find the optimal solution. The horizontal axis denotes the task granularity in binary logarithm. And the vertical axis denotes execution time. By this result, we can find out average 512 task instances perform superior performance with uniform task merging.

395

在文檔中適用於分散式暫存記憶體多核心平台之多媒體多解析處理應用最佳化 (頁 50-57)