• 沒有找到結果。

Chapter 2 Background and Related Work

2.2 OpenCL

OpenCL is an open industry standard from the Khronos group supported by AMD, IBM, Intel, NVIDA and others [5]. Thus it is vendor-neutral, and it models heterogeneous multiprocessor platforms, consisting of a host and one or more devices, which can be CPU, GPU, DSP, Cell/B.E. processor, etc. OpenCL specifies four models as its architecture:

platform model, execution model, memory model, and programming model [6].

1 #include <stdio.h>

2.2.1 Platform Model

In the platform model, one host connects to one or more OpenCL devices (compute devices) on a platform as shown in Figure 2-4 [6]. One or more processing elements (PEs) compose a compute unit (CU), and one or more compute units compose an OpenCL device.

Processing elements within a device are the workers doing computations.

An OpenCL program runs on a host in compliance with the native models of the host platform, and sends commands from the host to processing elements within a device for executing computations. The processing elements within a compute unit execute an instruction stream as SIMD units (each PE has its own data and a shared program counter) or as SPMD units (each PE has its own data and its own program counter).

Figure 2-5 shows a platform example with an Intel Core 2 as the host and a NVIDIA’s OpenCL platform. The platform has only one compute device, which is NVIDIA GeForce GTX 280 in this example. The device has totally 240 shader processors grouped into 30 sets, each set having 8 shader processors. As a consequence, when mapped to the OpenCL platform model, the device has 30 compute units, each of which composes 8 processing elements.

Figure 2-4 The Platform Model of OpenCL

Figure 2-5 The Platform Example of OpenCL

2.2.2 Execution Model

An OpenCL program is comprised of two parts: a host program executing on host and kernels executing on one or more OpenCL devices [6]. The host program defines the context for the kernels and manages their execution.

The way how a kernel should be executed on an OpenCL device is the heart of the OpenCL execution model as shown in Figure 2-6. When a kernel is executed on an OpenCL device, an index space is formed, and each point in the index space stand for a kernel instance, which is named as a work-item and identified by its position in the index space. Although each work-item executes the same code, it can have its own data and its own program counter;

in other words, control flow through the code can vary per work-item. Work-items are further organized into work-groups, which provide a more coarse-grained decomposition of the index space; in addition, the work-items in a given work-group execute concurrently on the processing elements of a single compute unit.

The index space within OpenCL is called an NDRange, which is an N-dimensional index space, where N is one, two or three. An NDRange is defined by an integer array of length N specifying the extent of the index space in each dimension starting at an offset index (zero by default).

Figure 2-6 The Execution Model of OpenCL

2.2.3 Memory Model

Work-item(s) executing a kernel may access four distinct memory regions [6]:

 Global Memory:

This memory region permits read/write access to all work-items in all work-groups, so work-items can read from or write to any element of a memory object. Reads and writes to global memory may be cached depending on the capabilities of the device.

 Constant Memory:

This memory region is a region of global memory that remains constant during the execution of a kernel. The host allocates and initializes memory objects placed into constant memory.

 Local Memory:

This memory region is local to a work-group. This memory region can be used to allocate variables that are shared by all work-items in that work-group. It may be

implemented as dedicated regions of memory on the OpenCL device. Alternatively, the local memory region may be mapped onto sections of the global memory.

 Private Memory:

This memory region is private to a item. Variables defined in one work-item’s private memory are not visible to another work-item.

Table 2-1 describes whether the host or the device can allocate from a memory region or not and the type of allocation (static, i.e., compile time vs. dynamic, i.e., runtime). Table 2-2 describes the type of access allowed, i.e., whether the host or the device can read and/or write to a memory region; moreover, Table 2-3 describes lifetime of variables defined in ach memory regions.

Table 2-1 Type of Allocation to Each Memory Regions for Hosts and Devices

Allocation Global Constant Local Private

Host Dynamic Dynamic Dynamic No

Device No Static Static Static

Table 2-2 Access Allowed to Each Memory Regions for Hosts and Devices

Access Global Constant Local Private

Host Read/write Read/write No No

Device Read/write Read-only Read/write Read/write

Table 2-3 Lifetime of Variables Defined in Each Memory Regions

Global Constant Local Private

Lifetime

The memory regions and how they relate to the platform model and the execution model are described in Figure 2-7.

Figure 2-7 The Memory Model of OpenCL

2.2.4 Programming Model

The OpenCL execution model supports data parallel and task parallel programming models as well as the hybrids of these two models [6]. The primary model driving the design of OpenCL is data parallel programming model.

 Data Parallel Programming Model

A data parallel programming model defines a computation in terms of a sequence of instructions applied to multiple elements of a memory object. The index space associated with the OpenCL execution model defines the work-items and how the data maps onto the work-items. In a strictly data parallel programming model, there is a one-to-one mapping between the work-item and the element in a memory object over which a kernel can be executed in parallel. However, OpenCL implements a relaxed version of the data parallel programming model where a strict one-to-one mapping is not a requirement.

 Task Parallel Programming Model

The OpenCL task parallel programming model defines a model in which a single kernel instance is executed independently of any index space. It is logically equivalent to executing a kernel on a compute unit with a work-group containing a single work-item. Under this model, users may express parallelism by one of the following schemes.

i. Using vector data types implemented by the device.

ii. Enqueuing multiple kernels.

iii. Enqueuing native kernels developed using a programming model other than that of OpenCL.

2.2.5 The OpenCL Framework

The framework inside OpenCL provides developers a way to develop an OpenCL program and manipulate the behavior of the program. The framework includes the following components:

 OpenCL Platform Layer

The platform layer implements platform-specific features that allow programs to

 OpenCL Runtime

The runtime supports numerous API calls that manage OpenCL objects such as command-queues, memory objects, program objects, and kernel objects for __kernel functions in a program object; it also supports API calls that allow you to enqueue commands to a command-queue for executing a kernel, and reading or writing a memory object.

 OpenCL Compiler

The OpenCL standard defines a language for programming kernels as a subset of the ISO C99 language with extensions for parallelism. Therefore, it is necessary to have a compiler supporting such language, which creates program executables containing OpenCL kernels.

相關文件