Instruction-Level Parallelism: Concepts and Challenges

Z Stop the machine and ring the warning bell

3.1 Instruction-Level Parallelism: Concepts and Challenges

this chapter could be understood without all of the ideas in Section 3.1, this basic material is important to later sections of this chapter as well as to chapter 4.

There are two largely separable approaches to exploiting ILP. This chapter covers techniques that are largely dynamic and depend on the hardware to locate the parallelism. The next chapter focuses on techniques that are static and rely much more on software. In practice, this partitioning between dynamic and static and between hardware-intensive and software-intensive is not clean, and tech-niques from one camp are often used by the other. Nonetheless, for exposition purposes, we have separated the two approaches and tried to indicate where an approach is transferable.

The dynamic, hardware intensive approaches dominate the desktop and server markets and are used in a wide range of processors, including: the Pentium III and 4, the Althon, the MIPS R10000/12000, the Sun ultraSPARC III, the Power-PC 603, G3, and G4, and the Alpha 21264. The static, compiler-intensive ap-proaches, which we focus on in the next chapter, have seen broader adoption in the embedded market than the desktop or server markets, although the new IA-64 architecture and Intel’s Itanium, use this more static approach.

In this section, we discuss features of both programs and processors that limit the amount of parallelism that can be exploited among instructions, as well as the critical mapping between program structure and hardware structure, which is key to understanding whether a program property will actually limit performance and under what circumstances.

Recall that the value of the CPI (Cycles per Instruction) for a pipelined pro-cessor is the sum of the base CPI and all contributions from stalls:

The ideal pipeline CPI is a measure of the maximum performance attainable by the implementation. By reducing each of the terms of the right-hand side, we min-imize the overall pipeline CPI and thus increase the IPC (Instructions per Clock).

In this chapter we will see that the techniques we introduce to increase the ideal IPC, can increase the importance of dealing with structural, data hazard, and con-trol stalls. The equation above allows us to characterize the various techniques we examine in this chapter by what component of the overall CPI a technique reduc-es. Figure 3.1 shows the techniques we examine in this chapter and in the next, as well as the topics covered in the introductory material in Appendix A.

Before we examine these techniques in detail, we need to deﬁne the concepts on which these techniques are built. These concepts, in the end, determine the limits on how much parallelism can be exploited.

Instruction-Level Parallelism

All the techniques in this chapter and the next exploit parallelism among instruc-tions. As we stated above, this type of parallelism is called instruction-level paral-lelism or ILP. The amount of paralparal-lelism available within a basic block–a straight-Pipeline CPI=Ideal pipeline CPI+Structural stalls+Data hazard stalls+Control stalls

line code sequence with no branches in except to the entry and no branches out ex-cept at the exit–is quite small. For typical MIPS programs the average dynamic branch frequency often between 15% and 25%, meaning that between four and seven instructions execute between a pair of branches. Since these instructions are likely to depend upon one another, the amount of overlap we can exploit within a basic block is likely to be much less than the average basic block size. To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks.

The simplest and most common way to increase the amount of parallelism available among instructions is to exploit parallelism among iterations of a loop.

This type of parallelism is often called loop-level parallelism. Here is a simple example of a loop, which adds two 1000-element arrays, that is completely parallel:

for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i];

Every iteration of the loop can overlap with any other iteration, although within each loop iteration there is little or no opportunity for overlap.

There are a number of techniques we will examine for converting such loop-level parallelism into instruction-loop-level parallelism. Basically, such techniques work by unrolling the loop either statically by the compiler (an approach we ex-plore in the next chapter) or dynamically by the hardware (the subject of this chap-ter).

Technique Reduces Section

Forwarding and bypassing Potential data hazard stalls A.2

Delayed branches and simple branch scheduling Control hazard stalls A.2 Basic dynamic scheduling (scoreboarding) Data hazard stalls from true dependences A.8 Dynamic scheduling with renaming Data hazard stalls and stalls from antidependences

and output dependences

3.2

Dynamic branch prediction Control stalls 3.4

Issuing multiple instructions per cycle Ideal CPI 3.6

Speculation Data hazard and control hazard stalls 3.5

Dynamic memory disambiguation Data hazard stalls with memory 3.2, 3.7

Loop unrolling Control hazard stalls 4.1

Basic compiler pipeline scheduling Data hazard stalls A.2, 4.1

Compiler dependence analysis Ideal CPI, data hazard stalls 4.4

Software pipelining, trace scheduling Ideal CPI, data hazard stalls 4,3

Compiler speculation Ideal CPI, data, control stalls 4.4

FIGURE 3.1 The major techniques examined in Appendix A, chapter 3, or chapter 4 are shown together with the component of the CPI equation that the technique affects.

An important alternative method for exploiting loop-level parallelism is the use of vector instructions (see Appendix B). Essentially, a vector instruction op-erates on a sequence of data items. For example, the above code sequence could execute in four instructions on some vector processors: two instructions to load the vectors x and y from memory, one instruction to add the two vectors, and an instruction to store back the result vector. Of course, these instructions would be pipelined and have relatively long latencies, but these latencies may be over-lapped. Vector instructions and the operation of vector processors are described in detail in the online Appendix B. Although the development of the vector ideas preceded many of the techniques we examine in these two chapters for exploiting ILP, processors that exploit ILP have almost completely replaced vector-based processors. Vector instruction sets, however, may see a renaissance, at least for use in graphics, digital signal processing, and multimedia applications.

Data Dependence and Hazards

Determining how one instruction depends on another is critical to determining how much parallelism exists in a program and how that parallelism can be ex-ploited. In particular, to exploit instruction-level parallelism we must determine which instructions can be executed in parallel. If two instructions are parallel, they can execute simultaneously in a pipeline without causing any stalls, assum-ing the pipeline has sufﬁcient resources (and hence no structural hazards exist). If two instructions are dependent they are not parallel and must be executed in or-der, though they may often be partially overlapped. The key in both cases is to determine whether an instruction is dependent on another instruction.

Data Dependences

There are three different types of dependences: data dependences (also called true data dependences), name dependences, and control dependences. An instruc-tion j is data dependent on instrucinstruc-tion i if either of the following holds:

n Instruction i produces a result that may be used by instruction j, or

n Instruction j is data dependent on instruction k, and instruction k is data depen-dent on instruction i.

The second condition simply states that one instruction is dependent on another if there exists a chain of dependences of the ﬁrst type between the two instructions.

This dependence chain can be as long as the entire program.

For example, consider the following code sequence that increments a vector of values in memory (starting at 0(R1) and with the last element at 8(R2)) by a sca-lar in register F2:

Loop: L.D F0,0(R1);F0=array element ADD.D F4,F0,F2;add scalar in F2 S.D F4,0(R1);store result

DADDUI R1,R1,#-8;decrement pointer 8 bytes (/e BNE R1,R2,LOOP; branch R1!=zero

The data dependences in this code sequence involve both ﬂoating point data:

and integer data:

Both of the above dependent sequences, as shown by the arrows, with each in-struction depending on the previous one. The arrows here and in following exam-ples show the order that must be preserved for correct execution. The arrow points from an instruction that must precede the instruction that the arrowhead points to.

If two instructions are data dependent they cannot execute simultaneously or be completely overlapped. The dependence implies that there would be a chain of one or more data hazards between the two instructions. Executing the instruc-tions simultaneously will cause a processor with pipeline interlocks to detect a hazard and stall, thereby reducing or eliminating the overlap. In a processor with-out interlocks that relies on compiler scheduling, the compiler cannot schedule dependent instructions in such a way that they completely overlap, since the pro-gram will not execute correctly. The presence of a data dependence in an instruc-tion sequence reﬂects a data dependence in the source code from which the instruction sequence was generated. The effect of the original data dependence must be preserved.

Dependences are a property of programs. Whether a given dependence results in an actual hazard being detected and whether that hazard actually causes a stall are properties of the pipeline organization. This difference is critical to under-standing how instruction-level parallelism can be exploited.

In our example, there is a data dependence between the DADDIU and the BNE; this dependence causes a stall because we moved the branch test for the MIPS pipeline to the ID stage. Had the branch test stayed in EX, this dependence would not cause a stall. Of course, the branch delay would then still be 2 cycles, rather than 1.

Loop: L.D F0,0(R1);F0=array element ADD.D F4,F0,F2;add scalar in F2 S.D F4,0(R1);store result

DADDIU R1,R1,-8;decrement pointer

;8 bytes (per DW) BNE R1,R2,Loop; branch R1!=zero

The presence of the dependence indicates the potential for a hazard, but the actual hazard and the length of any stall is a property of the pipeline. The impor-tance of the data dependences is that a dependence (1) indicates the possibility of a hazard, (2) determines the order in which results must be calculated, and (3) sets an upper bound on how much parallelism can possibly be exploited. Such limits are explored in section 3.8.

Since a data dependence can limit the amount of instruction-level parallelism we can exploit, a major focus of this chapter and the next is overcoming these limitations. A dependence can be overcome in two different ways: maintaining the dependence but avoiding a hazard, and eliminating a dependence by trans-forming the code. Scheduling the code is the primary method used to avoid a haz-ard without altering a dependence. In this chapter, we consider hhaz-ardware schemes for scheduling code dynamically as it is executed. As we will see, some types of dependences can be eliminated, primarily by software, and in some cases by hardware techniques.

A data value may ﬂow between instructions either through registers or through memory locations. When the data ﬂow occurs in a register, detecting the depen-dence is reasonably straightforward since the register names are ﬁxed in the in-structions, although it gets more complicated when branches intervene and correctness concerns cause a compiler or hardware to be conservative.

Dependences that ﬂow through memory locations are more difﬁcult to detect since two addresses may refer to the same location, but look different: For exam-ple,100(R4) and 20(R6) may be identical. In addition, the effective address of a load or store may change from one execution of the instruction to another (so that 20(R4) and 20(R4) will be different), further complicating the detection of a de-pendence. In this chapter, we examine hardware for detecting data dependences that involve memory locations, but we shall see that these techniques also have limitations. The compiler techniques for detecting such dependences are critical in uncovering loop-level parallelism, as we shall see in the next chapter.

Name Dependences

The second type of dependence is a name dependence. A name dependence oc-curs when two instructions use the same register or memory location, called a name, but there is no ﬂow of data between the instructions associated with that name. There are two types of name dependences between an instruction i that precedes instruction j in program order:

1. An antidependence between instruction i and instruction j occurs when in-struction j writes a register or memory location that inin-struction i reads. The original ordering must be preserved to ensure that i reads the correct value.

2. An output dependence occurs when instruction i and instruction j write the same register or memory location. The ordering between the instructions must be preserved to ensure that the value finally written corresponds to instruction j.

Both antidependences and output dependences are name dependences, as opposed to true data dependences, since there is no value being transmitted be-tween the instructions. Since a name dependence is not a true dependence, in-structions involved in a name dependence can execute simultaneously or be reordered, if the name (register number or memory location) used in the instruc-tions is changed so the instrucinstruc-tions do not conﬂict. This renaming can be more easily done for register operands, where it is called register renaming. Register renaming can be done either statically by a compiler or dynamically by the hard-ware. Before describing dependences arising from branches, let’s examine the re-lationship between dependences and pipeline data hazards.

Data Hazards

A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap caused by pipelining, or other reordering of instructions, would change the order of access to the operand involved in the dependence. Because of the dependence, we must preserve what is called pro-gram order, that is the order that the instructions would execute in, if executed se-quentially one at a time as determined by the original source program. The goal of both our software and hardware techniques is to exploit parallelism by pre-serving program order only where it affects the outcome of the program. Detect-ing and avoidDetect-ing hazards ensures that necessary program order is preserved.

Data hazards may be classiﬁed as one of three types, depending on the order of read and write accesses in the instructions. By convention, the hazards are named by the ordering in the program that must be preserved by the pipeline.

Consider two instructions i and j, with i occurring before j in program order. The possible data hazards are

n RAW (read after write) — j tries to read a source before i writes it, so j incor-rectly gets the old value. This hazard is the most common type and corresponds to a true data dependence. Program order must be preserved to ensure that j re-ceives the value from i. In the simple common five-stage static pipeline (see Appendix A) a load instruction followed by an integer ALU instruction that di-rectly uses the load result will lead to a RAW hazard.

n WAW (write after write) — j tries to write an operand before it is written by i.

The writes end up being performed in the wrong order, leaving the value writ-ten by i rather than the value writwrit-ten by j in the destination. This hazard corre-sponds to an output dependence. WAW hazards are present only in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled. The classic ﬁve-stage integer pipeline used in Appendix A writes a register only in the WB stage and avoids this class of hazards, but this chapter explores pipelines that allow instructions to be re-ordered, creating the possibility of WAW hazards. WAW hazards can also be-tween a short integer pipeline and a longer ﬂoating-point pipeline (see the pipelines in Sections A.5 and A.6 of Appendix A). For example, a ﬂoating point multiply instruction that writes F4, shortly followed by a load of F4 could yield a WAW hazard, since the load could complete before the multiply completed.

n WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value. This hazard arises from an antidependence.

WAR hazards cannot occur in most static issue pipelines even deeper pipelines or ﬂoating point pipelines because all reads are early (in ID) and all writes are late (in WB). (See Appendix A to convince yourself.) A WAR hazard occurs either when there are some instructions that write results early in the instruc-tion pipeline, and other instrucinstruc-tions that read a source late in the pipeline or when instructions are reordered, as we will see in this chapter.

Note that the RAR (read after read) case is not a hazard.

Control Dependences

The last type of dependence is a control dependence. A control dependence deter-mines the ordering of an instruction, i, with respect to a branch instruction so that the instruction i is executed in correct program order and only when it should be.

Every instruction, except for those in the ﬁrst basic block of the program, is con-trol dependent on some set of branches, and, in general, these concon-trol dependenc-es must be prdependenc-eserved to prdependenc-eserve program order. One of the simpldependenc-est exampldependenc-es of a control dependence is the dependence of the statements in the “then” part of an if statement on the branch. For example, in the code segment:

if p1 { S1;

};

if p2 { S2;

}

S1 is control dependent on p1, and S2 is control dependent on p2 but not on p1. In general, there are two constraints imposed by control dependences:

1. An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch. For ex-ample, we cannot take an instruction from the then-portion of an if-statement and move it before the if-statement.

2. An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch. For example, we cannot take a statement before the if-statement and move it into the then-por-tion.

Control dependence is preserved by two properties in a simple pipeline, such as that in Chapter 1. First, instructions execute in program order. This ordering ensures that an instruction that occurs before a branch is executed before the branch. Second, the detection of control or branch hazards ensures that an

in-struction that is control dependent on a branch is not executed until the branch di-rection is known.

Although preserving control dependence is a useful and simple way to help preserve program order, the control dependence in itself is not the fundamental performance limit. We may be willing to execute instructions that should not have been executed, thereby violating the control dependences, if we can do so without affecting the correctness of the program. Control dependence is not the critical property that must be preserved. Instead, the two properties critical to program correctness–and normally preserved by maintaining both data and con-trol dependence–are the exception behavior and the data ﬂow.

Preserving the exception behavior means that any changes in the ordering of instruction execution must not change how exceptions are raised in the program.

Often this is relaxed to mean that the reordering of instruction execution must not cause any new exceptions in the program. A simple example shows how main-taining the control and data dependences can prevent such situations. Consider

在文檔中 Fundamentals of Computer Design 1 (頁 168-178)