Intel SIMD architecture
Computer Organization and Assembly Languages p g z y g g Yung-Yu Chuang
Overview
• SIMD
MMX architectures
• MMX architectures
• MMX instructions l
• examples
• SSE/SSE2
• SIMD instructions are probably the best place p y p to use assembly since compilers usually do not do a good job on using these instructions
2
Performance boost
• Increasing clock rate is not fast enough for boosting performance
boosting performance
In his 1965 paper, Intel co-founder Gordon Moore observed that
“the number of
“the number of transistors per square inch had square inch had doubled every 18 months.
Performance boost
• Architecture improvements (such as
pipeline/cache/SIMD) are more significant pipeline/cache/SIMD) are more significant
• Intel analyzed multimedia applications and f d th h th f ll i h t i ti found they share the following characteristics:
– Small native data types (8-bit pixel, 16-bit audio) – Recurring operations
– Inherent parallelism
SIMD
• SIMD (single instruction multiple data)
architecture performs the same operation on architecture performs the same operation on multiple data elements in parallel
PADDW MM0 MM1
• PADDW MM0, MM1
5
SISD/SIMD/Streaming
6
IA-32 SIMD development
• MMX (Multimedia Extension) was introduced in 1996 (Pentium with MMX and Pentium II)
1996 (Pentium with MMX and Pentium II).
• SSE (Streaming SIMD Extension) was introduced ith P ti III
with Pentium III.
• SSE2 was introduced with Pentium 4.
• SSE3 was introduced with Pentium 4 supporting hyper-threading technology. SSE3 adds 13 more yp g gy instructions.
MMX
• After analyzing a lot of existing applications such as graphics MPEG music speech such as graphics, MPEG, music, speech recognition, game, image processing, they found that many multimedia algorithms y g
execute the same instructions on many pieces of data in a large data set.
• Typical elements are small, 8 bits for pixels, 16 bits for audio, 32 bits for graphics and general computing.
MMX data types
9
MMX integration into IA
79 NaN or infinity as real
11…11 y
because bits 79-64 are ones.
Even if MMX registers are 64-bit, they don’t e tend Penti m to a extend Pentium to a 64-bit CPU since only logic instructions are logic instructions are provided for 64-bit data.
8 MM0~MM7 10
Compatibility
• To be fully compatible with existing IA, no new mode or state was created Hence for context mode or state was created. Hence, for context switching, no extra state needs to be saved.
T h th l MMX i hidd b hi d FPU
• To reach the goal, MMX is hidden behind FPU.
When floating-point state is saved or restored, MMX i d t d
MMX is saved or restored.
• It allows existing OS to perform context switching on the processes executing MMX instruction without be aware of MMX.
• However, it means MMX and FPU can not be used at the same time. Big overhead to switch.g
Compatibility
• Although Intel defenses their decision on aliasing MMX to FPU for compatibility It is aliasing MMX to FPU for compatibility. It is actually a bad decision. OS can just provide a service pack or get updated
service pack or get updated.
• It is why Intel introduced SSE later without any li i
aliasing
MMX instructions
• 57 MMX instructions are defined to perform the parallel operations on multiple data elements parallel operations on multiple data elements packed into 64-bit data types.
Th i l d dd bt t lti l
• These include add, subtract, multiply, compare, and shift, data conversion, 64 bit d t 64 bit l i l 64-bit data move, 64-bit logical operation and multiply-add for multiply-
l t ti accumulate operations.
• All instructions except for data move use MMX registers as operands.
• Most complete support for 16-bit operations.
13
p pp p
Saturation arithmetic
• Useful in graphics applications.
Wh i fl d fl
• When an operation overflows or underflows, the result becomes the largest or smallest
ibl t bl b
possible representable number.
• Two types: signed and unsigned saturation
wrap-around saturating
14
wrap-around saturating
MMX instructions MMX instructions
Arithmetic
• PADDB/PADDW/PADDD: add two packed numbers no EFLAGS is set ensure overflow numbers, no EFLAGS is set, ensure overflow never occurs by yourself
M lti li ti t t
• Multiplication: two steps
• PMULLW: multiplies four words and stores the four lo words of the four double word results
• PMULHW/PMULHUW: multiplies four words and p stores the four hi words of the four double word results. PMULHUW for unsigned.g
17
Arithmetic
• PMADDWD
18
Detect MMX/SSE
mov eax, 1 ; request version info cpuid ; supported since Pentium cpuid ; supported since Pentium test edx, 00800000h ;bit 23
; 02000000h (bit 25) SSE
; 02000000h (bit 25) SSE
; 04000000h (bit 26) SSE2 jnz HasMMX
jnz HasMMX
cpuid
: : :
Example: add a constant to a vector
char d[]={5, 5, 5, 5, 5, 5, 5, 5};
char clr[]={65 66 68 87 88}; // 24 bytes char clr[]={65,66,68,...,87,88}; // 24 bytes __asm{
movq mm1 d movq mm1, d mov cx, 3 mov esi 0 mov esi, 0
L1: movq mm0, clr[esi]
ddb 0 1
paddb mm0, mm1 movq clr[esi], mm0
dd i 8 add esi, 8 loop L1
22
emms }
Comparison
• No CFLAGS, how many flags will you need?
Results are stored in destination Results are stored in destination.
• EQ/GT, no LT
Change data types
• Pack: converts a larger data type to the next smaller data type
smaller data type.
• Unpack: takes two operands and interleave th It b d f d d t t f them. It can be used for expand data type for immediate calculation.
Pack with signed saturation
25
Pack with signed saturation
26
Unpack low portion Unpack low portion
Unpack low portion
29
Unpack high portion
30
Keys to SIMD programming
• Efficient data layout Eli i i f b h
• Elimination of branches
Application: frame difference
A B
|A-B|
| |
Application: frame difference
A-B B-A
(A-B) or (B-A)
( ) ( )
33
Application: frame difference
MOVQ mm1, A //move 8 pixels of image A MOVQ mm2 B //move 8 pixels of image B MOVQ mm2, B //move 8 pixels of image B MOVQ mm3, mm1 // mm3=A
PSUBSB mm1 mm2 // mm1=A B PSUBSB mm1, mm2 // mm1=A-B PSUBSB mm2, mm3 // mm2=B-A POR mm1 mm2 // mm1 |A B|
POR mm1, mm2 // mm1=|A-B|
34
Example: image fade-in-fade-out
A B
A*α+B*(1-α) = B+α(A-B)
α=0.75
α=0.5
37
α=0.25
38
Example: image fade-in-fade-out
• Two formats: planar and chunky
I Ch k f 16 bi f 64 bi d
• In Chunky format, 16 bits of 64 bits are wasted
• So, we use planar in the following example
R G B A R G B A
Example: image fade-in-fade-out
Image A Image B
Example: image fade-in-fade-out
MOVQ mm0, alpha//4 16-b zero-padding α MOVD mm1 A //move 4 pixels of image A MOVD mm1, A //move 4 pixels of image A MOVD mm2, B //move 4 pixels of image B PXOR mm3 mm3 //clear mm3 to all zeroes PXOR mm3, mm3 //clear mm3 to all zeroes //unpack 4 pixels to 4 words
PUNPCKLBW mm1 mm3 // Because B A could be PUNPCKLBW mm1, mm3 // Because B-A could be PUNPCKLBW mm2, mm3 // negative, need 16 bits PSUBW 1 2 //(B A)
PSUBW mm1, mm2 //(B-A)
PMULHW mm1, mm0 //(B-A)*fade/256 PADDW 1 2 //(B A)*f d B PADDW mm1, mm2 //(B-A)*fade + B //pack four words back to four bytes
1 3
41
PACKUSWB mm1, mm3
Data-independent computation
• Each operation can execute without needing to know the results of a previous operation
know the results of a previous operation.
• Example, sprite overlay
for i=1 to sprite_Size if sprite[i]=clr
then out_color[i]=bg[i]
else out_color[i]=sprite[i]
• How to execute data-dependent calculations on
42
• How to execute data-dependent calculations on several pixels in parallel.
Application: sprite overlay Application: sprite overlay
MOVQ mm0, sprite MOVQ mm2 mm0 MOVQ mm2, mm0 MOVQ mm4, bg MOVQ mm1 clr MOVQ mm1, clr PCMPEQW mm0, mm1 PAND mm4 mm0 PAND mm4, mm0 PANDN mm0, mm2
POR 0 4
POR mm0, mm4
Application: matrix transport
45
Application: matrix transport
char M1[4][8];// matrix to be transposed char M2[8][4];// transposed matrix
char M2[8][4];// transposed matrix int n=0;
for (int i=0;i<4;i++) for (int i=0;i<4;i++)
for (int j=0;j<8;j++) { M1[i][j] n; n++; } { M1[i][j]=n; n++; } __asm{
// th 4 f M1 i t MMX i t
//move the 4 rows of M1 into MMX registers movq mm1,M1
2 M1 8 movq mm2,M1+8 movq mm3,M1+16
1 2
46
movq mm4,M1+24
Application: matrix transport
//generate rows 1 to 4 of M2 punpcklbw mm1, mm2
p p ,
punpcklbw mm3, mm4 movq mm0, mm1
//
punpcklwd mm1, mm3 //mm1 has row 2 & row 1 punpckhwd mm0, mm3 //mm0 has row 4 & row 3 movq M2 mm1
movq M2, mm1 movq M2+8, mm0
Application: matrix transport
//generate rows 5 to 8 of M2 movq mm1, M1 //get row 1 of M1 movq mm1, M1 //get row 1 of M1 movq mm3, M1+16 //get row 3 of M1 punpckhbw mm1, mm2
p p ,
punpckhbw mm3, mm4 movq mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 6 & row 5 punpckhwd mm0, mm3 //mm0 has row 8 & row 7 //save results to M2
Performance boost (data from 1996)
Benchmark kernels:
FFT FIR vector dot FFT, FIR, vector dot- product, IDCT,
motion compensation motion compensation.
65% performance gain Lower the cost of multimedia programs multimedia programs by removing the need of specialized DSP
49
of specialized DSP chips
How to use assembly in projects
• Write the whole project in assembly Li k i h hi h l l l
• Link with high-level languages
• Inline assembly
• Intrinsics
50
Link ASM and HLL programs
• Assembly is rarely used to develop the entire program.
• Use high-level language for overall project development
– Relieves programmer from low-level details
• Use assembly language code
– Speed up critical sections of codep p – Access nonstandard hardware devices – Write platform-specific codeW te plat o spec c code
– Extend the HLL's capabilities
General conventions
• Considerations when calling assembly language procedures from high level languages:
procedures from high-level languages:
– Both must use the same naming convention (rules regarding the naming of variables and procedures) regarding the naming of variables and procedures) – Both must use the same memory model, with
compatible segment names compatible segment names
– Both must use the same calling convention
Inline assembly code
• Assembly language source code that is inserted directly into a HLL program
directly into a HLL program.
• Compilers such as Microsoft Visual C++ and Borland C++ have compiler specific directives Borland C++ have compiler-specific directives that identify inline ASM code.
Effi i i li d i kl b
• Efficient inline code executes quickly because CALL and RET instructions are not required.
• Simple to code because there are no external names, memory models, or naming conventions involved.
• Decidedly not portable because it is written for
53
y p
a single platform.
__asm directive in Microsoft Visual C++
• Can be placed at the beginning of a single statement
statement
• Or, It can mark the beginning of a block of
bl l t t t
assembly language statements
• Syntax: __asm statement __asm {
__
statement-1 statement-2 ...
statement-n
} 54
Intrinsics
• An intrinsic is a function known by the compiler that directly maps to a sequence of one or that directly maps to a sequence of one or more assembly language instructions.
• The compiler manages things that the user
• The compiler manages things that the user would normally have to be concerned with, such as register names, register allocations, g , g , and memory locations of data.
• Intrinsic functions are inherently more efficient y than called functions because no calling linkage
Intrinsics
#include <xmmintrin.h>
__m128 a , b , c;
c = mm add ps( a b );
c = _mm_add_ps( a , b );
float a[4] b[4] c[4];
float a[4] , b[4] , c[4];
for( int i = 0 ; i < 4 ; ++ i ) [i] [i] + b[i]
c[i] = a[i] + b[i];
SSE
• Adds eight 128-bit registers
All SIMD i k d i l
• Allows SIMD operations on packed single- precision floating-point numbers
• Most SSE instructions require 16-aligned addresses
57
SSE features
• Add eight 128-bit data registers (XMM registers) in non 64 bit modes; sixteen XMM registers are in non-64-bit modes; sixteen XMM registers are available in 64-bit mode.
32 bit MXCSR i t ( t l d t t )
• 32-bit MXCSR register (control and status)
• Add a new data type: 128-bit packed single- precision floating-point (4 FP numbers.)
• Instruction to perform SIMD operations on 128-p p bit packed single-precision FP and additional 64-bit SIMD integer operations.g p
• Instructions that explicitly prefetch data, control data cacheability and ordering of store
58
control data cacheability and ordering of store
SSE programming environment
XMM0
| XMM7
MM0 MM0
| MM7 MM7 EAX EBX ECX EDX EAX, EBX, ECX, EDX EBP, ESI, EDI, ESP
MXCSR control and status register
Generally faster, but not compatible with IEEE 754
Exception
_MM_ALIGN16 float test1[4] = { 0, 0, 0, 1 };
MM ALIGN16 float test2[4] = { 1, 2, 3, 0 };
_ _ [ ] { , , , }
_MM_ALIGN16 float out[4];
_MM_SET_EXCEPTION_MASK(0);//enable exception __try {
__m128 a = _mm_load_ps(test1);
m128 b = mm load ps(test2);
Without this, result is 1.#INF
__m128 b = _mm_load_ps(test2);
a = _mm_div_ps(a, b);
mm store ps(out, a);
_ _ _p ,
}
__except(EXCEPTION_EXECUTE_HANDLER) {
if( () )
if(_mm_getcsr() & _MM_EXCEPT_DIV_ZERO) cout << "Divide by zero" << endl;
return;
61
return;
}
SSE packed FP operation
• ADDPS/SUBPS: packed single-precision FP
62
SSE scalar FP operation SSE2
• Provides ability to perform SIMD operations on double precision FP allowing advanced
double-precision FP, allowing advanced graphics such as ray tracing
P id t th h t b ti
• Provides greater throughput by operating on 128-bit packed integers, useful for RSA and RC5
SSE2 features
• Add data types and instructions for them
• Programming environment unchanged
65
• Programming environment unchanged
Example
void add(float *a, float *b, float *c) { for (int i = 0; i < 4; i++)
for (int i = 0; i < 4; i++) c[i] = a[i] + b[i];
} }
__asm {
mov eax a
movaps: move aligned packed single- precision FP
addps: add packed single precision FP mov eax, a
mov edx, b
addps: add packed single-precision FP
mov ecx, c
movaps xmm0, XMMWORD PTR [eax]
dd 0 XMMWORD PTR [ d ] addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
66
}
SSE Shuffle (SHUFPS)
SHUFPS xmm1, xmm2, imm8
S l [1 0] d id hi h DW f DEST b Select[1..0] decides which DW of DEST to be
copied to the 1st DW of DEST ...
SSE Shuffle (SHUFPS)
Example (cross product)
Vector cross(const Vector& a , const Vector& b ) { return Vector((
( a[1] * b[2] - a[2] * b[1] ) , ( a[2] * b[0] - a[0] * b[2] ) , ( a[0] * b[1] - a[1] * b[0] ) );
}
69
Example (cross product)
/* cross */
__m128 _mm_cross_ps( __m128 a , __m128 b ) { __m128 ea , eb;
// set to a[1][2][0][3] , b[2][0][1][3]
ea = _mm_shuffle_ps( a, a, _MM_SHUFFLE(3,0,2,1) );
eb = _mm_shuffle_ps( b, b, _MM_SHUFFLE(3,1,0,2) );
// multiply
m128 xa = mm mul ps( ea , eb );
__ _ _ _p ( , );
// set to a[2][0][1][3] , b[1][2][0][3]
a = _mm_shuffle_ps( a, a, _MM_SHUFFLE(3,1,0,2) );
b = mm shuffle ps( b b MM SHUFFLE(3 0 2 1) );
b = _mm_shuffle_ps( b, b, _MM_SHUFFLE(3,0,2,1) );
// multiply
__m128 xb = _mm_mul_ps( a , b );
// subtract // subtract
return _mm_sub_ps( xa , xb );
}
70
Example: dot product
• Given a set of vectors {v1,v2,…vn}={(x1,y1,z1), (x y z ) (x y z )} and a vector v (x y z ) (x2,y2,z2),…, (xn,yn,zn)} and a vector vc=(xc,yc,zc), calculate {vcvi}
T ti f l t
• Two options for memory layout
• Array of structure (AoS)
typedef struct { float dc, x, y, z; } Vertex;
Vertex v[n];
• Structure of array (SoA)
Example: dot product (AoS)
movaps xmm0, v ; xmm0 = DC, x0, y0, z0 movaps xmm1 vc ; xmm1 = DC xc yc zc movaps xmm1, vc ; xmm1 = DC, xc, yc, zc
mulps xmm0, xmm1 ;xmm0=DC,x0*xc,y0*yc,z0*zc movhlps xmm1 xmm0 ; xmm1= DC DC DC x0*xc movhlps xmm1, xmm0 ; xmm1= DC, DC, DC, x0*xc addps xmm1, xmm0 ; xmm1 = DC, DC, DC,
; x0*xc+z0*zc
; x0*xc+z0*zc movaps xmm2, xmm0
h f 2 2 55h 2 DC DC DC 0*
shufps xmm2, xmm2, 55h ; xmm2=DC,DC,DC,y0*yc
Example: dot product (SoA)
; X = x1,x2,...,x3
; Y = y1,y2,...,y3y ,y , ,y
; Z = z1,z2,...,z3
; A = xc,xc,xc,xc
; B = yc,yc,yc,yc
; C = zc,zc,zc,zc
movaps xmm0 X ; xmm0 = x1 x2 x3 x4 movaps xmm0, X ; xmm0 = x1,x2,x3,x4 movaps xmm1, Y ; xmm1 = y1,y2,y3,y4 movaps xmm2 Z ; xmm2 = z1 z2 z3 z4 movaps xmm2, Z ; xmm2 = z1,z2,z3,z4
mulps xmm0, A ;xmm0=x1*xc,x2*xc,x3*xc,x4*xc mulps xmm1, B ;xmm1=y1*yc,y2*yc,y3*xc,y4*ycu ps , ; y yc,y yc,y3 c,y yc mulps xmm2, C ;xmm2=z1*zc,z2*zc,z3*zc,z4*zc addps xmm0, xmm1
73
p
addps xmm0, xmm2 ;xmm0=(x0*xc+y0*yc+z0*zc)…
Other SIMD architectures
• Graphics Processing Unit (GPU): nVidia 7800, 24 pipelines (8 vector/16 fragment)
pipelines (8 vector/16 fragment)
74
NVidia GeForce 8800, 2006
• Each GeForce 8800 GPU stream processor is a fully generalized fully decoupled scalar fully generalized, fully decoupled, scalar, processor that supports IEEE 754 floating point precision
precision.
• Up to 128 stream processors
Cell processor
• Cell Processor (IBM/Toshiba/Sony): 1 PPE (Power Processing Unit) +8 SPEs (Synergistic (Power Processing Unit) +8 SPEs (Synergistic Processing Unit)
A SPE i RISC ith 128 bit SIMD f
• An SPE is a RISC processor with 128-bit SIMD for single/double precision instructions, 128 128- bit i t 256K l l h
bit registers, 256K local cache
• used in PS3.
Cell processor
77
GPUs keep track to Moore’s law better
78
Different programming paradigms References
• Intel MMX for Multimedia PCs, CACM, Jan. 1997 Ch 11 Th MMX I i S Th A f
• Chapter 11 The MMX Instruction Set, The Art of Assembly
• Chap. 9, 10, 11 of IA-32 Intel Architecture Software Developer’s Manual: Volume 1: Basic Architecture
• http://www.csie.ntu.edu.tw/~r89004/hive/sse/page_1.html