NCTS 2006 School on Modern Numerical Methods in Mathematics and Physics : Algorithms for Dynamical Fermions

(1)

Algorithms for

Dynamical Fermions

A D Kennedy

School of Physics, The University of Edinburgh

(2)

Outline

Monte Carlo integration

Importance sampling

Markov chains

Detailed balance, Metropolis algorithm

Symplectic integrators

Hybrid Monte Carlo

Pseudofermions

RHMC

(3)

Monte Carlo methods: I

Monte Carlo integration is based on the

identification of

probabilities

with

measures

There are much better methods of carrying out

low dimensional quadrature

All other methods become hopelessly expensive for large dimensions

In lattice QFT there is one integration per degree of freedom

(4)

Measure the value of Ω on each configuration

and compute the average

1

N

_{( )}

t

N

φ

Ω ≡

∑

Ω

Monte Carlo methods: II

Generate a sequence of random field

configurations

chosen from the probability distribution

(

φ φ

1

,

2

,

K

,

φ

t

,

K

,

φ

N

)

( )

1 ( )

S t t t

P

d

e

Z

φ

₌

−

(5)

Central Limit Theorem: I

Distribution of values for a single sample ω = Ω(φ)

( )

(

( )

)

( )

P_Ω ω ≡

∫

d Pφ φ δ ω − Ω φ = δ ω

(

− Ω

( )

φ

)

Central Limit Theorem

where the variance of the distribution of Ω is

( )

C2 N

O

Ω ∼ Ω +

(

)

2 2 C ≡ Ω − Ω

Law of Large Numbers

lim

N→∞

Ω =

Ω

The Laplace–DeMoivre Central Limit theorem

is an asymptotic expansion for the probability

distribution of Ω

(6)

(

)

(

)

(

)

3 0 3 4 ₂ 1 4 2 2 2 0 3 C C C C C C = = Ω − Ω = Ω = Ω − Ω − = Ω − Ω

The first few cumulants are

Central Limit Theorem: II

Generating function for connected moments

( ) ln ( ) ik W k d_ω P _ω e ω Ω ≡

∫

Ω

( )

( ) ln _d_φ _P _φ _eikΩ φ ln _eikΩ =

∫

=

( )

0 ! n n n ik C n ∞ = =

∑

(7)

Central Limit Theorem: III

Connected generating function

( )

( ) ln⎡ d_φ P _φ eikΩ φ N ⎤N = ⎣

∫

⎦

( )

1 1 1 ln _N _N exp N _t t ik d d P P N φ φ φ φ φ = ⎡ ⎤ = _⎢ Ω _⎥ ⎣

∑

⎦

∫

K K ( ) ln ( ) ik W k d_ω P _ω e ω Ω ≡

∫

Ω

( )

1 1 ! n n n n ik _C n N ∞ − = =

∑

ln ik N N e Ω = k NW N Ω ⎛ ⎞ = _{⎜ ⎟} ⎝ ⎠

( )

1

( )

1

( )

1 1 N N N t t P d d P P N ω φ φ φ φ δ ω φ Ω = ⎛ ⎞ ≡ _⎜ − Ω _⎟ ⎝

∑

⎠

∫

K K

(8)

( )

3 4 3 4 ₂ 1 2 3 3 4 2 2 3! 4! 2 N C d C d ik ik C N _d N _d dk ik

e

_ω _ω

e

ω π − + − Ω + ₋

∼

L

∫

Take inverse Fourier transform to obtain distribution P_Ω

( )

1 ( )

2

W k _ik

P

_ω

dk e

e

ω

π

Ω − Ω

=

∫

( )2 3 4 ₂ 3 4 ₂ 2 3 3 4 3! 4! 2 ₂ C d C d C N N _d N _d C N

e

ω ω ω π − Ω − − + −

=

L

(9)

Central Limit Theorem: V

where and

ξ

≡

(

ω

− Ω

)

N

( )

₃

(

2 ₂

)

2 2 2 3 2 2 3 1 6 2 C C C _e F C N C ξ ξ ξ ξ π − ⎡ ₋ ⎤ ⎢ ⎥ = + + ⎢ ⎥ ⎣ L⎦

( )

d

P

F

d

ξ

ω

ξ

ω

Ω

=

(10)

Importance Sampling: I

( ) 1

N =

∫

dx ρ x =

Sample from distribution

Probability Normalisation 0 < ρ( )x a e. .

( )

dx f x

∫

Integral

( )

f x

I

x dx

x

ρ

=

∫

Estimator of integral

Estimator of variance

2 ₂ 2

( )

f x

V

x dx

I

dx

I

x

ρ

⎛

⎞

=

_⎜

−

_⎟

=

−

⎝

⎠

∫

(11)

Importance Sampling: II

2 2

(

)

( )

₀

( )

V

N

f y

y

δ

λ

_λ

δρ

ρ

+

_{= −}

_{+ =}

Minimise variance

Constraint N=1 Lagrange multiplier λ

( )

opt

f x

x

dx f x

ρ

⇒

=

∫

Optimal measure

(12)

( ) sin( )

f x

=

x

Example:

1 2

( )

x

sin( )

x

ρ

=

Optimal weight

2 2 2 2 opt 0 0

sin( )

16 V

=

_⎜

⎛

π

dx

x

_⎟

⎞

−

_⎜

⎛

π

dx

x

⎞

_⎟

=

⎝

∫

⎠

⎝

∫

⎠

Optimal variance

2 0

dx

sin

x

π

∫

(13)

Importance Sampling: IV

∫

₀2π

dx

sin

x

1 Construct cheap approximation to |sin(x)|

2 Calculate relative area within each rectangle

5 Choose another random number uniformly

4 Select corresponding rectangle

3 Choose a random number uniformly

6 Select corresponding x value within rectangle

(14)

Importance Sampling: V

(

)

(

)

2 2

0 0

sin( ) sin( )

sin( )

I

=

∫

π

dx

x

θ

x

+

∫

π

dx

x

θ

−

x

But we can do better!

opt

0 V

=

For which

With 100 rectangles we have

V

= 16.02328561

With 100 rectangles we have

V

= 0.011642808

2

0

dx

sin

x

π

(15)

Markov Chains: I

State space

Ω

(Ergodic) stochastic transitions

P’: Ω → Ω

Distribution converges to unique fixed point

Q

Deterministic evolution of probability

distribution

P: Q → Q

(16)

The sequence

Q, PQ, P²Q, P³Q,…

is Cauchy

Define a metric on

the space of (equivalence classes of) probability

distributions

(

1

,

2

)

1

( )

2

( )

d Q Q

≡

∫

dx Q x

−

Q x

Prove that with

α

> 0, so the Markov process

P

is a contraction

mapping

(

1

,

2

) (

1 ) (

1

,

2

)

d PQ PQ

≤

−

α

d Q Q

The space of probability distributions is

complete

, so the

sequence converges to a unique fixed point

_Q

₌

lim

_{P Q}

n

(17)

(

1

,

2

)

1

( )

2

( )

d PQ PQ

=

∫

dx PQ x

−

PQ x

(

) ( )

1

(

) ( )

2

dx dy P x

y Q y

dy P x

y Q y

=

∫ ∫

←

−

∫

←

(

)

( )

(

( )

)

(

( )

)

dx dy P x

y

Q y

⎡

θ

Q y

θ

Q y

⎤

=

∫ ∫

←

Δ

_⎣

Δ

+

−Δ

_⎦

(

)

( )

dx dy P x

y

Q y

=

∫ ∫

←

Δ

(

)

( )

(

( )

)

2 dx

min

dy P x

y

Q y

θ

Q y

±

−

∫

←

Δ

±Δ

(

)

( )

dx dy P x

y

Q y

=

∫ ∫

←

Δ

(

)

2min

,

a

−

b

=

a

+

b

−

a b

( )

1

( )

2

( )

Q y

Δ

≡

−

( )

y

( )

y

1 θ

+

θ

−

=

(18)

( )

2 min

(

)

( )

(

( )

)

dy Q y

dx

dy P x

y

Q y

θ

Q y

±

=

∫

Δ

−

∫

←

Δ

±Δ

( )

2 inf

(

)

min

( )

(

( )

)

y

dy Q y

dx

P x

y

dy Q y

θ

Q y

±

≤

∫

Δ

−

∫

←

∫

Δ

±Δ

( )

inf

(

)

( )

y

dy Q y

dx

P x

y

dy Q y

≤

∫

Δ

−

∫

←

∫

Δ

≤

(1

−

α

)

d Q Q

(

₁

,

₂

)

( )

1

( )

2

( )

1 1 0

dy Q y

=

∫

Δ

=

∫

−

∫

= − =

( )

(

( )

)

( )

(

( )

)

dy Q y

Δ

θ

Δ

Q y

+

dy Q y

Δ

θ

−Δ

Q y

∫

(

)

1 dx P x

←

y

=

∫

( )

(

( )

)

1

( )

2

dy Q y

θ

Q y

dy Q y

⇒

∫

Δ

±Δ

=

∫

Δ

(

)

0 <

α

=

∫

dx

inf

P x

←

y

(

1

,

2

)

d PQ PQ

(19)

Markov Chains: II

Use Markov chains to sample from

Q

Suppose we can construct an ergodic Markov process P which has distribution Q as its fixed point

Start with an arbitrary state (“field configuration”)

Iterate the Markov process until it has converged (“thermalized”)

Thereafter, successive configurations will be distributed according to Q

But in general they will be correlated

To construct P we only need relative probabilities of states

Do not know the normalisation of Q

Cannot use Markov chains to compute integrals directly We can compute ratios of integrals

(20)

Metropolis algorithm

(

)

( )

min 1,

Q x

P x

y

Q y

⎛

⎞

←

=

_⎜

_⎟

⎝

⎠

Markov Chains: III

How do we construct a Markov process with a specified

fixed point ?

Q x

( )

=

∫

dy P x

(

←

y Q y

) ( )

Integrate w.r.t. y to obtain fixed point condition

Sufficient but not necessary for fixed point

(

) ( )

(

) ( )

P y

←

x Q x

=

P x

←

y Q y

Detailed balance

(

)

Q x

( )

P x ← y = Other choices are possible, e.g.,

Consider cases and separately to obtain detailed balance conditionQ x( ) > Q y( ) Q x( ) < Q y( )

(21)

Markov Chains: IV

Composition of Markov steps

Let P₁ and P₂ be two Markov steps which have the desired fixed point distribution

They need not be ergodic

Then the composition of the two steps P₂_oP₁ will also have the desired fixed point

And it may be ergodic

This trivially generalises to any (fixed) number of steps

For the case where P₁ is not ergodic but (P₁) n _{is the terminology}

(22)

Markov Chains: V

This result justifies “sweeping” through a

lattice performing single site updates

Each individual single site update has the desired fixed point because it satisfies detailed balance

The entire sweep therefore has the desired fixed point, and is ergodic

But the entire sweep does not satisfy detailed balance

Of course it would satisfy detailed balance if the sites were updated in a random order

But this is not necessary

(23)

Markov Chains: VI

α

Coupling from the Past

Propp and Wilson (1996)

Use fixed set of random numbers

Flypaper principle: If states coalesce they stay together forever – Eventually, all states coalesce to some state α with probability one – Any state from t = -∞ will coalesce to α

(24)

Markov Chains: VII

Suppose we have a lattice

That is, a partial ordering with a largest and smallest element

And an update step that

preserves it

Then once the largest and

smallest states have

coalesced all the others must

have too

a

b

c

d

e

g

f

(25)

Markov Chains: VIII

α

β ≥ 0

Ordering is spin configuration

A ≥ B iff for every site A_s≥ B_s

Update is a local heatbath

i j ij

H

=

b

å

s s

A non-trivial example of

where this is possible is

the ferrormagnetic Ising

model

(26)

Autocorrelations: I

This corresponds to the exponential autocorrelation time

exp 1 0 ln N λ ≡ − >

Exponential autocorrelations

The unique fixed point of an ergodic Markov process corresponds to the unique eigenvector with eigenvalue 1 All its other eigenvalues must lie within the unit circle

max 1

λ < In particular, the largest subleading eigenvalue is

(27)

Autocorrelations: II

( ) ( )

2 2 1 1 1 N N t t t t N φ φ ′ ′ = = Ω =

∑∑

Ω Ω ₂

( )

2 1

( ) ( )

1 1 1 1 2 N N N t t t t t t t N φ φ φ − ′ ′ = = = + ⎧ ⎫ = _⎨ Ω + Ω Ω _⎬ ⎩

∑

∑ ∑

⎭

Integrated autocorrelations

Consider the autocorrelation of some operator Ω Without loss of generality we may assume Ω = 0

(

) ( )

1 2 ₂ ₂ 1 1 2 N N C N N − Ω = ⎧ ⎫ Ω = _⎨ Ω + − Ω _⎬ ⎩

∑

l ⎭ l l

( )

( ) ( )

( )

2 t t C φ φ φ + Ω Ω Ω ≡ Ω l l

(28)

Autocorrelations: III

The autocorrelation function must fall faster that the

exponential autocorrelation

( )

exp

max N C λ e− Ω l ≤ l = l

{

}

2 2 _exp 1 2A 1 O N N N Ω Ω ⎡ _⎛ _⎞⎤ Ω = + _⎢ + _⎜ _⎟_⎥

( )

2 exp 1 1 2 C 1 O N N N ∞ Ω = Ω ⎡ ⎛ ⎞⎤ ⎧ ⎫ = _⎨ + _⎬ _⎢ + _⎜ _⎟_⎥ ⎩

∑

l ⎭ _⎣ ⎝ ⎠_⎦ l

For a sufficiently large number of samples

( )

1 A_Ω ∞ C_Ω = ≡

∑

l l

(29)

In order to carry out Monte Carlo computations

including the effects of dynamical fermions we

would like to find an algorithm which

Update the fields globally

Because single link updates are not cheap if the action is not local

Take large steps through configuration space

Because small-step methods carry out a random walk which leads to critical slowing down with a dynamical critical exponent z=2

Hybrid Monte Carlo: I

Does not introduce any systematic errors

z relates the autocorrelation to the correlation length of the system, A_Ω ∝ ξz

(30)

A useful class of algorithms with these properties is

the (Generalised) Hybrid Monte Carlo (HMC) method

Introduce a “fictitious” momentum p corresponding to each dynamical degree of freedom q

Find a Markov chain with fixed point ∝ exp[-H(q,p) ] where H is the “fictitious” Hamiltonian ½p2 _{+ S(q)}

The action S of the underlying QFT plays the rôle of the potential in the “fictitious” classical mechanical system

This gives the evolution of the system in a fifth dimension, “fictitious” or computer time

This generates the desired distribution exp[-S(q) ] if we ignore the momenta p (i.e., the marginal distribution)

(31)

The HMC Markov chain alternates two Markov

steps

Molecular Dynamics Monte Carlo (MDMC) (Partial) Momentum Refreshment

Both have the desired fixed point

Together they are ergodic

Hybrid Monte Carlo: III

(32)

MDMC: I

If we could integrate Hamilton’s equations

exactly we could follow a trajectory of

constant fictitious energy

This corresponds to a set of equiprobable fictitious phase space configurations

Liouville’s theorem tells us that this also preserves the functional integral measure dp ∧ dq as required

Any approximate integration scheme which

is reversible and area preserving may be

used to suggest configurations to a

(33)

MDMC: II

Molecular Dynamics (MD), an approximate integrator which is exactly

( ) (

: ,

) (

,

)

U τ q p _a q p′ ′

We build the MDMC step out of three parts

A Metropolis accept/reject step

Area preserving,

(

)

(

)

* , det det 1 , q p U q p ⎡∂ ′ ′ ⎤ = _⎢ _⎥ = ∂ ⎣ ⎦ Reversible, F U_o

( )

τ _{o o}F U

( )

τ = 1 A momentum flip F p: _a −p

( )

(

H

)

1 (

H

)

q

F U

e

y

y e

p

δ δ

τ ϑ

−

ϑ

−

′

⎛ ⎞

_⎡

_⎤

⎛ ⎞

=

−

+

−

⎜ ⎟

_′

_⎣

_⎦

⎜ ⎟

⎝ ⎠

o

⎝ ⎠

The composition of these gives

(34)

The Gaussian distribution of p is invariant under F

The extra momentum flip F ensures that for small θ the momenta are reversed after a rejection rather than after an acceptance

For θ = π / 2 all momentum flips are irrelevant

This mixes the Gaussian distributed momenta

p

with Gaussian noise

ξ

cos

sin

cos

p

F

θ

ξ

θ

ξ

′

⎛ ⎞ ⎛

⎞

⎛ ⎞

=

⎜ ⎟ ⎜

_′

₋

⎟

⎜ ⎟

⎝ ⎠ ⎝

⎠

o

⎝ ⎠

(35)

If A and B belong to any (non-commutative) algebra

then , where δ constructed from

commutators of A and B (i.e., is in the Free Lie Algebra generated by {A,B }) A B A B e e ₌e + +δ

Symplectic Integrators: I

[

]

_{( )}

₁ ₂ 1 2 1 2 2 2 1 1 2 , , 1 0 1 , , , , 1 2 ! m m m n m n n k k k k m k k n B c c A B c c A B n m ⎢ ⎥ ⎣ ⎦ + ≥ = + + = ⎧ ⎫ ⎪ _⎡ _⎡ _⎡ _⎤ _⎤_⎤⎪ = _⎨− − + _⎣ _⎣ _⎣ + _⎦ _⎦_⎦_⎬ + ⎪ ⎪ ⎩ ⎭

∑

K L K K

More precisely, where and

(

)

1 ln A B n n e e c ≥ =

∑

c₁ = +A B

Baker-Campbell-Hausdorff (BCH) formula

(36)

Symplectic Integrators: II

Explicitly, the first few terms are

(

)

{ } [ ]

{

[ ] [ ]

}

[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] 1 1 1 2 12 24 1 720 ln , , , , , , , , , , , , 4 , , , , 6 , , , , 4 , , , , 2 , , , , , , , , A B e e A B A B A A B B A B B A A B A A A A B B A A A B A B A A B B B A A B A B B A B B B B A B ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ = + + + _⎣ _{⎦ ⎣}− _⎦ − _⎣ _⎣ _⎦_⎦ ⎧ −⎡_⎣ ⎡_⎣ ⎡_⎣ ⎤_⎦⎤_⎦⎤_⎦ − ⎡_⎣ ⎡_⎣ ⎡_⎣ ⎤_⎦⎤_⎦⎤_⎦ ⎫ ⎪ ⎪ _⎡ _⎡ _⎤_⎤ _⎡ _⎡ _⎡ _⎤_⎤_⎤ + _⎨− _⎣ _⎣ _⎦_⎦+ _⎣ _⎣ _⎣ _⎦_⎦_⎦_⎬ ⎪ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎪ − _⎣ ⎡_⎣ _⎦⎤_⎦ +_⎣ _⎣ ⎡_⎣ ⎤_⎦_⎦_⎦ ⎩ ⎪ ⎪ + ⎪ ⎪ ⎭ L

In order to construct reversible integrators we use symmetric symplectic integrators

(

)

{ }

{

[ ] [ ]

}

[ ] [ ] [ ] [ ] [ ] 2 2 ₁ 24 1 5760 ln , , 2 , , 7 , , , , 28 , , , , 12 , , , , 32 , , , , A B A e e e A B A A B B A B A A A A B B A A A B A B A A B B B A A B ⎡ ⎤ ⎡ ⎤ = + + _⎣ _⎦ − _⎣ _⎦ ⎧ ⎡_⎣ ⎡_⎣ ⎡_⎣ ⎤_⎦⎤_⎦_⎦⎤+ _⎣⎡ ⎡_⎣ ⎡_⎣ ⎤_⎦_⎦⎤⎤_⎦ ⎫ ⎪ ⎪ ⎪ _⎡ _⎡ _⎤_⎤ _⎡ _⎡ _⎡ _⎤_⎤_⎤⎪ + _⎨+ _⎣ _⎣ _⎦_⎦ + _⎣ _⎣ _⎣ _⎦_⎦_⎦_⎬+ ⎪ ⎪ L

(37)

Symplectic Integrators: III

exp d exp dp dq dt dt p dt q τ ⎛τ ⎧ ∂ ∂ ⎫⎞ ⎛ _{⎞ ≡} ₊ ⎨ ⎬ ⎜ ⎟ ⎜ ⎟ _∂ _∂ ⎝ ⎠ _⎝ ⎩ ⎭_⎠

(

)

( )

₁ 2

( )

2

,

H q p

=

T p

+

S q

=

p

+

S q

We are interested in finding the classical trajectory in

phase space of a system described by the Hamiltonian

The basic idea of such a symplectic integrator is to write the time evolution operator as

ˆ exp H H _e H q p p q τ τ ⎛ ⎧ ∂ ∂ ∂ ∂ ⎫⎞ = _⎜ _⎨− + _⎬_⎟ ≡ ∂ ∂ ∂ ∂ ⎩ ⎭ ⎝ ⎠

( )

exp S q T p p q τ ⎛ ⎧ _′ ∂ _′ ∂ ⎫⎞ = _⎜ _⎨− + _⎬_⎟ ∂ ∂ ⎩ ⎭ ⎝ ⎠

(38)

Symplectic Integrators: IV

Define and so thatP S q

( )

Hˆ = P Q+

p ∂ ′ ≡ − ∂

( )

Q T p q ∂ ′ ≡ ∂

Since the kinetic energy T is a function only of p and the potential energy S is a function only of q, it follows that the action of and may be evaluated trivially_eτP _eτQ

(

)

(

( )

)

(

)

(

( )

)

: , , : , , Q P e f q p f q T p p e f q p f q p S q τ τ τ τ ′ + ′ − a a

(39)

Symplectic Integrators: V

From the BCH formula we find that the PQP symmetric symplectic integrator is given by

(

1 1

)

2 2 / / 0( ) P _Q P U _δτ τ δτ ₌ e δτ e eδτ δτ τ δτ

(

)

(

[

]

[

]

)

( )

(

1 3 5

)

24 exp⎡ P Q δτ ⎡P P Q, , ⎤ 2⎡Q P Q, , ⎤ _δτ O _δτ ⎤ τ δτ = _⎣ + − _⎣ _⎦+ _⎣ _⎦ + _⎦

(

)

(

[

]

[

]

)

( )

(

1 2 4

)

24 exp⎡τ P Q ⎡P P Q, , ⎤ 2⎡Q P Q, , ⎤ _δτ O _δτ ⎤ = _⎣ + − _⎣ _⎦ + _⎣ _⎦ + _⎦ ( )

( )

ˆ P Q ₂ H eτ ′ eτ + O _δτ = = +

In addition to conserving energy to O (δτ² ) such symmetric symplectic integrators are manifestly area preserving and reversible

(40)

Symplectic Integrators: VI

For each symplectic integrator there exists a

Hamiltonian

H’

which is

exactly conserved

This is given by substituting Poisson brackets for commutators in the BCH formula

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

{

}

( )

2 1 24 4 6 1 5760 ' , , 2 , , 32 , , , 16 , , , , 28 , , , , 12 , , , , 8 , , , , 7 , , , , H S T T T S S T S S S T T S T S S T S S T T T S T S T T S O S S S T S T T T T S δτ δτ δτ ⎡ ⎤ = + − _⎣ + _⎦ ⎡ ₋ ⎤ ⎢ ⎥ ⎢ ⎥ + _⎢+ + _⎥ + ⎢ ⎥ + + ⎢ ⎥ ⎣ ⎦

(41)

Symplectic Integrators: VII

{

}

{

A B C, ,

}

+

{

B C A,

{

,

}

+

{

C A B, ,

{

}

= 0

{

A B,

}

A B A B p q q p ∂ ∂ ∂ ∂ ≡ − ∂ ∂ ∂ ∂

Poisson brackets form a

Lie algebra

Homework exercise: verify the Jacobi identity

(42)

Symplectic Integrators: VIII

The explicit result for H’ is

{

}

( )

(

)

{

}

( )

2 2 2 1 24 4 4 2 2 2 4 6 1 720 2 6 2 3 H H p S S p S p S S S S S O δτ δτ δτ ′ = + ′′ − ′ ′ ′′′ ′′ ′ ′′ + − + + − +

Note that H’ cannot be written as the sum of a p-dependent kinetic term and a q-dependent potential term

As H’ is conserved, δH is of O(δτ 2₎_{for trajectories of}

arbitrary length

(43)

Symplectic Integrators: IX

Define and so thatP_i S q_i

( )

Hˆ = P₁ + P₂ +Q p ∂ ′ ≡ − ∂

( )

Q T p q ∂ ′ ≡ ∂ 1 2 ( , ) ( ) ( ) ( ) H q p =T p + S q + S q

Multiple timescales

Split the Hamiltonian into pieces

2 1 2 1 1 1 1 1 1 1 2 1 2 4 ₂ 2 ₂ 4 ₁ ₁ 4 ₁ 2 ₂ 4 ₂ / / SW( ) n n n n n n n n n n Q P Q P Q P Q U e e e e e e e τ δτ δτ δτ δτ δτ δτ δτ δτ τ δτ δτ _{= ⎜}⎛⎡_⎢ ⎤ ⎡_{⎥ ⎢} ⎤ ⎡_{⎥ ⎢} ⎤_⎥ ⎞_⎟ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎝ ⎠

Introduce a symmetric symplectic integrator of the form

If then the instability in the integrator is tickled equally by each sub-step

1 2

1 2 2

P P

Q n ≈ n ≈

This helps if the most expensive force computation does not correspond to the largest force

(44)

Dynamical fermions: I

Pseudofermions

Direct simulation of Grassmann fields is not feasible

The problem is not that of manipulating anticommuting values in a computer

We therefore integrate out the fermion fields to obtain the fermion determinant _{d d e}_{ψ ψ} −ψ ψM _∝ det

( )

_M

∫

It is that is not positive, and thus we get poor importance sampling

F

S M

e− ₌ e−ψ ψ

and always occur quadraticallyψ

(45)

Dynamical fermions: II

( )

1( ) 0 , , eηM φ η η η δ δ φ φ δη δη − = = ⎛ ⎞ ′ Ω _{= Ω ⎜} _⎟ ⎝ ⎠

Any operator Ω can be expressed solely in terms of the bosonic fields

Use Schwinger’s source method and integrate out the fermions

( ) ( )

1

( , ) ( , )

G x y_ψ ₌ ψ x ψ y ₌ M − x y E.g., the fermion propagator is

(46)

Dynamical fermions: III

Represent the fermion determinant as a bosonic Gaussian integral with a non-local kernel detM

( )

φ d d eχ χ −χM−1( )φ χ

∝ ∫

The fermion kernel must be positive definite (all its

eigenvalues must have positive real parts) otherwise the bosonic integral will not converge

The determinant is extensive in the lattice volume, thus again we get poor importance sampling

Including the determinant as part of the observable to be measured is not feasible

( ) ( )

( )

det det B B S S M M φ φ φ Ω Ω =

(47)

Dynamical fermions: IV

It is usually convenient to introduce two flavours of fermion

and to write

(

( )

)

(

( ) ( )

)

(

)

1 †

2 _†

detM φ = det M φ M φ _{∝ ∫}d d eχ χ −χ M M − χ

The evaluation of the pseudofermion action and the

corresponding force then requires us to find the solution of a (large) set of linear equations

(

_{M M}†

)

−1 _{χ ψ}₌

This not only guarantees positivity, but also allows us to generate the pseudofermions from a global heatbath by applying to a random Gaussian distributed field

†

M

The equations for motion for the boson (gauge) fields are

( )

₍

₎

₁

( )

₍

₎

₁ †

₍

_{) (}

₎

₁ ? ? ٛ B B S S M M M M M M M M φ π φ φ π χ χ χ χ φ φ φ φ − − − = ∂ ∂ ∂ _⎡ _⎤ ∂ _⎡ _⎤ = − − = − _{+ ⎢} _⎥ _⎢ _⎥ ⎣ ⎦ ⎣ ⎦ ∂ ∂ ∂ ∂ & &

(48)

Dynamical fermions: V

It is not necessary to carry out the inversions

required for the equations of motion exactly

There is a trade-off between the cost of computing the force and the acceptance rate of the Metropolis MDMC step

The inversions required to compute the

pseudofermion action for the accept/reject

step does need to be computed exactly,

however

We usually take “exactly” to by synonymous with “to machine precision”

(49)

Reversibility: I

Are HMC trajectories reversible and area

preserving in practice?

The only fundamental source of irreversibility is the rounding error caused by using finite precision floating point arithmetic

For fermionic systems we can also introduce irreversibility by

choosing the starting vector for the iterative linear equation solver time-asymmetrically

We do this if we to use a Chronological Inverter, which takes (some extrapolation of) the previous solution as the starting vector

Floating point arithmetic is not associative

It is more natural to store compact variables as scaled integers (fixed point)

Saves memory

(50)

Reversibility: II

Data for SU(3) gauge theory and QCD with heavy quarks show that rounding errors are amplified exponentially

The underlying continuous time equations of motion are chaotic

Ляпунов exponents

characterise the divergence of nearby trajectories

The instability in the integrator occurs when δH » 1

(51)

Reversibility: III

In QCD the Ляпунов exponents appear to scale with β as the system approaches the continuum limit β → ∞

νξ = constant

This can be interpreted as saying that the Ляпунов exponent characterises the

chaotic nature of the continuum classical equations of motion, and is not a lattice artefact

Therefore we should not have to worry about reversibility breaking down as we approach the continuum limit

Caveat: data is only for small lattices, and

(52)

What is the best polynomial

approximation

p(x)

to a continuous

function

f(x)

for

x

in [

0,1

] ?

Polynomial approximation

Best with respect to the appropriate norm

where n

≥

1

1 / 1 0

( )

n n

p f

−

_n

=

⎛

_⎜

dx p x

−

f x

⎞

_⎟

⎝

∫

⎠

(53)

Weierstraß’ theorem

Weierstraß: Any continuous

function can be arbitrarily well

approximated by a polynomial

0 1

minmax ( )

( )

p x

p f

p x

f x

≤ ≤

−

_∞

=

−

Taking n

→∞

this is the minimax

this is the

norm

(54)

Бернштейне polynomials

( )

0

(1

)

n n n k n k

n

k

f

x

n

k

p x

− =

⎛ ⎞

₋

⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠

≡

∑

The explicit solution is

provided by Бернштейне

polynomials

(55)

Чебышев’s theorem

The error |p(x) - f(x)| reaches its

maximum at exactly d+2 points on

the unit interval

( )

0 1

max

x

p f

_∞

p x

f x

≤ ≤

−

=

−

Чебышев

: There is always a

unique polynomial of any

degree d which minimises

(56)

Чебышев’s theorem: Necessity

Suppose p-f has less than d+2 extrema of equal magnitude Then at most d+1 maxima exceed some magnitude

And whose magnitude is smaller than the “gap”

The polynomial p+q is then a better approximation than p to f This defines a “gap”

We can construct a polynomial q of degree d which has the opposite sign to p-f at each of these maxima (Lagrange interpolation)

(57)

Чебышев’s theorem: Sufficiency

Then at each of the d+2p x′

( )

_i − f x

( )

_i ≤ p x

( )

_i − f x

( )

_i extrema Thus p’ – p = 0 as it is a polynomial of degree d

Therefore p’ - p must have d+1 zeros on the unit interval Suppose there is a polynomial p′ − f _∞ ≤ p − f _∞

(58)

The notation is an old transliteration of Чебышев !

Чебышев polynomials

Convergence is often exponential in

d

The best approximation of degree

d-1

over [

-1,1

] to

x

d

_is

( )

1

( )

1

1 ₂

d d d d

p

₋

x

≡

x

−

T x

( )

_cos

(

_cos 1

( )

)

d T x ₌ d − x

Where the Чебышев polynomials are

( )

1

( )

2 ln2

2

d d d d

d

x

p x

−

T x

e

∞ ∞

−

=

The error is

(59)

Чебышев rational functions

Чебышев’s theorem is easily extended to

rational approximations

Rational functions with nearly equal degree numerator and denominator are usually best

Convergence is still often exponential

Rational functions usually give a much better approximation A simple (but somewhat slow) numerical

algorithm for finding the optimal Чебышев rational approximation was given by Ремез

(60)

Чебышев rationals: Example

A realistic example of a rational approximation is

(

)(

)

(

x 2.3475661045 x

)(

0.1048344600 x

)(

0.0073063814

)

1 _0.3904603901 x 0.4105999719 x 0.0286165446 x 0.0012779193 x + + + ≈ + + +

Using a partial fraction expansion of such rational functions allows us to use a multishift linear equation solver, thus reducing the cost significantly.

1 _0.3904603901 0.0511093775 0.1408286237 0.5964845033 x 0.0012779193 x 0.0286165446 x 0.4105999719

x ≈ + + + + + +

The partial fraction expansion of the rational function above is

(61)

Polynomials versus rationals

Optimal

L

₂

approximation with weight

is

2

1

1 −

x

2 1 0

( ) 4

( )

(2

1)

j n j j

T

x

j

π

+ =

−

+

∑

Optimal

L

_∞

approximation cannot be too

much better (or it would lead to a better

L

₂

approximation)

lnn

e

ε

Δ ≤

Золотарев’s formula has

L

_∞

error

(62)

Non-linearity of CG solver

Suppose we want to solve

A

2

_x=b

_for

Hermitian

A

by CG

It is better to solve Ax=y, Ay=b successively Condition number κ(A2_{) =}κ_(A)2

Cost is thus 2κ(A) < κ(A2_{) in general}

Suppose we want to solve Ax=b

Why don’t we solve A1/2_{x=y, A}1/2_{y=b successively?}

The square root of A is uniquely defined if A>0

This is the case for fermion kernels

All this generalises trivially to

n

th

_roots

No tuning needed to split condition number evenly

(63)

Rational matrix approximation

Functions on matrices

Defined for a Hermitian matrix by diagonalisation

H = U D U -1

f (H) = f (U D U -1_{) = U f (D) U}-1

Rational functions do not require diagonalisation

α H m _{+ β H} n _{= U (α D} m _{+ β D} n_{) U} -1

H -1 _{= U D}-1 _U-1

Rational functions have nice properties

Cheap (relatively)

(64)

No Free Lunch Theorem

We must apply the rational approximation

with each CG iteration

M1/n ≈ r(M)

The condition number for each term in the partial fraction expansion is approximately κ(M)

So the cost of applying M1/n _{is proportional to} _κ(M)

Even though the condition number κ(M1/n_)=κ(M)1/n

And even though κ(r(M))=κ(M)1/n

(65)

Pseudofermions

We want to evaluate a functional integral

including the fermionic determinant det

M

1

det

M

_∝

d d e

φ φ

∗ −φ∗M− φ

∫

We write this as a bosonic functional integral

over a pseudofermion field with kernel

M

-1

(66)

Multipseudofermions

We are introducing extra noise into the system by

using a single pseudofermion field to sample this

functional integral

This noise manifests itself as fluctuations in the force exerted by the pseudofermions on the gauge fields

This increases the maximum fermion force This triggers the integrator instability

This requires decreasing the integration step size

1

det

_M

_n

_{d d e}

_{φ φ}

_∗ ₋φ∗M−nφ

∝ ∫

(67)

Hasenbusch’s method

Clever idea due to Hasenbusch

Start with the Wilson fermion action M=1 - κH

Introduce the quantity M’=1 - κ’H

Use the identity M = M’(M’ -1_M)

Write the fermion determinant as det M = det M’ det (M’ -1_M) Introduce separate pseudofermions for each determinant Adjust κ’ to minimise the cost

Easily generalises

More than two pseudofermions Wilson-clover action

(68)

Violation of NFL Theorem

So let’s try using our

n

th

_{root trick to implement}

multipseudofermions

Condition number κ(r(M))=κ(M)1/n

So maximum force is reduced by a factor of nκ(M)(1/n)-1

This is a good approximation if the condition number is dominated by a few isolated tiny eigenvalues

This is so in the case of interest

Cost reduced by a factor of

nκ(M)

(1/n)-1

Optimal value n_opt ≈ ln κ(M)

So optimal cost reduction is (e lnκ) /κ

(69)

Rational Hybrid Monte Carlo: I

Generate pseudofermion from Gaussian heatbath

RHMC algorithm for fermionic kernel

(

†

)

21n

M M

† 1 2 ( ) P ξ e− ξ ξ ∝ χ =

(

†

)

41n ξ M M

(

)

1 1 ?

(

)

21 † 1 ₄ 2 2 † ( ) n n P χ ∞ d eξ − ξ ξδ χ ξ e− χ − χ −∞ ⎛ ⎞ ∝ _∫ _⎜ − _⎟ ∝ ⎝ ⎠ M M M M

Use accurate rational approximation _{r x}_{( )} ≈ 4n _x

Use less accurate approximation for MD, _{r x}%_{( )} ≈ 2n _x

, so there are no double poles 2

( ) ( )

r x% ≠ r x

(70)

Rational Hybrid Monte Carlo: II

Reminders

Apply rational approximations using their partial fraction expansions

Denominators are all just shifts of the original fermion kernel

All poles of optimal rational approximations are real and positive for cases of interest (Miracle #1)

Only simple poles appear (by construction!)

Use multishift solver to invert all the partial fractions using a single Krylov space

Cost is dominated by Krylov space construction, at least for O(20)

shifts

Result is numerically stable, even in 32-bit precision

All partial fractions have positive coefficients (Miracle #2)

(71)

Comparison with R algorithm: I

1.57(2) 84% 0.055 RHMC 1.73(4) 0.0038 R 1.56(5) 0.0019 R B₄ A δt Algorithm

Binder cumulant of chiral condensate,

B₄, and RHMC acceptance rate A from a finite temperature study (2+1 flavour naïve staggered fermions, Wilson

gauge action, V = 83_×4,m

ud = 0.0076,

(72)

Comparison with R algorithm: II

0.60779(1) 65.5% 0.02 0.04 0.04 RHMC 0.60817 0.005 0.04 0.02 R 0.60829(1) 0.01 0.04 0.02 R 0.60812(2) 0.01 0.04 0.04 R P A δt m_s m_ud Algorithm

The different masses at which domain wall results were gathered, together with the step-sizes δt, acceptance rates A, and plaquettes P (V = 163_{×32×8, DBW2}

gauge action, β = 0.72)

The step-size variation of the plaquette with m =0.02

(73)

Comparison with R algorithm: III

The integrated

autocorrelation time of the 13th _{time-slice of}

the pion propagator from the domain wall test, with m_ud = 0.04

(74)

Multipseudofermions with multiple timescales

Semiempirical observation: The largest force from a single pseudofermion does not come from the smallest shift

1 _≈ _0.3904603901₊ 0.0511093775 ₊ 0.1408286237 ₊ 0.5964845033

For example, look at the numerators in the partial fraction expansion we exhibited earlier

Make use of this by using a coarser timescale for the more expensive smaller shifts 0% 25% 50% 75% 100% -13 -10 -8.5 -7.1 -5.8 -4.4 -3.1 -1.7 -0.3 1.5 Shift [ln(β)] Residue (α) L² Force α/(β+0.125) CG iterations

(75)

L

₂

versus L

_∞

Force Norms

Wilson fermion forces (from Urbach et. al.)

(76)

Berlin Wall for Wilson fermions

HMC Results

C Urbach, K Jansen, A Schindler, and U Wenger, hep-lat/0506011,

hep-lat/0510064

Comparable performance

to Lüscher’s SAP algorithm

RHMC?

(77)

Conclusions (RHMC)

Advantages of RHMC

Exact

No step-size errors; no step-size extrapolations

Significantly cheaper than the R algorithm Allows easy implementation of Hasenbusch (multipseudofermion) acceleration

Further improvements possible

Such as multiple timescales for different terms in the partial fraction expansion

Disadvantages of RHMC

(78)

QCD Machines: I

We want a cost-effective computer to solve

interesting scientific problems

In fact we wanted a computer to solve lattice QCD

But it turned out that there is almost nothing that was not applicable to many other problems too

Not necessary to solve all problems with one architecture

Demise of the general-purpose computer?

Development cost « hardware cost for one large machine Simple OS and software model

(79)

QCD Machines: II

Take advantage of mass market components

It is not cost- or time-effective to compete with PC market in designing custom chips

Use standard software and tools whenever possible

Do not expect compilers or optimisers to anything

particularly smart

Parallelism has to be built into algorithms and programs from the start

Hand code critical kernels in assembler

(80)

QCD Machines: III

Parallel applications

Many real-world applications are intrinsically parallel

Because they are approximations to continuous systems

Lattice QCD is an good example

Lattice is a discretisation of four dimensional space-time Lots of arithmetic on small complex matrices and vectors Relatively tiny amount of I/O required

Amdahl’s law

Amount of parallel work may be increased working on a larger volume

(81)

Strong Scaling

The amount of computation

V

δ

_{required to}

equilibrate a system of volume

V

increases faster

than linearly

If we are to have any hope of equilibrating large systems the value of δ cannot be much larger than one

For lattice QCD we have algorithms with δ = 5/4

We therefore are driven to as small a value of σ as

possible

This corresponds to “thin nodes,” as opposed to “fat nodes” as in PC clusters

Clusters are competitive in price/performance up to a

certain maximum problem size

(82)

Data Parallelism

All processors run the same code

Not necessarily SIMD, where they share a common clock Synchronization on communication, or at explicit barriers

Type of data parallel operations

Pointwise arithmetic

Nearest neighbour shifts

Perhaps simultaneously in several directions

Global operations

(83)

Alternatives Paradigms

Multithreading

Parallelism comes from running many separate

more-or-less independent threads

Recent architectures propose running many

light-weight threads on each processor to overcome

memory latency

What are the threads that need almost no memory

Calculating zillions of digits of π? Cryptography?

(84)

Computational Grids

In the future carrying out large scale

computations using the Grid will be as

easy as plugging into an electric socket

(85)

Hardware Choices

Cost/MFlop

Goal is to be about 10 times more cost-effective than commercial machines

Otherwise it is not worth the effort and risk of building our own machine

Our current Teraflops machines cost about $1/MFlop

For a Petaflops machine we will need to reach about $1/GFlop

Power/cooling

Most cost effective to use low-power components and high-density packaging

(86)

Clock Speed

Peak/Sustained speed

The sustained performance should be about 20%—50% of peak Otherwise there is either too much or too little floating point hardware

Real applications rarely have equal numbers of adds and multiplies

Clock speed

“Sweet point” is currently at 0.5—1 GHz chips

High-performance chips running at 3 GHz are both hot and expensive

Using moderate clock speed makes electrical design issues such as clock distribution much simpler

(87)

Memory Systems

Memory bandwidth

This is currently the main bottleneck in most

architectures

There are two obvious solutions

Data prefetching

Vector processing is one way of doing this Requires more sophisticated software

Feasible for our class of applications because the control flow is essentially static (almost no data dependencies)

Hierarchical memory system (NUMA) We make use of both approaches

(88)

Memory Systems

On-chip memory

For QCD the memory footprint is small enough that we can put all the required data into an on-chip embedded DRAM memory

Cache

Whether the on-chip memory is managed as a cache or as directly addressable memory is not too important

Cache flushing for communications DMA is a nuisance

Caches are built into most μ-processors, not worth designing our own!

Off-chip memory

The amount of off-chip memory is determined by cost

If the cost of the memory is » 50% of the total cost then buy more processing nodes instead

(89)

Communications Network: I

This is where a massively parallel computer

differs from a desktop or server machine

In the future the network will become the

principal bottleneck

For large data parallel applications

We will end up designing networks and decorating them with processors and memories which are almost free

(90)

Communications Network

Topology

Grid

This is easiest to build How many dimensions?

QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6

Extra dimensions allow easier partitioning of the machine

Hypercube/Fat Tree/Ω network/Butterfly network

These are all essentially an infinite dimensional grid Good for FFTs

Switch

(91)

36

Global Operations

6 8 10 12 1 2 3 4 5 6 7 8

Grid wires

Not very good error propagation

O(N1/4₎ _{latency (grows as the} perimeter of the grid)

O(N) hardware

Combining network

A tree which can perform arithmetic at each node Useful for global reductions

But global operations can be performed using grid wires

It is all a matter of cost Used in BGL, not in QCDx

(92)

Global Operations

1 2 3 4 5 6 7 8 3 7 11 15 10 26 36

Combining tree

Good error propagation

O(log N) latency

O(N log N) hardware

Bit-wise operations

Allow arbitrary precision Used on QCDSP

Not used on QCDOC or BGL

Because data sent byte-wise (for ECC)

(93)

Broadcast ₁₁₀₁₀₀

Global Operations

110100 111000 Select Max MSB first

>

11010 11100 ? 0 1101 1110 ? 00 110 111 upper 100 11 11 upper 0100 1 1 upper 10100 upper 110100 110100 110100 001011 000111 LSB first

+

Carry Sum 00101 00011 1 0 0010 0001 1 10 001 000 1 010 00 00 1 0010 0 0 0 10010 0 010010

(94)

Communications Network: II

Parameters

Bandwidth Latency Packet size

The ability to send small [O(102_{) bytes] packets between}

neighbours is vital if we are to run efficiently with a small number of sites per processor (σ)

Control (DMA)

The DMA engine needs to be sophisticated enough to

interchange neighbouring faces without interrupting the CPU Block-strided moves

(95)

(96)

Block-strided Moves

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 11 10 9 8 7 6 5 4 3 2 1 0 11 10 9 8 7 6 5 4 3 2 1 0 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0

Block size

1 4

Stride

12 3

For each direction separately we

specify in memory mapped registers:

n Source starting address o Target starting address p Block size

(97)

Hardware Design Process

Optimise machine for applications of interest

We run simulations of lattice QCD to tune our design

Most design tradeoffs can be evaluated using a spreadsheet as the applications are mainly static (data-independent)

It also helps to debug the machine if you use the actual kernels you will run in production as test cases!

But running QCD even on a RTL simulator is painfully slow

Circuit design done using hardware

description languages (e.g., VHDL)

Time to completion is critical

(98)

Performance Model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91 No

Cache FIFO DMA-RF CBUF CBUF +SBUF SBUF +CBUF +

FPU Comm Memory Bus

(99)

VHDL

entitylcount5is

port (signalclk, res: invlbit;

signalreset, set:invlbit;

signal count: inout vlbit_1d(4downto0) ); endlcount5; architecturebhvoflcount5is begin process begin

wait untilprising(clk)orres='1';

ifres='1'thencount<=b"00000";

else ifreset='1'

thencount<=b"00000";

elsifset='1'then

count(4)<=count(4)xor(count(0)andcount(1)andcount(2)andcount(3)); count(3)<=count(3)xor(count(0)andcount(1)andcount(2));

count(2)<=count(2)xor(count(0)andcount(1)); count(1)<=count(1)xor count(0);

count(0)<=notcount(0);

end if;end if; end process ;

(100)

QCDSP

(101)

4 Mbytes of Embedded DRAM

Complete Processor Node for

QCD computer on a Chip

fabricated by IBM using SA27E 0.18μ logic + embedded DRAM process

IBM ASIC Library Component

DMA

Controller

QCDOC ASIC Design

DCR SDRAM DCR SDRAM Controller Controller DCR SDRAM DCR SDRAM

Custom Designed Logic

Interrupt

Controller

Controller Boot/ClockBoot/ClockSupportSupport

PLB PLB Arbiter Arbiter PLB PLB FPU FPU 440 440 Core Core HSSL HSSL HSSL HSSL HSSL HSSL SCU SCU PLL PLL EDRAM EDRAM PEC PEC MCMAL MCMAL Ethernet Ethernet DMA DMA location number IDC IDC Interface

Interface Serialnumber

Slave

OFB

OFB--PLBPLB

Bridge

Bridge ArbiterArbiterOFBOFB

O O F F B B MII FIFO FIFO 4KB

4KB recvrecv. FIFO. FIFO

EMACS EMACS 100 Mbit/sec Fast Ethernet 24 Off-Node Links 12 Gbit/sec Bandwidth 24 Link DMA Communications Control 2.6 Gbyte/sec EDRAM/SDRAM DMA 1 (0.8) Glops Double Precision RISC Processor 8 Gbyte/sec Memory-Processor Bandwidth 2.6 Gbyte/sec Interface to External Memory

PEC = Prefetching EDRAM Controller EDRAM = Embedded DRAM

DRAM = Dynamic RAM

(102)

(103)

(104)

Edinburgh ACF

UKQCDOC is

located in the

ACF near

Edinburgh

Our insurers will

not let us show

photographs of

the building for

security reasons!

(105)

(106)

Vicious Design Cycle

1. Communications DMA controller required

Otherwise control by CPU needed

Requires interrupts at each I/O completion CPU becomes bottleneck

2. Introduce more sophisticated DMA instructions

Block-strided moves for QCDSP

Sequences of block-strided moves for QCDOC

Why not? It is cheap

3. Use CPU as DMA engine

Second CPU in Blue Gene/L

4. Add FPU

Why not? It is cheap, and you double you peak Flops

5. Use second FPU for computation

Otherwise sustained fraction of peak is embarrassingly small

(107)

Power Efficiencies

QCDSP ASCI White QCDOC Blue Gene/L NCSA Intel Xeon ASCI Q Earth Simulator 0.001 0.01 0.1 1 1997 1998 1999 2000 2001 2002 2003 2004 2005

Year

G

Fl

ops/W

at

t

(108)

What Next?

We need a Petaflop computer for QCD

New fermion formulations solve a lot of problems, but cost significantly more in computer time