Algorithms for
Algorithms for
Dynamical Fermions
Dynamical Fermions
A D Kennedy
A D Kennedy
School of Physics, The University of Edinburgh
Outline
Monte Carlo integration
Importance sampling
Markov chains
Detailed balance, Metropolis algorithm
Symplectic integrators
Hybrid Monte Carlo
Pseudofermions
RHMC
Monte Carlo methods: I
Monte Carlo integration is based on the
identification of
probabilities
with
measures
There are much better methods of carrying out
low dimensional quadrature
All other methods become hopelessly expensive for large dimensions
In lattice QFT there is one integration per degree of freedom
Measure the value of Ω on each configuration
and compute the average
1
N( )
t
N
φ
Ω ≡
∑
Ω
Monte Carlo methods: II
Generate a sequence of random field
configurations
chosen from the probability distribution
(
φ φ
1,
2,
K
,
φ
t,
K
,
φ
N)
( )1
( )
S t t tP
d
e
Z
φφ
φ
=
−Central Limit Theorem: I
Distribution of values for a single sample ω = Ω(φ)
( )
(
( )
)
( )
PΩ ω ≡
∫
d Pφ φ δ ω − Ω φ = δ ω(
− Ω( )
φ)
Central Limit Theorem
where the variance of the distribution of Ω is
( )
C2 NO
Ω ∼ Ω +
(
)
2 2 C ≡ Ω − ΩLaw of Large Numbers
lim
N→∞
Ω =
Ω
The Laplace–DeMoivre Central Limit theorem
is an asymptotic expansion for the probability
distribution of Ω
(
)
(
)
(
)
3 0 3 4 2 1 4 2 2 2 0 3 C C C C C C = = Ω − Ω = Ω = Ω − Ω − = Ω − ΩThe first few cumulants are
Central Limit Theorem: II
Generating function for connected moments
( ) ln ( ) ik W k dω P ω e ω Ω ≡
∫
Ω( )
( ) ln dφ P φ eikΩ φ ln eikΩ =∫
=( )
0 ! n n n ik C n ∞ = =∑
Central Limit Theorem: III
Connected generating function
( )
( ) ln⎡ dφ P φ eikΩ φ N ⎤N = ⎣∫
⎦( )
( )
( )
1 1 1 ln N N exp N t t ik d d P P N φ φ φ φ φ = ⎡ ⎤ = ⎢ Ω ⎥ ⎣∑
⎦∫
K K ( ) ln ( ) ik W k dω P ω e ω Ω ≡∫
Ω( )
1 1 ! n n n n ik C n N ∞ − = =∑
ln ik N N e Ω = k NW N Ω ⎛ ⎞ = ⎜ ⎟ ⎝ ⎠( )
1( )
1( )
( )
1 1 N N N t t P d d P P N ω φ φ φ φ δ ω φ Ω = ⎛ ⎞ ≡ ⎜ − Ω ⎟ ⎝∑
⎠∫
K K( )
3 4 3 4 2 1 2 3 3 4 2 2 3! 4! 2 N C d C d ik ik C N d N d dk ike
ω ωe
e
ω π − + − Ω + −∼
L∫
Take inverse Fourier transform to obtain distribution PΩ
( )
1
( )
2
W k ikP
ω
dk e
e
ωπ
Ω − Ω=
∫
( )2 3 4 2 3 4 2 2 3 3 4 3! 4! 2 2 C d C d C N N d N d C Ne
e
ω ω ω π − Ω − − + −=
LCentral Limit Theorem: V
where and
ξ
≡
(
ω
− Ω
)
N
( )
3(
2 2)
2 2 2 3 2 2 3 1 6 2 C C C e F C N C ξ ξ ξ ξ π − ⎡ − ⎤ ⎢ ⎥ = + + ⎢ ⎥ ⎣ L⎦( )
( )
d
P
F
d
ξ
ω
ξ
ω
Ω=
Importance Sampling: I
( ) 1
N =
∫
dx ρ x =Sample from distribution
Probability Normalisation 0 < ρ( )x a e. .
( )
dx f x
∫
Integral
( )
( )
( )
f x
I
x dx
x
ρ
ρ
=
∫
Estimator of integral
Estimator of variance
2 2 2( )
( )
( )
( )
( )
f x
f x
V
x dx
I
dx
I
x
x
ρ
ρ
ρ
⎛
⎞
=
⎜
−
⎟
=
−
⎝
⎠
∫
∫
Importance Sampling: II
2 2(
)
( )
0
( )
( )
V
N
f y
y
y
δ
λ
λ
δρ
ρ
+
= −
+ =
Minimise variance
Constraint N=1 Lagrange multiplier λ( )
( )
( )
optf x
x
dx f x
ρ
⇒
=
∫
Optimal measure
( ) sin( )
f x
=
x
Example:
1 2( )
x
sin( )
x
ρ
=
Optimal weight
2 2 2 2 opt 0 0sin( )
sin( )
16
V
=
⎜
⎛
πdx
x
⎟
⎞
−
⎜
⎛
πdx
x
⎞
⎟
=
⎝
∫
⎠
⎝
∫
⎠
Optimal variance
2 0dx
sin
x
π∫
Importance Sampling: IV
∫
02πdx
sin
x
1 Construct cheap approximation to |sin(x)|
2 Calculate relative area within each rectangle
5 Choose another random number uniformly
4 Select corresponding rectangle
3 Choose a random number uniformly
6 Select corresponding x value within rectangle
Importance Sampling: V
(
)
(
)
2 2
0 0
sin( ) sin( )
sin( )
sin( )
I
=
∫
πdx
x
θ
x
+
∫
πdx
x
θ
−
x
But we can do better!
opt
0
V
=
For which
With 100 rectangles we have
V
= 16.02328561
With 100 rectangles we have
V
= 0.011642808
2
0
dx
sin
x
π
Markov Chains: I
State space
Ω
(Ergodic) stochastic transitions
P’: Ω → Ω
Distribution converges to unique fixed point
Q
Deterministic evolution of probability
distribution
P: Q → Q
The sequence
Q, PQ, P²Q, P³Q,…
is Cauchy
Define a metric on
the space of (equivalence classes of) probability
distributions
(
1,
2)
1( )
2( )
d Q Q
≡
∫
dx Q x
−
Q x
Prove that with
α
> 0, so the Markov process
P
is a contraction
mapping
(
1,
2) (
1
) (
1,
2)
d PQ PQ
≤
−
α
d Q Q
The space of probability distributions is
complete
, so the
sequence converges to a unique fixed point
Q
=
lim
P Q
n(
1,
2)
1( )
2( )
d PQ PQ
=
∫
dx PQ x
−
PQ x
(
) ( )
1(
) ( )
2dx dy P x
y Q y
dy P x
y Q y
=
∫ ∫
←
−
∫
←
(
)
( )
(
( )
)
(
( )
)
dx dy P x
y
Q y
⎡
θ
Q y
θ
Q y
⎤
=
∫ ∫
←
Δ
⎣
Δ
+
−Δ
⎦
(
)
( )
dx dy P x
y
Q y
=
∫ ∫
←
Δ
(
)
( )
(
( )
)
2
dx
min
dy P x
y
Q y
θ
Q y
±−
∫
∫
←
Δ
±Δ
(
)
( )
dx dy P x
y
Q y
=
∫ ∫
←
Δ
(
)
2min
,
a
−
b
=
a
+
b
−
a b
( )
1( )
2( )
Q y
Q y
Q y
Δ
≡
−
( )
y
( )
y
1
θ
+
θ
−
=
( )
2
min
(
)
( )
(
( )
)
dy Q y
dx
dy P x
y
Q y
θ
Q y
±=
∫
Δ
−
∫
∫
←
Δ
±Δ
( )
2
inf
(
)
min
( )
(
( )
)
ydy Q y
dx
P x
y
dy Q y
θ
Q y
±≤
∫
Δ
−
∫
←
∫
Δ
±Δ
( )
inf
(
)
( )
ydy Q y
dx
P x
y
dy Q y
≤
∫
Δ
−
∫
←
∫
Δ
≤
(1
−
α
)
d Q Q
(
1,
2)
( )
1( )
2( )
1 1 0
dy Q y
dy Q y
dy Q y
=
∫
Δ
=
∫
−
∫
= − =
( )
(
( )
)
( )
(
( )
)
dy Q y
Δ
θ
Δ
Q y
+
dy Q y
Δ
θ
−Δ
Q y
∫
∫
(
)
1
dx P x
←
y
=
∫
( )
(
( )
)
1( )
2dy Q y
θ
Q y
dy Q y
⇒
∫
Δ
±Δ
=
∫
Δ
(
)
0
<
α
=
∫
dx
inf
P x
←
y
(
1,
2)
d PQ PQ
Markov Chains: II
Use Markov chains to sample from
Q
Suppose we can construct an ergodic Markov process P which has distribution Q as its fixed point
Start with an arbitrary state (“field configuration”)
Iterate the Markov process until it has converged (“thermalized”)
Thereafter, successive configurations will be distributed according to Q
But in general they will be correlated
To construct P we only need relative probabilities of states
Do not know the normalisation of Q
Cannot use Markov chains to compute integrals directly We can compute ratios of integrals
Metropolis algorithm
(
)
( )
( )
min 1,
Q x
P x
y
Q y
⎛
⎞
←
=
⎜
⎜
⎟
⎟
⎝
⎠
Markov Chains: III
How do we construct a Markov process with a specified
fixed point ?
Q x
( )
=
∫
dy P x
(
←
y Q y
) ( )
Integrate w.r.t. y to obtain fixed point condition
Sufficient but not necessary for fixed point
(
) ( )
(
) ( )
P y
←
x Q x
=
P x
←
y Q y
Detailed balance
(
)
Q x( )
P x ← y = Other choices are possible, e.g.,
Consider cases and separately to obtain detailed balance conditionQ x( ) > Q y( ) Q x( ) < Q y( )
Markov Chains: IV
Composition of Markov steps
Let P1 and P2 be two Markov steps which have the desired fixed point distribution
They need not be ergodic
Then the composition of the two steps P2oP1 will also have the desired fixed point
And it may be ergodic
This trivially generalises to any (fixed) number of steps
For the case where P1 is not ergodic but (P1 ) n is the terminology
Markov Chains: V
This result justifies “sweeping” through a
lattice performing single site updates
Each individual single site update has the desired fixed point because it satisfies detailed balance
The entire sweep therefore has the desired fixed point, and is ergodic
But the entire sweep does not satisfy detailed balance
Of course it would satisfy detailed balance if the sites were updated in a random order
But this is not necessary
Markov Chains: VI
α
Coupling from the Past
Propp and Wilson (1996)
Use fixed set of random numbers
Flypaper principle: If states coalesce they stay together forever – Eventually, all states coalesce to some state α with probability one – Any state from t = -∞ will coalesce to α
Markov Chains: VII
Suppose we have a lattice
That is, a partial ordering with a largest and smallest element
And an update step that
preserves it
Then once the largest and
smallest states have
coalesced all the others must
have too
a
b
c
d
e
g
f
Markov Chains: VIII
α
β ≥ 0
Ordering is spin configuration
A ≥ B iff for every site As ≥ Bs
Update is a local heatbath
i j ij
H
=
b
å
s s
A non-trivial example of
where this is possible is
the ferrormagnetic Ising
model
Autocorrelations: I
This corresponds to the exponential autocorrelation time
exp 1 0 ln N λ ≡ − >
Exponential autocorrelations
The unique fixed point of an ergodic Markov process corresponds to the unique eigenvector with eigenvalue 1 All its other eigenvalues must lie within the unit circle
max 1
λ < In particular, the largest subleading eigenvalue is
Autocorrelations: II
( ) ( )
2 2 1 1 1 N N t t t t N φ φ ′ ′ = = Ω =∑∑
Ω Ω 2( )
2 1( ) ( )
1 1 1 1 2 N N N t t t t t t t N φ φ φ − ′ ′ = = = + ⎧ ⎫ = ⎨ Ω + Ω Ω ⎬ ⎩∑
∑ ∑
⎭Integrated autocorrelations
Consider the autocorrelation of some operator Ω Without loss of generality we may assume Ω = 0
(
) ( )
1 2 2 2 1 1 2 N N C N N − Ω = ⎧ ⎫ Ω = ⎨ Ω + − Ω ⎬ ⎩∑
l ⎭ l l( )
( ) ( )
( )
2 t t C φ φ φ + Ω Ω Ω ≡ Ω l lAutocorrelations: III
The autocorrelation function must fall faster that the
exponential autocorrelation
( )
expmax N C λ e− Ω l ≤ l = l
{
}
2 2 exp 1 2A 1 O N N N Ω Ω ⎡ ⎛ ⎞⎤ Ω = + ⎢ + ⎜ ⎟⎥( )
2 exp 1 1 2 C 1 O N N N ∞ Ω = Ω ⎡ ⎛ ⎞⎤ ⎧ ⎫ = ⎨ + ⎬ ⎢ + ⎜ ⎟⎥ ⎩∑
l ⎭ ⎣ ⎝ ⎠⎦ lFor a sufficiently large number of samples
( )
1 AΩ ∞ CΩ = ≡∑
l lIn order to carry out Monte Carlo computations
including the effects of dynamical fermions we
would like to find an algorithm which
Update the fields globally
Because single link updates are not cheap if the action is not local
Take large steps through configuration space
Because small-step methods carry out a random walk which leads to critical slowing down with a dynamical critical exponent z=2
Hybrid Monte Carlo: I
Does not introduce any systematic errors
z relates the autocorrelation to the correlation length of the system, AΩ ∝ ξz
A useful class of algorithms with these properties is
the (Generalised) Hybrid Monte Carlo (HMC) method
Introduce a “fictitious” momentum p corresponding to each dynamical degree of freedom q
Find a Markov chain with fixed point ∝ exp[-H(q,p) ] where H is the “fictitious” Hamiltonian ½p2 + S(q)
The action S of the underlying QFT plays the rôle of the potential in the “fictitious” classical mechanical system
This gives the evolution of the system in a fifth dimension, “fictitious” or computer time
This generates the desired distribution exp[-S(q) ] if we ignore the momenta p (i.e., the marginal distribution)
The HMC Markov chain alternates two Markov
steps
Molecular Dynamics Monte Carlo (MDMC) (Partial) Momentum Refreshment
Both have the desired fixed point
Together they are ergodic
Hybrid Monte Carlo: III
Hybrid Monte Carlo: III
MDMC: I
If we could integrate Hamilton’s equations
exactly we could follow a trajectory of
constant fictitious energy
This corresponds to a set of equiprobable fictitious phase space configurations
Liouville’s theorem tells us that this also preserves the functional integral measure dp ∧ dq as required
Any approximate integration scheme which
is reversible and area preserving may be
used to suggest configurations to a
MDMC: II
Molecular Dynamics (MD), an approximate integrator which is exactly
( ) (
: ,) (
,)
U τ q p a q p′ ′
We build the MDMC step out of three parts
A Metropolis accept/reject step
Area preserving,
(
)
(
)
* , det det 1 , q p U q p ⎡∂ ′ ′ ⎤ = ⎢ ⎥ = ∂ ⎣ ⎦ Reversible, F Uo( )
τ o oF U( )
τ = 1 A momentum flip F p: a −p( )
(
H)
1
(
H)
q
q
F U
e
y
y e
p
p
δ δτ ϑ
−ϑ
−′
⎛ ⎞
⎡
⎤
⎛ ⎞
=
−
+
−
⎜ ⎟
′
⎣
⎦
⎜ ⎟
⎝ ⎠
o
⎝ ⎠
The composition of these gives
The Gaussian distribution of p is invariant under F
The extra momentum flip F ensures that for small θ the momenta are reversed after a rejection rather than after an acceptance
For θ = π / 2 all momentum flips are irrelevant
This mixes the Gaussian distributed momenta
p
with Gaussian noise
ξ
cos
sin
sin
cos
p
p
F
θ
θ
ξ
θ
θ
ξ
′
⎛ ⎞ ⎛
⎞
⎛ ⎞
=
⎜ ⎟ ⎜
′
−
⎟
⎜ ⎟
⎝ ⎠ ⎝
⎠
o
⎝ ⎠
If A and B belong to any (non-commutative) algebra
then , where δ constructed from
commutators of A and B (i.e., is in the Free Lie Algebra generated by {A,B }) A B A B e e =e + +δ
Symplectic Integrators: I
[
]
( )
1 2 1 2 1 2 2 2 1 1 2 , , 1 0 1 , , , , 1 2 ! m m m n m n n k k k k m k k n B c c A B c c A B n m ⎢ ⎥ ⎣ ⎦ + ≥ = + + = ⎧ ⎫ ⎪ ⎡ ⎡ ⎡ ⎤ ⎤⎤⎪ = ⎨− − + ⎣ ⎣ ⎣ + ⎦ ⎦⎦⎬ + ⎪ ⎪ ⎩ ⎭∑
∑
K L K KMore precisely, where and
(
)
1 ln A B n n e e c ≥ =
∑
c1 = +A BBaker-Campbell-Hausdorff (BCH) formula
Symplectic Integrators: II
Explicitly, the first few terms are(
)
{ } [ ]{
[ ] [ ]}
[ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] 1 1 1 2 12 24 1 720 ln , , , , , , , , , , , , 4 , , , , 6 , , , , 4 , , , , 2 , , , , , , , , A B e e A B A B A A B B A B B A A B A A A A B B A A A B A B A A B B B A A B A B B A B B B B A B ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ = + + + ⎣ ⎦ ⎣− ⎦ − ⎣ ⎣ ⎦⎦ ⎧ −⎡⎣ ⎡⎣ ⎡⎣ ⎤⎦⎤⎦⎤⎦ − ⎡⎣ ⎡⎣ ⎡⎣ ⎤⎦⎤⎦⎤⎦ ⎫ ⎪ ⎪ ⎡ ⎡ ⎤⎤ ⎡ ⎡ ⎡ ⎤⎤⎤ + ⎨− ⎣ ⎣ ⎦⎦+ ⎣ ⎣ ⎣ ⎦⎦⎦⎬ ⎪ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎪ − ⎣ ⎡⎣ ⎦⎤⎦ +⎣ ⎣ ⎡⎣ ⎤⎦⎦⎦ ⎩ ⎪ ⎪ + ⎪ ⎪ ⎭ LIn order to construct reversible integrators we use symmetric symplectic integrators
(
)
{ }{
[ ] [ ]}
[ ] [ ] [ ] [ ] [ ] 2 2 1 24 1 5760 ln , , 2 , , 7 , , , , 28 , , , , 12 , , , , 32 , , , , A B A e e e A B A A B B A B A A A A B B A A A B A B A A B B B A A B ⎡ ⎤ ⎡ ⎤ = + + ⎣ ⎦ − ⎣ ⎦ ⎧ ⎡⎣ ⎡⎣ ⎡⎣ ⎤⎦⎤⎦⎦⎤+ ⎣⎡ ⎡⎣ ⎡⎣ ⎤⎦⎦⎤⎤⎦ ⎫ ⎪ ⎪ ⎪ ⎡ ⎡ ⎤⎤ ⎡ ⎡ ⎡ ⎤⎤⎤⎪ + ⎨+ ⎣ ⎣ ⎦⎦ + ⎣ ⎣ ⎣ ⎦⎦⎦⎬+ ⎪ ⎪ LSymplectic Integrators: III
exp d exp dp dq dt dt p dt q τ ⎛τ ⎧ ∂ ∂ ⎫⎞ ⎛ ⎞ ≡ + ⎨ ⎬ ⎜ ⎟ ⎜ ⎟ ∂ ∂ ⎝ ⎠ ⎝ ⎩ ⎭⎠(
)
( )
( )
1 2( )
2,
H q p
=
T p
+
S q
=
p
+
S q
We are interested in finding the classical trajectory in
phase space of a system described by the Hamiltonian
The basic idea of such a symplectic integrator is to write the time evolution operator as
ˆ exp H H e H q p p q τ τ ⎛ ⎧ ∂ ∂ ∂ ∂ ⎫⎞ = ⎜ ⎨− + ⎬⎟ ≡ ∂ ∂ ∂ ∂ ⎩ ⎭ ⎝ ⎠
( )
( )
exp S q T p p q τ ⎛ ⎧ ′ ∂ ′ ∂ ⎫⎞ = ⎜ ⎨− + ⎬⎟ ∂ ∂ ⎩ ⎭ ⎝ ⎠Symplectic Integrators: IV
Define and so thatP S q
( )
Hˆ = P Q+p ∂ ′ ≡ − ∂
( )
Q T p q ∂ ′ ≡ ∂Since the kinetic energy T is a function only of p and the potential energy S is a function only of q, it follows that the action of and may be evaluated triviallyeτP eτQ
(
)
(
( )
)
(
)
(
( )
)
: , , : , , Q P e f q p f q T p p e f q p f q p S q τ τ τ τ ′ + ′ − a aSymplectic Integrators: V
From the BCH formula we find that the PQP symmetric symplectic integrator is given by
(
1 1)
2 2 / / 0( ) P Q P U δτ τ δτ = e δτ e eδτ δτ τ δτ(
)
(
[
]
[
]
)
( )
(
1 3 5)
24 exp⎡ P Q δτ ⎡P P Q, , ⎤ 2⎡Q P Q, , ⎤ δτ O δτ ⎤ τ δτ = ⎣ + − ⎣ ⎦+ ⎣ ⎦ + ⎦(
)
(
[
]
[
]
)
( )
(
1 2 4)
24 exp⎡τ P Q ⎡P P Q, , ⎤ 2⎡Q P Q, , ⎤ δτ O δτ ⎤ = ⎣ + − ⎣ ⎦ + ⎣ ⎦ + ⎦ ( )( )
ˆ P Q 2 H eτ ′ eτ + O δτ = = +In addition to conserving energy to O (δτ² ) such symmetric symplectic integrators are manifestly area preserving and reversible
Symplectic Integrators: VI
For each symplectic integrator there exists a
Hamiltonian
H’
which is
exactly conserved
This is given by substituting Poisson brackets for commutators in the BCH formula
{
}
{
}
{
{
}
}
{
}
{
}
{
}
{
}
{
{
}
{
{
}
}
}
{
}
{
}
{
}
{
}
{
{
}
{
{
}
}
}
{
}
{
}
{
}
{
}
{
{
{
{
}
}
}
}
( )
2 1 24 4 6 1 5760 ' , , 2 , , 32 , , , 16 , , , , 28 , , , , 12 , , , , 8 , , , , 7 , , , , H S T T T S S T S S S T T S T S S T S S T T T S T S T T S O S S S T S T T T T S δτ δτ δτ ⎡ ⎤ = + − ⎣ + ⎦ ⎡ − ⎤ ⎢ ⎥ ⎢ ⎥ + ⎢+ + ⎥ + ⎢ ⎥ + + ⎢ ⎥ ⎣ ⎦Symplectic Integrators: VII
{
}
{
A B C, ,}
+{
B C A,{
,}
}
+{
C A B, ,{
}
}
= 0{
A B,}
A B A B p q q p ∂ ∂ ∂ ∂ ≡ − ∂ ∂ ∂ ∂Poisson brackets form a
Lie algebra
Homework exercise: verify the Jacobi identity
Symplectic Integrators: VIII
The explicit result for H’ is
{
}
( )(
)
{
}
( )
2 2 2 1 24 4 4 2 2 2 4 6 1 720 2 6 2 3 H H p S S p S p S S S S S O δτ δτ δτ ′ = + ′′ − ′ ′ ′′′ ′′ ′ ′′ + − + + − +Note that H’ cannot be written as the sum of a p-dependent kinetic term and a q-dependent potential term
As H’ is conserved, δH is of O(δτ 2) for trajectories of
arbitrary length
Symplectic Integrators: IX
Define and so thatPi S qi
( )
Hˆ = P1 + P2 +Q p ∂ ′ ≡ − ∂( )
Q T p q ∂ ′ ≡ ∂ 1 2 ( , ) ( ) ( ) ( ) H q p =T p + S q + S qMultiple timescales
Split the Hamiltonian into pieces
2 1 2 1 1 1 1 1 1 1 2 1 2 4 2 2 2 4 1 1 4 1 2 2 4 2 / / SW( ) n n n n n n n n n n Q P Q P Q P Q U e e e e e e e τ δτ δτ δτ δτ δτ δτ δτ δτ τ δτ δτ = ⎜⎛⎡⎢ ⎤ ⎡⎥ ⎢ ⎤ ⎡⎥ ⎢ ⎤⎥ ⎞⎟ ⎣ ⎦ ⎣ ⎦ ⎣ ⎦ ⎝ ⎠
Introduce a symmetric symplectic integrator of the form
If then the instability in the integrator is tickled equally by each sub-step
1 2
1 2 2
P P
Q n ≈ n ≈
This helps if the most expensive force computation does not correspond to the largest force
Dynamical fermions: I
Pseudofermions
Direct simulation of Grassmann fields is not feasible
The problem is not that of manipulating anticommuting values in a computer
We therefore integrate out the fermion fields to obtain the fermion determinant d d eψ ψ −ψ ψM ∝ det
( )
M∫
It is that is not positive, and thus we get poor importance sampling
F
S M
e− = e−ψ ψ
and always occur quadraticallyψ
Dynamical fermions: II
( )
1( ) 0 , , eηM φ η η η δ δ φ φ δη δη − = = ⎛ ⎞ ′ Ω = Ω ⎜ ⎟ ⎝ ⎠Any operator Ω can be expressed solely in terms of the bosonic fields
Use Schwinger’s source method and integrate out the fermions
( ) ( )
1( , ) ( , )
G x yψ = ψ x ψ y = M − x y E.g., the fermion propagator is
Dynamical fermions: III
Represent the fermion determinant as a bosonic Gaussian integral with a non-local kernel detM
( )
φ d d eχ χ −χM−1( )φ χ∝ ∫
The fermion kernel must be positive definite (all its
eigenvalues must have positive real parts) otherwise the bosonic integral will not converge
The determinant is extensive in the lattice volume, thus again we get poor importance sampling
Including the determinant as part of the observable to be measured is not feasible
( ) ( )
( )
det det B B S S M M φ φ φ Ω Ω =Dynamical fermions: IV
It is usually convenient to introduce two flavours of fermion
and to write
(
( )
)
(
( ) ( )
)
(
)
1 †
2 †
detM φ = det M φ M φ ∝ ∫d d eχ χ −χ M M − χ
The evaluation of the pseudofermion action and the
corresponding force then requires us to find the solution of a (large) set of linear equations
(
M M†)
−1 χ ψ=This not only guarantees positivity, but also allows us to generate the pseudofermions from a global heatbath by applying to a random Gaussian distributed field
†
M
The equations for motion for the boson (gauge) fields are
( )
(
)
1( )
(
)
1 †(
) (
)
1 ? ? ٛ B B S S M M M M M M M M φ π φ φ π χ χ χ χ φ φ φ φ − − − = ∂ ∂ ∂ ⎡ ⎤ ∂ ⎡ ⎤ = − − = − + ⎢ ⎥ ⎢ ⎥ ⎣ ⎦ ⎣ ⎦ ∂ ∂ ∂ ∂ & &Dynamical fermions: V
It is not necessary to carry out the inversions
required for the equations of motion exactly
There is a trade-off between the cost of computing the force and the acceptance rate of the Metropolis MDMC step
The inversions required to compute the
pseudofermion action for the accept/reject
step does need to be computed exactly,
however
We usually take “exactly” to by synonymous with “to machine precision”
Reversibility: I
Are HMC trajectories reversible and area
preserving in practice?
The only fundamental source of irreversibility is the rounding error caused by using finite precision floating point arithmetic
For fermionic systems we can also introduce irreversibility by
choosing the starting vector for the iterative linear equation solver time-asymmetrically
We do this if we to use a Chronological Inverter, which takes (some extrapolation of) the previous solution as the starting vector
Floating point arithmetic is not associative
It is more natural to store compact variables as scaled integers (fixed point)
Saves memory
Reversibility: II
Data for SU(3) gauge theory and QCD with heavy quarks show that rounding errors are amplified exponentially
The underlying continuous time equations of motion are chaotic
Ляпунов exponents
characterise the divergence of nearby trajectories
The instability in the integrator occurs when δH » 1
Reversibility: III
In QCD the Ляпунов exponents appear to scale with β as the system approaches the continuum limit β → ∞
νξ = constant
This can be interpreted as saying that the Ляпунов exponent characterises the
chaotic nature of the continuum classical equations of motion, and is not a lattice artefact
Therefore we should not have to worry about reversibility breaking down as we approach the continuum limit
Caveat: data is only for small lattices, and
What is the best polynomial
approximation
p(x)
to a continuous
function
f(x)
for
x
in [
0,1
] ?
Polynomial approximation
Best with respect to the appropriate norm
where n
≥
1
1 / 1 0( )
( )
n np f
−
n
=
⎛
⎜
dx p x
−
f x
⎞
⎟
⎝
∫
⎠
Weierstraß’ theorem
Weierstraß: Any continuous
function can be arbitrarily well
approximated by a polynomial
0 1minmax ( )
( )
p xp f
p x
f x
≤ ≤−
∞
=
−
Taking n
Taking n
→∞
→∞
this is the minimax
this is the
norm
Бернштейне polynomials
( )
0(1
)
n n n k n kn
k
f
x
x
n
k
p x
− =⎛ ⎞
⎛ ⎞
−
⎜ ⎟ ⎜ ⎟
⎝ ⎠ ⎝ ⎠
≡
∑
The explicit solution is
provided by Бернштейне
polynomials
Чебышев’s theorem
The error |p(x) - f(x)| reaches its
maximum at exactly d+2 points on
the unit interval
( )
( )
0 1max
xp f
∞p x
f x
≤ ≤−
=
−
Чебышев
Чебышев
: There is always a
: There is always a
unique polynomial of any
unique polynomial of any
degree d which minimises
Чебышев’s theorem: Necessity
Suppose p-f has less than d+2 extrema of equal magnitude Then at most d+1 maxima exceed some magnitude
And whose magnitude is smaller than the “gap”
The polynomial p+q is then a better approximation than p to f This defines a “gap”
We can construct a polynomial q of degree d which has the opposite sign to p-f at each of these maxima (Lagrange interpolation)
Чебышев’s theorem: Sufficiency
Then at each of the d+2p x′
( )
i − f x( )
i ≤ p x( )
i − f x( )
i extrema Thus p’ – p = 0 as it is a polynomial of degree dTherefore p’ - p must have d+1 zeros on the unit interval Suppose there is a polynomial p′ − f ∞ ≤ p − f ∞
The notation is an old transliteration of Чебышев !
Чебышев polynomials
Convergence is often exponential in
d
The best approximation of degree
d-1
over [
-1,1
] to
x
dis
( )
( )
1( )
11
2
d d d dp
−x
≡
x
−
−T x
( )
cos(
cos 1( )
)
d T x = d − xWhere the Чебышев polynomials are
( )
( )
1
1( )
2
ln2
2
d d d dd
x
p x
−T x
e
∞ ∞−
−
=
=
The error is
Чебышев rational functions
Чебышев’s theorem is easily extended to
rational approximations
Rational functions with nearly equal degree numerator and denominator are usually best
Convergence is still often exponential
Rational functions usually give a much better approximation A simple (but somewhat slow) numerical
algorithm for finding the optimal Чебышев rational approximation was given by Ремез
Чебышев rationals: Example
A realistic example of a rational approximation is
(
)(
)(
)
(
x 2.3475661045 x)(
0.1048344600 x)(
0.0073063814)
1 0.3904603901 x 0.4105999719 x 0.0286165446 x 0.0012779193 x + + + ≈ + + +Using a partial fraction expansion of such rational functions allows us to use a multishift linear equation solver, thus reducing the cost significantly.
1 0.3904603901 0.0511093775 0.1408286237 0.5964845033 x 0.0012779193 x 0.0286165446 x 0.4105999719
x ≈ + + + + + +
The partial fraction expansion of the rational function above is
Polynomials versus rationals
Optimal
L
2approximation with weight
is
21
1
−
x
2 1 0( ) 4
( )
(2
1)
j n j jT
x
j
π
+ =−
+
∑
Optimal
L
∞approximation cannot be too
much better (or it would lead to a better
L
2approximation)
lnn
e
εΔ ≤
Золотарев’s formula has
L
∞error
Non-linearity of CG solver
Suppose we want to solve
A
2x=b
for
Hermitian
A
by CG
It is better to solve Ax=y, Ay=b successively Condition number κ(A2) = κ(A)2
Cost is thus 2κ(A) < κ(A2) in general
Suppose we want to solve Ax=b
Why don’t we solve A1/2x=y, A1/2y=b successively?
The square root of A is uniquely defined if A>0
This is the case for fermion kernelsAll this generalises trivially to
n
throots
No tuning needed to split condition number evenly
Rational matrix approximation
Functions on matrices
Defined for a Hermitian matrix by diagonalisation
H = U D U -1
f (H) = f (U D U -1) = U f (D) U -1
Rational functions do not require diagonalisation
α H m + β H n = U (α D m + β D n) U -1
H -1 = U D -1 U -1
Rational functions have nice properties
Cheap (relatively)No Free Lunch Theorem
We must apply the rational approximation
with each CG iteration
M1/n ≈ r(M)
The condition number for each term in the partial fraction expansion is approximately κ(M)
So the cost of applying M1/n is proportional to κ(M)
Even though the condition number κ(M1/n)=κ(M)1/n
And even though κ(r(M))=κ(M)1/n
Pseudofermions
We want to evaluate a functional integral
including the fermionic determinant det
M
1
det
M
∝
d d e
φ φ
∗ −φ∗M− φ∫
We write this as a bosonic functional integral
over a pseudofermion field with kernel
M
-1Multipseudofermions
We are introducing extra noise into the system by
using a single pseudofermion field to sample this
functional integral
This noise manifests itself as fluctuations in the force exerted by the pseudofermions on the gauge fields
This increases the maximum fermion force This triggers the integrator instability
This requires decreasing the integration step size
1
1
det
M
nd d e
φ φ
∗ −φ∗M−nφ∝ ∫
Hasenbusch’s method
Clever idea due to Hasenbusch
Start with the Wilson fermion action M=1 - κH
Introduce the quantity M’=1 - κ’H
Use the identity M = M’(M’ -1M)
Write the fermion determinant as det M = det M’ det (M’ -1M) Introduce separate pseudofermions for each determinant Adjust κ’ to minimise the cost
Easily generalises
More than two pseudofermions Wilson-clover action
Violation of NFL Theorem
So let’s try using our
n
throot trick to implement
multipseudofermions
Condition number κ(r(M))=κ(M)1/n
So maximum force is reduced by a factor of nκ(M)(1/n)-1
This is a good approximation if the condition number is dominated by a few isolated tiny eigenvalues
This is so in the case of interest
Cost reduced by a factor of
nκ(M)
(1/n)-1Optimal value nopt ≈ ln κ(M)
So optimal cost reduction is (e lnκ) /κ
Rational Hybrid Monte Carlo: I
Generate pseudofermion from Gaussian heatbath
RHMC algorithm for fermionic kernel
(
†)
21nM M
† 1 2 ( ) P ξ e− ξ ξ ∝ χ =(
†)
41n ξ M M(
)
1 1 ?(
)
21 † 1 4 2 2 † ( ) n n P χ ∞ d eξ − ξ ξδ χ ξ e− χ − χ −∞ ⎛ ⎞ ∝ ∫ ⎜ − ⎟ ∝ ⎝ ⎠ M M M MUse accurate rational approximation r x( ) ≈ 4n x
Use less accurate approximation for MD, r x%( ) ≈ 2n x
, so there are no double poles 2
( ) ( )
r x% ≠ r x
Rational Hybrid Monte Carlo: II
Reminders
Apply rational approximations using their partial fraction expansions
Denominators are all just shifts of the original fermion kernel
All poles of optimal rational approximations are real and positive for cases of interest (Miracle #1)
Only simple poles appear (by construction!)
Use multishift solver to invert all the partial fractions using a single Krylov space
Cost is dominated by Krylov space construction, at least for O(20)
shifts
Result is numerically stable, even in 32-bit precision
All partial fractions have positive coefficients (Miracle #2)
Comparison with R algorithm: I
1.57(2) 84% 0.055 RHMC 1.73(4) 0.0038 R 1.56(5) 0.0019 R B4 A δt AlgorithmBinder cumulant of chiral condensate,
B4, and RHMC acceptance rate A from a finite temperature study (2+1 flavour naïve staggered fermions, Wilson
gauge action, V = 83×4, m
ud = 0.0076,
Comparison with R algorithm: II
0.60779(1) 65.5% 0.02 0.04 0.04 RHMC 0.60817 0.005 0.04 0.02 R 0.60829(1) 0.01 0.04 0.02 R 0.60812(2) 0.01 0.04 0.04 R P A δt ms mud AlgorithmThe different masses at which domain wall results were gathered, together with the step-sizes δt, acceptance rates A, and plaquettes P (V = 163×32×8, DBW2
gauge action, β = 0.72)
The step-size variation of the plaquette with m =0.02
Comparison with R algorithm: III
The integratedautocorrelation time of the 13th time-slice of
the pion propagator from the domain wall test, with mud = 0.04
Multipseudofermions with multiple timescales
Semiempirical observation: The largest force from a single pseudofermion does not come from the smallest shift
1 ≈ 0.3904603901+ 0.0511093775 + 0.1408286237 + 0.5964845033
For example, look at the numerators in the partial fraction expansion we exhibited earlier
Make use of this by using a coarser timescale for the more expensive smaller shifts 0% 25% 50% 75% 100% -13 -10 -8.5 -7.1 -5.8 -4.4 -3.1 -1.7 -0.3 1.5 Shift [ln(β)] Residue (α) L² Force α/(β+0.125) CG iterations
L
2versus L
∞
Force Norms
Wilson fermion forces (from Urbach et. al.)
Berlin Wall for Wilson fermions
HMC Results
C Urbach, K Jansen, A Schindler, and U Wenger, hep-lat/0506011,
hep-lat/0510064
Comparable performance
to Lüscher’s SAP algorithm
RHMC?
Conclusions (RHMC)
Advantages of RHMC
Exact
No step-size errors; no step-size extrapolations
Significantly cheaper than the R algorithm Allows easy implementation of Hasenbusch (multipseudofermion) acceleration
Further improvements possible
Such as multiple timescales for different terms in the partial fraction expansion
Disadvantages of RHMC
QCD Machines: I
We want a cost-effective computer to solve
interesting scientific problems
In fact we wanted a computer to solve lattice QCD
But it turned out that there is almost nothing that was not applicable to many other problems too
Not necessary to solve all problems with one architecture
Demise of the general-purpose computer?
Development cost « hardware cost for one large machine Simple OS and software model
QCD Machines: II
Take advantage of mass market components
It is not cost- or time-effective to compete with PC market in designing custom chips
Use standard software and tools whenever possible
Do not expect compilers or optimisers to anything
particularly smart
Parallelism has to be built into algorithms and programs from the start
Hand code critical kernels in assembler
QCD Machines: III
Parallel applications
Many real-world applications are intrinsically parallel
Because they are approximations to continuous systems
Lattice QCD is an good example
Lattice is a discretisation of four dimensional space-time Lots of arithmetic on small complex matrices and vectors Relatively tiny amount of I/O required
Amdahl’s law
Amount of parallel work may be increased working on a larger volume
Strong Scaling
The amount of computation
V
δrequired to
equilibrate a system of volume
V
increases faster
than linearly
If we are to have any hope of equilibrating large systems the value of δ cannot be much larger than one
For lattice QCD we have algorithms with δ = 5/4
We therefore are driven to as small a value of σ as
possible
This corresponds to “thin nodes,” as opposed to “fat nodes” as in PC clusters
Clusters are competitive in price/performance up to a
certain maximum problem size
Data Parallelism
All processors run the same code
Not necessarily SIMD, where they share a common clock Synchronization on communication, or at explicit barriers
Type of data parallel operations
Pointwise arithmetic
Nearest neighbour shifts
Perhaps simultaneously in several directions
Global operations
Alternatives Paradigms
Multithreading
Parallelism comes from running many separate
more-or-less independent threads
Recent architectures propose running many
light-weight threads on each processor to overcome
memory latency
What are the threads that need almost no memory
Calculating zillions of digits of π? Cryptography?
Computational Grids
In the future carrying out large scale
In the future carrying out large scale
computations using the Grid will be as
computations using the Grid will be as
easy as plugging into an electric socket
Hardware Choices
Cost/MFlop
Goal is to be about 10 times more cost-effective than commercial machines
Otherwise it is not worth the effort and risk of building our own machine
Our current Teraflops machines cost about $1/MFlop
For a Petaflops machine we will need to reach about $1/GFlop
Power/cooling
Most cost effective to use low-power components and high-density packaging
Clock Speed
Peak/Sustained speed
The sustained performance should be about 20%—50% of peak Otherwise there is either too much or too little floating point hardware
Real applications rarely have equal numbers of adds and multiplies
Clock speed
“Sweet point” is currently at 0.5—1 GHz chips
High-performance chips running at 3 GHz are both hot and expensive
Using moderate clock speed makes electrical design issues such as clock distribution much simpler
Memory Systems
Memory bandwidth
This is currently the main bottleneck in most
architectures
There are two obvious solutions
Data prefetching
Vector processing is one way of doing this Requires more sophisticated software
Feasible for our class of applications because the control flow is essentially static (almost no data dependencies)
Hierarchical memory system (NUMA) We make use of both approaches
Memory Systems
On-chip memory
For QCD the memory footprint is small enough that we can put all the required data into an on-chip embedded DRAM memory
Cache
Whether the on-chip memory is managed as a cache or as directly addressable memory is not too important
Cache flushing for communications DMA is a nuisance
Caches are built into most μ-processors, not worth designing our own!
Off-chip memory
The amount of off-chip memory is determined by cost
If the cost of the memory is » 50% of the total cost then buy more processing nodes instead
Communications Network: I
This is where a massively parallel computer
differs from a desktop or server machine
In the future the network will become the
principal bottleneck
For large data parallel applications
We will end up designing networks and decorating them with processors and memories which are almost free
Communications Network
Topology
Grid
This is easiest to build How many dimensions?
QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6
Extra dimensions allow easier partitioning of the machine
Hypercube/Fat Tree/Ω network/Butterfly network
These are all essentially an infinite dimensional grid Good for FFTs
Switch
36
Global Operations
6 8 10 12 1 2 3 4 5 6 7 8Grid wires
Not very good error propagation
O(N1/4) latency (grows as the perimeter of the grid)
O(N) hardware
Combining network
A tree which can perform arithmetic at each node Useful for global reductions
But global operations can be performed using grid wires
It is all a matter of cost Used in BGL, not in QCDx
Global Operations
1 2 3 4 5 6 7 8 3 7 11 15 10 26 36Combining tree
Good error propagation
O(log N) latency
O(N log N) hardware
Bit-wise operations
Allow arbitrary precision Used on QCDSP
Not used on QCDOC or BGL
Because data sent byte-wise (for ECC)
Broadcast 110100
Global Operations
110100 111000 Select Max MSB first>
11010 11100 ? 0 1101 1110 ? 00 110 111 upper 100 11 11 upper 0100 1 1 upper 10100 upper 110100 110100 110100 001011 000111 LSB first+
Carry Sum 00101 00011 1 0 0010 0001 1 10 001 000 1 010 00 00 1 0010 0 0 0 10010 0 010010Communications Network: II
Parameters
Bandwidth Latency Packet size
The ability to send small [O(102) bytes] packets between
neighbours is vital if we are to run efficiently with a small number of sites per processor (σ)
Control (DMA)
The DMA engine needs to be sophisticated enough to
interchange neighbouring faces without interrupting the CPU Block-strided moves
Block-strided Moves
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 11 10 9 8 7 6 5 4 3 2 1 0 11 10 9 8 7 6 5 4 3 2 1 0 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0Block size
1 4Stride
12 3For each direction separately we
specify in memory mapped registers:
n Source starting address o Target starting address p Block size
Hardware Design Process
Optimise machine for applications of interest
We run simulations of lattice QCD to tune our design
Most design tradeoffs can be evaluated using a spreadsheet as the applications are mainly static (data-independent)
It also helps to debug the machine if you use the actual kernels you will run in production as test cases!
But running QCD even on a RTL simulator is painfully slow
Circuit design done using hardware
description languages (e.g., VHDL)
Time to completion is critical
Performance Model
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.91 NoCache FIFO DMA-RF CBUF CBUF +SBUF SBUF +CBUF +
FPU Comm Memory Bus
VHDL
entitylcount5is
port (signalclk, res: invlbit;
signalreset, set:invlbit;
signal count: inout vlbit_1d(4downto0) ); endlcount5; architecturebhvoflcount5is begin process begin
wait untilprising(clk)orres='1';
ifres='1'thencount<=b"00000";
else ifreset='1'
thencount<=b"00000";
elsifset='1'then
count(4)<=count(4)xor(count(0)andcount(1)andcount(2)andcount(3)); count(3)<=count(3)xor(count(0)andcount(1)andcount(2));
count(2)<=count(2)xor(count(0)andcount(1)); count(1)<=count(1)xor count(0);
count(0)<=notcount(0);
end if;end if; end process ;
QCDSP
4 Mbytes of Embedded DRAM
Complete Processor Node for
QCD computer on a Chip
fabricated by IBM using SA27E 0.18μ logic + embedded DRAM process
IBM ASIC Library Component
IBM ASIC Library Component
DMA
DMA
Controller
Controller
QCDOC ASIC Design
DCR SDRAM DCR SDRAM Controller Controller DCR SDRAM DCR SDRAM
Custom Designed Logic
Custom Designed Logic
Interrupt
Interrupt
Controller
Controller Boot/ClockBoot/ClockSupportSupport
PLB PLB Arbiter Arbiter PLB PLB FPU FPU 440 440 Core Core HSSL HSSL HSSL HSSL HSSL HSSL SCU SCU PLL PLL EDRAM EDRAM PEC PEC MCMAL MCMAL Ethernet Ethernet DMA DMA location number IDC IDC Interface
Interface Serialnumber
Slave
Slave
OFB
OFB--PLBPLB
Bridge
Bridge ArbiterArbiterOFBOFB
O O F F B B MII FIFO FIFO 4KB
4KB recvrecv. FIFO. FIFO
EMACS EMACS 100 Mbit/sec Fast Ethernet 24 Off-Node Links 12 Gbit/sec Bandwidth 24 Link DMA Communications Control 2.6 Gbyte/sec EDRAM/SDRAM DMA 1 (0.8) Glops Double Precision RISC Processor 8 Gbyte/sec Memory-Processor Bandwidth 2.6 Gbyte/sec Interface to External Memory
PEC = Prefetching EDRAM Controller EDRAM = Embedded DRAM
DRAM = Dynamic RAM
Edinburgh ACF
UKQCDOC is
located in the
ACF near
Edinburgh
Our insurers will
not let us show
photographs of
the building for
security reasons!
Vicious Design Cycle
1. Communications DMA controller required
Otherwise control by CPU needed
Requires interrupts at each I/O completion CPU becomes bottleneck
2. Introduce more sophisticated DMA instructions
Block-strided moves for QCDSP
Sequences of block-strided moves for QCDOC
Why not? It is cheap
3. Use CPU as DMA engine
Second CPU in Blue Gene/L
4. Add FPU
Why not? It is cheap, and you double you peak Flops
5. Use second FPU for computation
Otherwise sustained fraction of peak is embarrassingly small
Power Efficiencies
QCDSP ASCI White QCDOC Blue Gene/L NCSA Intel Xeon ASCI Q Earth Simulator 0.001 0.01 0.1 1 1997 1998 1999 2000 2001 2002 2003 2004 2005Year
G
Fl
ops/W
at
t
What Next?
We need a Petaflop computer for QCD
New fermion formulations solve a lot of problems, but cost significantly more in computer time
Algorithms are improving
Slowly
By factors of about 2