Chapter 3 Integrated DOT/ECG/EEG Biomedical Multiprocessor for Portable Brain-Heart
3.3 Hardware Design and Implementation
3.3.2 Core Module Design
3.3.2.4 Independent Component Analysis (ICA) Engine
The independent component analysis (ICA) engine is mostly a straightforward implementation of the Infomax ICA algorithm outlined in Section 3.2.1. The hardware architecture of the ICA engine, shown in Figure 3-28, comprises 1) an input buffering and preprocessing unit (STAGE1) that calculates the covariance matrix and centers the input EEG data; 2) a whitening unit (WU) which calculates the whitening transformation matrix; 3) an ICA training unit (TU) which performs the unmixing weight training; and 4) an ICA computation unit (CU) that calculates the whitened unmixing weight matrix and outputs the final estimated independent source signals.
Figure 3-28 Architecture and sequenced operation of the ICA engine
The top operation of the ICA engine follows the described algorithm flow in Section 3.2.1 and is detailed as follows:
1. STAGE1 receives sparsely-timed EEG data and the input buffering unit (IBU) stores them into an interleaved SRAM array for later processing. (sequence in green)
2. Once a complete window of 64 by 4ch EEG samples has been received, the EEG samples are read out from the memory and the channel means are calculated. The mean values are then 1) immediately used to calculate the covariance matrix; and 2) saved in a register array for later use during the centering operation. The eigenvalue value decomposition (EVD) unit of the whitening unit then iteratively diagonalizes the
55
covariance matrix, and upon convergence outputs the final estimated eigenvalue D and eigenvector E matrices. The inverse square root of D calculated by the INV_SQRT unit is then multiplied together with E and its transpose, and the resulting matrix P is stored in a register array for use later on during Infomax ICA training and calculation.
(sequence in light blue)
3. All this time, the ICA training unit (TU) had been waiting for whitened data to train on, asserting the request signal Z_REQ. Once the whitening matrix P has been calculated, the whitening unit issues a request (X_ZM_REQ) and STAGE1 starts outputting centered data (channel means are subtracted from the raw data read from IBU’s SRAM). The whitening unit then transforms the centered data X4x1 to uncorrelated whitened data Z4x1 = P4x4 X4x1, and forwards the result to the ICA training unit. The ICA TU calculates an estimate of the unmixing weight matrix W based on a window of Z4x1, if the change in W is small enough, convergence is reached and training concludes; otherwise, the ICA TU will keep on requesting for the same window of Z4x64 data (through X_ZM_RESEND and X_ZM_REQ) up to 512 times or until W convergence is achieved, whichever comes first. The final value of W in this process serves as the best estimate of W and is stored for 1) training the next window; and 2) calculating ICA_OUT. (sequence in red; note how all of IBU, CTR, WU and TU are active concurrently)
4. In the beginning of the next window cycle when the P and W have not yet been calculated and overwritten, STAGE1 triggers ICA output calculation by issuing an OUTPUT_W command for ICA TU to send to ICA CU the previously calculated best estimate of the unmixing weight matrix W. Upon receiving W, ICA CU is triggered to calculate UW = WP which is stored in UW_REG upon completion (sequence in pink).
Finally, STAGE1 sends X and the ICA CU calculates the final ICA_OUT = WPX (sequence in orange).
56
The above discussion only focuses on the top operation of the ICA engine, omitting the details on the data windowing scheme, eigenvalue and eigenvector calculation in the whitening unit, and the iterative operation of the Infomax ICA training unit. These mechanisms actually form bulk of the complexity of the ICA engine; therefore, in the following subsections, more details will be provided regarding these topics, and special highlight will be shed on the hardware implementation issues associated with each of them.
3.3.2.4.1 Data windowing scheme
A basic but important parameter to consider in the implementation of real-time ICA is the input data length (number of multi-channel sampling events) on which one full run of the algorithm is operated upon. This length of data or “window size (WS)” is illustrated in Figure 3-29. The input data is broken down into data segments each of length N, denoted as x1, x2, …, xn, and data windows are specified by combining a number of consecutive data segments, denoted as w1, w2, …, wn. In order to achieve tractable performance and complexity in the proposed design, only fixed window sizes are considered (the window size is not allowed to vary during operation) while overlapped windows are introduced in order to improve ICA training and to facilitate pipelined operation.
A window size of 512 with 50% overlap (N=256, WS=2N, OV=50%) was initially chosen since it was able to achieve a good ICA performance of 0.9208 correlation coefficient using super-gaussian random pattern sets. However, because the tape out shuttle area budget was severely limited, this design was not feasible to be released due to its large area attributed to high memory requirements. Therefore with the intention of minimizing memory size at acceptable performance loss, a performance trade-off analysis of the ICA algorithm against various combinations of window size (WS) and overlap (OV) parameters was performed using MATLAB. Based on the results shown in Table 3-13, a window size of 64 with 50%
window overlap was chosen for the final design since it was able to achieve an acceptable 0.84 correlation using only a minimal 3.84Kb of memory.
57
Figure 3-29 Illustration of the data windowing concept
Table 3-13 ICA correlation performance vs. window size and window overlap Window Size
(WS)
Window Overlap
(OV) N Average
Correlation
Memory Size(Kb)
512 256 (50%) 256 0.9208 30.72
128
112 (87.5%) 16 0.8307 5.76
96 (75%) 32 0.8336 6.4
64 (50%) 64 0.8334 7.68
64 32 (50%) 32 0.8401 3.84
Figure 3-30 illustrates the implications of the chosen design parameters (N=32, WS=2N, OV=50%) on the pipelined operation schedule of the ICA engine. Three separate interleaved SRAMs each of size N are employed and allocated such that memory writes performed during storing of x2 (RAM2) do not conflict with the memory reading of x0 and x1 (RAM0 and RAM1) during training of window w0. Because of the 50% window overlap, a new unmixing weight matrix wn is available for every new xn. Therefore, under the assumption that the unmixing weight matrix does not change too much from window to window, the unmixing of xn (ICA computation) is performed using the unmixing weight matrix wn-1, which is based on the training of window wn-1 formed by xn and xn-1.
Figure 3-30 Data windowing scheme (50% overlap) and operation schedule
58
3.3.2.4.2 Eigenvalue and eigenvector calculation in the whitening unit
The primary function of the whitening unit is the determination of the whitening transformation matrix P expressed in Equation (3-4). Its calculation requires the matrix factorization of the covariance matrix of X into the form , where each of diagonal matrix D and orthogonal matrix E comprises the covariance matrix’s eigenvalues and eigenvectors respectively. The diagonalization of the real, symmetric, positive semi-definite covariance matrix Xcov into , known more formally in linear algebra as eigenvalue decomposition or EVD, is specifically performed by the EVD_PROCESSOR shown in Figure 3-28. It is a non-trivial mathematical problem whose solution takes up most of the computational burden handled by the whitening unit and is arguably the most interesting and algorithmically colorful and complex numerical method in the ICA engine. The details of the EVD computation, in terms of both algorithm and hardware architecture, are described here instead of the ICA algorithm discussion in 3.2.1, as the method’s rationale is more of a consequence of hardware implementation rather than ICA itself.
The Jacobi Eigenvalue Decomposition Algorithm - Introduction
Eigenvalue decomposition is an old mathematical problem and many methods for its solution have been proposed in the literature. Among these methods, the two most popular are the Jacobi eigenvalue algorithm [51] and the QR algorithm [52][53]. Although the QR method has been widely celebrated for its superior computational speed (presumably on general purpose computers), it has proved to be numerically less stable and less accurate than the Jacobi method [54][55][56]. On the other hand, the Jacobi algorithm possesses, in addition, desirable characteristics of simplicity, elegance, and regularity [57] making it extremely well-suited to efficient parallel VLSI implementations [58]. Indeed, while hardware implementation of the QR method has proven to be problematic due to the division, square root and inverse square root operations performed in the algorithm [59][60], rotation matrix multiplications in the parallel Jacobi method [61] can be efficiently handled by CORDIC
59
(COordinate Rotation DIgital Computer) arithmetic [62] such that even multiplications can be avoided altogether [63][64]. Due to its numerous attractive features, especially ones concerning hardware implementation, the Jacobi method was chosen for solving the EVD problem in the developed ICA engine. In the following subsection the Jacobi eigenvalue and the CORDIC algorithms are described in more detail.
The Jacobi Eigenvalue Decomposition Algorithm – Basic Idea
The basic idea behind Jacobi’s method is to systematically reduce to zero the quantity
off = kIv!
6 vvI 6
I (3-17)
i.e., the “norm” of the off-diagonal elements of the input symmetric n-by-n matrix A. The input matrix A is iteratively updated according to
Y = Y Y Y, (3-18)
where JJJJ is orthogonal, such that the new A is more diagonal than the previous one in the sense of (3-17). When off(Ak+1) goes below a predefined threshold, say after K iterations, the algorithm is said to have converged; and the final
= ! … ! !… ! (3-19) is a diagonal matrix D containing the eigenvalues of A0. Furthermore, recalling that JJJJ is orthogonal, then JJJJT = JJJJ-1 and J J J J JJJJT = JJJJT JJJJ = IIII, and manipulating both sides of (3-19) gives
… … =… … … … (3-20)
= (3-21)
where matrix E = JE = JE = JE = J0 JJJJ1…JJJJK-1 is orthogonal and whose rows (or columns) are the eigenvectors of A0, thus solving the original eigenvalue decomposition problem.
By now it should be apparent that the key to Jacobi’s method lies in the proper design of matrix JJJJ to produce the effects described above. It follows then that the specifications of JJJJ
60
must at least include the following: 1) it should be orthogonal; 2) for manageability, it should be able to selectively “annihilate” specific off-diagonal elements only; 3) it should be able to cause a net effect of off(Ak+1) < off(Ak) after each iteration; and 4) it should have the ability to converge.
To satisfy these requirements, Jacobi defined JJJJ as
, , θ =
called the Jacobi rotation matrix, which is simply an identity matrix whose elements at (p,p), (q,q), (p,q) and (q,p) are replaced with sine and cosine trigonometric functions as shown above. By using trigonometric identities, it can be easily verified that JJJJ is orthogonal, satisfying the first requirement. To illustrate the second requirement, Figure 3-31 shows an expansion of (3-18) using the Jacobi rotation matrix defined in (3-22). The left (right) rotations in blue (red) simply “remixes” or “rotates” pairs of column (row) elements (kI|, kI|) to (kI|, kI|) ((kI|, kI|) to (kI|, kI|)). It becomes obvious then that the only opportunity here is to “rotate” the energy of the off-diagonal elements apq and aqp into the diagonal elements app and aqq, since the remaining off-diagonal elements (red only or blue only) have no paired diagonal elements to transfer their energy to. It becomes clear then that the rotation angle θ must be chosen based on app, aqq, apq and aqp in order to achieve the desired effect.
Figure 3-31 Illustration of matrix AAAA update using two-sided Jacobi plane rotations
61
Evaluating the right hand side of Figure 3-31 for k|and k| and noting that k| = k| since Ak is symmetric, gives
k| = k| = k| sinθcosθ + k| cosθcosθ − k| sinθsinθ − k| sinθcosθ
= k| − k| sinθcosθ + k| cos!θ − k| sin!θ
= k| − k| sinθcosθ + k| cos!θ − sin!θ
= k| − k| tanθ + k| 1 − tan!θ (3-23) Since the off-diagonal elements in the updated matrix must be 0, setting k|= k| = 0 gives
2tanθ
1 − tan!θ = 2k|
k| − k|
tan2θ = 2k|
k| − k|
θ =1
2 tan 2k|
k| − k| ¡ (3-24)
Using the rotation angle θ “annihilates” k| and k| to zero, and their energy is transferred to the diagonal elements. As for the rest of the rotation pairs (in blue only or red only), it can be shown that each pair’s total norm remains constant, i.e., kIv|!+ kil|! = kIv|!+
kil| !, such that on the global perspective, off(Ak+1)2 – off(Ak)2 = 0! − 2k| ! or
offY! = offY!− k| ! (3-25) satisfying the third requirement.
The remaining question now is how to choose the indices (p,q) in each iteration. From the standpoint of maximizing the reduction of off(Ak) in (3-25), it makes sense to choose (p,q) so that k| ! is maximal. Indeed, this is the strategy originally proposed by Jacobi in his work [51], where a proof of convergence is also given. However, the drawback of this method is that sweeping the whole n-by-n matrix just to search for the largest off-diagonal element at every iteration is a costly operation. This problem is overcome by fixing the sequence of (p,q) locations to be updated in advance. Common schemes are the by-row and the
cyclic-62
by-column procedures where the pair (p,q) is chosen in a row-by-row, or column-by-column fashion respectively [65][66]. Any such sequence, called a sweep, must cover all possible pairs of (p,q) and requires N = n(n-1)/2 iterations. For example, the row-cyclic strategy uses the following sequence of update locations (p,q):
(1,2) → (1,3) → (1,4) → … → (1,n-1) → (1,n) → (2,3) → (2,4) → (2,5) → … → (2,n) →
…
(n-2,n-1) → (n-2,n) → (n-1,n)
By foregoing the off-diagonal searching procedure, the cyclic Jacobi executes considerably faster than Jacobi’s original method.
While not immediately apparent in Jacobi’s original method due to the conditional off-diagonal search, the more regular cyclic version is able to expose more readily the inherent parallelism present in Jacobi’s method. From Figure 3-31, it can be seen that each right rotation (matrix multiplication in red) involves only columns p and q. Thus, any other right rotation (p’,q’) where p’≠p and q’≠q is a totally separate problem. For example, given n = 8, a possible “non-conflicting” set of rotation pairs could be {(1,2),(3,4),(5,6),(7,8)} such that all four can be performed in parallel. After all the right rotations are done, the left rotations can all be performed in parallel as well.
The final question would then be how to group together the remaining (p,q) pairs in the sweep while also maintaining the maximum n/2- way parallelism as shown above. A practical approach to the problem is to visualize a chess tournament with n players in which everybody must play everybody else exactly once; and, to minimize the total duration of the tournament (like a sweep), matches are conducted in parallel as much as possible. A simple method for generating rotation sets covering a full sweep is presented in [61] and a mnemonic illustration of this process is given in Figure 3-32.
63
Figure 3-32 Illustration of parallel Jacobi rotation set sequence The CORDIC (COordinate Rotation DIgital Computer) Algorithm
Based on the discussion above, hardware implementation of the Jacobi EVD algorithm presents some challenges, particularly in the following sub operations: 1) squaring and square root functions in the calculation of off(A) during convergence check, as in (3-17); 2) division and arctangent operations for calculating θ in (3-24); and 3) numerous matrix rotation multiplications during iterative updates, involving sine and cosine elements as shown in Figure 3-31. Straightforward evaluation of these functions can prove costly in hardware;
therefore alternative approaches are investigated in order to achieve efficient implementation.
The square and square root operations in 1) are avoided by simplifying the condition for convergence to max(|aij|,i≠j) < threshold. On the other hand, explicit division, arctangent and matrix rotation operations associated with 2) and 3) are avoided altogether through the use of CORDIC arithmetic, a powerful algorithm capable of evaluating these functions using only simple add, shift and table look-up operations.
CORDIC stands for “COordinate Rotation DIgital Computer” and is a simple yet very powerful computational algorithm invented by Jack E. Volder in 1959 [62]. Using only shift-add and look-up operations, it has the ability to evaluate a wide variety of mathematical computational tasks including trigonometric, hyperbolic, exponential and logarithmic
functions, real and complex multiplication, division, square making it indispensable in
eigenvalue estimation, singular value decomposition, QR factorization and so on CORDIC has been utilized for applications in diverse areas such as signal processing, communication systems, robotics and 3D graphics apar
and technical computation.
The CORDIC algorithm is a vector in the Cartesian plane is rotated
a. The relationship between constrained by θ,
Instead of finding the solution of breaks down the
moves the system closer
decomposed into component rotations 1 specifying the direction of
Table 3-14. Any angle d1,d2,…,dn-1
functions, real and complex multiplication, division, square t indispensable in solving
eigenvalue estimation, singular value decomposition, QR factorization and so on CORDIC has been utilized for applications in diverse areas such as signal processing, communication systems, robotics and 3D graphics apar
and technical computation.
The CORDIC algorithm is
in the Cartesian plane is rotated relationship between
θ, can be derived
(a) Figure
Instead of finding the solution of breaks down the problem into a series of small moves the system closer to the co
decomposed into component rotations ying the direction of
. Any angle θ can
1} where n is the total number of iterations.
functions, real and complex multiplication, division, square
solving higher level mathematical problems s
eigenvalue estimation, singular value decomposition, QR factorization and so on CORDIC has been utilized for applications in diverse areas such as signal processing, communication systems, robotics and 3D graphics apar
The CORDIC algorithm is based on the in the Cartesian plane is rotated
relationship between the
can be derived using elementary trigonometry and is
Figure 3-33 Rotation of vector in a Cartesian plane {¢= {
£¢= { Instead of finding the solution of
problem into a series of small to the correct decomposed into component rotations
ying the direction of rotation. The
can then be uniquely represented is the total number of iterations.
64
functions, real and complex multiplication, division, square
higher level mathematical problems s
eigenvalue estimation, singular value decomposition, QR factorization and so on CORDIC has been utilized for applications in diverse areas such as signal processing, communication systems, robotics and 3D graphics apar
based on the 2D geometric in the Cartesian plane is rotated counterclockwise
original vector (
using elementary trigonometry and is
Rotation of vector in a Cartesian plane { cosθ − £
{ sinθ + £ Instead of finding the solution of (3-26)
problem into a series of smaller rotation rrect solution as in decomposed into component rotations ¤I = tan
rotation. These fixed
uniquely represented is the total number of iterations.
functions, real and complex multiplication, division, square
higher level mathematical problems s
eigenvalue estimation, singular value decomposition, QR factorization and so on CORDIC has been utilized for applications in diverse areas such as signal processing, communication systems, robotics and 3D graphics apar
2D geometric
counterclockwise by an angle vector (x,y) and
using elementary trigonometry and is
Rotation of vector in a Cartesian plane sinθ is the total number of iterations.
functions, real and complex multiplication, division, square-root, and many others higher level mathematical problems s
eigenvalue estimation, singular value decomposition, QR factorization and so on CORDIC has been utilized for applications in diverse areas such as signal
processing, communication systems, robotics and 3D graphics apart from general scientific
2D geometric rotation transformation by an angle
) and the rotated using elementary trigonometry and is shown in
(b) Rotation of vector in a Cartesian plane
in one evaluation, the such that each Figure 3-33b. T
such that θ = ∑ component rotation angles
as a sequence of rotation directions root, and many others higher level mathematical problems such as linear systems, eigenvalue estimation, singular value decomposition, QR factorization and so on CORDIC has been utilized for applications in diverse areas such as signal Rotation of vector in a Cartesian plane
in one evaluation, the CORDIC algorithm each incremental rotation
The rotation angle
∑l I ~I¤I,
rotation angles ¤I are shown in a sequence of rotation directions root, and many others, thus linear systems, eigenvalue estimation, singular value decomposition, QR factorization and so on. The and image a sequence of rotation directions thus a sequence of rotation directions
65
The advantage of this scheme becomes apparent when the cosine terms in (3-26) are factored out and the equations rewritten in incremental form:
{I= cos ~I¤I {I − £I tan ~I¤I
£I= cos~I¤I £I+ {I tan~I¤I (3-27) and substituting tan ¤I = 2 I,
{I= cos ¤I {I − ~I£I2 I
£I= cos¤I £I + ~I{I2 I (3-28) since αi < π/2 and cos() and tan() are even and odd functions respectively.
However, based on (3-28), cos ¤I can be factored out from the right product term {I − ~I£I2 I, and then cos¤I ! and so on. Thus, instead of multiplying by cos(αi) in every iteration, an accumulated scaling product ¦ = ∏ cos¤lI I is defined, and applied only once at the final output. This way, many multiplication operations can be saved and the update iterations simplified to
{I= {I− ~I£I2 I
£I= £I + ~I{I2 I (3-29)
which only requires elementary shift and add operations. Table 3-14 shows the value of the
which only requires elementary shift and add operations. Table 3-14 shows the value of the