3. Fast matrix multiplication in AES Mixcolumns step

(1)

Efficient schemes with diverse of a pair of circulant matrices for AES MixColumns-InvMixcolumns transformation

Jeng-Jung Wang¹, Yan-Haw Chen^2*, Guan-Hsiung Liaw³, Jack Chang⁴, Cheng-Chih Lee⁵

1,2,3,5

Dept. of Information Engineering, I-Shou University, Kaohsiung, Taiwan 84008.

4Intellectual Property Group, Davis, Wright, & Tremaine, Seattle, Washington, USA

1,2,3,5

[email protected], ⁴[email protected]

Abstract

Recently, AES is a commonly used encryption-decryption algorithm applied to wireless communication protocols. However, confidentiality and speed both associated with Cipher-InvCipher that are a very important issue in many current communication systems. In this paper, the key idea here is to propose a method with more variations in circulant matrix for enhancing security in AES MixColumns-InvMixColumns step. The paper is also to propose a method minimizes the number of multiplications for matrix multiplication theoretically based on two-point cyclic convolution properties of circulant matrix. The conventional 44 matrix multiplication typically needs 16 multiplications and 12 additions;

however, the proposed method, described herein as Scheme 3, can reduce the matrix multiplications into 5 multiplications and 15 additions, which is used for encryption and decryption. Using Scheme 3 and Horner’s rule-based multiplication running on Intel CPU, the computational cost of the matrix multiplication can be reduced by ~63%. Furthermore, experiments using Scheme 3 along with Horner’s rule-based multiplication by means of AES keys lengths with 128, 192, 256 bits were tested by running on STM32L476VG MCU, result leads to the reduction of encryption and decryption time respectively by ~60%. Finally, the proposed procedure enables found many a pair of the circulant matrices for AES Cipher-InvCipher so that diverse of a pair of the circulant matrices can enhance security of the data transmission.

Keywords: AES; Circulant; Lookup Table; Finite Field; Multiplication

*Corresponding author. Email: [email protected], Fax: (886-7)-657-8944.

(2)

1. Introduction

New features are being introduced and protecting data transmission is now more important than ever. Thus, an improvement to efficiently apply the Advanced Encryption Standard (AES) to communication systems, and cloud computing in healthcare systems [18]

are important. The MixColumns-InvMixColumns transformation [13] is one of the functions in the Cipher-InvCipher. In AES, MixColumns transformation is a computationally expensive operation where the input matrix is multiplied with the MDS matrix. This transformation plays an important role with respect to the wide trail strategy in the cipher. In the early, the MDS matrix is also using in error correction code which authors by Lacan [14] and Macwillanms [9] have performed cyclic convolution of complex values with a hybrid transformation over finite fields. There exists several new research directions suggested by searching methods for finding MDS matrices in [7][8][16][17]. Moreover, in [10] has shown that the method can generate a random MDS matrix, and those techniques can be enhanced by dynamic MDS matrices. The diversity circulant matrices are used in the modern cryptographic method in AES. The computation of MDS matrix might be used in the encryption and decryption such as Rijndael method and Twofish method in [5]. However, these articles fail to mention to get inverse MDS matrices method.

Furthermore, due to attacks [1] on AES-128 using known-key distinguishing attack with a computation complexity 2 method, this leads to opportunities to enhance security of data transmission. We propose using different coefficients of the polynomial A(x) and the inverse polynomial A(x), namely A^-1(x). They are used in AES MixColumns-InvColumns by using some of the bits from the AES key as an index to find the variations of the coefficients of the polynomial. The method would be more difficult for attackers to locate and thus less prone to attacks in general. This paper also proposes an efficient method to find pairs consisting of the polynomial A(x) and A^-1(x) by the Find_inv_matrix() procedure. Scheme 3, as descried in this paper, may be designed as a circuit in VLSI, see [2][4][6][15][11][20], which can be used to decrease logic gates. The matrix product operation can be used with distinct method of the multiplication in finite field see [3][12]. The method also can provide the security of the data transfer to the health monitoring system on ARM-based microcontrollers [18].

The remaining portion of this paper is organized as follows: Section 2 introduces enhanced security in AES MixColumns step. Section 3 discusses the multiplication in finite field concepts necessary for further developments, and also proposes methods to reduce the multiplication in matrix products for the AES encryption-decryption which these methods are called Scheme 1, Scheme 2, and Scheme 3, respectively. Section 4 proposes an efficient

(3)

row vectors of the inverse matrix A for using in AES MixColumns-InvMixColumns step.

Section 5 presents a performance analysis of AES Cipher-InvCipher on Intel CPU and STM32L476VG ARM-based MCU. Section 6 concludes the paper.

2. Enhanced security in AES MixColumns step

This paper mainly is not focused on fix polynomial a(x) in AES MixColumns transformation. We aim to enhance security of this AES algorithm with diversity MixColumns of the coefficients of polynomial that can be for increasing security. Since, if data is given in both plaintext and ciphertext, the determining the key would require an exhaustive search. However, Encrypting and decrypting data is must to know the Table A and Table B as shown in Figure 1. In other words, the key cannot be known from the plaintext and the ciphertext because the ciphertext and plaintext are obtained from AES standard MixColumns (02, 03, 01, 01) and InvMixColumns (oe, ob, od, 09) transformation.

Furthermore, it might be sent the coefficient of the polynomial a(x) by elliptic curve cryptography of the ECDH algorithm to receiver. Receiver got the polynomial a(x) must to compute inverse the polynomial a(x) for decryption. So that it does not need to the Tabe A and Tabe B.

Figure 1: Some bits of a key as index of coefficients

(4)

3. Fast matrix multiplication in AES Mixcolumns step

A new method for computing of circulant matrix is described herein that is based on the 2-point cyclic convolution matrix. This section consists of three subsections, in the first subsection describes different method of the multiplication over finite field for matrix multiplication that can be also applied to matrix operation. Besides, Scheme 1, which uses a two point cyclic matrix for reducing multiplication of the matrix product, and Scheme 2 uses 2 multiplied by any element in GF(2^m) which is zero for reducing a multiplication. The coefficients of the polynomial A(x) has the property (a₀a₁a₂a₃)r₂, where aj is over GF(2^m), whichcan use lookup table method for reducing 4 multiplications. Lastly, Scheme 3 uses sum of the coefficients of the polynomial A(x) that has the properties (a₀a₁a₂a₃)1, which reduce 4 multiplications in Scheme 3.

3.1 Multiplication over finite field

Let ^m ⁱ

i aix x

a 



^¹

) 0

( and 



^¹

) 0

( ^m

i i ix b x

b be polynomial equation of degree m-1 in GF(2^m), where ai, bi {0, 1}. It is well know that finite field addition is defined as:

), ( ) ( )

(x a x b x

c   (1)

Note that the symbol of “+” is XOR bitwise operation so it does not need extra defined function in C programming. Finite field multiplication is defined as:

), ( mod ) ( ) ( )

(x a x b x f x

c   (2)

where the AES algorithm with multiplication is irreducible polynomial 1

)

(x x⁸x⁴x³x

f . In (2), the Russian Peasant method can be written as a function in C programming as follows:

(5)

Russian Peasant method

unsigned char GFM(unsigned char a, unsigned char b){

unsigned char c = 0;

for( int i = 0; i < 8; i++){

if (b & 1) c ^= a;

if (a & 0x80)

a = (a << 1) ^ 0x11b;

else

a <<= 1;

b >>= 1;

} return p;

}

In (2), the proposed multiplication can be evaluated by using Horner’s rule, according to the following recursive formula, c(x)(((a₇Bxmod f(x)a₆B)x²mod f(x)a₅Bxmod f(x)a₄B)

, ) ( mod )

(

mod ₁ ₀

2 f x aBx f x a B

x   where B is represented as the polynomial b(x).

Thus, an expression (a_iBxmod f(x)a_jB) can be represented as a lookup table as following

 ] , [

Bt a_i a_j (a_iBxmodf(x)a_jB) , where a_i,a_jGF(2) . Let Bt[a_i,a_j] be c, the ccx² )

(

modf x can be represented as ccx²f[c_m_₁,c_m_₂], where f[c_i,c_j]c_ire(x)xc_jre(x) and

) ( mod )

(x x f x

re  ^m is a remainder polynomial (e.g., re(x)=x⁴x³x1, binary 11001, Hex 0x1b). Horner’s rule method is rewritten in C programming as shown below:

Horner’s rule

unsigned char f[4]; unsigned char Bt[4];

unsigned char GFM(unsigned char a, unsigned char b){

unsigned char c; f[0] = 0; f[1] = 0x1b; f[2] = 0x36; f[3] = 0x2d; Bt[0] = 0; Bt[1] = b;

if (b & 0x80)

Bt[2] = (b << 1) ^ 0x1b;

else

Bt[2] = (b << 1);

Bt[3] = Bt[2] ^ b;

c= Bt[(a >> 6) & 0x3];

c=(c << 2) ^ f[c >> 6] ^ Bt[(a >> 4) & 0x3];

c=(c << 2) ^ f[c >> 6] ^ Bt[(a >> 2) & 0x3];

c=(c << 2) ^ f[c >> 6] ^ Bt[a & 0x3];

return c;

}

As mentioned above, the two methods of multiplication can be used for making an 2D array GFMT[][] for lookup table method (i.e., GFMT[i][j]=GFM(i,j) where 0 i, j  255). An array

(6)

GFMT[][] needs 256*256=64K bytes for saving data. The lookup table method is shown as below:

Lookup table method

unsigned char GFM(unsigned char a, unsigned char b) {

unsigned char c=0;

c=GTMT[a][b];

return c;

}

3.2 Reducing multiplications in matrix multiplication

The AES MixColumns transformation, the modular product of A(x) and B(x), is presented as the four-term polynomial D(x), defined as

) ( mod ) ( ) ( )

(x A x B x T x

D  (3)

where T(x)x⁴1, A(x)a₃x³a₂x²a₁xa₀andB(x)b₃x³b₂x²b₁xb₀, for a_i,b_iGF(2^m). By (3), there is a circulant matrix form as:

.

3 2 1 0

0 1 2 3

3 0 1 2

2 3 0 1

1 2 3 0

3 2 1 0







































b b b b

a a a a

d d d d

(4)

In (4), the matrix D is a product of matrices A and B, which requires 16 multiplications and 12 additions (16M, 12A) listed below:

(16M, 12A)

3 0 2 1 1 2 0 3 3

3 3 2 0 1 1 0 2 2

3 2 2 3 1 0 0 1 1

3 1 2 2 1 3 0 0 0

b a b a b a b a d

















(7)

Using the two-point cyclic convolution matrix property for 22 matrices multiplication is given by:

   

^.

y

0 1 0 1 0 0

1 1 0 1 0 0 1 0 0 1

1 0 1

0 



 







 



 







 







 





b a a b b a

b a a b b a b b a a

a a

y (5)

Hence, the method only requires 3 multiplications and 4 additions (3M, 3A) as shown in Table 1.

Table 1: The two-point cyclic convolution method with (3M, 3A).

) ( ₀ ₁

0

0 a b b

s   ^s¹^^a⁰^^a¹

1 1 0

0 s sb

y   y₁s₀s₁b₀

In Table 1, two entries a₀ and a₁ are fix data, the item s₁a₀a₁ can be precomputed in the program. Thus, the 2-point cyclic matrix method only uses 3 multiplications and 3 additions. If the matrices _



 





0 1

3 0

a a

a

A a is not 2-point cyclic matrix, that product of the matrix A and B is given by

.

1 0 0 1

1 3 0 0 1 0 0 1

3 0 1

0 



 







 



 







 







 





b a b a

b a b a b b a a

a a y y

(6) Theorem 1 Let A be any nn cyclic matrix, where nn₁n₂ and GCD(n₁,n₂)1, then the matrix A can be partitioned into a cyclic n₁n₁ matrix, in which entries are n₂n₂ submatrix.

It is similar to the proof by Winograd (1978). Using (4), by Theorem 1, the four-point cyclic matrix can be partitioned as,

.

3 2 1 0

0 1 2 3

3 0 1 2

2 3 0 1

1 2 3 0

3 2 1 0







































b b b b

a a a a

d d d d

(7)

From (7), it can be rewritten as

,

1 0 0 1

1 0 1

0 



 







 







 





B B A A

A A D

D (8)

(8)

where _, _, _, _,

2 3

1 2 1 0 1

3 0 0 3 2 1 1 0

0 



 







 







 







 





a a

a A a

a a

a A a

d D d d

D d ,and .

3 2 1 1

0

0 



 







 





b B b b

B b In (8), it

can be used to reduce the multiplications by (5) form as follows:

   



₀ ₁

 

₀ ₁



₀ ^,

0

1 1 0 1 0 0 1

0 



 







 



 







 



 





H F

G F B A A B B A

B A A B B A D

D (9)

where _



 







 



 









3 2 1 0 0 1

3 0 1

0

0( )

b b b b a a

a B a

B A

F ( ) ,

3 2 2 0 3 1

1 3 2 0 1 1

0 



 







 







 



 b

b a a a a

a a a B a

A A

G and

. )

(

1 0 2 0 3 1

1 3 2 0 0 1

0 



 







 







 



 b

b a a a a

a a a B a

A A

H The matrix F can be form by (6), and matrix G and

matrix H are form by (5) yields:



 







 

) ( ) (

3 1 0 2 0 1

3 1 3 2 0 0

b b a b b a

b b a b b F a

      

^_^







 

2 1 3 2 0 3 2 2 0

3 1 3 2 0 3 2 2 0

( )

b a a a a b b a a

b a a a a b b a G a

      

^_^







 

0 1 3 2 0 1 0 2 0

1 1 3 2 0 1 0 2 0

( )

b a a a a b b a a

b a a a a b b a H a

(10)

Obviously, the matrices F, G, and H are combination of the sets with element b_i. Rewrite the terms in s₀b₀b₂, s₁b₁b₃,s₂a₀s₀a₃s₁,s₃ a₁s₀a₀s₁, s₄ b₂b₃, and s₅ b₀b₁ as follows:

,

3 2



 



 s

F s

   

 



) (



^,

) (

( ) )

(

2 1 3 2 0 4 2 0

3 1 3 2 0 4 2

0 



 







 

b a a a a s a a

b a a a a s a

G a and

 



) (



^.

) (

( ) )

(

0 1 3 2 0 5 2 0

1 1 3 2 0 5 2

0 



 







 

b a a a a s a a

b a a a a s a H a

Next, the matrix G and matrix H are replaced with w₀ a₀a₂andw₁ a₃a₁. Thus, the matrix G and H matrix can be given as



 







 

2 2 0

3 2 0

b r r

b r

G r and ,

0 2 1

1 2

1 



 







 

b r r

b r H r

(9)

where r₀w₀s₄,r₁w₀s₅ ,andr₂w₀w₁. Finally, the four-point cyclic matrix method can be obtained as a new matrix form

































 







 



 





0 2 1 3

1 2 1 2

2 2 0 3

3 2 0 2

3 2 1 0

1 0

b r r s

d d d d

H F

G F D

D .

In the simplified case, the MixColumns transformation can be performed by 10 multiplications and 17 additions. Two items w₀a₀a₂ ,w₁a₃a₁ and r₂w₀w₁ are known because the value a_i of the coefficients of polynomial A(x), can be precomputed in the program. So that the method only uses 10 multiplications and 14 additions, that is remarked as (10M, 14A).

Scheme 1. (10M, 14A)

1 3 1 2 0 0 1 0 0 1 3 1 3 0 0 2

3 1 1 2 0 0

, ,

a a w a a w s a s a s s a s a s

b b s b b s

























1 0 2 1 0 0 1 3 2 0

0 w(b b) r w(b b), r w w

r      

0 2 1 3 3

1 2 1 2 2

2 2 0 3 1

3 2 0 2 0

b r r s d

















3.3 Reducing multiplications by multiply 2

The matrix product 



 







 





1 0 0 1

3 0 0

0 b

b a a

a B a

A can be further simplified by properties of addition over GF(2^m). Adding two entries of 2a₀b₁0 and 2a₀b₀ 0 are into matrix product A0B0 as follows:

   

^.

2 2

0 1 0 1 0 0

1 3 0 1 0 0 0 0 1 0 0 1

1 0 1 3 0 0 0

0 



 







 



 







 

b a a b b a

b a a b b a b a b a b a

b a b a b B a

A (11)

(10)

In Scheme 2, the matrix F was replaced by (6). Now, the matrix F is replaced by (11) to obtain the following matrix

    



( ) ( )

   

^.

) ( ) (

2 0 1 0 3 1 2 0 0

3 1 3 0 3 1 2 0 0 3 1

2 0 0 1

3

0 



 







 



 







 



 





b b a a b b b b a

b b a a b b b b a b b

b b a a

a F a

 



₀ ₁



₁ ₀ ₂² ₁⁰ ₀¹ ^,

0

1 0 1 0

0 



 







 



 







 

s t t

s t t s t s s a

s t s s F a

where s₀ b₀b₂,s₁ b₁b₃,t₀ a₀a₃,t₁ a₀a₁,and _t₂a₀



s₀s₁



.

In Scheme 1, the two items s₂a₀s₀a₃s₁ and s₃a₁s₀a₀s₁ can be replaced as

1,

0 2

2 t ts

s   s₃t₂t₁s₀, t₀a₀a₃, t₁a₀a₁, and t₂ a₀



s₀s₁



for computing MixColumns transformation. Consequently, in Scheme 1, each r₂b_i term can be replaced with lookup table method of tc[b_i]r₂b_i, namely, constant multiplication doesn’t require computing multiplications as it did. It needs 256 bytes of memory, which is called Scheme 2.

Scheme 1 can further be rewritten as follows:

Scheme 2. (5M, 15A) (It needs 256 bytes as lookup table)

 

2 0 0 0 1 2 3 1 0 2 2

1 0 0 2 1 0 1 3 0 0

3 1 1 2 0 0

, ,

,

a a w s t t s s t t s

s s a t a a t a a t

b b s b b s

































) ( ),

( ₂ ₃ ₁ ₀ ₀ ₁

0

0 w b b r w b b

r    

] [

0 1 3 3

1 1 2 2

2 0 3 1

3 0 2 0

b tc r s d

















In Scheme 2, it uses only 5 multiplications and 18 additions with 256 bytes of memory for matrix multiplication. Obviously, if the coefficients of the polynomial A(x) have the equality a₀+a₃+a₂+a₁=1 in AES standard, then the property would make r₂=w₀+w₁=1, based on Scheme 2. Consequently, the r₂1 doesn’t require lookup table computing as it did in Scheme 2 (e.g., tc[b_i]r₂b_i 1b_i ), does not need memory used in embedded system, so that the method can be rewritten as Scheme 3. In Scheme 3, there are three items

,

and ,

, ₀ ₀ ₃ ₁ ₀ ₁

2 0

0 a a t a a t a a

w       which can be precomputed in the program, so that the method only used 5 multiplications and 15 additions, namely, (5M, 15A).