Academic Open Internet Journal |
Volume 12, 2004 |

**Efficient
VLSI implementation of the block cipher Rijndael algorithm**

*** Mrs.G.Umamaheswari, **
Dr.A.Shanmugam**

***Research
scholar, ** Professor,**

**ECE
Department,**

**PSG
College of technology, Coimbatore**

**Email:gumabhaskar@yahoo.co.in**

**Abstract:** Cryptographic
algorithms are more efficiently implemented in custom hardware than in software
running on general-purpose processors. This paper addresses efficient hardware
implementation approaches for the AES (Advanced Encryption Standard)Algorithm
and describes the design and performance testing Rijndael algorithm. Compared
to software implementation, hardware implementations provide more physical
security as well as higher speed. An optimized coding for the implementation of Rijndael
algorithm for 128 bytes has been developed. The speed factor of the algorithm
implementation has been targeted and a software code in VHDL **(V**HSIC **H**ardware **D**escription **L**anguage) which boasts of a throughput
of 2.18 Gb/sec has been developed. The architectural innovations that have been
incorporated in the coding include on the fly round key generation, which
facilitates simultaneous execution of sub bytes, shift rows and mix columns and
round key generation. A look up table called S-Box has been used to obtain the
sub byte values instead of applying affine transformation every time to
calculate sub byte values. This implementation will be useful in wireless
security like military communication and mobile telephony where there is a
greater emphasis on the speed of communication.

**Keywords:** Rijndael, Encryption, Advanced Encryption Standard (AES),
pipelining, security, VHDL**.**

**INTRODUCTION**

The cipher Rijndael is one of the
five finalists of the Advanced Encryption Standard. The AES Rijndael algorithm was chosen in October 2000 and is
expected to replace the DES and Triple DES because of its enhanced security
levels [6, 7, 8]. The algorithm has been designed by Joan Daemen and Vincent
Rijmen and its specification is given in [1]. It is a block cipher. The length
of the block and the length of the key can be specified to be 128, 192, 256
bits. In this paper, the hardware implementation with 128-bit blocks and
128-bit keys is presented. In this variant the cipher consists of 10 rounds.
Last of all, it is often the case that applications require the encryption
logic be physically isolated from the rest of the system, so that the
encryption can be secured more easily. In this case a hardware accelerator is a
more suitable solution as well[10,11]. In this paper, VLSI optimizations of the
Rijndael algorithm are discussed and several hardware design modifications and
techniques are used, such as memory sharing and parallelism. Different
applications of the AES algorithm require different speed/area trade offs. Some
applications such as WWW servers and ATMS are speed critical. Moreover, in
terms of mobile application like cellular phones, PDA’s, etc.,software
implementation[12] on general-purpose processors consumes much more power than
special purpose ASIC's do. The high performance is especially reached by
keeping combinational paths balanced so that every clock cycle is fully
utilized. The paper presents implemented Rijndael processor structure of
moderate gate count and high speed and it is organized as follows. In section
2, the algorithm flow is discussed. In section 3, the hardware implementations
and technique system level architecture are discussed. The design methodology
is discussed in section 4. Finally the measurements and performance results are
shown in section 5, conclusions and outlook are given in section 6.

**2. THE RIJNDAEL
ENCRYPTION/DECRYPTION ALGORITHM**

The NIST initiate to develop the Federal Information Processing Standard (FIPS) for the AES in replacing the Data Encryption Standard (DES) which expired in 1998 [3]. A new block cipher algorithm called Rijndael has been developed and published by Daeman and Rijmen[2].The symmetric block cipher was standardized by NIST as AES.This algorithm is iterative block cipher with variable block length and a variable key length.

Fig1
.Hardware structure

The round function of Rijndael in 128-bit blocks is composed of four distinct invertible transformations as follows:

-The ByteSub transformation (16 lookup tables with 8bit-input/output)

-The ShiftRow transformation (no hardware operations)

-The MixColumn transformation (logical AND and XOR operations)

-The AddRoundKey transformation and 4 data-dependent rotations (Logical XOR)

Before the first round, the AddRoundKey transformation is also performed, and in the

final round, the MixColumn transformation is omitted.

The basic components of Rijndael are
logical operations and lookup tables; the latter is actually a composite
function of an inversion over GF (_{2}^{8}) with an affine mapping. Hence the structure of Rijndael is
expected to be suitable for hardware implementation [10, 11].

**3. SYSTEM SPECIFICATION AND ARCHITECTURE**

** **

Fig 2.Block diagram of the hardware implementation of
a symmetric-block cipher

Hence by the time these operations get over, the round key is ready for performing the XOR operation with the output of mix column operation. Since the same hardware is used for generating all the round keys, the amount of hardware involved in the round key generation process is reduced greatly. The cost of generating round keys through this method is far less than the cost involved in the previous method. The on the fly method of generating round keys is only applicable to the encryption process. In the decryption process all the round keys are generated before hand. This is because in the decryption process, the last generated round key in the encryption process is used first and thus annuls the possibility of generating the round keys from the first round key.

In S-box implementation the input values are divided into chunk of eight bits each and each of them is used as an address for the S-Box [3,4] look up as shown in fig3. The constraint involved in applying Affine transformation as and when needed is that the time involved is large. Since the sub bytes operation takes place in each of the eleven cycles in the encryption process, a significant amount of the processor’s resources is used up. In the S-Box implementation, since the sub bytes values are already calculated before hand, no further processing is required for sub byte calculation. Even though the S-Box implementation involves the use of a significant amount of memory space , the time required of retrieving the sub byte values from the S-Box is significantly lesser than the time involved in calculating the sub bytes through Affine transformation. On the whole the trade off between the memory space and processing involved is loaded in favor of the former from the cost point of view and also reduces the time of operation. The concept of parallel processing has been used in this architecture for calculating the sub bytes. Instead of calculating the sub bytes sequentially, which consumes a lot of time, the 16 sub bytes are generated simultaneously using 16 S-Boxes [2]. By using 16 S-Boxes 16 sub bytes can be calculated in one go and hence reduces the time of operation tremendously.

** **

** **

** **

** 16S-Boxes
**

** **** **

Fig** **3.Parallel Implementation of S-Boxes

The encryption algorithm has been designed in such a way that the generation of round key and the round calculations can be parallely executed. The advantage of this design is the fact that we do not need to store the round key since they are currently calculated. The decryption block is shown in fig 4.Encryption and decryptions are mathematically inverse.

Figure 4. Block Diagram of Decryption Cores

The decryption function inv_cipher as shown in Fig 5 step-by-step reverses the transformations of the encryption process. The input parameter of inv_cipher is the ciphertext to be decrypted (back) into plaintext, the inverse S-box inv_s_box), the key schedule w, and the inverse polynomial matrix inv_poly_mat[7].

** **

** **

** **Fig 5.Decryption function inv_cipher

**The RoundKey**

The
Roundkeys are made by expanding the encryption key into an array holding the
Round Keys one after another. The expansion works on words of four bytes. Nk is
a constant defined as the number of four bytes words in the key. The encryption
key is filled into the first Nk words and the rest of the key material is
defined recursively from proceeding words. The word in position i, W[i], except
the first word of a RoundKey, is defined as the XOR between the preceding
word,W[i-1], and W[i-Nk]. The first word of each RoundKey, W[i] (where i mod Nk
= = 0), is defined as the XOR of a transformation on the preceding word, T(W[i
- 1]) and W[i - Nk]. The transformation T on a word, w, is w rotated to the
left by one byte, XOR’ed by a round constant and with each byte substituted by
the S-box.

**4. OPTIMIZATION IN CODING**

The various parameters involved in judging the efficacy of any implementation design include hardware complexity, area, cost, speed etc. The Verilog code will be optimized no matter where it resides in the hierarchy, even if VHDL is instantiated by the Verilog. The +opt switch is well suited for both RTL and Gate flows.

Since the number of input bytes is fixed to 128 bytes , the number of rounds involved in calculating the cipher text is also fixed to 10 , so the coding required to accommodate 12 and 14 rounds are eliminated. Also the number of round keys required is also fixed, this also helps in increasing the throughput. The program is simulated in ModelSim 5.7c and the results are captured.

**5. DESIGN METHODOLOGIES**

This architecture implements one round of the AES algorithm in one clock
cycle. This means that the S-boxes (the largest component in the
algorithm) are repeated 32 times. Sub keys are generated “on the fly” in this
architecture, which eliminates the need to store the sub keys.The basic
decryption round (which is the inverse of the basic encryption round) contains
the following transformations: AddRoundKey, InvMixColumn, InvShiftRow,
InvByteSub and the decryption round number one (which is the inverse of the
final encryption round) does not contain the InvMixColumn transformation. The
implementation of the decryption round has been designed in the similar way as
the encryption one. Rijndael operates on a state that is initialized with a
plaintext block, and after encryption this contains the ciphertext. The state
can be pictured as an rectangular array of bytes. It consists of four rows and
a number of columns defined by the block size in bytes divided by four. A block
size of 128 bit would require a state of four rows and 4 columns.** **Among the various time-space tradeoffs, we
focus primarily on time performance. The inherent parallelism of cryptographic
core and the low level hardware features of FPGAs are exploited to enhance the
performance.

** **

** **

**6.
EXPERIMENTAL AND SIMULATION RESULTS**

The test vectors provided by [7] for the algorithm was applied to assure that the design was working as intended. Performance comparisons between the proposed and previous designs are shown in Table 6.1 and Table 6.2. The encryption throughput of the proposed Implementation is many times higher comparing to the software implementations. Measurement results and comparisons between the proposed and previous hardware and software implementations are presented. The program is simulated in ModelSim 5.7c and the results are captured and displayed below.(Simulation results shown below). Table 6.3 compares the design in this paper with the design by Henry Kuo [2].

Throughput denotes the speed of the encryption process. It is calculated as,

Throughput = 128 Bits / (Cycles per Encrypted Block * Time Period)

the time period of the clock used is 769.23ns. The number of cycles per encryption block is 71.Therefore the throughput is given as

Throughput = 128 / (71* 769.23ns)

** =
2.18 Gb/sec**

**CALCULATION OF THROUGHPUT PER SLICE:**

Hardware resources within CLB slices may not be fully utilized by the place-and-route software so as to relieve routing congestion. This results in an increase in the number of CLB slices without a corresponding increase in logic gates. To achieve a more accurate measure of chip utilization, CLB slice count was chosen as the most reliable area measurement. Therefore, to measure the hardware resource cost associated with an implementation’s resultant throughput, the Throughput per Slice (TPS) metric is used.

** TPS = (Encryption Rate) / (# CLB Slices
Used)**

Here the encryption rate is 2.18 Gb/ sec.The number of slices used is 548.

Therefore the TPS is calculated as

TPS = (2.18 Gb/sec) / 548

**= 3978.1 Kb/sec/slice**

**HDL SYNTHESIS REPORT**

Macro Statistics

# ROMs : 56

16x128-bit ROM : 56

# Multiplexers : 66

8-bit 10-to-1 multiplexer : 10

8-bit 16-to-1 multiplexer : 56

# Xors : 171

128-bit xor2 : 11

8-bit xor2 : 150

8-bit xor3 : 10

Table 5.1Performance Comparison

Implementation |
Encryption Speed |

Software implementation(ANSI C) |
27Mb/s |

Visual C++ |
70.5Mb/s |

Hardware implementation(Altra) |
268Mb/s |

ProposedVHDL(Virtex II) |
2.18Gb/s |

(Encryption Simulation Result1)

** **

**(**Decryption Simulation Result2)

Table 6.2Comparison AES 128 standard module and VHDL implementation.

MODULE / COMPONENT |
GE[5] |
VHDLIMPLEMENTATION
VALUES |

S-BOXES |
16* 392 |
16*392 |

MULTIPLIERS |
16*212 |
16*230 |

D-CELLS |
16*87 |
16*76 |

MULTIPLEXERS |
374 |
512 |

Table 6.3 Comparison between standard AES 128 module and VHDL implementation

PARAMETERS |
STANDARD[2] |
VHDL IMPLEMENTATION (Proposed) |

CRITICAL PATH |
10ns |
10.443ns |

GATE COUNT |
173,000 |
184,000 |

THROUGHPUT |
1.82 Gb/sec |
2.18Gb/sec |

** **

**7. FUTURE DEVELOPMENT**

For future development, estimation
on the real time required for key initialization and time for a whole
encryption should be done on the real chip. Although the newly described design has only
been tested for 128 bit key, research is still going on the encryptor core for
higher bit lengths. Similar performance analysis can be performed for larger
sizes of data blocks and keys as well as for implementation s that process
multiple blocks of data concurrently.FPGA based solutions have shown
significant speedups compared with software based approaches[3,9]. The
widespread adoption of distributed, wireless, and mobile computing makes the
inclusion of privacy, authentication and security in general a necessity[1].
Power consumption will remain a critical factor ,especially when cryptographic
applications will move into embedded context.

**8. CONCLUSION**

In this paper, a VLSI implementation for the Rijndael encryption algorithm is presented. The combination of security, and high speed implementation, makes it a very good choice for wireless systems. The whole design was captured entirely in VHDL language using a bottom-up design and verification methodology. The proposed VLSI implementation of the algorithm reduces the covered area and achieves a data throughput up to 2.18Gbit/sec. Thus, an optimized coding for the implementation of Rijndael algorithm for 128 bits has been developed. Keeping in mind the speed factor and a software code in VHDL which boasts of a throughput of 2.18GB/s has been developed. Architectural innovations like on the fly round key generation, which facilitates simultaneous execution of sub bytes, shift rows and mix columns and round key generation has been incorporated in our coding.

**REFERENCES**

1. W.Diffle
and H.Hellman,”Privacy and authentication: An Introduction to cryptography”,
Proceedings of IEEE, 67(1979) ,pp 397-427.

2. Henry
Kuo and Ingrid Verbauwhede,”Architectural Optimization for a 1.82Gbits/sec VLSI
Implementation of the AES Rijindael Algorithm”, in 3^{rd} international workshop cryptographic Hardware and
embedded systems (CHES 2001), LNCS2162, Paris, May 2001,pp 51-64.

3. Ingrid
Verbauwhede and Henry Kuo, ”Design and performance testing of a 2.29Gbits/sec
Rijindael algorithm”, IEEE Journal of solid state circuits ,volume 38,No.3,March 2003,pp 569-572.

4.
J. Daemen and V. Rijmen, “AES Proposal:
Rijndael”, http://www.esat.kuleuven.ac.be/~rijmen/rijndael/, September, 1999.

5. K. Gaj
and P. Chodowiec , “Implementations of the AES Candidate Algorithms using
FPGA Devices” Technical Report, George Mason University, April, 2000.

6. Joan
Daemen and Vincent Rijmen , “ A Specification for the AES Algorithm Rijndael”,V
3.7,10^{th} May , 2003.

7. http://www.esat.kulven.ac.be/~rijmen/rijndael/,
(2001)

8. http://www.ascom.ch/infosec/downloads/IDEACryptCoprocessor.pdf

9.
Xilinx Webpack 5.1 Software,http://support.xilinx.com. M. McLoone, J. McCanny, “High

Performance Single-Chip FPGA Rijndael
Algorithm Implementations,”Proceedings

Cryptographic Hardware and
EmbeddedSystems Workshop, CHES, Paris, May 2001.

10. T. Ichikawa, T. Kasuya, M. Matsui,
“Hardware Evaluation of the AES Finalists,” in AES3: the third Advanced
Encryption Standard Candidate conference, New-York,April 13-14, 2000.

11. M.
Bean, C. Ficke, T. Rozylowicz, B. Weeks, “Hardware Performance simulations of
Round

2 Advanced Encryption Standard
Algorithms,” available at

http://csrc.nist.gov/encryption/aes/round2/NSAAESfinalreport.

12. B. Gladman, “The AES Algorithm (Rijndael)
in C and C++, performance of the optimized

implementation,” from http://fp.gladman.plus.com/cryptography_technology/rijndael/index.htm

Technical College - Bourgas,