Academic Open Internet Journal

Volume 12, 2004

Efficient VLSI implementation of the block cipher Rijndael algorithm

* Mrs.G.Umamaheswari, ** Dr.A.Shanmugam

*Research scholar,   ** Professor,

ECE Department,

PSG College of technology, Coimbatore

Abstract: Cryptographic algorithms are more efficiently implemented in custom hardware than in software running on general-purpose processors. This paper addresses efficient hardware implementation approaches for the AES (Advanced Encryption Standard)Algorithm and describes the design and performance testing Rijndael algorithm. Compared to software implementation, hardware implementations provide more physical security as well as higher speed. An optimized coding for the implementation of Rijndael algorithm for 128 bytes has been developed. The speed factor of the algorithm implementation has been targeted and a software code in VHDL (VHSIC Hardware Description Language) which boasts of a throughput of 2.18 Gb/sec has been developed. The architectural innovations that have been incorporated in the coding include on the fly round key generation, which facilitates simultaneous execution of sub bytes, shift rows and mix columns and round key generation. A look up table called S-Box has been used to obtain the sub byte values instead of applying affine transformation every time to calculate sub byte values. This implementation will be useful in wireless security like military communication and mobile telephony where there is a greater emphasis on the speed of communication.

Keywords: Rijndael, Encryption, Advanced Encryption Standard (AES), pipelining, security, VHDL.



The cipher Rijndael is one of the five finalists of the Advanced Encryption Standard.  The AES Rijndael algorithm was chosen in October 2000 and is expected to replace the DES and Triple DES because of its enhanced security levels [6, 7, 8]. The algorithm has been designed by Joan Daemen and Vincent Rijmen and its specification is given in [1]. It is a block cipher. The length of the block and the length of the key can be specified to be 128, 192, 256 bits. In this paper, the hardware implementation with 128-bit blocks and 128-bit keys is presented. In this variant the cipher consists of 10 rounds. Last of all, it is often the case that applications require the encryption logic be physically isolated from the rest of the system, so that the encryption can be secured more easily. In this case a hardware accelerator is a more suitable solution as well[10,11]. In this paper, VLSI optimizations of the Rijndael algorithm are discussed and several hardware design modifications and techniques are used, such as memory sharing and parallelism. Different applications of the AES algorithm require different speed/area trade offs. Some applications such as WWW servers and ATMS are speed critical. Moreover, in terms of mobile application like cellular phones, PDA’s, etc.,software implementation[12] on general-purpose processors consumes much more power than special purpose ASIC's do. The high performance is especially reached by keeping combinational paths balanced so that every clock cycle is fully utilized. The paper presents implemented Rijndael processor structure of moderate gate count and high speed and it is organized as follows. In section 2, the algorithm flow is discussed. In section 3, the hardware implementations and technique system level architecture are discussed. The design methodology is discussed in section 4. Finally the measurements and performance results are shown in section 5, conclusions and outlook are given in section 6.




The NIST initiate to develop the Federal Information Processing Standard (FIPS) for the AES in replacing the Data Encryption Standard (DES) which expired in 1998 [3]. A new block cipher algorithm called Rijndael has been developed and published by Daeman and Rijmen[2].The symmetric block cipher was standardized by NIST as AES.This algorithm is iterative block cipher with variable block length and a variable key length.


Fig1 .Hardware structure

The round function of Rijndael in 128-bit blocks is composed of four distinct invertible transformations as follows:

-The ByteSub transformation (16 lookup tables with 8bit-input/output)

-The ShiftRow transformation (no hardware operations)

-The MixColumn transformation (logical AND and XOR operations)

-The AddRoundKey transformation and 4 data-dependent rotations (Logical XOR)

Before the first round, the AddRoundKey transformation is also performed, and in the

final round, the MixColumn transformation is omitted.

The basic components of Rijndael are logical operations and lookup tables; the latter is actually a composite function of an inversion over GF (28) with an affine mapping. Hence the structure of Rijndael is expected to be suitable for hardware implementation [10, 11].




The basic organization of the hardware implementation of a symmetric block cipher is shown in Fig. 2.The encryptor core controlled by a control block present in the architecture.The round keys are generated on the fly in the encryption process. The hardware required to generate one set of round key is implemented and re-used it for calculating the rest of the round keys. This results in reduction in space used for storing the sub keys values and also improves the speed of operation since round key is generated simultaneously while the sub bytes, shift rows and mix columns take place.


Fig 2.Block diagram of the hardware implementation of a symmetric-block cipher


Hence by the time these operations get over, the round key is ready for performing the XOR operation with the output of mix column operation. Since the same hardware is used for generating all the round keys, the amount of hardware involved in the round key generation process is reduced greatly. The cost of generating round keys through this method is far less than the cost involved in the previous method. The on the fly method of generating round keys is only applicable to the encryption process. In the decryption process all the round keys are generated before hand. This is because in the decryption process, the last generated round key in the encryption process is used first and thus annuls the possibility of generating the round keys from the first round key.

 In S-box implementation the input values are divided into chunk of eight bits each and each of them is used as an address for the S-Box [3,4] look up as shown in fig3. The constraint involved in applying Affine transformation as and when needed is that the time involved is large. Since the sub bytes operation takes place in each of the eleven cycles in the encryption process, a significant amount of the processor’s resources is used up.  In the S-Box implementation, since the sub bytes values are already calculated before hand, no further processing is required for sub byte calculation. Even though the S-Box implementation involves the use of a significant amount of memory space , the time required of retrieving the sub byte values from the S-Box is significantly lesser than the time involved in calculating the sub bytes through Affine transformation. On the whole the trade off between the memory space and processing involved is loaded in favor of the former from the cost point of view and also reduces the time of operation.   The concept of parallel processing has been used in this architecture for calculating the sub bytes. Instead of calculating the sub bytes sequentially, which consumes a lot of time, the 16 sub bytes are generated simultaneously using 16 S-Boxes [2]. By using 16 S-Boxes 16 sub bytes can be calculated in one go and hence reduces the time of operation tremendously.






                       Fig 3.Parallel Implementation of S-Boxes

The encryption algorithm has been designed in such a way that the generation of round key and the round calculations can be parallely executed. The advantage of this design is the fact that we do not need to store the round key since they are currently calculated. The decryption block is shown in fig 4.Encryption and decryptions are mathematically inverse.

                      Figure 4. Block Diagram of Decryption Cores


The decryption function inv_cipher as shown in Fig 5 step-by-step reverses the transformations of the encryption process. The input parameter of inv_cipher is the ciphertext to be decrypted (back) into plaintext, the inverse S-box inv_s_box), the key schedule w, and the inverse polynomial matrix inv_poly_mat[7].



                         Fig 5.Decryption function inv_cipher

The RoundKey

The Roundkeys are made by expanding the encryption key into an array holding the Round Keys one after another. The expansion works on words of four bytes. Nk is a constant defined as the number of four bytes words in the key. The encryption key is filled into the first Nk words and the rest of the key material is defined recursively from proceeding words. The word in position i, W[i], except the first word of a RoundKey, is defined as the XOR between the preceding word,W[i-1], and W[i-Nk]. The first word of each RoundKey, W[i] (where i mod Nk = = 0), is defined as the XOR of a transformation on the preceding word, T(W[i - 1]) and W[i - Nk]. The transformation T on a word, w, is w rotated to the left by one byte, XOR’ed by a round constant and with each byte substituted by the S-box.



The various parameters involved in judging the efficacy of any implementation design include hardware complexity, area, cost, speed etc. The Verilog code will be optimized no matter where it resides in the hierarchy, even if VHDL is instantiated by the Verilog. The +opt switch is well suited for both RTL and Gate flows.

Since the number of input bytes is fixed to 128 bytes , the number of rounds involved in calculating the cipher text is also fixed to 10 , so the coding required to accommodate 12 and 14 rounds are eliminated. Also the number of round keys required is also fixed, this also helps in increasing the throughput. The program is simulated in ModelSim 5.7c and the results are captured.


This architecture implements one round of the AES algorithm in one clock cycle. This means that the S-boxes (the largest component in the algorithm) are repeated 32 times. Sub keys are generated “on the fly” in this architecture, which eliminates the need to store the sub keys.The basic decryption round (which is the inverse of the basic encryption round) contains the following transformations: AddRoundKey, InvMixColumn, InvShiftRow, InvByteSub and the decryption round number one (which is the inverse of the final encryption round) does not contain the InvMixColumn transformation. The implementation of the decryption round has been designed in the similar way as the encryption one. Rijndael operates on a state that is initialized with a plaintext block, and after encryption this contains the ciphertext. The state can be pictured as an rectangular array of bytes. It consists of four rows and a number of columns defined by the block size in bytes divided by four. A block size of 128 bit would require a state of four rows and 4 columns. Among the various time-space tradeoffs, we focus primarily on time performance. The inherent parallelism of cryptographic core and the low level hardware features of FPGAs are exploited to enhance the performance.





The test vectors provided by [7] for the algorithm was applied to assure that the design was working as intended. Performance comparisons between the proposed and previous designs are shown in Table 6.1 and Table 6.2. The encryption throughput of the proposed Implementation is many times higher comparing to the software implementations. Measurement results and comparisons between the proposed and previous hardware and software implementations are presented. The program is simulated in ModelSim 5.7c and the results are captured and displayed below.(Simulation results shown below). Table 6.3 compares the design in this paper with the design by Henry Kuo [2].

Throughput denotes the speed of the encryption process. It is calculated as,

Throughput = 128 Bits / (Cycles per Encrypted Block * Time Period)

the time period of the clock used is 769.23ns. The number of cycles per encryption block is 71.Therefore the throughput is given as

Throughput       = 128 / (71* 769.23ns)

                                                             = 2.18 Gb/sec


Hardware resources within CLB slices may not be fully utilized by the place-and-route software so as to relieve routing congestion. This results in an increase in the number of CLB slices without a corresponding increase in logic gates. To achieve a more accurate measure of chip utilization, CLB slice count was chosen as the most reliable area measurement. Therefore, to measure the hardware resource cost associated with an implementation’s resultant throughput, the Throughput per Slice (TPS) metric is used.

                           TPS = (Encryption Rate) / (# CLB Slices Used)

Here the encryption rate is 2.18 Gb/ sec.The number of slices used is 548.

Therefore the TPS is calculated as

TPS     = (2.18 Gb/sec) / 548

= 3978.1 Kb/sec/slice


Macro Statistics

# ROMs                                               : 56

16x128-bit ROM                     : 56

# Multiplexers                                       : 66

8-bit 10-to-1 multiplexer           : 10

8-bit 16-to-1 multiplexer           : 56

# Xors                                                  : 171

128-bit xor2                             : 11

8-bit xor2                                 : 150

8-bit xor3                                 : 10

Table 5.1Performance Comparison


Encryption Speed

Software implementation(ANSI C)


Visual C++


Hardware implementation(Altra)


ProposedVHDL(Virtex II)













(Encryption Simulation Result1)














(Decryption Simulation Result2)

Table 6.2Comparison AES 128 standard module and VHDL implementation.







16* 392













Table 6.3 Comparison between standard AES 128 module and VHDL implementation













1.82 Gb/sec




For future development, estimation on the real time required for key initialization and time for a whole encryption should be done on the real chip. Although the newly described design has only been tested for 128 bit key, research is still going on the encryptor core for higher bit lengths. Similar performance analysis can be performed for larger sizes of data blocks and keys as well as for implementation s that process multiple blocks of data concurrently.FPGA based solutions have shown significant speedups compared with software based approaches[3,9]. The widespread adoption of distributed, wireless, and mobile computing makes the inclusion of privacy, authentication and security in general a necessity[1]. Power consumption will remain a critical factor ,especially when cryptographic applications will move into embedded context.



In this paper, a VLSI implementation for the Rijndael encryption algorithm is presented. The combination of security, and high speed implementation, makes it a very good choice for wireless systems. The whole design was captured entirely in VHDL language using a bottom-up design and verification methodology. The proposed VLSI implementation of the algorithm reduces the covered area and achieves a data throughput up to 2.18Gbit/sec. Thus, an optimized coding for the implementation of Rijndael algorithm for 128 bits has been developed. Keeping in mind the speed factor and a software code in VHDL which boasts of a throughput of 2.18GB/s has been developed. Architectural innovations like on the fly round key generation, which facilitates simultaneous execution of sub bytes, shift rows and mix columns and round key generation has been incorporated in our coding.


1.      W.Diffle and H.Hellman,”Privacy and authentication: An Introduction to cryptography”, Proceedings of IEEE, 67(1979) ,pp 397-427.

2.      Henry Kuo and Ingrid Verbauwhede,”Architectural Optimization for a 1.82Gbits/sec VLSI Implementation of the AES Rijindael Algorithm”, in  3rd international workshop cryptographic Hardware and embedded systems (CHES 2001), LNCS2162, Paris, May 2001,pp 51-64.

3.      Ingrid Verbauwhede and Henry Kuo, ”Design and performance testing of a 2.29Gbits/sec Rijindael algorithm”, IEEE Journal of solid state circuits  ,volume 38,No.3,March 2003,pp 569-572.            

4.      J. Daemen and V. Rijmen, “AES Proposal: Rijndael”,  , September, 1999.

5.      K. Gaj and P. Chodowiec , “Implementations of the AES Candidate Algorithms using FPGA  Devices”  Technical Report, George Mason University, April, 2000.

6.      Joan Daemen and Vincent Rijmen , “ A Specification for the AES Algorithm Rijndael”,V 3.7,10th  May , 2003.

7., (2001)


9.      Xilinx Webpack 5.1 Software, M. McLoone, J. McCanny, “High   

      Performance Single-Chip FPGA Rijndael Algorithm Implementations,”Proceedings       

       Cryptographic Hardware and EmbeddedSystems Workshop, CHES, Paris, May 2001.

10.  T. Ichikawa, T. Kasuya, M. Matsui, “Hardware Evaluation of the AES Finalists,” in AES3:      the third Advanced  Encryption Standard Candidate conference, New-York,April 13-14, 2000.

11. M. Bean, C. Ficke, T. Rozylowicz, B. Weeks, “Hardware Performance simulations of Round  

      2 Advanced Encryption Standard Algorithms,” available at  

 12. B. Gladman, “The AES Algorithm (Rijndael) in C and C++, performance of the optimized  

        implementation,” from


Technical College - Bourgas,

All rights reserved, © March, 2000