Transactions on Cryptographic Hardware and Embedded Systems, Volume 2021
Fixslicing AES-like ciphers:
New bitsliced AES speed records on ARM-Cortex M and RISC-V
README
Artifact for the paper "Fixslicing AES-like Ciphers" published in TCHES 2021/1
Overview
Main motivation
This artifact aims at giving details on how to reproduce our benchmarking results reported in the paper Fixslicing AES-like Ciphers on the following development boards:
- HiFive1 Rev B (E31 RISC-V core)
- STM32L100C (ARM Cortex-M3)
- STM32F407VG (ARM Cortex-M4).
More specifically, we provide the following information:
- our development environment (on Ubuntu 18.08)
- our benchmark methodology
- results interpretation.
Organization
The artifact consists in two folders as described below:
artifact_fixslicing
│ README.md
│ LICENSE
│
├───benchmark_arm
│ ├───1storder_masking
│ ├───common
│ ├───barrel_shiftrows
│ └───fixslicing
│
├───benchmark_riscv
│ ├───barrel_shiftrows
│ ├───fixslicing
│ └───screen.sh
where benchmark_arm
and benchmark_riscv
contain the necessary material to run the benchmark on the STM32 and HiFive1 Rev B boards, respectively.
The source code for the AES implementations comes from the github repository published alongside the paper. Because it includes the non-unrolled implementations only, it is left to the reader to unroll them to get the corresponding performance. Nevertheless, we provide the file benchmark_arm/fixslicing/aes_encrypt_unroll.s
to give some insights on how it can be done, and one has simply to replace aes_encrypt.s
by aes_encrypt_unroll.s
in the makefile to benchmark the unrolled fixsliced encryption functions.
Benchmark on STM32 boards
Setup
Hereafter we describe our setup to compile, load and run our code on the STM32 development boards above mentioned. We used a laptop running Ubuntu (18.08) with the following tools:
- GNU Arm Embedded Toolchain (arm-none-eabi gcc v9.2.1) to compile
- libopencm3 open-source firmware library (we installed it from the github repository, commit 946c1cbc48f58e56e5f1d3b65d91c7fd2b94140e) for ARM Cortex-M microcontrollers, to be used with arm-none-eabi
- STLink open source toolset (v1.6.1-98-gd819a4a) to program the boards
- pySerial Python module (v3.5) for serial communications to \dev\ttyUSB0
Note that the common
folder contains linker scripts and wrappers for the above-mentioned development boards, as well as a simple python script bench.py
to read the benchmark output.
:warning: Depending on where arm-none-eabi and libopencm3 are installed, you might need to adapt the following lines within makefiles:
OPENCM3DIR = ../libopencm3
ARMNONEEABIDIR = /usr/arm-none-eabi
After running make
in the folder to benchmark, one can use stlink
to program the boards by running st-flash write aes_m3.bin 0x8000000
for the STM32L100C or st-flash write aes_m4.bin 0x8000000
for the STM32F407VG (note that the 1storder_masking
implementations are only compiled for the STM32F407VG board since STM32L100C does not embed a random number generator).
Finally, to execute the benchmark, one has to run python3 ../common/bench.py
and reset the board. Note that an USB to TTL adapter is required to communicate with the boards, with TX and RX connected to PA3 and PA2, respectively.
Benchmarking methodology
Our benchmark simply consists in measuring the execution time of a single function call.
The execution time is measured by reading the DWT Cycle Counter (DWT_CYCCNT
) register after and before the function call.
Regarding the code size, it can be measured manually by disassembling the .elf
file by running arm-none-eabi-objdump -d aes_m3.elf > code_size.txt
in order to inspect the disassembly output.
A python script parse_arm-none-eabi-objdump.py
is also provided to do it automatically: simply run python3 parse_arm-none-eabi-objdump.py code_size.txt
in order to print the code size of the different functions listed in the main section.
Results interpretation
For the key schedule functions, the number of cycles printed in the console should match the numbers reported in the paper. For the encryption function, the number of cycles has to be divided by 2 and 8 for the fixsliced and barrel-shiftrows representations, respectively (since 2 and 8 blocks are processed in parallel, respectively, whereas the paper reports the results in cycles per block).
Benchmark on the HiFive1 Rev B board
Setup
Regarding the E31 RISC-V core, we used the SiFive Freedom E SDK (v20.05.00.00).
Once the SDK is set up correctly, one can simply copy and paste the files benchmark_riscv/fixslicing/*
to freedom-e-sdk/software/fixslicing/*
and run make BSP=metal PROGRAM=fixslicing TARGET=sifive-hifive1-revb clean software upload
from the freedom-e-sdk
folder.
In order to display the output, one has to open another terminal and run the script screen.sh
.
Benchmarking methodology
The benchmarking methodology is the same as the one described for the STM32 boards: we read the cycle counter before and after a single function call. The routine to read the counter is written in RV32I assembly in getcycles.S
.
Because the E31 RISC-V core embeds a branch predictor that can introduces penalty cycles in case of a wrong guess, note that we execute a function several times before benchmarking it in order to fill the instruction cache and train the branch predictor, so that such penalties are avoided.
For the code size, one can follow the same methodology described for the STM32 boards:
first run riscv64-unknown-elf-objdump -d fixslicing.elf > code_size.txt
and either inspect the disassembly output manually or simply use the script by running python3 parse_riscv64-unknown-elf-objdump.py code_size.txt
.
Results interpretation
Same remarks as for the STM32 boards.