Transactions on Cryptographic Hardware and Embedded Systems, Volume 2024

Fast and Clean: Auditable high-performance assembly via constraint solving

README

      # Submission artifact

This directory contains the artifact submission for the CHES 2024 paper [Fast and Clean: Auditable
high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303.pdf).

The artifact enables interested readers to:

1. _Optimize:_ Reproduce the SLOTHY optimizations described in the paper.
2. _Test:_ Validate the functional correctness of the optimized code through tests.
3. _Benchmark:_ If suitable development boards are available, evaluate the performance of the optimized code.

## Setup

Optimization requires the [SLOTHY](https://github.com/slothy-optimizer/slothy) repository. For testing and benchmarking,
we recommend and describe the use of the [pqmx](https://github.com/slothy-optimizer/pqmx) and
[pqax](https://github.com/slothy-optimizer/pqax) test repositories for Cortex-M and Cortex-A. Benchmarking further
requires the availability of suitable devices or development boards.

### Docker

SLOTHY, pqmx and pqax have a number of dependencies that can be cumbersome to setup manually, including [Google
OR-Tools](https://github.com/google/or-tools/) and cross-compilers for AArch64 and Armv8.1-M.

For convenience, this directory contains a Dockerfile [slothy.Dockerfile](./slothy.Dockerfile) establishing an
Ubuntu-22.04-based Docker image with SLOTHY, pqax and pqmx setup and ready for use.

#### Build image

* Build the image:

```
docker build -f slothy.Dockerfile -t slothy_image .
```

* Check success:

```
docker image ls
```

should show a line like this:

```
% docker image ls
REPOSITORY      TAG       IMAGE ID       CREATED         SIZE
slothy_image    latest    b009755ab33e   2 hours ago     3.5GB
```

#### Create and run container

* Create docker container from image

```
% docker run --name slothy_container -d -it slothy_image /bin/bash
e06f3c0155e552ce41a7fecdccf27f18e04e888ee30b5a43b48b98326df360bd
```

* Check that the container is running

```
% docker container ls
CONTAINER ID   IMAGE        COMMAND       CREATED          STATUS          PORTS     NAMES
e06f3c0155e5   slothy_image "/bin/bash"   20 seconds ago   Up 19 seconds             slothy_container
```

* Start shell in docker container

```
% docker exec -it slothy_container /bin/bash
root@e06f3c0155e5:/slothy#
```

Here are all steps together:

```
docker build -f slothy.Dockerfile -t slothy_image .
docker image ls
docker run --name slothy_container -d -it slothy_image /bin/bash
docker container ls
docker exec -it slothy_container /bin/bash
```

### Manual build

For a manual local build, you largely follow the steps in [`slothy.Dockerfile`](./slothy.Dockerfile). At the top-level,
you should have a directory structured as follows:

```
* artifact
* |-- pqmx
*     |-- submodules
*         |-- slothy # symlink to slothy repository
* |-- pqax
*     |-- submodules
*         |-- slothy # symlink to slothy repository
* |-- slothy
```

Note that to avoid having three copies of SLOTHY, you should not use `git submodule` in the pqmx and pqax repositories,
but symlink the SLOTHY repository into the submodule location.

For the SLOTHY repository, the main dependency is Google OR-Tools; see the README in the SLOTHY repository for setup
instructions (those are also followed in the Dockerfile).

## Using the artifact

Within either the docker container or your local copy of SLOTHY, please see `slothy/paper/README.md` for a detailed
description of how to reproduce, test and benchmark the results of the paper. This file is also available online
[here](https://github.com/slothy-optimizer/slothy/blob/ches2024_artifact/paper/README.md).

======================================================================================================================
|                                                                                                                    |
|                                                   slothy/paper/README.md                                           |
|                                                                                                                    |
======================================================================================================================

This directory contains supporting material for the paper [Fast and Clean: Auditable
high-performance assembly via constraint solving](https://eprint.iacr.org/2022/1303.pdf)
that introduced SLOTHY. It enables interested readers to:

1. _Optimize:_ Reproduce the SLOTHY optimizations described in the paper.
2. _Test:_ Validate the functional correctness of the optimized code through tests.
3. _Benchmark:_ If suitable development boards are available, evaluate the performance of the optimized code.

For optimization, only the SLOTHY repository is needed. For testing and benchmarking, we recommend the use of the
[pqmx](https://github.com/slothy-optimizer/pqmx) and [pqax](https://github.com/slothy-optimizer/pqax)
repositories. See the respective READMEs for setup instructions, or use the Dockerfile provided in
[artifact](artifact) (see also [artifact/README.md](artifact/README.md)).

# Testing optimized code

SLOTHY, pqmx and pqax are shipped with the optimized code for the workloads discussed in the paper. To test that
those versions are functional, or to test re-optimizations (see below), you have the following options. We describe
them separately for the AArch64 and Armv8.1-M examples.

## AArch64

AArch64 tests live in pqax, which provides unit tests for the Kyber NTTs, Dilithium NTTs, and X25519 scalar
multiplication. Each unit test can be built and run in different test environments depending on the target platform.
We refer to the pqax README for a detailed description of the repository structure.

To build and/or run a test, do:

```
make {build, run}-{cross,native_mac,native_linux}-{ntt_dilithium,ntt_kyber,x25519}
```

Here, the first argument from `{build, run}` indicates whether the test should be built only, or built-and-run.
The third argument from `{ntt_dilithium, ntt_kyber, x25519}` denotes the workload under test. The
second argument from `{cross, native_mac, native_linux}` denotes the test environment:

* The `cross` test environment cross-compiles a user space binary for a Linux-AArch64 target that can be
  either run emulated, or copied onto a remote device and tested there.
* `native_linux` assumes native compilation on a Linux-AArch64 host.
* `native_mac` assumes native compilation on an Arm-based MacOS host.

Upon success, the test binaries can be found in `envs/{cross, native_mac, native_linux}`.

__Examples:__

* If you work with a local copy of pqax on a Mac, use `native_mac` as the test environment:
```
% make run-native_mac-ntt_kyber
% make run-native_mac-ntt_dilithium
% make run-native_mac-x25519
```

* If you work in a Docker container on an Arm-based Mac, or on an AArch64-based Linux host, use `native_linux` as the
  test environment:

```
% make run-native_linux-ntt_kyber
% make run-native_linux-ntt_dilithium
% make run-native_linux-x25519
```

* If you work on an x86 Linux host, use `cross` to build and run user QEMU user space emulation:

```
% make run-cross-ntt_kyber
% make run-cross-ntt_dilithium
% make run-cross-x25519
```

* If you want to cross-compile test binaries for a remote AARch64 Linux target, build the tests via

```
% make run-cross-ntt_kyber
% make run-cross-ntt_dilithium
% make run-cross-x25519
```

then copy and run them on the target.

### Troubleshooting

* Garbage benchmarks: The test binaries include both functional correctness checks and benchmarks. However, when built
  in the way described above, cycle measurements are stubbed out, so benchmark results will be meaningless -- please
  ignore them. We describe below how to build binaries for environments with cycle accurate benchmarking.

### Re-optimization, connection with SLOTHY

Unit tests in pqax obtain their assembly source from `pqax/asm/manual/`, which has symlinks into the subdirectory
of the SLOTHY repository where optimization outputs are stored. Running the above commands after re-optimization
in SLOTHY (explained below) should therefore automatically pick up the new files.

## Armv8.1-M

Armv8.1-M tests live in pqmx, which is structured in the same way as pqax. pqmx provides unit tests for the Kyber and
Dilithium NTTs, and for the floating point and fixed point partial FFT.

To build a unit test for use with QEMU, use

```
make build-m55-core-{ntt_kyber, ntt_dilithium, fx_fft, flt_fft}
```

The resulting image is located in `envs/core`, and can be run on QEMU via

```
make run-m55-core-{ntt_kyber, ntt_dilithium, fx_fft, flt_fft}
```

For example, to build and test all examples, do:

```
% make run-m55-core-ntt_kyber
% make run-m55-core-ntt_dilithium
% make run-m55-core-fx_fft
% make run-m55-core-flt_fft
```

# Reproducing SLOTHY optimizations

We now describe how to reproduce the SLOTHY optimizations discussed in the paper:

- The Number Theoretic Transforms (NTT) underlying Kyber/ML-KEM and Dilithium/ML-DSA, optimized for Cortex-A55,
  Cortex-A72, Cortex-M55 and Cortex-M85.

- An instance of the Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic,
  optimized for Cortex-M55 and Cortex-M85.

- The X25519 scalar multiplication, optimized for Cortex-A55.

## Overview

The optimizations described in the SLOTHY paper are driven by the following scripts:

```
scripts/slothy_dilithium_ntt_a55.sh
scripts/slothy_dilithium_ntt_a72.sh
scripts/slothy_fft.sh
scripts/slothy_kyber_ntt_a55.sh
scripts/slothy_kyber_ntt_a72.sh
scripts/slothy_ntt_helium.sh
scripts/slothy_sqmag.sh
scripts/slothy_x25519.sh
```

Each script optimizes one or more 'base' version(s) of the corresponding workload from [clean/helium/](./clean/helium/)
(for Armv8.1-M code) and [clean/neon](clean/neon) (for AArch64 code) and stores the optimized
code in [opt/helium](./opt/helium) and [opt/neon](./opt/neon), respectively. Optimized
source files is suffixed with `_opt` and the target microarchitecture: For example, one of the optimizations conducted
by `slothy_kyber_ntt_a55.sh` transforms
[clean/neon/ntt_dilithium_123_45678.s](./clean/neon/ntt_dilithium_123_45678.s) to
[opt/neon/ntt_dilithium_123_45678_opt_a55.s](./opt/neon/ntt_dilithium_123_45678_opt_a55.s).

## Setup

Follow the [SLOTHY Readme](../README.md) to setup SLOTHY. If you use the Docker container provided in
[artifact](artifact), this step is not necessary.

## Running the optimizations

* From [scripts/](./scripts/), run one of the optimization scripts, e.g.

```
./slothy_kyber_ntt_a55.sh
```

If you want to run all optimizations, run `all.sh`, passing `SILENT={N,Y}` to indicate whether you want to see log
output from SLOTHY.

```
SILENT={Y,N} ./all.sh
```

* Now, wait. Running all optimizations at once will take multiple hours.

* Upon success, find the optimized source files in [examples/opt/](../examples/opt). They should be structurally equal
  to the input files, with the base assembly sections replaced by the optimized kernels and the rescheduling permutation
  indicated through comments.

### Trouble-shooting

* Timing and quality of results: The underlying CP-SAT constraint solver is non-deterministic, which means that the
  optimization timings may vary. The performance of the optimized code may also vary, esp. for the Cortex-A72
  optimizations which are based on a heuristic model of Cortex-A72. Variations for the in-order cores Cortex-M55,
  Cortex-M85 and Cortex-A55 should be smaller.

* Timeout: In the extreme case, examples may not terminate in acceptable time. In this case, you can either re-run
  the optimization, or send a SIGINT via CTRL+C while the CP-SAT solver is running, which will abort the current
  optimization and attempt another one with a larger number of stalls. Note, though, that this may lead to a non-optimal
  result.

* Compilation failure from immediate offsets: Rarely, it can happen that the resulting code fails to compile because immediate offsets in
  load/store instructions have gone out of bounds: SLOTHY will adjust such offsets when a post-increment load/store like
  `ldr/str Q0, [X0], #imm0` is reordered against a load/store with immediate offset `ldr/str Q0, [X0, #imm1]`, but it is
  not presently aware of the architectural limitations of those offsets. If the optimized code fails to compile because
  of an excessive immediate, please re-run the respective script.

In case of other issues, please let us know and we will investigate.

# Benchmarking

## AArch64

To enable benchmarking in the test binaries for the Kyber NTTs, Dilithium NTTs, and X25519 scalar multiplication on
Cortex-A55 and Cortex-A72, use the following:

```
CYCLES={PMU,PERF} make {build,run}-{cross,native_linux}-{ntt_dilithium,ntt_kyber,x25519}
```

Here, `CYCLES=PMU` means that cycle counts will be obtained by directly accessing the PMU cycle counter register. This
access needs to be enabled by loading a suitable kernel module as described in
[https://github.com/mupq/pqax#enable-access-to-performance-counters](https://github.com/mupq/pqax#enable-access-to-performance-counters).

Alternatively, `CYCLES=PERF` means that cycle counts will be obtained via the `perf` module.

### Troubleshooting

* `Illegal instruction`: This fatal error is encountered when access to the PMU cycle counter has not been enabled. See
  the link above, or use `CYCLES=PERF` instead.

* No cycles on Arm-based Macs: pqax does not offer cycle measurements on Arm-based Macs yet.

## Armv8.1-M

pqmx supports building images ready for use with the MPS3 FPGA prototyping board and the AN547 and AN555 nodes for the
Cortex-M55 and Cortex-M85, respectively.

To build the respective images, use

```
make build-{m55-an547, m85-an555}-{ntt_kyber, ntt_dilithium, fx_fft, flt_fft}
```

Those images then need to be flashed onto the MPS3 for test.
International Association for Cryptologic Research

International Association
for Cryptologic Research

Transactions on Cryptographic Hardware and Embedded Systems, Volume 2024

Fast and Clean: Auditable high-performance assembly via constraint solving

README

International Association for Cryptologic Research

International Associationfor Cryptologic Research

Transactions on Cryptographic Hardware and Embedded Systems, Volume 2024

Fast and Clean: Auditable high-performance assembly via constraint solving

README

International Association
for Cryptologic Research