International Association for Cryptologic Research

International Association
for Cryptologic Research

Transactions on Cryptographic Hardware and Embedded Systems, Volume 2024

Fast and Clean: Auditable high-performance assembly via constraint solving

Amin Abdulrahman
Ruhr University Bochum, Bochum, Germany; Max Planck Institute for Security and Privacy, Bochum, Germany

Hanno Becker
Automated Reasoning Group, Amazon Web Services, Cambridge, United Kingdom

Matthias J. Kannwischer
Quantum Safe Migration Center, Chelpis Quantum Tech, Taipei, Taiwan

Fabien Klein
Arm Limited

Keywords: Superoptimization, Constraint Solving, Cryptography, Post-Quantum Cryptography, Armv8.1-M, AArch64, Helium, Neon, Kyber, Dilithium, X25519, Fast Fourier Transform, FFT, Number Theoretic Transform, NTT, Software Pipelining, Google OR-Tools


Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice. In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture. We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.


Transactions of Cryptographic Hardware and Embedded Systems, Volume 2024, Issue 1



Artifact number

Artifact published
March 7, 2024

🏆 IACR CHES Results Reproduced


ZIP (9098099 Bytes)  

This work is licensed under the MIT License.

Some files in this archive are licensed under a different license. See the contents of this archive for more information.

BibTeX How to cite

Abdulrahman, A., Becker, H., Kannwischer, M. J., & Klein, F. (2023). Fast and Clean: Auditable high-performance assembly via constraint solving. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2024(1), 87–132. Artifact available at