Transactions on Cryptographic Hardware and Embedded Systems, Volume 2024
Fast and Clean: Auditable high-performance assembly via constraint solving
Amin Abdulrahman
Ruhr University Bochum, Bochum, Germany; Max Planck Institute for Security and Privacy, Bochum, Germany
Hanno Becker
Automated Reasoning Group, Amazon Web Services, Cambridge, United Kingdom
Matthias J. Kannwischer
Quantum Safe Migration Center, Chelpis Quantum Tech, Taipei, Taiwan
Fabien Klein
Arm Limited
Keywords: Superoptimization, Constraint Solving, Cryptography, Post-Quantum Cryptography, Armv8.1-M, AArch64, Helium, Neon, Kyber, Dilithium, X25519, Fast Fourier Transform, FFT, Number Theoretic Transform, NTT, Software Pipelining, Google OR-Tools
Abstract
Handwritten assembly is a widely used tool in the development of highperformance cryptography: By providing full control over instruction selection, instruction scheduling, and register allocation, highest performance can be unlocked. On the flip side, developing handwritten assembly is not only time-consuming, but the artifacts produced also tend to be difficult to review and maintain – threatening their suitability for use in practice. In this work, we present SLOTHY (Super (Lazy) Optimization of Tricky Handwritten assemblY), a framework for the automated superoptimization of assembly with respect to instruction scheduling, register allocation, and loop optimization (software pipelining): With SLOTHY, the developer controls and focuses on algorithm and instruction selection, providing a readable “base” implementation in assembly, while SLOTHY automatically finds optimal and traceable instruction scheduling and register allocation strategies with respect to a model of the target (micro)architecture. We demonstrate the flexibility of SLOTHY by instantiating it with models of the Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72microarchitectures, implementing the Armv8.1-M+Helium and AArch64+Neon architectures. We use the resulting tools to optimize three workloads: First, for Cortex-M55 and Cortex-M85, a radix-4 complex Fast Fourier Transform (FFT) in fixed-point and floating-point arithmetic, fundamental in Digital Signal Processing. Second, on Cortex-M55, Cortex-M85, Cortex-A55 and Cortex-A72, the instances of the Number Theoretic Transform (NTT) underlying CRYSTALS-Kyber and CRYSTALS-Dilithium, two recently announced winners of the NIST Post-Quantum Cryptography standardization project. Third, for Cortex-A55, the scalar multiplication for the elliptic curve key exchange X25519. The SLOTHY-optimized code matches or beats the performance of prior art in all cases, while maintaining compactness and readability.
Publication
Transactions of Cryptographic Hardware and Embedded Systems, Volume 2024, Issue 1
PaperArtifact
Artifact number
tches/2024/a6
Artifact published
March 7, 2024
Badge
🏆 IACR CHES Results Reproduced
License
This work is licensed under the MIT License.
Some files in this archive are licensed under a different license. See the contents of this archive for more information.
BibTeX How to cite
Abdulrahman, A., Becker, H., Kannwischer, M. J., & Klein, F. (2023). Fast and Clean: Auditable high-performance assembly via constraint solving. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2024(1), 87–132. https://doi.org/10.46586/tches.v2024.i1.87-132 Artifact available at https://artifacts.iacr.org/tches/2024/a6