International Association for Cryptologic Research

International Association
for Cryptologic Research

Transactions on Cryptographic Hardware and Embedded Systems, Volume 2022

Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4


README

Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4

This artifact accompanies our paper Multi-Parameter Support with NTTs for NTRU and NTRU Prime on Cortex-M4
published at TCHES 2022.
You can find the repository here, the talk here and an updated version here.

Summary of Overall Performance

The following table summarizes the overall performance of NTRU and NTRU Prime of our work.
It is stripped from Tables 9 and 10 in our paper.

scheme implementation key generation [cycles] encapsulation [cycles] decapsulation [cycles]
ntruhps2048677 m4f_1440 3,912k 525k 718k
ntruhrss701 m4f_1440 3,822k 361k 778k
ntruhps4096821 m4f 5,217k 654k 908k
ntrulpr653 m4f 669k 1,131k 1,231k
ntrulpr761 m4f 710k 1,266k 1,365k
ntrulpr857 m4f 886k 1,465k 1,596k
sntrup653 m4f 6,623k 621k 527k
sntrup761 m4f 7,937k 666k 563k
sntrup857 m4f 10,192k 812k 685k

The Structure of the Sources

We follow pqm4 directory structure to make it easier to adopt.
- common contains the common functions.
- crypto_kem contains our implementations. We provided implementations of each parameter sets for each scheme in a separeted directory under the crypto_kem directory.
- mk contains the relevant makefiles stripped from pqm4 and mupq.
- ldscripts contains the linker for our board.

We focus on the NTT-based polynomial multiplications.
For each implementation,
- The prototypes for assembly functions and precomputed tables are given in a header file NTT.h, and the parameters for the convolution implementation are given in a separeted header file NTT_params.h.
- Dedicated butterfly operations, which are explained in Section 4.1 of the paper, are implemented in the special_butterflies.i and used in Good_3x2.S files.
- The final map implementations are provided in the final_map.S files.

Imported Code

We used latest version of NTRU and NTRU Prime implementations from pqm4.
- We used TMVP-based polynomial multiplication from Faster NTRU on ARM Cortex-M4 with TMVP-based multiplication by Irem Keskinkurt Paksoy and Murak Cenk when any of the inputs in the polynomial multiplication of NTRU implementations doesn't have restriction that all coefficients should be in {-1, 0, 1}. In particular,
- tmvp4_677_10.S
- tmvp4_821_12.S
- tmvp4_701_10.S
- We used NTT-based multiplication (without changing the coefficient rings) for polynomial multiplications in the key generation of Streamlined NTRU Prime implementations from Number Theoretic Transform for Polynomial Multiplication in Lattice-based Cryptography on ARM Processors by Yun-Li Cheng. Please refer to the thesis for more details.
- We used polynomial inversions for the key generations of ntruhps, ntruhrss, and sntrup from Implementation of Polynomial Modular Inversion in Lattice- based cryptography on ARM by Ching-Lin Li. Please refer to the thesis for more details.

Running tests

Dependencies

Compilation

For a better compilation performance we recommend compile the code with following command.

make -j8

Binary files

After compilation, binary files are created under bin directory and can be uploaded to the board with using st-link tools as follows:

st-flash write XXX.bin 0x8000000

where XXX.bin is the produced binary file.

How to Read from Board

To read the output, one needs to connect the boards PA2 and PA3 pins to an usb-ttl serial adapter and run the following command.

python3 ./read_serial.py

Setting is in the file config.py.

Script for Testing

To test the correctness of the implementations in the repository, one can run the following python script.

python3 ./test.py

Script for Benchmarking

To verify cycle counts reported in the paper, one can use following script.

python3 ./speed.py

Results will be shown in the command line output for each implementation and the formatted tables are generated in the file speed.txt at the end.
For the overall performance, implementations without the suffix _168MHz are reported in our paper and summarized in the beginning of this README.
The implementations with suffix 168MHz are reported for completeness, but it is known that some parts of the programs (that are not programs from this paper) are not optimized for code size and incur non-negligible performance penalties while running at the full speed 168MHz.
For the performance of NTT-based polynomial multiplications, we report the numbers at both 24MHz and 168MHz.

Licences

Please refer to LICENSE.md for details.

Related External Code

We also have some programs for generating the tables of twiddle factors and computing the worst-case bounds based on Montgomery reductions and multiplications.

Generating Twiddle Factors

In gen_table,
the following folders generate the tables of twiddle factors in this paper:
- ntruhps2048677_1440
- ntruhps2048677_1536
- ntruhps4096821
- ntruhrss701_1440
- ntruhrss701_1536
- ntrup653
- ntrup761
- ntrup857

Computing Worst-Case Bounds

In bound,
the following folders compute the bounds (based on Montgomery reductions and multiplications) of implementations in this paper:
- ntruhps2048677_1440
- ntruhps2048677_1536
- ntruhps4096821
- ntruhrss701_1440
- ntruhrss701_1536
- ntrup653
- ntrup761
- ntrup857