Transactions on Cryptographic Hardware and Embedded Systems, Volume 2022

A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion

README

A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion

Authors: Kavya Sreedhar, Mark Horowitz, Christopher Torng

Contact: skavya@stanford.edu

XGCD Execution Flow Diagram – Key components in the execution flow are broken out in detail. (A) Pre-processing step to generate odd inputs to iterations loop; (B) Update for \delta register; (C) Variable updates for a, b, u, y, m, n registers in the iterations loop illustrating the wide parallel datapath with late selects (the logic for unique update types are shown in detail); (D) Control flow state diagram with termination condition; (E) Post-processing step to generate XGCD outputs. Please see the paper link above for more detail.

Example Designs
Installation
- Required System Setup
- Repo Setup
Extended GCD
Debugging
Physical Design
- SKY 130nm
- TSMC 16nm
FPGA Synthesis, Place and Route
Common Issues
Citation

Example Designs

Design | Bitwidth | Constant-time | Technology Node | Area (mm^2) | Clock Frequency (GHz) | Cycles | Execution Time (ns) |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
Design (1) | 1024 | No | TSMC 16nm ASIC | 0.41 | 3.89 | 1143 | 294 |
Design (2) | 255 | Yes | TSMC 16nmn ASIC | 0.059 | 4.55 | 399 | 88 |

Applications:
- Design (1) targets a vdf construction of squaring binary quadratic forms over class groups
- Design (2) targets constant-time modular inversion for Curve25519 (elliptic curve cryptography)

Installation

Required System Setup

Note that the code in this repository has been tested on the following platforms:
- MacOS (Big Sur Version 11.4, Catalina 10.15.7)
- Linux (Ubuntu 18.04.6 LTS, CentOS Linux 7 (Core))

System Requirements:
1) DesignWare Library - Datapath and Building Block IP (https://www.synopsys.com/dw/buildingblock.php), specifically DW01_csa (https://www.synopsys.com/dw/ipdir.php?c=DW01_csa) for the DesignWare carry-save adder module and DW_tap (https://www.synopsys.com/dw/ipdir.php?c=DW_tap) for the DesignWare TAP controller (for our JTAG module)
* We provide a simple handwritten CSA module that can be used instead of the DW01_csa module. Note that the DesignWare CSA module will result in better performance.
* The DW_tap module is only needed for physical design for JTAG generation. Thus, the extended GCD functional model, hardware generator, and testbench can be run without this file.

2) Python 3.7+ (We have used Python 3.7.4 and Python 3.9.4 with this repo)

3) Verilator (https://verilator.org/guide/latest/install.html) required for running tests.

For MacOS:

brew install verilator

4) Genesis r11879 (https://github.com/StanfordVLSI/Genesis2/wiki) required for JTAG in RTL generation for physical design script execution.

5) Vivado 2019.2 required for FPGA experiments.

Repo Setup

1) Clone repository:

git clone https://github.com/kavyasreedhar/sreedhar-xgcd-hardware-ches2022
cd sreedhar-xgcd-hardware-ches2022

2) Create a virtual environment (python3.7+ required)

python3.7 -m venv xgcd_env

3) Activate virtual environment

source xgcd_env/bin/activate

4) Install requirements

python -m pip install --upgrade pip
pip install -e .

The above command installs the below packages.

Name	Version
kratos	0.0.38
magma-lang	2.2.26
fault	3.1.2
pytest	7.1.2
mflowgen	0.4.0
gnureadline	8.1.2

5) Set environment variables:
* TOP: set to the path that should contain this cloned repository, including sreedhar-xgcd-hardware-ches2022
* DW_PATH, set to the file path for DW01_csa.v (the DesignWare carry-save adder module)

export TOP=$PWD
# example command, put the file to your DW01_csa.v file here
# skip this step if you are using our simple handwritten CSA module instead of the DesignWare CSA module
export DW_PATH=/path/to/src_ver/DW01_csa.v

Notes:
- Steps 1, 2, 4 are one-time installation steps. Steps 3 and 5 must always be run when working with this repository.
- Unless otherwise specified, all commands below should be run from $TOP.

Extended GCD

Directory Structure

Important files within these directories are explained further in the below sections.
- $TOP -- Cloned repository
* tests -- Tests for XGCD hardware generator with functional model
* xgcd -- Project source code
* functional_models -- XGCD functional model in Python
* hardware -- Hardware files, primarily in Kratos (a hardware design language capable of generating SystemVerilog)
* extended_gcd -- Hardware files for XGCD modules
* jtag -- Hardware files for JTAG modules in Genesis
* csa.py -- Hardware files for carry-save adder (CSA) related modules
* physical_design -- Physical design scripts for open-source (SKY130) and commercial (TSMC16) technologies, to the extent possible under TSMC16 NDA
* sky130 -- Physical design scripts with mflowgen to run the flow through signoff
* tsmc16 -- Physical design scripts with mflowgen to run the flow through synthesis
* utils -- General helper functions
* XGCD_debug -- Directory for logs and waveforms when debugging hardware tests
* xgcd_env -- Virtual environment directory

Functional Model

Usage:

cd $TOP
python xgcd/functional_models/xgcd_model.py -h

Arguments:

Example usage:

For non-constant-time 1024-bit XGCD designs, including Design (1):
- using random inputs, with set seed

python xgcd/functional_models/xgcd_model.py  --bit_length 1024 --reduction_factor_even 8 --reduction_factor_odd 4

Expected output:

Test Stats:
Namespace(bit_length=1024, diff_bit_lengths=False, input_bit_length=1024, debug=False, random=True, use_seed=True, num_tests=1024, constant_time=False, min_pair_bezout=False, reduction_factor_even=8, reduction_factor_odd=4)

Running 1024 tests...
0 / 1024 tests have successfully completed.
100 / 1024 tests have successfully completed.
200 / 1024 tests have successfully completed.
300 / 1024 tests have successfully completed.
400 / 1024 tests have successfully completed.
500 / 1024 tests have successfully completed.
600 / 1024 tests have successfully completed.
700 / 1024 tests have successfully completed.
800 / 1024 tests have successfully completed.
900 / 1024 tests have successfully completed.
1000 / 1024 tests have successfully completed.

All 1024 tests have successfully completed.

Summary stats for our algorithm
-------------------------------
Average number of iterations across the 1024 tests:  1131.140625
Maximum number of iterations across the 1024 tests: 1172

For constant-time 255-bit XGCD designs, including Design (2):
- using random inputs, with set seed

python xgcd/functional_models/xgcd_model.py  --bit_length 255 --reduction_factor_even 2 --reduction_factor_odd 4 --constant_time

Expected output:

Test Stats:
Namespace(bit_length=255, diff_bit_lengths=False, input_bit_length=1024, debug=False, random=True, use_seed=True, num_tests=1024, constant_time=True, min_pair_bezout=False, reduction_factor_even=2, reduction_factor_odd=4)

Running 1024 tests...
0 / 1024 tests have successfully completed.
100 / 1024 tests have successfully completed.
200 / 1024 tests have successfully completed.
300 / 1024 tests have successfully completed.
400 / 1024 tests have successfully completed.
500 / 1024 tests have successfully completed.
600 / 1024 tests have successfully completed.
700 / 1024 tests have successfully completed.
800 / 1024 tests have successfully completed.
900 / 1024 tests have successfully completed.
1000 / 1024 tests have successfully completed.

All 1024 tests have successfully completed.

Summary stats for our algorithm
-------------------------------
Average number of iterations across the 1024 tests:  387.0
Maximum number of iterations across the 1024 tests: 387

Hardware Generator

Relevant Files:
- $TOP/xgcd/hardware/extended_gcd -- Contains all hardware files required for the extended GCD hardware design
* xgcd_top.py -- Contains top-level wiring for XGCD module
* pre_processing.py -- Executes the pre-processing step in the XGCD algorithm
* update_a_b.py -- Updates a and b every iteration
* update_a_b_odd.py -- Computes updates for a and b when they are both odd
* update_bezout.py -- Updates one Bezout coefficient variable every iteration (four of these modules are required to implement all Bezout coefficient updates)
* update_bezout_odd.py -- Computes updates for one Bezout coefficient variable when a and b are odd
* post_processing.py -- Executes the post-processing step in the XGCD algorithm
* half_adder.py -- Uses a half adder to correct truncated results when shifting in CSA form
* xgcd_wrapper.py -- Top-level wiring for XGCD Unit, JTAG modules, and reset synchronizer
- $TOP/xgcd/hardware/csa.py -- Contains CSA-related modules (including logic for shifting in CSA form)
- $TOP/xgcd/hardware/jtag -- Contains JTAG modules
- $TOP/xgcd/hardware/AhaResetSync.py, AhaResetSync.v -- Contains reset synchronizer module

To generate Core XGCD Design with $TOP/xgcd/hardware/extended_gcd:

Commonly Used Generator Parameters:

cd $TOP
python xgcd/hardware/extended_gcd/xgcd_top.py -h

For the full list of generator parameters, please see the XGCDTop module definition in xgcd_top.py.
After executing xgcd_top.py, the Verilog output will be at $TOP/XGCDTop.v.

Example Usage:

To generate Design (1) Verilog using the DesignWare CSA module:

python xgcd/hardware/extended_gcd/xgcd_top.py --bit_length 1024 --reduction_factor_even 8 --reduction_factor_odd 4

Expected printout:

Verilog Generation Parameter Summary:
Namespace(bit_length=1024, constant_time_support=False, reduction_factor_even=8, reduction_factor_odd=4, csa_handwritten=False, output='XGCDTop.v')

Generating 1024-bit XGCD...
- without constant-time-support
- with DesignWare CSA module

Generated Verilog is at XGCDTop.v

To generate Design (1) Verilog with our simple handwritten CSA module:
- Note that this command will take slightly longer than the above command.

python xgcd/hardware/extended_gcd/xgcd_top.py --bit_length 1024 --reduction_factor_even 8 --reduction_factor_odd 4 --csa_handwritten

Expected printout:

Verilog Generation Parameter Summary:
Namespace(bit_length=1024, constant_time_support=False, reduction_factor_even=8, reduction_factor_odd=4, csa_handwritten=True, output='XGCDTop.v')

Generating 1024-bit XGCD...
- without constant-time-support
- with handwritten CSA module (NOT DesignWare CSA module)

Generated Verilog is at XGCDTop.v

To generate Design (2) Verilog using the DesignWare CSA module:

python xgcd/hardware/extended_gcd/xgcd_top.py --bit_length 255 --reduction_factor_even 2 --reduction_factor_odd 4 --constant_time_support

Expected printout:

Verilog Generation Parameter Summary:
Namespace(bit_length=255, constant_time_support=True, reduction_factor_even=2, reduction_factor_odd=4, csa_handwritten=False, output='XGCDTop.v')

Generating 255-bit XGCD...
- with constant-time-support
- with DesignWare CSA module

Generated Verilog is at XGCDTop.v

To generate Design (2) Verilog with our simple handwritten CSA module:

python xgcd/hardware/extended_gcd/xgcd_top.py --bit_length 255 --reduction_factor_even 2 --reduction_factor_odd 4 --constant_time_support --csa_handwritten

Expected printout:

Verilog Generation Parameter Summary:
Namespace(bit_length=255, constant_time_support=True, reduction_factor_even=2, reduction_factor_odd=4, csa_handwritten=True, output='XGCDTop.v')

Generating 255-bit XGCD...
- with constant-time-support
- with handwritten CSA module (NOT DesignWare CSA module)

Generated Verilog is at XGCDTop.v

Here is how to generate the wrapper that instantiates and wires together the XGCDTop, JTAG, and reset synchronizer modules. Note that the first step of the generate RTL step in the Physical Design section below automates this procedure:

XGCDWrapper generation
- python xgcd/hardware/extended_gcd/xgcd_wrapper.py
- This script accepts the same arguments as the XGCDTop generation script above (python xgcd/hardware/extended_gcd/xgcd_top.py)
- The generated top-level module XGCDWrapper will instantiate XGCDTop, jtag_wrapper, and AhaResetSync in the output Verilog

For Design (1):

python xgcd/hardware/extended_gcd/xgcd_wrapper.py --bit_length 1024 --reduction_factor_even 8 --reduction_factor_odd 4

Expected printout:

XGCDWrapper Verilog Generation Parameter Summary:
Namespace(bit_length=1024, constant_time_support=False, reduction_factor_even=8, reduction_factor_odd=4, csa_handwritten=False, output='XGCDWrapper.v')

Generating XGCDWrapper at XGCDWrapper.v

Generating 1024-bit XGCD...
- without constant-time-support
- with DesignWare CSA module

For Design (2):

python xgcd/hardware/extended_gcd/xgcd_wrapper.py --bit_length 255 --reduction_factor_even 2 --reduction_factor_odd 4 --constant_time_support

Expected printout:

XGCDWrapper Verilog Generation Parameter Summary:
Namespace(bit_length=255, constant_time_support=True, reduction_factor_even=2, reduction_factor_odd=4, csa_handwritten=False, output='XGCDWrapper.v')

Generating XGCDWrapper at XGCDWrapper.v

Generating 255-bit XGCD...
- with constant-time-support
- with DesignWare CSA module

The JTAG is generated with the generate RTL step in the Physical Design section (more information below).

Tests

These tests verify the functionality of the hardware design ($TOP/xgcd/hardware/xgcd_top.py) with the functional model ($TOP/xgcd/functional_models/xgcd_model.py). The functional model includes assertions that verify the gcd result with Python's math package, as well as the xgcd result using Bezout's Identity.

For each pair of XGCD inputs, our testbench automatically checks functionality for various

Reduction factor pairs: (2, 4), (4, 4), (8, 4), and (4, 8)
Levels of constant_time support: no constant-time support, constant_time support with constant_time evaluation configuration set to True, and constant_time support with constant_time evaluation configuration set to False

We split our tests into four categories, which each test inputs with all of the above combinations

basic: Tests small input values on hardware with various input bitwidth support
small: Tests one random input with all of the above combinations, for 12 total tests
medium: Tests 50 random inputs with all of the above combinations, for 600 total tests
large: Tests 100 random inputs with all of the above combinations, for 1200 total tests

First, try our small tests to verify that the repo setup is working. This will run 36 tests (1 random input pair, 12 hardware combinations with reduction factor pair and constant_time support parameters, 3 input bitwidths) and will take about five minutes.

cd $TOP
pytest tests/ -k "small" -v

The following tests will run:

$TOP/tests/test_255_small.py
$TOP/tests/test_512_small.py
$TOP/tests/test_1024_small.py

Expected output:

Note that the Python version and directory paths may differ, but all other output should be identical

========================================== test session starts ===========================================
platform darwin -- Python 3.9.4, pytest-7.1.2, pluggy-1.0.0 -- /Users/kavya/Desktop/Kavya/xgcd/xgcd_env/bin/python3
cachedir: .pytest_cache
rootdir: /Users/kavya/Desktop/Kavya/xgcd
collected 5616 items / 5580 deselected / 36 selected

tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-True-1-2-2-4] PASSED [  2%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-True-1-4-4-4] PASSED [  5%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-True-1-8-8-4] PASSED [  8%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-True-1-4-4-8] PASSED [ 11%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-True-0-2-2-4] PASSED [ 13%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-True-0-4-4-4] PASSED [ 16%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-True-0-8-8-4] PASSED [ 19%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-True-0-4-4-8] PASSED [ 22%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-False-0-2-2-4] PASSED [ 25%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-False-0-4-4-4] PASSED [ 27%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-False-0-8-8-4] PASSED [ 30%]
tests/test_1024_small.py::test_xgcd[129740214252898973727546781863690236427505086565118448844733374574338935066795444075011107409251780249534288652292843452467431465455814122189084624424706410901040204839071069816085988326202894375752477038977995129925461413982352331306193938911620728120537498098087099200301958466874771000687451301847359266754-49187590039535125951987380134882589439567227195013692231006837852847386192875140839226122174891376217921494817877253369498542074188542392087455774134995355878141646924561048463369962741837813262970855428491139534497215856651744571501108030438689721399347064521728275123901691438822576871224914785078523326200-False-0-4-4-8] PASSED [ 33%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-True-1-2-2-4] PASSED [ 36%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-True-1-4-4-4] PASSED [ 38%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-True-1-8-8-4] PASSED [ 41%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-True-1-4-4-8] PASSED [ 44%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-True-0-2-2-4] PASSED [ 47%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-True-0-4-4-4] PASSED [ 50%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-True-0-8-8-4] PASSED [ 52%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-True-0-4-4-8] PASSED [ 55%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-False-0-2-2-4] PASSED [ 58%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-False-0-4-4-4] PASSED [ 61%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-False-0-8-8-4] PASSED [ 63%]
tests/test_255_small.py::test_xgcd[37011419202199436911699116621961681821072612210737535774885147375630198915587-28579201187133985494884386205927947524432691876410703363446155211413454751348-False-0-4-4-8] PASSED [ 66%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-True-1-2-2-4] PASSED [ 69%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-True-1-4-4-4] PASSED [ 72%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-True-1-8-8-4] PASSED [ 75%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-True-1-4-4-8] PASSED [ 77%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-True-0-2-2-4] PASSED [ 80%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-True-0-4-4-4] PASSED [ 83%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-True-0-8-8-4] PASSED [ 86%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-True-0-4-4-8] PASSED [ 88%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-False-0-2-2-4] PASSED [ 91%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-False-0-4-4-4] PASSED [ 94%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-False-0-8-8-4] PASSED [ 97%]
tests/test_512_small.py::test_xgcd[11131124899557068741862058561592518079504046495653300707145427789461633036882920487399788808158681316368204285944213603344797542642903805191868520283316585-3495104725420115573371166489124179573918088228342617275036983782911199651299585406475638639213621113074267389327609157584179448626288140116893135735248984-False-0-4-4-8] PASSED [100%]

============================ 36 passed, 5580 deselected in 196.20s (0:03:16) =============================

These tests can also be run individually.

pytest tests/test_255_small.py -v
pytest tests/test_512_small.py -v
pytest tests/test_1024_small.py -v

You can also run the basic test, which executes with small input values on the hardware with 255-bit, 512-bit, and 1024-bit input bitwidth support. This test also checks that the order of the two inputs does not affect the final result. In total, it executes 180 tests (5 small input pairs, 12 hardware combinations with reduction factor pair and constant_time support parameters, 3 input bitwidths). This is expected to take longer than the small tests to run.

pytest tests/test_basic.py

Here is how to execute on more reasonably sized inputs for the hardware datapath bitwidth (e.g., 255-bit inputs on 255-bit hardware datapath and similarly for 512 bits and 1024 bits). This will run 1800 tests (50 random input pairs, 12 hardware combinations with reduction factor pair and constant_time support parameters, 3 input bitwidths).

cd $TOP
pytest tests/ -k "medium" -v

These tests can also be run individually.

pytest tests/test_255_medium.py -v
pytest tests/test_512_medium.py -v
pytest tests/test_1024_medium.py -v

Finally, you can also execute a more extensive test suite, again with reasonably sized inputs for the hardware datapath bitwidth (for 255-bit, 512-bit, and 1024-bit hardware datapaths). This will run 3600 tests (100 random input pairs, 12 hardware combinations with reduction factor pair and constant_time support parameters, 3 input bitwidths).

cd $TOP
pytest tests/ -k "large" -v

These tests can also be run individually.

pytest tests/test_255_large.py -v
pytest tests/test_512_large.py -v
pytest tests/test_1024_large.py -v

These tests may take a while to run. For example, tests/test_1024_large.py runs for 7 hours on our machines.

Test Setup Parameters:

Name	Type	Default	Description
bit_length	int	1024	Number of bits for XGCD inputs.
use_seed	bool	True	Whether to set random seed to 0.
num_inputs	bool	True	Number of input pairs to test reduction factor and constant time support combinations with.

Commonly Used Testbench Parameters in $TOP/tests/xgcd_test_core.py (helper function used in all tests):

Debugging

To debug a pair of inputs in a test, such as tests/test_basic.py:
- Make sure the inputs and hardware parameterization is correctly specified in the test function call at the bottom of the file
- Set debug_print=True (as is done in the test function call at the bottom of the file) to dump waveforms and other generated information
- Execute the script, with an example below

python tests/test_basic.py

Debugging Information:
- We recommend debugging by comparing the waveforms with the functional model output.
- The waveforms will be at $TOP/XGCD_debug/logs/XGCDTop_W.vcd. We use gtkwave (http://gtkwave.sourceforge.net/) to view the waveforms.
- The functional model output will be displayed on the terminal that the test command above is run on. Every iteration has an output line such as a 31 b -2 y 4773 n -24685 u 2241 l -11590 delta 0 that shows the current state of all the variables. This can be compared to the value of these variables every cycle in the waveforms.
- The functional model output also prints other intermediate variables and messages. The level of prints can be controlled by commenting out print_debug function calls in xgcd/functional_models/xgcd_model.py and xgcd/functional_models/xgcd_helper.py
- Our RTL is written in Kratos. The generated Verilog will be at $TOP/XGCD_debug/XGCDTop_W.v.

Physical Design

We use mflowgen (https://github.com/mflowgen/mflowgen, a tool presented at DAC 2022) to assemble our physical design flow. We also draw from mflowgen's common library of reusable scripts to implement much of our flow.

We provide scripts with mflowgen for commercial tools, including Synopsys DC for synthesis and Cadence Innovus 19.10.000 for floorplan, power, place, clock tree synthesis, and route. We use the following tool versions.

Name	Version	Purpose
mflowgen	0.4.0	Physical design flow organization
Synopsys DC	2019.03	Synthesis
Cadence Innovus	19.10.000	Floorplan, power, place, clock tree synthesis, and route

The information below contains various mflowgen commands. For more information on mflowgen commands and features, please see https://mflowgen.readthedocs.io/en/latest/.

The hardware design parameters can be set in the parameters dict in $TOP/xgcd/physical_design/sky130/design/construct.py and $TOP/xgcd/physical_design/tsmc16/design/construct.py for the physical design flow.

bit_length
reduction_factor_even
reduction_factor_odd
constant_time support

The following parameters in construct.py are REQUIRED. The default values can be used for the first two, but DW_ALL_PATH must be set specifically depending on where your files are.

design_name -- XGCDWrapper by default
clock_period -- in nanoseconds
DW_ALL_PATH -- Path to sim_ver of DW_tap.v, typically in the form of dw/sim_ver with the DW_tap.v file in sim_ver. Do not include the file itself (e.g., DW_tap.v) in this parameter.

SKY 130nm

The SKY130 process node is an open-source hybrid 180nm-130nm technology (https://github.com/google/skywater-pdk). Although our paper focused on TSMC16 results, open-source technologies enable far more accessible replication of results. The SKY130 technology is a real technology that can be fabricated. In this section, we use commercial tools together with this open-source technology.

Here is how to see the steps in the physical design flow with mflowgen:

For Design (1):

cd $TOP/xgcd/physical_design/sky130/build
mflowgen run --design ../design/construct_design1.py
make status

For Design (2):

cd $TOP/xgcd/physical_design/sky130/build
mflowgen run --design ../design/construct_design2.py
make status

For any design, with parameters set in $TOP/xgcd/physical_design/sky130/design/construct.py:

cd $TOP/xgcd/physical_design/sky130/build
mflowgen run --design ../design
make status

To generate the RTL (note that "2" corresponds to the number associated with the rtl node in make status and make list:

make 2

This step requires Genesis2 r11879 (https://github.com/StanfordVLSI/Genesis2) for JTAG
This step generates and wires the XGCD hardware, the JTAG modules, and the reset synchornizer. The commands that are run are listed in $TOP/xgcd/physical_design/sky130/design/rtl/configure.yml.
The top level pin names match with the Caravel wrapper provided by Efabless and Google for SKY130 projects.

To run the physical design flow (note that prerequisite steps will be automatically run, and you can simply run the final step):

# adk
make 3
# synthesis
make 5
# place
make 9
# clock tree synthesis
make 10
# route
make 12
# signoff
make 14

Final GDS location: $TOP/xgcd/physical_design/sky130/build/14-decapsignoff/design.gds.gz

TSMC 16nm

The TSMC16 process node is a 16nm commercial technology. We cannot release PDK files for this technology and PDK-specific scripts since they are under NDA. Thus, we provide our physical design scripts up to the synthesis step. The scripts we provide DO NOT CONTAIN any confidential information.

To see the steps in the physical design flow with mflowgen:

For Design (1):

cd $TOP/xgcd/physical_design/tsmc16/build
mflowgen run --design ../design/construct_design1.py
make status

For Design (2):

cd $TOP/xgcd/physical_design/tsmc16/build
mflowgen run --design ../design/construct_design2.py
make status

For any design, with parameters set in $TOP/xgcd/physical_design/tsmc16/design/construct.py:

cd $TOP/xgcd/physical_design/tsmc16/build
mflowgen run --design ../design
make status

To generate the RTL (again, note that "2" corresponds to the number associated with the rtl node in make status and make list:

make 2

This step requires Genesis2 r11879 (https://github.com/StanfordVLSI/Genesis2) for JTAG
This step generates and wires the XGCD hardware, the JTAG modules, and the reset synchornizer. The commands that are run are listed in $TOP/xgcd/physical_design/tsmc16/design/rtl/configure.yml.
The top level pin names match with the Caravel wrapper provided by Efabless and Google for SKY130 projects.

To run synthesis with Synopsys DC (similarly to the RTL step, note that "4" corresponds to the synopsys-dc-synthesis node in make status and make list:

make 4

FPGA Synthesis, Place and Route

We use Vivado v2019.2 (module load vivado/2019.2) and provide scripts with mflowgen (please see Physical Design section above for more information on mflowgen).

To run synthesis, place, and route for Design (1):

cd $TOP/xgcd/fpga/build_1024
mflowgen run --design ../design_1024
make 2

This command takes 3.5 hours on our machine.

Similarly, to run synthesis, place, and route for Design (2):

cd $TOP/xgcd/fpga/build_255
mflowgen run --design ../design_255
make 2

This command takes 30 minutes on our machine.

The commands below are for Design (1), corresponding to $TOP/xgcd/fpga/design_1024 and $TOP/xgcd/fpga/build_1024. The same commands can be used for Design (2), by replacing those paths with $TOP/xgcd/fpga/design_255 and $TOP/xgcd/fpga/build_255 respectively instead.

To view timing and utilization reports:

cd $TOP/xgcd/fpga/build_1024
cd 2-fpga_synth_pnr/reports/post-route

timing-summary.rpt - timing report
util-summary.rpt - utilization report

To change the clock period or FPGA part used for synthesis, place, and route, please edit the corresponding parameters (clock_period and fpga_part) in $TOP/xgcd/fpga/design_1024/construct.py and then run

cd $TOP/xgcd/fpga/build_1024
mflowgen run --update

To verify that the clock period and/or FPGA part parameters have been updated, you can use

make info-2

and see that the parameter values have been changed for this mflowgen step.

Finally, you can re-run this mflowgen step with the updated parameters.

make clean-all
make 2

The timing and utilization reports will be updated in the same paths as listed above.

Common Issues

Here are some common errors you may run into when running tests.

Error:

  if not follow_symlinks and _islink(src):
      os.symlink(os.readlink(src), dst)
  else:
>           with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
E           IsADirectoryError: [Errno 21] Is a directory: '/Users/sreedhar/xgcd'

/usr/local/Cellar/python@3.9/3.9.1_4/Frameworks/Python.framework/Versions/3.9/lib/python3.9/shutil.py:264: IsADirectoryError

Solution

If using the DesignWare CSA module (not our handwritten stub), please make sure that $DW_PATH includes DW01_CSA.v (it should include the filename /path/to/src_ver/DW01_csa.v, instead of only the directory /path/to/src_ver.

Error:

---------------------------------------------------------- Captured stdout call ----------------------------------------------------------
Generating 1024-bit XGCD with constant-time-support
Using our functional model
Running command: verilator -Wall -Wno-INCABSPATH -Wno-DECLFILENAME -Wno-fatal --cc XGCDTop_1024_W.v --exe XGCDTop_1024_W_driver.cpp --top-module XGCDTop_1024_W
<STDERR>
/bin/sh: verilator: command not found
</STDERR>

Solution:

Install Verilator (see "Required System Setup" section above).

Error:

mflowgen: command not found

Solution:

cd $TOP
source xgcd_env/bin/activate

Other notes:
- Unless otherwise specified, all commands should be run from $TOP.
- If using our handwritten stub (not the DesignWare CSA module), please make sure that DW_PATH is not set (unset DW_PATH)

Citation

@article{sreedhar2022fast,
  title={A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion},
  author={Sreedhar, Kavya and Horowitz, Mark and Torng, Christopher},
  journal={IACR Transactions on Cryptographic Hardware and Embedded Systems (TCHES)},
  year={2022}
}

International Association for Cryptologic Research

International Association
for Cryptologic Research

Transactions on Cryptographic Hardware and Embedded Systems, Volume 2022

A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion

README

A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion

Table of Contents

Example Designs

Installation

Required System Setup

Repo Setup

Extended GCD

Directory Structure

Functional Model

Hardware Generator

Tests

Debugging

Physical Design

SKY 130nm

TSMC 16nm

FPGA Synthesis, Place and Route

Common Issues

Citation

International Association for Cryptologic Research

International Associationfor Cryptologic Research

Transactions on Cryptographic Hardware and Embedded Systems, Volume 2022

A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion

README

A Fast Large-Integer Extended GCD Algorithm and Hardware Design for Verifiable Delay Functions and Modular Inversion

Table of Contents

Example Designs

Installation

Required System Setup

Repo Setup

Extended GCD

Directory Structure

Functional Model

Hardware Generator

Tests

Debugging

Physical Design

SKY 130nm

TSMC 16nm

FPGA Synthesis, Place and Route

Common Issues

Citation

International Association
for Cryptologic Research