So I've run up against the point in work on elliptic curve diffie hellman in which I need to implement point addition and doubling in a manner that will not take 6 hours to complete. Here's the research I've done so far and what, from what I understand, is needed:

Point Addition

Code:

Point addition seems to be an operation relative to the slope of the line that links both points, defined:
P + Q = R
if P or Q are point at infinity (0,0?), R = the other point
if P == Q, double instead
if Xp == Xq, set R to point at infinity.
else:
slope = (Yp - Yq) / (Xp - Xq)
Xr = slope^2 + Xp - Xq
Yr = slope(Xp - Xr) - Yp


Point Doubling

Code:

P + P = R
compute slope as:
slope = (3(Xp)^2 + a) / 2(Yp)
use slope in the Xr, Yr computations for addition


I foresee a few issues trying to implement this:
1. Value overflow. From what I can tell, this doesn't involve modulo some type width, you need to work with exact values. So would this need a few bytes to buffer for overflow?
2. Computational cost for division. The curve I am using is a 224-bit curve, meaning this will be running that many divisions * 2. Isn't that expensive? I read something in the docs about deferring division, ie: maintaining a numerator and a denominator until the end and then computing division then? How does that work exactly? Is there a better way?
3. Fastest algorithm for multiplication? As of now I'm using the double-then-add method for multiplication, but is there a faster way? If so, how does it work.

My existing API uses:

Code:

#define ECC_PRV_KEY_SIZE   28
...
#define OVERFLOW_BYTES      4

// a bigint type constrainted to the private key size + 4 bytes for overflow buffering
typedef uint8_t BIGINT[ECC_PRV_KEY_SIZE + OVERFLOW_BYTES];

struct Point {
   BIGINT x;
   BIGINT y;
};

struct Curve {
   BIGINT polynomial;
   BIGINT coeff_a;
   BIIGINT coeff_b;
   Point G;
   BIGINT b_order;
   uint8_t cofactor;
};

I save all static data as big-endian encoded bytearrays, but when loading the data in for arithmetic, the array is reversed into little endian for more efficient mathing.

What I am seeking here is (1) recommendations for the best way to do the 4 issues I indicated above, as well as to fish for if anyone has, or can write up, fast bigint arithmetic routines supporting up to 32-bytes. Or at least can point to the best resources for implementing it, though I'd be doing it in C which would not have the ideal speed. It would seem we need addition, subtraction, multiplication, division, and squaring.
Timing resistance preferred.

For reference, the full API as it presently stands is:
https://github.com/acagliano/cryptx/blob/dev/src/ecdh.c

And I've taken some stabs at writing the bigint routines I think I can do myself:
Here they are (ps the shifting was written by Zeda. I've done add/sub/mul)
https://github.com/acagliano/cryptx/blob/dev/src/asm/ecdh_ops.asm
Update
With a lot of help from the likes of calc84maniac and jacobly, I have managed to work out bigint addition and multiplication. I'll post the code here if anyone else wants to recommend fixes or optimizations.
I am still missing multiplicative inversion. To that end, I will post the algorithm based on my reference for anyone who may have a fast way to do it.

Addition/Subtraction

Code:

; bigint_add(uint8_t *op1, uint8_t *op2);
; hard limit to 32 bytes
; output in op1
; addition over a galois field of form GF(2^m) is mod 2 or just xor
_bigint_add:
   ti._frameset0
   ld hl, (ix + 3)      ; op2
   ld de, (ix + 6)      ; op1
   ld b, 32
.loop:
   ld a,(de)
   ;adc a,(hl)
   xor (hl)
   ld (de),a
   inc hl
   inc de
   djnz .loop
   ld sp, ix
   pop ix
   ret

; apparently they are the same
_bigint_sub := _bigint_add


Multiplication

Code:

; bigint_mul(uint8_t *op1, uint8_t *op2)
; multiplication is add then double, then a polynomial reduction
_bigint_mul:
   ld hl, 32
   ti._frameset
   lea de, ix - 32      ; stack mem?
   ld hl, (ix + 9)      ; op1 (save a copy)
   ld bc, 32
   ldir            ; ix - 32 = tmp = op1
   
   ; zero out op1
   ld de, (ix + 9)      ; op 1
   ld a, 0
   ld (de), a
   inc de
   ld hl, (ix + 9)
   ld bc, 31
   ldir            ; op1 = res = 0
   
   ld hl, (ix + 6)      ; op2 = for bit in bits
   ld c, 32
.loop_op2
   ld a, (hl)
   ld b, 8
.loop_bits_in_byte:
   rra
   push af
      sbc a,a
      push bc, hl
         ld c,a
         
         ; add op1 (res) + tmp
         ld hl, (ix +9)      ; hl = op1 (dest)
         lea de, ix - 32      ; de = tmp (src)
         ld b, 32
.loop_add:
         ld a, (de)
         and a, c
         xor a, (hl)
         ld (hl), a
         inc hl
         inc de
         djnz .loop_add
      
         ; now double tmp
         lea hl, ix - 32      ; tmp in hl
         ld b, 32
         or a            ; reset carry
.loop_mul2:
         inc hl
         rl (hl)
         djnz .loop_mul2
         
         ; now xor with polynomial if tmp degree too high
         ; this means timing analysis will leak polynomial info
         ; however, this is a public spec and therefore not
         ; implementation-breaking
         bit 1, (ix - 4)      ; polynomial is 233 bits, check 234th bit
         jr z, .no_xor_poly

         ; xor byte 1 (little-endian encoding)
         ld a, (ix - 32 + 1)
         xor 2
         ld (ix - 32 + 1), a
         
         ; xor byte 21 (little endian encoding)
         ld a, (ix - 32 + 21)
         xor 4
         ld (ix - 32 + 21), a
         
         ; xor byte 28 (little endian encoding)
         ld a, (ix - 32 + 28)
         xor 1
         ld (ix - 32 + 28), a
         
.no_xor_poly
      pop hl,bc
   pop af
   djnz .loop_bits_in_byte
   inc hl
   dec c
   jr nz, .loop_op2
   ld sp, ix
   pop ix
   ret
 


And lastly:
Algorithm for Mult Inv

Code:

_bigint_invert:
   ; tmp = op
   ; v = polynomial
   ; g = 0
   ; res = 1
   ; while(tmp != 1)
   ;    i = degree(tmp) - degree(v)
   ;   if( i < 0 )
   ;      swap tmp1, v
   ;      swap g, res
   ;      i = -i
   ;   h = left_shift v by i bits
   ;   add tmp, h
   ;   h = left_shift g by i bits
   ;   add res, h
        ; return res
So I have completed the bigint routine for multiplicative inverse over a finite field of form F2m.

The code is here: https://github.com/acagliano/cryptx/blob/stable/encrypt/encrypt.asm#L6231
There are some issues with the algorithm. For reasons currently unknown to me the algorithm seems to work fine for powers of 2, but loop infinitely for anything that is not an even power of two. Several days of debugging have brought it to its current state from infinitely looping for everything. If anyone is good with this kind of math and wants to take a stab at determining how dumb I am, feel free. Please and thanks.
  
Register to Join the Conversation
Have your own thoughts to add to this or any other topic? Want to ask a question, offer a suggestion, share your own programs and projects, upload a file to the file archives, get help with calculator and computer programming, or simply chat with like-minded coders and tech and calculator enthusiasts via the site-wide AJAX SAX widget? Registration for a free Cemetech account only takes a minute.

» Go to Registration page
Page 1 of 1
» All times are UTC - 5 Hours
 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum

 

Advertisement