From: olson@umbc.edu (Bryan G. Olson)

Newsgroups: sci.crypt

Subject: A Quick IDEA, was: Speed of DES/IDEA implementations

Date: 7 Dec 1993 21:49:41 -0500



A while ago I posted a message claiming a speed of 238,000

bytes/sec for an implementation of IDEA on a 33Mh 486.  Below is

an explanation and some code to show how it works.  The basic

trick should be useful on many (but not all) processors.  I

expect only those familiar with IDEA and its reference

implementation will be able to follow the discussion.  See:



Lai, Xueja and Massey, James L.  A Proposal for a New Block

Encryption Standard, Eurocrypt 90



For those who have been asking for the code, sorry I kept

putting it off.  I wanted to get it out of Turbo Pascal

ideal-mode, but I never had the time.



Colin Plum wrote IDEA-386 code which is included in PGP

2.3a and uses the same tricks.  I don't know who's is

faster, but I expect they will be very close.  Now

here's how it's done.



A major bottleneck in software IDEA is the mul() routine, which

is used 34 times per 64 bit block.  The routine performs

multiplication in the multiplicative group mod 2^16+1.  The two

factors are each in a 16 bit word, and the output is also in a 16

bit word.  Note that 0 is not a member of the multiplicative

group and 2^16 does not fit in 16 bits. We therefor use the 0

word to represent 2^16.  Now group elements map one to one onto

all possible 16 bit words, since 2^16+1 is prime.



Here is (essentially) the reference implementation from [Lai].





unsigned mul( unsigned a, unsigned b ) {

  long int p ;

  long unsigned q ;

    if( a==0 ) p= 0x00010001 - b ;

    else if( b==0 ) p= 0x00010001 - a ;

    else {

        q= a*b;

        p= (q & 0xffff) - (q>>16)

        if( p<0 ) p= p + 0x00010001 ;

      }

    return (unsigned)(p & 0xffff) ;

}





Note the method of reducing a 32 bit word modulo 2^16-1.  We

subtract the high word from the low word, and add the modulus

back if the result is less than 0.  [Lai] contains a proof that

this works, and you can convince yourself fairly easily.



To speed up this routine, we note that the tests for a=0 and b=0

will rarely be false.  With the possible exception of the first 2

of the 34 multiplications, 0 should be no more likely than any of

the other 65535 numbers.  Note that if (and only if) either a or

b is 0 then q will also be 0, and we can check for this in one

instruction if our processor sets a zero flag for multiplication

(as the 68000 does but 80x86 does not).  



Fortunately p will also be zero after the subtraction if and only

if either a or b is 0.  Proof: r will be zero when the high order

word of q equals the low order word, and that happens when q is

divisible by 00010001 hex.  Since 00010001h = 2^16+1 is prime,

this happens if either a or b is a multiple of 2^16+1, and 0 is

the only such multiple which will fit in a 16 bit word.



The speed-up strategy is to proceed under the assumption that a

and b are not 0, check to be sure in one instruction, and

recompute if the assumption was wrong.  Here's some 8086

assembler code:



    mov  ax, [a]

    mul  [b]        ; ax is implied. q is now in DX AX

    sub  ax, dx     ; mod 2^16+1 

    jnz  not0       ; Jump if neither op was 0. Usually taken.



    mov  ax, 1      ; recompute result knowing one op is 0.

    sub  ax, [a]

    sub  ax, [b]

    jmp  out        ; Just jump over adding the carry.

not0:

    adc  ax, 0      ; If r<0 add 1, otherwise do nothing.

out:                ; Result is now in ax





Note that when r<0 we add 1 instead of 2^16+1 since the 2^16 part

overflows out of the result.  The "adc  ax, 0" does all the work

of checking for a negative result and adding the modulus if

needed.



The multiplication takes 9 instructions, 4 of which are rarely

executed.  I believe similar tricks are possible on many

processors.  The one drawback to the check-after-multiply tactic

is that we can't let the multiply overwrite the only copy of an

operand.



Note that most software implementations of IDEA will run at

slightly different speeds when 0's come up in the multiply

routine.  The reference implementation is faster on 0, this one

is faster on non-zero.  This may be a problem for some real-time

stuff, and also suggests an attack based on timing.



Finally, below is an implementation of the complete encryption

function in 8086 assembler, to replace the cipher_idea() function

in PGP.  It takes the same parameters as the function from PGP,

and uses the c language calling conventions.  I tested it using

the debug features of the idea.c file in PGP.  You will need to

add segment/assume directives.  This version uses no global data

and should be reentrant.



The handling of zero multipliers is outside the inner loop so

that a short conditional jump can loop back to the beginning. 

Forward conditional jumps are usually not taken and backward

jumps are usually taken, which is consistent with 586 branch

prediction (or so I've heard).  Stalls where the output of one

instruction is needed for the next seem unavoidable.



Last I heard, IDEA was patent pending.  My code is up for grabs,

although I would get a kick out being credited if you use it.

On the other hand Colin's code is already tested and ready

to assemble and link with PGP.



--Bryan



____________________CODE STARTS BELOW THIS LINE_________



;  Called as: asmcrypt( inbuff, outbuff, zkey ) just like PGP



PROC    _asmcrypt



        ; establish parameter and local space on stack

        ; follow c language calling conventions



        ARG  inblock:Word, outblock:Word, zkey:Word

        LOCAL sx1:Word,sx4:Word,skk:Word,done8:Word =stacksize



        push bp

        mov  bp, sp

        sub  sp, stacksize



 ;      push ax     ; My compiler assumes these are not saved.

 ;      push bx

 ;      push cx

 ;      push dx



        push si

        push di



; Put the 16 bit sub-blocks in registers and/or local variables

        mov  si, [inblock]

        mov  ax, [si]

        mov  [sx1], ax       ; x1  is in ax and sx1

        mov  di, [si+2]      ; x2  is in di

        mov  bx, [si+4]      ; x3  is in bx

        mov  dx, [si+6]

        mov  [sx4], dx       ; x4  is in sx4



        mov  si, [zkey]      ; si points to next subkey

        mov  [done8], si

        add  [done8], 96     ; we will be finished with 8 rounds

                             ; when si=done8



@@loop:                      ; 8 rounds of this

        add  di, [si+2]      ; x2+=zkey[2]  is in di

        add  bx, [si+4]      ; x3+=zkey[4]  is in bx



        mul  [Word si]       ;x1 *= zkey[0]

        sub  ax, dx

        jz  @@x1             ; if 0, use special case multiply

        adc  ax, 0

@@x1out:

        mov  [sx1], ax       ; x1 is in ax and sx1



        xor  ax, bx          ; ax= x1^x3

        mul  [Word si+8]     ; compute kk

        sub  ax, dx          ; if 0, use special case multiply

        jz  @@kk

        adc  ax, 0

@@kkout:

        mov  cx, ax          ; kk is in cx



        mov  ax, [sx4]       ; x4 *= zkey[6]

        mul  [Word si+6]

        sub  ax, dx

        jz   @@x4            ; if 0, use special case multiply

        adc  ax, 0

@@x4out:

        mov  [sx4], ax       ; x4 is in sx4 and ax



        xor  ax, di          ; x4^x2

        add  ax, cx          ; kk+(x2^x4)

        mul  [Word si+10]    ; compute t1

        sub  ax, dx

        jz  @@t1             ; if 0, use special case multiply

        adc  ax, 0

@@t1out:                     ; t1 is in ax



        add  cx, ax          ; t2 is in cx   kk+t1



        xor  [sx4], cx       ; x4 in sx4

        xor  di, cx          ; new x3 in di

        xor  bx, ax          ; new x2 in bx

        xchg bx, di          ; x2 in di, x3 in bx

        xor  ax, [sx1]       ; x1 in ax

        mov  [sx1], ax       ; and [sx1]



        add  si, 12          ; point to next subkey

        cmp  si, [done8]

        jne  @@loop

        jmp  @@out8



;------------------------------------------

; Special case multiplications, when one factor is 0



@@x1:   mov  ax, 1

        sub  ax, [sx1]

        sub  ax, [Word si]

        jmp  @@x1out



@@kk:   mov  ax, [sx1]       ; rebuild overwritten operand

        xor  ax, bx

        neg  ax

        inc  ax

        sub  ax, [si+8]

        jmp  @@kkout



@@x4:   mov  ax, 1

        sub  ax, [sx4]

        sub  ax, [Word si+6]

        jmp  @@x4out



@@t1:   mov  ax, [sx4]       ; rebuild

        xor  ax, di

        add  ax, cx

        neg  ax

        inc  ax

        sub  ax, [si+10]

        jmp  @@t1out



;---------------------------------------------------

;   8 rounds are done, now that extra pseudo-round



@@out8:

        push di

        mov  di, [outblock]



        mul  [Word si]

        sub  ax, dx

        jnz  @@o1n           ; jump over special case code

        mov  ax, 1

        sub  ax, [sx1]

        sub  ax, [si]

        jmp  @@o1out

@@o1n:  adc  ax, 0

@@o1out:  mov [di], ax       ; final ciphertext block 1



        mov  ax, [sx4]

        mul  [Word si+6]

        sub  ax, dx

        jnz  @@o4n           ; jump over special case code

        mov  ax, 1

        sub  ax, [sx4]

        sub  ax, [si+6]

        jmp  @@o4out

@@o4n:  adc  ax, 0

@@o4out: mov  [di+6], ax     ; final ciphertext block 4



        add  bx, [si+2]

        mov  [di+2], bx      ; final ciphertext block 2

        pop  ax

        add  ax, [si+4]

        mov  [di+4], ax      ; final ciphertext block 3



;  Restore the stack and return



        pop  di

        pop  si

;       pop  dx

;       pop  cx

;       pop  bx

;       pop  ax



        mov  sp, bp

        pop  bp

        ret

ENDP    _asmcrypt



