8-bit Multiply: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
No edit summary
m (Thwaite moved to userspace)
 
(28 intermediate revisions by 4 users not shown)
Line 1: Line 1:
The following code multiplies two 8-bit integers (range 0...255) and outputs a 16-bit result using only real calculation, no lockup table so the size of the code is very small.
Since the 6502 CPU has no multiplication instruction, this feature has to be written in software.


<source lang="6502">
See also: [[Fast signed multiply]]
 
== tepples ==
This 8x8 multiply routine from ''[[User:Tepples/Thwaite|Thwaite]]'' does binary long multiplication (a.k.a. "the [[wikipedia:Ancient Egyptian multiplication#Russian_peasant_multiplication|Russian Peasant method]]").
 
<pre>
;;
; Multiplies two 8-bit factors to produce a 16-bit product
; in about 153 cycles.
; @param A one factor
; @param Y another factor
; @return high 8 bits in A; low 8 bits in $0000
;        Y and $0001 are trashed; X is untouched
.proc mul8
prodlo  = $0000
factor2 = $0001
 
  ; Factor 1 is stored in the lower bits of prodlo; the low byte of
  ; the product is stored in the upper bits.
  lsr a  ; prime the carry bit for the loop
  sta prodlo
  sty factor2
  lda #0
  ldy #8
loop:
  ; At the start of the loop, one bit of prodlo has already been
  ; shifted out into the carry.
  bcc noadd
  clc
  adc factor2
noadd:
  ror a
  ror prodlo  ; pull another bit out for the next iteration
  dey        ; inc/dec don't modify carry; only shifts and adds do
  bne loop
  rts
.endproc
</pre>
 
This is a similar technique to Bregalad's method below, but is further optimized to keep part of the running total in the accumulator.
 
Assuming no page crossings and zero page, this routine takes 137-169 cycles, not counting the JSR to call it. (Each non-zero bit in A adds 4 cycles.)
 
== tepples unrolled ==
The above code can be made to run slightly faster by both unrolling the loop and pre-decrementing factor2 so that CLC isn't required. Note that the low byte is now returned in A, the high byte in Y, and that CA65 syntax is used.
<pre>
; @param A one factor
; @param Y another factor
; @return low 8 bits in A; high 8 bits in Y
mul8_multiply:
    lsr
    sta prodlo
    tya
    beq mul8_early_return
    dey
    sty factor2
    lda #0
.repeat 8, i
    .if i > 0
        ror prodlo
    .endif
    bcc :+
    adc factor2
:
    ror
.endrepeat
    tay
    lda prodlo
    ror
mul8_early_return:
    rts
</pre>
 
The code takes up 70 bytes and runs in 120 cycles or less.
 
== Bregalad ==
This routine by Bregalad is similar to tepples' method above, but is less optimized.
 
<pre>
;8-bit multiply
;8-bit multiply
;by Bregalad
;by Bregalad
;Enter with A,Y, numbers to multiply
;Enter with A,Y, numbers to multiply
;Output with YA = 16-bit result (X is unchanged)
;Output with YA = 16-bit result (X is unchanged)
Multiply
Multiply:
sty Factor ;Store input factor
sty Factor ;Store input factor
ldy #$00
ldy #$00
sty Res
sty Res
sty Res2 ;Clear result
sty Res2   ;Clear result
ldy #$08 ;Number of shifts needed
ldy #$08   ;Number of shifts needed


- lsr A ;Shift right input number
- lsr A       ;Shift right input number
bcc + ;Check if bit is set
bcc +       ;Check if bit is set
pha
pha
lda Res2
lda Res2
clc
clc
adc Mult
adc Factor
sta Res2 ;If so add number to result
sta Res2   ;If so add number to result
pla
pla
+ lsr Res2 ;Shift result right
+ lsr Res2   ;Shift result right
ror Res
ror Res
dey
dey
Line 28: Line 106:
ldy Res2
ldy Res2
rts
rts
</source>
</pre>
 
An optimization for efficiency is made here; binary long multiplication requires adding one multiplicand to the result at various bit-shifts (i.e. multiply by each power of 2). The naive approach might maintain the value to add as a 16-bit value, left shifting it once each iteration to reach the next power of 2. This one, however, takes advantage of the input being only 8-bits wide, and instead pre-multiplies the result by 256 (8 bits), and each iteration instead right-shifts the result. After 8 iterations the pre-multiply is undone, and the advantage gained is that only the shift is 16-bit; adding the multiplicand remains an efficient 8-bit add.


== Russian Peasant Algorithm ==
Assuming no page crossings and zero page, this routine takes 184-320 cycles, not counting the JSR to call it. (Each non-zero bit in A adds 17 cycles.)
This is the Russian peasant algorithm which was suggested by Bob Rost in his NES Development class PDFs.  With 8 bit values the maximum number of iterations would be 7.  Try to have value2ptr point to the lesser of the two values to reduce iterations.
 
<source lang="6502">
== frantik ==
 
This routine by frantik is another binary long multiplication.
 
<pre>
; Multiply two bytes in memory using Russian peasant algorithm
; Multiply two bytes in memory using Russian peasant algorithm
; by frantik
; by frantik


 
; Accepts: value1 and value2, labels for bytes in memory
; Accepts: value1ptr and value2ptr, pointers to bytes in memory
value2 should ideally be the lesser of the two input values  
value2ptr should point to the lesser of the two values  
;  for increased efficiency
;  for increased efficiency
; Uses: $00, $01, $02 for temporary variables
; Uses: $00, $01, $02 for temporary variables
; Returns: 16 bit value in $00 and $01
; Returns: 16 bit value in $00 and $01


.macro multiply value1ptr, value2ptr
.macro multiply value1, value2


ret  = $00 ; return value
ret  = $00             ; return value
temp = $02 ; temp storage
temp = $02             ; temp storage


lda #$00 ; clear temporary variables
lda #$00       ; clear temporary variables
sta ret
sta ret
sta ret+1
sta ret+1
sta temp
sta temp
jmp start:
lda value2
bne +end"
jmp +start:


-loop:
-loop:
asl value1ptr ; double first value
asl value1      ; double first value
rol temp ; using 16bit precision
rol temp       ; using 16bit precision
lsr value2ptr ; halve second vale
lsr value2      ; halve second vale
start:
+start:
lda value2ptr ;
lda value2      ;
and #01 ; is new 2nd value an odd number?
and #01         ; is new 2nd value an odd number?
beq -loop: ;  
beq -loop:     ;  
clc ; if so, add new 1st value to running total
clc             ; if so, add new 1st value to running total
lda ret ;
lda ret         ;
adc value1ptr ;
adc value1      ;
sta ret ;
sta ret         ;
lda ret+1 ;
lda ret+1       ;
adc temp ;
adc temp       ;
sta ret+1 ;
sta ret+1       ;
lda value2ptr ;
lda value2      ;
cmp #01 ; is 2nd value 1?  if so, we're done
cmp #01         ; is 2nd value 1?  if so, we're done
bne -loop: ; otherwise, loop
bne -loop:     ; otherwise, loop
+end:
.endm
.endm
</source>
</pre>
 
Assuming no page crossings and zero page, this routine takes 17-403 cycles, though it is an in-place macro generating 46 bytes of new code each time it is used.
 
== MMC5 ==
The [[MMC5]] contains an 8x8-bit multiplier at $5205-$5206.
 
== External links ==
* [//forums.nesdev.org/viewtopic.php?f=2&t=16571 Forum post:] Fast multi, ... - Faster multiplication routine using lookup tables, by keldon.
* [//forums.nesdev.org/viewtopic.php?f=2&t=16616 Forum post:] What's the best (fastest) multiplication/division code?
* [//atariage.com/forums/topic/71120-6502-killer-hacks/page-2#entry896028 AtariAge forum post:] 6502 Killer hacks
 
[[Category:Arithmetic]]
[[Category:Arithmetic]]

Latest revision as of 00:35, 18 March 2023

Since the 6502 CPU has no multiplication instruction, this feature has to be written in software.

See also: Fast signed multiply

tepples

This 8x8 multiply routine from Thwaite does binary long multiplication (a.k.a. "the Russian Peasant method").

;;
; Multiplies two 8-bit factors to produce a 16-bit product
; in about 153 cycles.
; @param A one factor
; @param Y another factor
; @return high 8 bits in A; low 8 bits in $0000
;         Y and $0001 are trashed; X is untouched
.proc mul8
prodlo  = $0000
factor2 = $0001

  ; Factor 1 is stored in the lower bits of prodlo; the low byte of
  ; the product is stored in the upper bits.
  lsr a  ; prime the carry bit for the loop
  sta prodlo
  sty factor2
  lda #0
  ldy #8
loop:
  ; At the start of the loop, one bit of prodlo has already been
  ; shifted out into the carry.
  bcc noadd
  clc
  adc factor2
noadd:
  ror a
  ror prodlo  ; pull another bit out for the next iteration
  dey         ; inc/dec don't modify carry; only shifts and adds do
  bne loop
  rts
.endproc

This is a similar technique to Bregalad's method below, but is further optimized to keep part of the running total in the accumulator.

Assuming no page crossings and zero page, this routine takes 137-169 cycles, not counting the JSR to call it. (Each non-zero bit in A adds 4 cycles.)

tepples unrolled

The above code can be made to run slightly faster by both unrolling the loop and pre-decrementing factor2 so that CLC isn't required. Note that the low byte is now returned in A, the high byte in Y, and that CA65 syntax is used.

; @param A one factor
; @param Y another factor
; @return low 8 bits in A; high 8 bits in Y
mul8_multiply:
    lsr
    sta prodlo
    tya
    beq mul8_early_return
    dey
    sty factor2
    lda #0
.repeat 8, i
    .if i > 0
        ror prodlo
    .endif
    bcc :+
    adc factor2
:
    ror
.endrepeat
    tay
    lda prodlo
    ror
mul8_early_return:
    rts

The code takes up 70 bytes and runs in 120 cycles or less.

Bregalad

This routine by Bregalad is similar to tepples' method above, but is less optimized.

;8-bit multiply
;by Bregalad
;Enter with A,Y, numbers to multiply
;Output with YA = 16-bit result (X is unchanged)
Multiply:
	sty Factor  ;Store input factor
	ldy #$00
	sty Res
	sty Res2    ;Clear result
	ldy #$08    ;Number of shifts needed

-	lsr A       ;Shift right input number
	bcc +       ;Check if bit is set
	pha
	lda Res2
	clc
	adc Factor
	sta Res2    ;If so add number to result
	pla
+	lsr Res2    ;Shift result right
	ror Res
	dey
	bne -
	lda Res
	ldy Res2
	rts

An optimization for efficiency is made here; binary long multiplication requires adding one multiplicand to the result at various bit-shifts (i.e. multiply by each power of 2). The naive approach might maintain the value to add as a 16-bit value, left shifting it once each iteration to reach the next power of 2. This one, however, takes advantage of the input being only 8-bits wide, and instead pre-multiplies the result by 256 (8 bits), and each iteration instead right-shifts the result. After 8 iterations the pre-multiply is undone, and the advantage gained is that only the shift is 16-bit; adding the multiplicand remains an efficient 8-bit add.

Assuming no page crossings and zero page, this routine takes 184-320 cycles, not counting the JSR to call it. (Each non-zero bit in A adds 17 cycles.)

frantik

This routine by frantik is another binary long multiplication.

; Multiply two bytes in memory using Russian peasant algorithm
; by frantik

; Accepts: value1 and value2, labels for bytes in memory
;   value2 should ideally be the lesser of the two input values 
;   for increased efficiency
; Uses: $00, $01, $02 for temporary variables
; Returns: 16 bit value in $00 and $01

.macro multiply value1, value2

ret  = $00              ; return value
temp = $02              ; temp storage

	lda #$00        ; clear temporary variables
	sta ret
	sta ret+1
	sta temp
	lda value2
	bne +end"
	jmp +start:

-loop:
	asl value1      ; double first value
	rol temp        ; using 16bit precision
	lsr value2      ; halve second vale
+start:
	lda value2      ;
	and #01         ; is new 2nd value an odd number?
	beq -loop:      ; 
	clc             ; if so, add new 1st value to running total
	lda ret         ;
	adc value1      ;
	sta ret         ;
	lda ret+1       ;
	adc temp        ;
	sta ret+1       ;
	lda value2      ;
	cmp #01         ; is 2nd value 1?  if so, we're done
	bne -loop:      ; otherwise, loop
+end:
.endm

Assuming no page crossings and zero page, this routine takes 17-403 cycles, though it is an in-place macro generating 46 bytes of new code each time it is used.

MMC5

The MMC5 contains an 8x8-bit multiplier at $5205-$5206.

External links