6502 assembly optimisations: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
(Pedants will be pedants. The syntax I added is broken until someone installs Cite.php into this installation of MediaWiki. I may tomorrow.)
Line 28: Line 28:
=== Split word tables in high and low components ===
=== Split word tables in high and low components ===


This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled size smaller and faster:
This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled<ref name="compile_pedantry">Pedants may complain that "compile" is incorrect terminology for "translate a program written in assembly language into object code". But it is the most familiar term meaning "translate a program, no matter the language, into object code", and the same issues apply to code generators within a compiler that targets the 6502 as to programs written in 6502 assembly language.</ref> size smaller and faster:


  Example
  Example
Line 53: Line 53:


  PointerTableL
  PointerTableL
   .db <Pointer1, <Pointer2, ....
   .byt <Pointer1, <Pointer2, ....


  PointerTableH
  PointerTableH
   .db >Pointer1, >Pointer2, ....
   .byt >Pointer1, >Pointer2, ....


Savings : 2 bytes, 4 cycles
Savings : 2 bytes, 4 cycles
Line 236: Line 236:


Identity
Identity
     .db $00, $01, $02, $03, .....
     .byt $00, $01, $02, $03, .....


Savings : 2 cycles
Savings : 2 cycles
Line 339: Line 339:
   jmp (parameters)
   jmp (parameters)


Syscalls in Apple ProDOS and FDS BIOS work this way.
Syscalls in Apple ProDOS<ref>''ProDOS 8 Technical Reference Manual''</ref> and FDS BIOS work this way.


Savings : Complicated to estimate - only saves bytes if the trick is used fairly often across the program, in order to compensate for the size of the PassArguments routine.
Savings : Complicated to estimate - only saves bytes if the trick is used fairly often across the program, in order to compensate for the size of the PassArguments routine.
Line 345: Line 345:
== See also ==
== See also ==
*[[Synthetic instructions]]
*[[Synthetic instructions]]
== Notes ==
<references />

Revision as of 03:29, 8 June 2013

This page is about optimisations that are possible in assembly language, or various things one programmer has to keep in mind to make his code as optimal as possible.

There is two major kind of optimisations: Optimisation for speed (code executes in fewer cycles) and optimisation for size (the code takes fewer bytes).

There is also some other kinds of optimisations, such as constant-executing-time optimisation (code execute in a constant number of cycle no matter what it has to do), or RAM usage optimisation (use as few variables as possible). Because those optimisations have more to do with the algorithm than with its implementation in assembly, only speed and size optimisations will be discussed in this article.

Optimise both speed and size of the code

Avoid a jsr + rts chain

A tail call occurs when a subroutine finishes by calling another subroutine. This can be optimised into a JMP instruction:

MySubroutine
  lda Foo
  sta Bar
  jsr SomeRandomRoutine
  rts

becomes :

MySubroutine
  lda Foo
  sta Bar
  jmp SomeRandomRoutine

Savings : 9 cycles, 1 byte

Split word tables in high and low components

This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled[1] size smaller and faster:

Example
  lda FooBar
  asl A
  tax
  lda PointerTable,X
  sta Temp
  lda PointerTable+1,X
  sta Temp+1
  ....
PointerTable
  .dw Pointer1, Pointer2, ....

Becomes :

Example
  ldx FooBar
  lda PointerTableL,X
  sta Temp
  lda PointerTableH,X
  sta Temp+1
  ....
PointerTableL
  .byt <Pointer1, <Pointer2, ....
PointerTableH
  .byt >Pointer1, >Pointer2, ....

Savings : 2 bytes, 4 cycles

Use Jump tables with RTS instruction instead of JMP indirect instruction

The so-called RTS Trick is a method of implementing jump tables by pushing a subroutine's entry point to the stack.

 Example
   ldx JumpEntry
   lda PointerTableL,X
   sta Temp
   lda PointerTableH,X
   sta Temp+1
   jmp [temp]

becomes :

 Example
   ldx JumpEntry
   lda PointerTableH,X
   pha
   lda PointerTableL,X
   pha
   rts

Note that PointerTable entries must point to one byte before the intended target when the RTS trick is used, because RTS will add 1 to the offset.

Savings : 4 bytes, 1 cycle.

Use a macro instead of a subroutine which is only called once

What is the point to call a subroutine if you only call it at a single place ? It would be more optimal to just insert the code where the subroutine is called. However this makes the code less structured and harder to understand.

How macros are used depends on the assembler so no code examples will be placed here to avoid further confusion.

Savings : 4 bytes, 12 cycles.

Arithmetic shift right

Compact way to divide a variable by 2 but keep its sign:

  cmp #$80
  ror A

Easily test 2 upper bits of a variable

   lda FooBar
   asl A         ;C = b7, N = b6

Alternative:

   bit Foobar    ;N = b7, V = b6, regardless of the value of A.

This can be e.g. used to poll the sprite-0-hit flag in $2002.

Negating a value without temporaries

  eor #$FF
  clc
  adc #1

Avoiding the need for CLC/SEC with ADC/SBC

If you are using ADC #imm, and you know your carry is already cleared, you do not need to do CLC. However, if you know that carry is set (for example, your code is located in a branch that is only ever entered with a BCS instruction), you can still avoid using CLC by just doing ADC #(value-1).

Similarly for SBC #imm: If you know know that carry is clear, you can still avoid using SEC by just doing SBC #(value+1).

Test bits in decreasing order

  lda foobar 
  bmi bit7_set 
  cmp #$40  ; we know that bit 7 wasn't set 
  bcs bit6_set 
  cmp #$20 
  bcs bit5_set 
            ; and so on

Or if you do not need to preserve the value of A:

  lda foobar 
  asl
  bcs bit7_set 
  asl
  bcs bit6_set 
  asl
  bcs bit5_set 
            ; and so on

This saves one byte per comparison, but 2 cycles more are used because of the extra ASL.

Test bits in increasing order

  lda foobar 
  lsr
  bcs bit0_set
  lsr
  bcs bit1_set
  lsr
  bcs bit2_set
            ; and so on

Note: This does not preserve the value of A.

Test bits without destroying the accumulator

The AND instruction can be used to test bits, but this destroy the value in the accumulator. The BIT can do this but it has no immediate adressing mode. A way to do it is to look for an opcode that has the bits you want to test, and use bit $xxxx on this opcode.

Example
   lda foobar
   and #$30
   beq bits_clear
   lda foobar
   ....
bits_clear
   lda foobar
   .....

becomes :

Example
   lda foobar
   bit _bmi_instruction ;equivalent to and #$30 but preserves A
   beq bits_clear
   ....
bits_clear
   .....
anywhere_in_the_code
    ....
_bmi_instruction    ;The BMI opcode = $30
    bmi somewhere

Savings : 2 cycles, 3 bytes

Use opposite rotate instead of a great number of shifts

To retrieve the 3 highest bits of a value in the low positions, you might be tempted to do 5 LSRs in a row. However, if you do not need the 5 top bits to be cleared, this is more efficient:

  lda value   ; got: 76543210 c
  rol         ; got: 6543210c 7
  rol         ; got: 543210c7 6 
  rol         ; got: 43210c76 5
  rol         ; got: 3210c765 4
  ; Only care about these ^^^

It works the same for replacing 5 ASLs with 4 RORs.

To replace 6 ASLs you can use 3 RORs:

  lda value   ; got: 76453210 c
  ror         ; got: c7654321 0
  ror         ; got: 0c765432 1
  ror         ; got: 10c76543 2
  and #$C0    ; got: 10------

Optimise speed at the expense of size

Those optimisations will make code faster to execute, but use more ROM.

Use identity look-up table instead of temp variable

 Example
    ldx Foo
    lda Bar
    stx Temp
    clc
    adc Temp    ;A = Foo + Bar

becomes :

 Example
    ldx Foo
    lda Bar
    clc
    adc Identity,X    ;A = Foo + Bar

Identity

    .byt $00, $01, $02, $03, .....

Savings : 2 cycles

Use look-up table to shift left 4 times

Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table.

Example:
  lda rownum
  asl A
  asl A
  asl A
  asl A
  rts

becomes

Example:
  ldx rownum
  lda times_sixteen,x
  rts

times_sixteen:
  .byt $00, $10, $20, $30, $40, $50, $60, $70
  .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0

Savings: 4 cycles

Optimise code size at the expense of cycles

Those optimisations will produce code that is smaller but takes more cycles to execute.

Use the stack instead of a temp variable

Example
   lda Foo
   sta Temp
   lda Bar
   ....
   ....
   lda Temp   ;Restores Foo
   .....

becomes:

Example
   lda Foo
   pha
   lda Bar
   ....
   ....
   pla   ;Restores Foo
   .....

Savings : 2 bytes.

Use an "intelligent" argument system

Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes.

 Example
    lda Argument1
    sta Temp
    lda Argument2
    ldx Argument3
    ldy Argument4
    jsr RoutineWhichNeeds4Args
    .....

Becomes something like:

 Example
    jsr PassArguments
    .dw RoutineWhichNeeds4Args
    .db Argument1, Argument2, Argument3, Argument4
    .db $00
    ....
PassArguments
  pla 
  tay 
  pla 
  pha                    ; put the high byte back 
  sta pointer+1 
  ldx #$00 
  beq SKIP 
LOOP 
  sta parameters,x 
  inx 
SKIP 
  iny                    ; pointing one short first pass here fixes that 
  lda (pointer),y 
  bne LOOP      
  iny 
  lda (pointer),y 
  beq LOOP
  dey                    ; fix the return address guess we can't return to a 
                        ;  break        
  tya 
  pha 
  jmp (parameters)

Syscalls in Apple ProDOS[2] and FDS BIOS work this way.

Savings : Complicated to estimate - only saves bytes if the trick is used fairly often across the program, in order to compensate for the size of the PassArguments routine.

See also

Notes

  1. Pedants may complain that "compile" is incorrect terminology for "translate a program written in assembly language into object code". But it is the most familiar term meaning "translate a program, no matter the language, into object code", and the same issues apply to code generators within a compiler that targets the 6502 as to programs written in 6502 assembly language.
  2. ProDOS 8 Technical Reference Manual