6502 assembly optimisations: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
No edit summary
No edit summary
Line 12: Line 12:
<span id="Tail call"></span>A [[wikipedia:tail call|tail call]] occurs when a subroutine finishes by calling another subroutine. This can be optimised into a JMP instruction:
<span id="Tail call"></span>A [[wikipedia:tail call|tail call]] occurs when a subroutine finishes by calling another subroutine. This can be optimised into a JMP instruction:


MySubroutine
<source lang="6502tasm">
  lda Foo
MySubroutine
  sta Bar
  lda Foo
  jsr SomeRandomRoutine
  sta Bar
  rts
  jsr SomeRandomRoutine
  rts
</source>


becomes :
becomes :
MySubroutine
 
  lda Foo
<source lang="6502tasm">
  sta Bar
MySubroutine
  jmp SomeRandomRoutine
  lda Foo
  sta Bar
  jmp SomeRandomRoutine
</source>


Savings : 9 cycles, 1 byte
Savings : 9 cycles, 1 byte
Line 30: Line 35:
This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled<ref name="compile_pedantry">Pedants may complain that "compile" is incorrect terminology for "translate a program written in assembly language into object code". But it is the most familiar term meaning "translate a program, no matter the language, into object code", and the same issues apply to code generators within a compiler that targets the 6502 as to programs written in 6502 assembly language.</ref> size smaller and faster:
This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled<ref name="compile_pedantry">Pedants may complain that "compile" is incorrect terminology for "translate a program written in assembly language into object code". But it is the most familiar term meaning "translate a program, no matter the language, into object code", and the same issues apply to code generators within a compiler that targets the 6502 as to programs written in 6502 assembly language.</ref> size smaller and faster:


Example
<source lang="6502tasm">
  lda FooBar
Example
  asl A
  lda FooBar
  tax
  asl A
  lda PointerTable,X
  tax
  sta Temp
  lda PointerTable,X
  lda PointerTable+1,X
  sta Temp
  sta Temp+1
  lda PointerTable+1,X
  ....
  sta Temp+1
  ....


PointerTable
PointerTable
  .dw Pointer1, Pointer2, ....
  .dw Pointer1, Pointer2, ....
</source>


Becomes :
Becomes :
Example
  ldx FooBar
  lda PointerTableL,X
  sta Temp
  lda PointerTableH,X
  sta Temp+1
  ....


PointerTableL
<source lang="6502tasm">
  .byt <Pointer1, <Pointer2, ....
Example
  ldx FooBar
  lda PointerTableL,X
  sta Temp
  lda PointerTableH,X
  sta Temp+1
  ....
 
PointerTableL
  .byt <Pointer1, <Pointer2, ....


PointerTableH
PointerTableH
  .byt >Pointer1, >Pointer2, ....
  .byt >Pointer1, >Pointer2, ....
</source>


Some assemblers may have a way to implement a macro to automatically make the table coded like this (Unofficial MagicKit Assembler is one such program).
Some assemblers may have a way to implement a macro to automatically make the table coded like this (Unofficial MagicKit Assembler is one such program).
Line 66: Line 76:
<span id="RTS Trick"></span>The so-called [[RTS Trick]] is a method of implementing jump tables by pushing a subroutine's entry point to the stack.
<span id="RTS Trick"></span>The so-called [[RTS Trick]] is a method of implementing jump tables by pushing a subroutine's entry point to the stack.


<pre>
<source lang="6502tasm">
; This:
; This:


Line 84: Line 94:
   pha
   pha
   rts
   rts
</pre>
</source>


Note that PointerTable entries must point to one byte ''before'' the intended target when the RTS trick is used, because RTS will add 1 to the offset.
Note that PointerTable entries must point to one byte ''before'' the intended target when the RTS trick is used, because RTS will add 1 to the offset.
Line 91: Line 101:


To squeeze slightly more, it's possible to combine this with the tail call optimization:
To squeeze slightly more, it's possible to combine this with the tail call optimization:
<pre>
 
<source lang="6502tasm">
; This:
; This:


Line 110: Line 121:
   pha
   pha
   jmp SomeOtherFunction
   jmp SomeOtherFunction
</pre>
</source>
Here, the CPU runs SomeOtherFunction, then returns to the function from the jump table, then returns to what called this dispatcher.
Here, the CPU runs SomeOtherFunction, then returns to the function from the jump table, then returns to what called this dispatcher.


Line 130: Line 141:
Compact way to divide a variable by 2 but keep its sign:
Compact way to divide a variable by 2 but keep its sign:


<source lang="6502tasm">
   cmp #$80
   cmp #$80
   ror A
   ror A
</source>


=== Easily test 2 upper bits of a variable ===
=== Easily test 2 upper bits of a variable ===


<source lang="6502tasm">
     lda FooBar
     lda FooBar
     asl A        ;C = b7, N = b6
     asl A        ;C = b7, N = b6
</source>


Alternative:
Alternative:


<source lang="6502tasm">
     bit Foobar    ;N = b7, V = b6, regardless of the value of A.
     bit Foobar    ;N = b7, V = b6, regardless of the value of A.
</source>


This can be e.g. used to poll the sprite-0-hit flag in $2002.
This can be e.g. used to poll the sprite-0-hit flag in $2002.
Line 146: Line 163:
=== Negating a value without temporaries ===
=== Negating a value without temporaries ===


<source lang="6502tasm">
   eor #$FF
   eor #$FF
   clc
   clc
   adc #1
   adc #1
</source>


=== Avoiding the need for CLC/SEC with ADC/SBC ===
=== Avoiding the need for CLC/SEC with ADC/SBC ===
Line 162: Line 181:


=== Test bits in decreasing order ===
=== Test bits in decreasing order ===
<source lang="6502tasm">
   lda foobar  
   lda foobar  
   bmi bit7_set  
   bmi bit7_set  
Line 169: Line 190:
   bcs bit5_set  
   bcs bit5_set  
             ; and so on
             ; and so on
</source>


Or if you do not need to preserve the value of A:
Or if you do not need to preserve the value of A:
<source lang="6502tasm">
   lda foobar  
   lda foobar  
   asl
   asl
Line 179: Line 203:
   bcs bit5_set  
   bcs bit5_set  
             ; and so on
             ; and so on
</source>


This saves one byte per comparison, but 2 cycles more are used because of the extra ASL.
This saves one byte per comparison, but 2 cycles more are used because of the extra ASL.


=== Test bits in increasing order ===
=== Test bits in increasing order ===
<source lang="6502tasm">
   lda foobar  
   lda foobar  
   lsr
   lsr
Line 191: Line 218:
   bcs bit2_set
   bcs bit2_set
             ; and so on
             ; and so on
</source>


Note: This does not preserve the value of A.
Note: This does not preserve the value of A.
Line 199: Line 227:
A way to do it is to look for an opcode that has the bits you want to test, and use bit $xxxx on this opcode.
A way to do it is to look for an opcode that has the bits you want to test, and use bit $xxxx on this opcode.


Example
<source lang="6502tasm">
    lda foobar
Example
    and #$30
  lda foobar
    beq bits_clear
  and #$30
    lda foobar
  beq bits_clear
    ....
  lda foobar
  ....


bits_clear
bits_clear
    lda foobar
  lda foobar
    .....
  .....
</source>


becomes :
becomes :
Example
    lda foobar
    bit _bmi_instruction ;equivalent to and #$30 but preserves A
    beq bits_clear
    ....


bits_clear
<source lang="6502tasm">
    .....
Example
  lda foobar
  bit _bmi_instruction ;equivalent to and #$30 but preserves A
  beq bits_clear
  ....
 
bits_clear
  .....


anywhere_in_the_code
anywhere_in_the_code
    ....
    ....
_bmi_instruction    ;The BMI opcode = $30
_bmi_instruction    ;The BMI opcode = $30
    bmi somewhere
    bmi somewhere
</source>


Savings : 2 cycles, 3 bytes
Savings : 2 cycles, 3 bytes
Line 230: Line 263:


To retrieve the 3 highest bits of a value in the low positions, you might be tempted to do 5 LSRs in a row. However, if you do not need the 5 top bits to be cleared, this is more efficient:
To retrieve the 3 highest bits of a value in the low positions, you might be tempted to do 5 LSRs in a row. However, if you do not need the 5 top bits to be cleared, this is more efficient:
<pre> lda value  ; got: 76543210 c
<source lang="6502tasm">
  lda value  ; got: 76543210 c
   rol        ; got: 6543210c 7
   rol        ; got: 6543210c 7
   rol        ; got: 543210c7 6  
   rol        ; got: 543210c7 6  
   rol        ; got: 43210c76 5
   rol        ; got: 43210c76 5
   rol        ; got: 3210c765 4
   rol        ; got: 3210c765 4
   ; Only care about these ^^^</pre>
   ; Only care about these ^^^
</source>


It works the same for replacing 5 ASLs with 4 RORs.
It works the same for replacing 5 ASLs with 4 RORs.


To replace 6 ASLs you can use 3 RORs:
To replace 6 ASLs you can use 3 RORs:
<pre>
<source lang="6502tasm">
   lda value  ; got: 76453210 c
   lda value  ; got: 76453210 c
   ror        ; got: c7654321 0
   ror        ; got: c7654321 0
   ror        ; got: 0c765432 1
   ror        ; got: 0c765432 1
   ror        ; got: 10c76543 2
   ror        ; got: 10c76543 2
   and #$C0    ; got: 10------</pre>
   and #$C0    ; got: 10------
</source>


== Optimise speed at the expense of size ==
== Optimise speed at the expense of size ==
Line 253: Line 289:
=== Use identity look-up table instead of temp variable ===
=== Use identity look-up table instead of temp variable ===


  Example
<source lang="6502tasm">
    ldx Foo
Example
    lda Bar
    ldx Foo
    stx Temp
    lda Bar
    clc
    stx Temp
    adc Temp    ;A = Foo + Bar
    clc
    adc Temp    ;A = Foo + Bar
</source>


becomes :
becomes :
  Example
    ldx Foo
    lda Bar
    clc
    adc Identity,X    ;A = Foo + Bar


  Identity
<source lang="6502tasm">
    .byt $00, $01, $02, $03, .....
Example
    ldx Foo
    lda Bar
    clc
    adc Identity,X    ;A = Foo + Bar
 
Identity
    .byt $00, $01, $02, $03, .....
</source>


If your program is very large, it is possible that this way uses slightly less ROM; also, it might save one byte of RAM in some circumstances.
If your program is very large, it is possible that this way uses slightly less ROM; also, it might save one byte of RAM in some circumstances.
Line 276: Line 317:
=== Use look-up table to shift left 4 times ===
=== Use look-up table to shift left 4 times ===
Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table.
Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table.
<pre>
<source lang="6502tasm">
Example:
Example:
   lda rownum
   lda rownum
Line 284: Line 325:
   asl A
   asl A
   rts
   rts
</pre>
</source>


becomes
becomes
<pre>
 
<source lang="6502tasm">
Example:
Example:
   ldx rownum
   ldx rownum
Line 296: Line 338:
   .byt $00, $10, $20, $30, $40, $50, $60, $70
   .byt $00, $10, $20, $30, $40, $50, $60, $70
   .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0
   .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0
</pre>
</source>


In very large programs, this might save some ROM space. However, it will use up the X register, so it might not be ideal.
In very large programs, this might save some ROM space. However, it will use up the X register, so it might not be ideal.
Line 308: Line 350:
=== Use the stack instead of a temp variable ===
=== Use the stack instead of a temp variable ===


Example
<source lang="6502tasm">
    lda Foo
Example
    sta Temp
  lda Foo
    lda Bar
  sta Temp
    ....
  lda Bar
    ....
  ....
    lda Temp  ;Restores Foo
  ....
    .....
  lda Temp  ;Restores Foo
  .....
</source>


becomes:
becomes:
Example
 
    lda Foo
<source lang="6502tasm">
    pha
Example
    lda Bar
  lda Foo
    ....
  pha
    ....
  lda Bar
    pla  ;Restores Foo
  ....
    .....
  ....
  pla  ;Restores Foo
  .....
</source>


Savings : 2 bytes.
Savings : 2 bytes.
Line 333: Line 380:
Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes.
Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes.


  Example
<source lang="6502tasm">
    lda Argument1
Example
    sta Temp
  lda Argument1
    lda Argument2
  sta Temp
    ldx Argument3
  lda Argument2
    ldy Argument4
  ldx Argument3
    jsr RoutineWhichNeeds4Args
  ldy Argument4
    .....
  jsr RoutineWhichNeeds4Args
  .....
</source>


Becomes something like:
Becomes something like:


  Example
<source lang="6502tasm">
    jsr PassArguments
Example
    .dw RoutineWhichNeeds4Args
  jsr PassArguments
    .db Argument1, Argument2, Argument3, Argument4
  .dw RoutineWhichNeeds4Args
    .db $00
  .db Argument1, Argument2, Argument3, Argument4
    ....
  .db $00
  ....


PassArguments
PassArguments
   pla  
   pla  
   tay  
   tay  
Line 359: Line 409:
   ldx #$00  
   ldx #$00  
   beq SKIP  
   beq SKIP  
LOOP  
LOOP  
   sta parameters,x  
   sta parameters,x  
   inx  
   inx  
SKIP  
SKIP  
   iny                    ; pointing one short first pass here fixes that  
   iny                    ; pointing one short first pass here fixes that  
   lda (pointer),y  
   lda (pointer),y  
Line 375: Line 425:
   pha  
   pha  
   jmp (parameters)
   jmp (parameters)
</source>


Syscalls in Apple ProDOS<ref>''ProDOS 8 Technical Reference Manual''</ref> and FDS BIOS work this way.
Syscalls in Apple ProDOS<ref>''ProDOS 8 Technical Reference Manual''</ref> and FDS BIOS work this way.

Revision as of 09:15, 24 August 2013

This page is about optimisations that are possible in assembly language, or various things one programmer has to keep in mind to make his code as optimal as possible.

There is two major kind of optimisations: Optimisation for speed (code executes in fewer cycles) and optimisation for size (the code takes fewer bytes).

There is also some other kinds of optimisations, such as constant-executing-time optimisation (code execute in a constant number of cycle no matter what it has to do), or RAM usage optimisation (use as few variables as possible). Because those optimisations have more to do with the algorithm than with its implementation in assembly, only speed and size optimisations will be discussed in this article.

Optimise both speed and size of the code

Avoid a jsr + rts chain

A tail call occurs when a subroutine finishes by calling another subroutine. This can be optimised into a JMP instruction:

<source lang="6502tasm"> MySubroutine

 lda Foo
 sta Bar
 jsr SomeRandomRoutine
 rts

</source>

becomes :

<source lang="6502tasm"> MySubroutine

 lda Foo
 sta Bar
 jmp SomeRandomRoutine

</source>

Savings : 9 cycles, 1 byte

Split word tables in high and low components

This optimisation is not human friendly, makes the source code much bigger, but still makes the compiled[1] size smaller and faster:

<source lang="6502tasm"> Example

 lda FooBar
 asl A
 tax
 lda PointerTable,X
 sta Temp
 lda PointerTable+1,X
 sta Temp+1
 ....

PointerTable

 .dw Pointer1, Pointer2, ....

</source>

Becomes :

<source lang="6502tasm"> Example

 ldx FooBar
 lda PointerTableL,X
 sta Temp
 lda PointerTableH,X
 sta Temp+1
 ....

PointerTableL

 .byt <Pointer1, <Pointer2, ....

PointerTableH

 .byt >Pointer1, >Pointer2, ....

</source>

Some assemblers may have a way to implement a macro to automatically make the table coded like this (Unofficial MagicKit Assembler is one such program).

Savings : 2 bytes, 4 cycles

Use Jump tables with RTS instruction instead of JMP indirect instruction

The so-called RTS Trick is a method of implementing jump tables by pushing a subroutine's entry point to the stack.

<source lang="6502tasm">

This
 ldx JumpEntry
 lda PointerTableL,X
 sta Temp
 lda PointerTableH,X
 sta Temp+1
 jmp [Temp]
becomes this
 ldx JumpEntry
 lda PointerTableH,X
 pha
 lda PointerTableL,X
 pha
 rts

</source>

Note that PointerTable entries must point to one byte before the intended target when the RTS trick is used, because RTS will add 1 to the offset.

Savings : 4 bytes, 1 cycle.

To squeeze slightly more, it's possible to combine this with the tail call optimization:

<source lang="6502tasm">

This
 jsr SomeOtherFunction  ; MUST NOT modify JumpEntry
 ldx JumpEntry
 lda PointerTableH,X
 pha
 lda PointerTableL,X
 pha
 rts
Becomes this
 ldx JumpEntry
 lda PointerTableH,X
 pha
 lda PointerTableL,X
 pha
 jmp SomeOtherFunction

</source> Here, the CPU runs SomeOtherFunction, then returns to the function from the jump table, then returns to what called this dispatcher.

Use a macro instead of a subroutine which is only called once

What is the point to call a subroutine if you only call it at a single place? It would be more optimal to just insert the code where the subroutine is called. However this makes the code less structured and harder to understand. Inline expansion of a subroutine into another subroutine can be done with a macro. One drawback is that if the subroutine is called in a loop, it may require some JMPing to work around the 128-byte limit on branch length.

How macros are used depends on the assembler so no code examples will be placed here to avoid further confusion. In C, the static inline keyword acts as a hint to expand a function as a macro.

Savings : 4 bytes, 12 cycles.

Arithmetic shift right

Compact way to divide a variable by 2 but keep its sign:

<source lang="6502tasm">

  cmp #$80
  ror A

</source>

Easily test 2 upper bits of a variable

<source lang="6502tasm">

   lda FooBar
   asl A         ;C = b7, N = b6

</source>

Alternative:

<source lang="6502tasm">

   bit Foobar    ;N = b7, V = b6, regardless of the value of A.

</source>

This can be e.g. used to poll the sprite-0-hit flag in $2002.

Negating a value without temporaries

<source lang="6502tasm">

  eor #$FF
  clc
  adc #1

</source>

Avoiding the need for CLC/SEC with ADC/SBC

If you are using ADC #imm, and you know your carry is already cleared, you do not need to do CLC. However, if you know that carry is set (for example, your code is located in a branch that is only ever entered with a BCS instruction), you can still avoid using CLC by just doing ADC #(value-1). The PLOT subroutine in the Apple II Monitor uses this.

Similarly for SBC #imm: If you know know that carry is clear, you can still avoid using SEC by just doing SBC #(value+1).

Test bits in decreasing order

<source lang="6502tasm">

  lda foobar 
  bmi bit7_set 
  cmp #$40  ; we know that bit 7 wasn't set 
  bcs bit6_set 
  cmp #$20 
  bcs bit5_set 
            ; and so on

</source>

Or if you do not need to preserve the value of A:

<source lang="6502tasm">

  lda foobar 
  asl
  bcs bit7_set 
  asl
  bcs bit6_set 
  asl
  bcs bit5_set 
            ; and so on

</source>

This saves one byte per comparison, but 2 cycles more are used because of the extra ASL.

Test bits in increasing order

<source lang="6502tasm">

  lda foobar 
  lsr
  bcs bit0_set
  lsr
  bcs bit1_set
  lsr
  bcs bit2_set
            ; and so on

</source>

Note: This does not preserve the value of A.

Test bits without destroying the accumulator

The AND instruction can be used to test bits, but this destroy the value in the accumulator. The BIT can do this but it has no immediate adressing mode. A way to do it is to look for an opcode that has the bits you want to test, and use bit $xxxx on this opcode.

<source lang="6502tasm"> Example

  lda foobar
  and #$30
  beq bits_clear
  lda foobar
  ....

bits_clear

  lda foobar
  .....

</source>

becomes :

<source lang="6502tasm"> Example

  lda foobar
  bit _bmi_instruction ;equivalent to and #$30 but preserves A
  beq bits_clear
  ....

bits_clear

  .....

anywhere_in_the_code

   ....

_bmi_instruction ;The BMI opcode = $30

   bmi somewhere

</source>

Savings : 2 cycles, 3 bytes

Use opposite rotate instead of a great number of shifts

To retrieve the 3 highest bits of a value in the low positions, you might be tempted to do 5 LSRs in a row. However, if you do not need the 5 top bits to be cleared, this is more efficient: <source lang="6502tasm">

 lda value   ; got: 76543210 c
 rol         ; got: 6543210c 7
 rol         ; got: 543210c7 6 
 rol         ; got: 43210c76 5
 rol         ; got: 3210c765 4
 ; Only care about these ^^^

</source>

It works the same for replacing 5 ASLs with 4 RORs.

To replace 6 ASLs you can use 3 RORs: <source lang="6502tasm">

 lda value   ; got: 76453210 c
 ror         ; got: c7654321 0
 ror         ; got: 0c765432 1
 ror         ; got: 10c76543 2
 and #$C0    ; got: 10------

</source>

Optimise speed at the expense of size

Those optimisations will make code faster to execute, but use more ROM. Therefore, it is useful in NMI routines and other things that need to run fast.

Use identity look-up table instead of temp variable

<source lang="6502tasm"> Example

   ldx Foo
   lda Bar
   stx Temp
   clc
   adc Temp    ;A = Foo + Bar

</source>

becomes :

<source lang="6502tasm"> Example

   ldx Foo
   lda Bar
   clc
   adc Identity,X    ;A = Foo + Bar

Identity

   .byt $00, $01, $02, $03, .....

</source>

If your program is very large, it is possible that this way uses slightly less ROM; also, it might save one byte of RAM in some circumstances.

Savings : 2 cycles

Use look-up table to shift left 4 times

Provided that the high nibble is already cleared, you can shift left by 4 by making a multiplication look-up table. <source lang="6502tasm"> Example:

 lda rownum
 asl A
 asl A
 asl A
 asl A
 rts

</source>

becomes

<source lang="6502tasm"> Example:

 ldx rownum
 lda times_sixteen,x
 rts

times_sixteen:

 .byt $00, $10, $20, $30, $40, $50, $60, $70
 .byt $80, $90, $A0, $B0, $C0, $D0, $E0, $F0

</source>

In very large programs, this might save some ROM space. However, it will use up the X register, so it might not be ideal.

Savings: 4 cycles

Optimise code size at the expense of cycles

Those optimisations will produce code that is smaller but takes more cycles to execute. Therefore, it can be used in the parts of the program that do not have to be fast.

Use the stack instead of a temp variable

<source lang="6502tasm"> Example

  lda Foo
  sta Temp
  lda Bar
  ....
  ....
  lda Temp   ;Restores Foo
  .....

</source>

becomes:

<source lang="6502tasm"> Example

  lda Foo
  pha
  lda Bar
  ....
  ....
  pla   ;Restores Foo
  .....

</source>

Savings : 2 bytes.

Use an "intelligent" argument system

Each time a routine needs multiple bytes of arguments (>3) it's hard to code it without wasting a lot of bytes.

<source lang="6502tasm"> Example

  lda Argument1
  sta Temp
  lda Argument2
  ldx Argument3
  ldy Argument4
  jsr RoutineWhichNeeds4Args
  .....

</source>

Becomes something like:

<source lang="6502tasm"> Example

  jsr PassArguments
  .dw RoutineWhichNeeds4Args
  .db Argument1, Argument2, Argument3, Argument4
  .db $00
  ....

PassArguments

  pla 
  tay 
  pla 
  pha                    ; put the high byte back 
  sta pointer+1 
  ldx #$00 
  beq SKIP 

LOOP

  sta parameters,x 
  inx 

SKIP

  iny                    ; pointing one short first pass here fixes that 
  lda (pointer),y 
  bne LOOP      
  iny 
  lda (pointer),y 
  beq LOOP
  dey                    ; fix the return address guess we can't return to a 
                        ;  break        
  tya 
  pha 
  jmp (parameters)

</source>

Syscalls in Apple ProDOS[2] and FDS BIOS work this way.

Savings : Complicated to estimate - only saves bytes if the trick is used fairly often across the program, in order to compensate for the size of the PassArguments routine.

Using relative branch instruction instead of absolute

If the state of one of the processor flags is already known at this point and the branch target is not too far away, then you can use a condition branch instruction.

Savings : 1 byte.

See also

Notes

  1. Pedants may complain that "compile" is incorrect terminology for "translate a program written in assembly language into object code". But it is the most familiar term meaning "translate a program, no matter the language, into object code", and the same issues apply to code generators within a compiler that targets the 6502 as to programs written in 6502 assembly language.
  2. ProDOS 8 Technical Reference Manual