Delay code: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
m (fix inconsistent indentation (group PHA+TYA and TAY+PLA together))
 
(39 intermediate revisions by 3 users not shown)
Line 1: Line 1:
== Delay code ==
Code that causes a parametrised number of cycles of delay.


Functions that cause a parametrised number of cycles of delay.
Note that all branch instructions are written assuming that no page wrap occurs.
If you want to ensure this condition at compile time, use the bccnw/beqnw/etc. macros that are listed at [[Fixed cycle delay]].
 
== Inline code ==
 
=== 2—3 cycles of delay: delay=r+2; 0 ≤ r ≤ 1, r⊢Z, Δr = 0) ===
 
<pre>        bne @1
@1:</pre>
 
=== 4&mdash;5 cycles of delay: delay=r+4; 0 ≤ r ≤ 1, Δr = 0) ===
<pre>        ora #0 ; use ora=A, cpx=X, cpy=Y
        bne @1
@1:</pre>
 
=== 4&mdash;5 cycles of delay: delay=X+4; 0 ≤ X ≤ 1) ===
 
<pre>        dex
        bpl @1
@1:</pre>
 
=== 5&mdash;7 cycles of delay: delay=A+5; 0 ≤ A ≤ 2, A⊢Z) ===
 
<pre>        beq @2
        lsr
@2:    bne @3
@3:</pre>
 
=== 5&mdash;7 cycles of delay: delay=r+5; 0 ≤ r ≤ 2, Δr = 0) ===
 
<pre>        cmp #1 ; use cmp=A, cpx=X, cpy=Y
        bcc @3
        bne @3
@3:</pre>
 
=== 5&mdash;7 cycles of delay: delay=X+5; 0 ≤ X ≤ 2) ===
 
<pre>        dex
        bmi @3
        bne @3
@3:</pre>
 
=== 6&mdash;9 cycles of delay: delay=A+6; 0 ≤ A ≤ 3, A⊢Z) ===
 
<pre>        beq @2
        lsr
@2:    beq @4
        bcs @4
@4:</pre>
 
=== 7&mdash;10 cycles of delay: delay=A+7; 0 ≤ A ≤ 3) ===
 
<pre>        lsr
        beq @3
        bpl @3
@3:    bcs @4
@4:</pre>
 
=== 8&mdash;11 cycles of delay: delay=X+8; 0 ≤ X ≤ 3) ===
 
<pre>        dex
        bmi @4
        dex
        bmi @5
@4:    bne @5
@5:</pre>
 
=== 9&mdash;14 cycles of delay: delay=A&minus;242; 251 ≤ A ≤ 255; C = 0) ===
 
<pre>        adc #3  ;  2 2 2 2 2  FE FF 00 01 02
        bcc @4  ;  3 3 2 2 2  FE FF 00 01 02 ;bmi works too
        lsr    ;  - - 2 2 2  -- -- 00 00 01
        beq @5  ;  - - 3 3 2  -- -- 00 00 01
@4:    lsr    ;  2 2 - - 2  7F 7F -- -- 00
@5:    bcs @6  ;  2 3 2 3 2  7F 7F 00 00 00
@6:</pre>
 
=== 10&mdash;14 cycles of delay: delay=X+10; 0 ≤ X ≤ 4) ===
 
<pre>        cpx #3
        bcc @3
        bne @3
@3:    dex
        bmi @6
        bne @6
@6:</pre>
 
=== 9&mdash;14 cycles of delay: delay=A+9; 0 ≤ A ≤ 5) ===
 
<pre>        lsr
        bcs @2
@2:    beq @5
        lsr
        bcs @6 ;beq works too
@5:    bne @6
@6:</pre>
 
=== 9&mdash;16 cycles of delay: delay=A+9; 0 ≤ A ≤ 7) ===
 
<pre>        lsr
        bcs @2
@2:    beq @6
        lsr
        beq @7
        bcc @7
@6:    bne @7
@7:</pre>
 
=== 11&mdash;19 cycles of delay: delay=A+11; 0 ≤ A ≤ 8) ===
 
<pre>;      Cycles | A | 0  0  0  0  0  0  0  0  0  | 0  1  2  3  4  5  6  7  8
        lsr      ; 2  2  2  2  2  2  2  2  2  | 0  0  1  1  2  2  3  3  4
        bcs @3    ; 2  3  2  3  2  3  2  3  2  | 0 c0  1 c1  2 c2  3 c3  4
        adc #255  ; 2  -  2  -  2  -  2  -  2  |-1  - c0  - c1  - c2  - c3
@3:    beq @7    ; 2  3  3  2  2  2  2  2  2  |-1 c0 c0 c1 c1 c2 c2 c3 c3
        bcc @9    ; 3  -  -  2  2  2  2  2  2  |-1  -  - c1 c1 c2 c2 c3 c3 ;bmi works too
        lsr      ; -  -  -  2  2  2  2  2  2  | -  -  - c0 c0  1  1 c1 c1
        beq @9    ; -  -  -  3  3  2  2  2  2  | -  -  - c0 c0  1  1 c1 c1
@7:    bcc @9    ; -  2  2  -  -  3  3  2  2  | - c0 c0  -  -  1  1 c1 c1
        bne @9    ; -  2  2  -  -  -  -  3  3  | - c0 c0  -  -  -  - c1 c1
@9:      ;Total:  11 12 13 14 15 16 17 18 19
</pre>
 
=== 12&mdash;23 cycles of delay: delay=A+12; 0 ≤ A ≤ 11) ===
 
<pre>;      Cycles | A | 0  0  0  0  0  0  0  0  0  0  0  0  | 0  1  2  3  4  5  6  7  8  9 10 11
        lsr      ; 2  2  2  2  2  2  2  2  2  2  2  2  | 0  0  1  1  2  2  3  3  4  4  5  5
        bcs @2    ; 2  3  2  3  2  3  2  3  2  3  2  3  | 0  0  1  1  2  2  3  3  4  4  5  5
@2:    lsr      ; 2  2  2  2  2  2  2  2  2  2  2  2  | 0  0  0  0  1  1  1  1  2  2  2  2
        bcc @5    ; 3  3  2  2  3  3  2  2  3  3  2  2  | 0  0  0  0  1  1  1  1  2  2  2  2
        bcs @5    ; -  -  3  3  -  -  3  3  -  -  3  3  | -  -  0  0  -  -  1  1  -  -  2  2 ;bpl works too
@5:    beq @10  ; 3  3  3  3  2  2  2  2  2  2  2  2  | 0  0  0  0  1  1  1  1  2  2  2  2
        lsr      ; -  -  -  -  2  2  2  2  2  2  2  2  | -  -  -  -  0  0  0  0  1  1  1  1
        bcs @10  ; -  -  -  -  3  3  3  3  2  2  2  2  | -  -  -  -  0  0  0  0  1  1  1  1 ;beq works too
        delay_n 5 ; -  -  -  -  -  -  -  -  5  5  5  5  | -  -  -  -  -  -  -  -  1  1  1  1
@10:      ;Total:  12 13 14 15 16 17 18 19 20 21 22 23
</pre>
 
For delay_n 5, anything that causes 5 cycles of delay works. Examples: <tt>inc $00</tt>, <tt>nop</tt> + <tt>cmp $C5</tt>
 
=== 15&mdash;270 cycles of delay: delay=A+15; 0 ≤ A ≤ 255) ===
 
This code peels slices of 5 cycles with a SBC-BCS loop, and then executes the delay code for A=251&mdash;255. The same code will appear later as a function version (which adds 12 cycles overhead due to JSR+RTS cost).
 
<pre>        sec   
@L:    sbc #5 
        bcs @L  ;  6 6 6 6 6  FB FC FD FE FF
        adc #3  ;  2 2 2 2 2  FE FF 00 01 02
        bcc @4  ;  3 3 2 2 2  FE FF 00 01 02
        lsr    ;  - - 2 2 2  -- -- 00 00 01
        beq @5  ;  - - 3 3 2  -- -- 00 00 01
@4:    lsr    ;  2 2 - - 2  7F 7F -- -- 00
@5:    bcs @6  ;  2 3 2 3 2  7F 7F 00 00 00
@6:</pre>
 
=== 16&mdash;271 cycles of delay: delay=A+16; 0 ≤ A ≤ 255) ===
 
This code peels slices of 9 cycles with a CMP-BCC-SBC-BCS loop, and then executes the delay code for A=0&mdash;8.
 
<pre>@L:    cmp #9          ;2
        bcc @0          ;2 (+1)
        sbc #9          ;2
        bcs @L          ;3
;      Cycles | A | 5  5  5  5  5  5  5  5  5  | 0  1  2  3  4  5  6  7  8
@0:    lsr      ; 2  2  2  2  2  2  2  2  2  | 0  0  1  1  2  2  3  3  4
        bcs @3    ; 2  3  2  3  2  3  2  3  2  | 0 c0  1 c1  2 c2  3 c3  4
        adc #255  ; 2  -  2  -  2  -  2  -  2  |-1  - c0  - c1  - c2  - c3
@3:    beq @7    ; 2  3  3  2  2  2  2  2  2  |-1 c0 c0 c1 c1 c2 c2 c3 c3
        bcc @9    ; 3  -  -  2  2  2  2  2  2  |-1  -  - c1 c1 c2 c2 c3 c3
        lsr      ; -  -  -  2  2  2  2  2  2  | -  -  - c0 c0  1  1 c1 c1
        beq @9    ; -  -  -  3  3  2  2  2  2  | -  -  - c0 c0  1  1 c1 c1
@7:    bcc @9    ; -  2  2  -  -  3  3  2  2  | - c0 c0  -  -  1  1 c1 c1
        bne @9    ; -  2  2  -  -  -  -  3  3  | - c0 c0  -  -  -  - c1 c1
@9:      ;Total:  16 17 18 19 20 21 22 23 24</pre>
 
=== 5&mdash;65285 cycles of delay: delay = 256×X + 5 ===
 
Clobbers A:
 
<pre>@0:    txa      ;2
        beq @10  ;3
        nop      ;2
        tya      ;2
        ldy #48  ;2
@l:      dey      ;2×48
        bne @l  ;3×48
        tay      ;2&minus;1
        dex      ;2
        jmp @0    ;3
@10:</pre>
 
Doesn&rsquo;t clobber A (2 bytes longer):
 
<pre>@0:    cpx #0    ;2
        beq @10  ;3
        pha      ;3
        tya      ;2
        ldy #47  ;2
@l:      dey      ;2×47
        bne @l  ;3×47
        tay      ;2&minus;1
        pla      ;4
        jmp @0    ;3
@10:</pre>
 
=== 18&mdash;218103813 cycles of delay: delay = 13×(65536×Y + 256×A + X) + 18 ===


Note that all branch instructions are written assuming that no page wrap occurs.
<pre>        iny
If you want to ensure this condition at compile time, use the bccnw/beqnw/etc. macros that are listed at [[Fixed dycle delay]].
@l1:    nop
        nop
@l2:    cpx #1
        dex
        sbc #0
        bcs @l1
        dey
        bne @l2
        rts</pre>
 
== Callable functions ==
 
=== A + 25 cycles of delay, clobbers A, Z&amp;N, C, V ===


=== 25..280 cycles of delay ===
This code peels slices of 7 cycles with a CMP-BCS-SBC loop, and then executes the delay code for 9&mdash;16 cycles with A = 0&mdash;6. The reason its overhead is smaller than in the version that peels 5 cycles is because the case for A&lt;7 executes only two instructions instead of three. This comes at the cost that the entry point is not the first instruction. Therefore the code can only exist as a callable function and not inline code.


<pre>;;;;;;;;;;;;;;;;;;;;;;;;
<pre>;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Delays A clocks + overhead
; Preserved: X, Y
; Clobbers A. Preserves X,Y.
; Time: A+25 clocks (including JSR)
; Time: A+25 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;
:       sbc #7         ; carry set by CMP
                  ;      Cycles              Accumulator        Carry flag
                  ; 0  1  2  3  4  5  6          (hex)          0 1 2 3 4 5 6
                  ;
                  ; 6  6  6  6  6  6  6  00 01 02 03 04 05 06
:     sbc #7     ; carry set by CMP
delay_a_25_clocks:
delay_a_25_clocks:
cmp #7
      cmp #7     ; 2  2  2  2  2  2  2  00 01 02 03 04 05 06  0 0 0 0 0 0 0
bcs :-         ; do multiples of 7
      bcs :-     ; 2  2  2  2  2  2  2  00 01 02 03 04 05 06  0 0 0 0 0 0 0
lsr a          ; bit 0
      lsr       ; 2  2  2  2  2  2  2  00 00 01 01 02 02 03  0 1 0 1 0 1 0
bcs :+
      bcs *+2    ; 2 3  2  3  2  3  2  00 00 01 01 02 02 03  0 1 0 1 0 1 0
                      ; A=clocks/2, either 0,1,2,3
      beq :+    ; 3  3  2  2  2  2  2  00 00 01 01 02 02 03  0 1 0 1 0 1 0
beq @zero      ; 0: 5
      lsr       ;      2  2  2  2  2        00 00 01 01 01      1 1 0 0 1
lsr a
      beq @rts  ;       3  3  2  2  2        00 00 01 01 01      1 1 0 0 1
beq :+          ; 1: 7
      bcc @rts  ;             3  3  2               01 01 01          0 0 1
bcc :+          ; 2: 9
:     bne @rts  ; 2  2              3   00 00            01  0 1        0
@zero: bne :+          ; 3: 11
@rts: rts       ; 6  6  6  6  6  6  6  00 00 00 00 01 01 01  0 1 1 1 0 0 1
:       rts             ; (thanks to dclxvi for the algorithm)</pre>
; Total cycles:    25 26 27 28 29 30 31</pre>


=== 33..65568 cycles of delay ===
=== A + 27 cycles of delay, clobbers A, Z&amp;N, C, V ===
 
This function has longer overhead than delay_a_25_clocks,
but it can be appended into other functions,
as the execution begins from the first instruction.


<pre>;;;;;;;;;;;;;;;;;;;;;;;;
<pre>;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A:X clocks+overhead
; Delays A clocks + overhead
; Time: 256*A+X+33 clocks (including JSR)
; Clobbers A. Preserves X,Y.
; Clobbers A. Preserves X,Y. Has relocations.
; Time: A+27 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;
: ; do 256 cycles. ; 5 cycles done so far. Loop is 2+1+ 2+3+ 1 = 9 bytes.
delay_a_27_clocks:
sbc #1 ; 2 cycles - Carry was set from cmp
        sec   
pha ; 3 cycles
@L:     sbc #
lda #(256-25-10-2-4)  ; +2
        bcs @L  ; 6 6 6 6 6 FB FC FD FE FF
jsr delay_a_25_clocks
         adc #3  ;  2 2 2 2 2 FE FF 00 01 02
pla                    ; 4 cycles
        bcc @4  3 3 2 2 2  FE FF 00 01 02
delay_256a_x_33_clocks:
         lsr     ;  - - 2 2 2 -- -- 00 00 01
cmp #1 ; +2; 2 cycles overhead
         beq @5  - - 3 3 2 -- -- 00 00 01
bcs :- ; +2; 4 cycles overhead
@4:     lsr     ;  2 2 - - 2 7F 7F -- -- 00
; 0-255 cycles remain, overhead = 4
@5:    bcs @6  ; 2 3 2 3 2  7F 7F 00 00 00
txa ; +2; 6; +27 = 33
@6:    rts     ;</pre>
        ; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. 15+12=27
 
         ;         ;    Cycles        Accumulator    Carry flag
It is created by wrapping the code for 15&mdash;270 cycles of delay into a function.
        ;          ; 0  1 2 3  4      (hex)        0 1 2 3 4
The JSR+RTS instructions adds 12 cycles of overhead.
        sec        ; 0 0  0  0  0  00 01 02 03 04  1 1 1 1 1
:      sbc #5    ; 2 2 2 2  FB FC FD FE FF   0 0 0 0 0
         bcs :-     ; 4 4 4  4  4  FB FC FD FE FF  0 0 0 0 0
         lsr a      ; 6 6 6  6  6  7D 7E 7E 7F 7F  1 0 1 0 1
        bcc :+     ; 8 8  8  8 8  7D 7E 7E 7F 7F   1 0 1 0 1
:      sbc #$7E  ;10 11 10 11 10  FF FF 00 00 01  0 0 1 1 1
        bcc :+     ;12 13 12 13 12  FF FF 00 00 01  0 0 1 1 1
        beq :+    ;      14 15 14        00 00 01      1 1 1
        bne :+     ;            16              01          1
:      rts       ;15 16 17 18 19  (thanks to dclxvi for the algorithm)</pre>


=== 256×A + X + 33 cycles of delay, clobbers A, Z&amp;N, C, V ===


<pre>;;;;;;;;;;;;;;;;;;;;;;;;
<pre>;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A:X clocks+overhead
; Delays A:X clocks+overhead
; Time: 256*A+X+33 clocks (including JSR)
; Time: 256*A+X+33 clocks (including JSR)
; Clobbers A,Y. Preserves X. No relocations.
; Clobbers A. Preserves X,Y. Has relocations.
; Does not depend on delay_a_25_clocks.
;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;
: ; do 256 cycles. ; 5 cycles done so far. Loop is 2+2+1+2+2 = 9 bytes.
: ; 5 cycles done, do 256-5 more.
sbc #1 ; 2 cycles - Carry was set from cmp
sbc #1 ; 2 cycles - Carry was set from cmp
        ldy #48  ;\
pha                    ; 3
        dey      ; |- Clobbers Y; 246 cycles, 253 total
lda #(256-27 - 16)    ; 2
        bpl *-1 ;/
jsr delay_a_27_clocks ; 240
        ldy $A4  ;             ; 3 cycles, 256 total
pla                    ; 4
delay_256a_x_33_clocks_b:
delay_256a_x_33_clocks:
cmp #1 ; +2; 2 cycles overhead
cmp #1 ; +2
bcs :- ; +2; 4 cycles overhead
bcs :- ; +3
; 0-255 cycles remain, overhead = 4
; 0-255 cycles remain, overhead = 4
txa ; +2; 6; +27 = 33
txa ; -1+2; 6; +27 = 33
        ; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. 15+12=27
;passthru
        ;          ;   Cycles        Accumulator    Carry flag
<<Place the function delay_a_27_clocks immediately following here>></pre>
        ;          ; 0  1  2  3  4      (hex)        0 1 2 3 4
        sec        ; 0  0  0  0  0  00 01 02 03 04  1 1 1 1 1
:      sbc #5    ; 2  2  2  2  2  FB FC FD FE FF  0 0 0 0 0
        bcs :-    ; 4  4  4  4  4  FB FC FD FE FF  0 0 0 0 0
        lsr a      ; 6  6  6  6  6  7D 7E 7E 7F 7F  1 0 1 0 1
        bcc :+    ; 8  8  8  8  8  7D 7E 7E 7F 7F  1 0 1 0 1
:      sbc #$7E  ;10 11 10 11 10  FF FF 00 00 01  0 0 1 1 1
        bcc :+    ;12 13 12 13 12  FF FF 00 00 01  0 0 1 1 1
        beq :+    ;      14 15 14        00 00 01      1 1 1
        bne :+    ;            16              01          1
:      rts        ;15 16 17 18 19  (thanks to dclxvi for the algorithm)</pre>


<pre>;;;;;;;;;;;;;;;;;;;;;;;;
Can be trivially changed to swap X, Y.
; Delays A:X clocks+overhead
; Time: 256*A+X+33 clocks (including JSR)
; Clobbers A. Preserves X,Y. No relocations.
; Does not depend on delay_a_25_clocks.
;;;;;;;;;;;;;;;;;;;;;;;;
: ; do 256 cycles. ; 5 cycles done so far. Loop is 2+1+ 1+2+1+2+1 + 1+1 = 12 bytes.
sbc #1 ; 2 cycles - Carry was set from cmp
        pha      ;\
        txa      ; |
        ldx #46  ; |
        dex      ; |-          ; 247 cycles, 254 total
        bpl *-1  ; |
        tax      ; |
        pla      ;/
        nop                    ; 2 cycles; 256 cycles total
delay_256a_x_33_clocks_c:
cmp #1 ; +2; 2 cycles overhead
bcs :- ; +2; 4 cycles overhead
; 0-255 cycles remain, overhead = 4
txa ; +2; 6; +27 = 33
        ; 15 + JSR + RTS overhead for the code below. JSR=6, RTS=6. 15+12=27
        ;          ;    Cycles        Accumulator    Carry flag
        ;          ; 0  1  2  3  4      (hex)        0 1 2 3 4
        sec        ; 0  0  0  0  0  00 01 02 03 04  1 1 1 1 1
:      sbc #5    ; 2  2  2  2  2  FB FC FD FE FF  0 0 0 0 0
        bcs :-    ; 4  4  4  4  4  FB FC FD FE FF  0 0 0 0 0
        lsr a      ; 6  6  6  6  6  7D 7E 7E 7F 7F  1 0 1 0 1
        bcc :+    ; 8  8  8  8  8  7D 7E 7E 7F 7F  1 0 1 0 1
:      sbc #$7E  ;10 11 10 11 10  FF FF 00 00 01  0 0 1 1 1
        bcc :+    ;12 13 12 13 12  FF FF 00 00 01  0 0 1 1 1
        beq :+    ;      14 15 14        00 00 01      1 1 1
        bne :+    ;            16              01          1
:      rts        ;15 16 17 18 19  (thanks to dclxvi for the algorithm)</pre>


=== 30..65565 cycles of delay ===
If you can clobber Y, change the part that begins with "pha"
and ends with "pla" into this, for 1 byte shorter code:


<pre>;;;;;;;;;;;;;;;;;;;;;;;;
<pre> ldy #49  ; 2
; Delays X:A clocks+overhead
@l:     dey      ; 49*2
; Time: 256*X+A+30 clocks (including JSR)
        bne @l  ; 49*3
; Clobbers A,X. Preserves Y. Has relocations.
        ldy $A4 ; 3-1</pre>
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_a_30_clocks:
cpx #0 ; +2
beq delay_a_25_clocks ; +3  (25+5 = 30 cycles overhead)
; do 256 cycles.        ;  4 cycles so far. Loop is 1+1+ 2+3+ 1+3 = 11 bytes.
dex                    ; 2 cycles
pha                    ; 3 cycles
lda #(256-25-9-2-7)    ; +2
jsr delay_a_25_clocks
pla                        ; 4
jmp delay_256x_a_30_clocks ; 3.</pre>


=== 16..65296 cycles of delay ===
=== 256×A + 16 cycles of delay, clobbers A, Z&amp;N, C, V ===


<pre>;;;;;;;;;;;;;;;;;;;;;;;;
<pre>;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A*256 clocks + overhead
; Delays A*256 clocks + overhead
; Preserved: X, Y
; Clobbers A. Preserves X,Y.
; Time: A*256+16 clocks (including JSR)
; Time: A*256+16 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256a_16_clocks:
delay_256a_16_clocks:
cmp #0
cmp #0
bne :+
bne delay_256a_11_clocks_
rts
rts
delay_256a_11_clocks_:
delay_256a_11_clocks_:
:      pha
;5 cycles done. Must consume 256 cycles; 251 cycles remain.
lda #256-19-22
        pha                     ;3 - 3
jsr delay_a_25_clocks
        tya                      ;2 - 5
pla
        ldy #46                ;2 - 7
clc
@l:      dey                    ;2*46 - 99
adc #-1&$FF
        bne @l                  ;3*46 - 237
bne :-
        nop                    ;2 - 238
rts</pre>
        tay                      ;2 - 240
        pla                     ;4 - 244
sec                      ;2 - 246
sbc #1                  ;2 - 248
jmp delay_256a_16_clocks ;3 - 251</pre>


=== 31..65566 cycles of delay ===
If you can clobber Y, change the part that begins with pha and ends in pla, into this, for shorter code:
 
<pre> ldy #48  ; 2
@l:    dey      ; 49*2
        bne @l  ; 49*3
        ldy $A4  ; 3-1</pre>
 
=== 256×X + 16 cycles of delay, clobbers X, Z&amp;N ===


<pre>;;;;;;;;;;;;;;;;;;;;;;;;
<pre>;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A:X clocks+overhead
; Delays X*256 clocks + overhead
; Time: 256*A+X+31 clocks (including JSR)
; Clobbers X,Y. Preserves A. Relocatable.
; Clobbers A. Preserves X,Y. Has relocations.
; Time: X*256+16 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_16_clocks:
cpx #0
bne delay_256x_11_clocks_
rts
delay_256x_11_clocks_:
;5 cycles done. Must consume 256 cycles; 251 cycles remain.
        pha                      ;3
        tya                      ;2
        ldy #46                ;2
@l:      dey                    ;2*46
        bne @l                  ;3*46
        nop                    ;2-1
        nop                    ;2
        tay                      ;2
        pla                      ;4
dex                      ;2
jmp delay_256x_16_clocks ;3</pre>
 
Can be trivially changed to swap X, Y.
 
If you can clobber Y, change the part that begins with pha and ends in pla, into this, for shorter code:
 
<pre>        ldy #50                  ;2-1
@l: dey                      ;2*50
bne @l                  ;3*50</pre>
 
=== 256×X + A + 30 cycles of delay, clobbers A, X, Z&amp;N, C, V ===
 
<pre>;;;;;;;;;;;;;;;;;;;;;;;;
; Delays X*256+A clocks + overhead
; Clobbers A,X. Preserves Y.
; Depends on delay_a_25_clocks within short branch distance
; Time: X*256+A+30 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
;;;;;;;;;;;;;;;;;;;;;;;;
: ; do 256 cycles. ; 5 cycles done so far. Loop is 2+1+ 2+3+ 1 = 9 bytes.
delay_256x_a_30_clocks:
sbc #1 ; 2 cycles - Carry was set from cmp
        cpx #0                  ;2
pha ; 3 cycles
        beq delay_a_25_clocks  ;3
lda #(256-25-10-2-4)   ; +2
        ;4 cycles done. Must consume 256 cycles; 252 cycles remain.
jsr delay_a_25_clocks
        pha                             ;3
pla                     ; 4 cycles
        lda #(256-4-(3+2+4+2+3))-25    ;2
delay_256a_x_31_clocks:
        jsr delay_a_25_clocks         ;238
cmp #1 ; +2; 2 cycles overhead
        pla                             ;4
bcs :- ; +2; 4 cycles overhead
        dex                            ;2
; 0-255 cycles remain, overhead = 4
        jmp delay_256x_a_30_clocks      ;3</pre>
txa ; +2; 6; +25 = 31
 
;passthru
Can be trivially changed to swap X, Y.
<<Place the function delay_a_25_clocks immediately following here>>
 
</pre>
Alternative version that does not depend on other delay functions,
but has otherwise the same implications:
 
<pre>:      sbc #7    ; carry set by CMP
delay_256x_a_30_clocks_b:
      cmp #7    ; 2  2  2  2  2  2  2  00 01 02 03 04 05 06  0 0 0 0 0 0 0
      bcs :-    ; 2  2  2  2  2  2  2  00 01 02 03 04 05 06  0 0 0 0 0 0 0
      lsr      ; 2  2  2  2  2  2  2  00 00 01 01 02 02 03  0 1 0 1 0 1 0
      bcs @2    ; 2  3  2  3  2  3  2  00 00 01 01 02 02 03  0 1 0 1 0 1 0
@2:    beq @6    ; 3  3  2  2  2  2  2  00 00 01 01 02 02 03  0 1 0 1 0 1 0
      lsr      ;      2  2  2  2  2        00 00 01 01 01      1 1 0 0 1
      beq @do_x ;      3  3  2  2  2        00 00 01 01 01      1 1 0 0 1
      bcc @do_x ;             3  3  2               01 01 01          0 0 1
@6:    bne @do_x ; 2 2              3  00 00            01  0 1        0
@do_x: txa      ;2
      beq @rts  ;3
      ;4 cycles done. Must consume 256 cycles; 252 cycles remain.
      nop      ;2
      tya      ;2
        ldy #48  ;2
@l:    dey      ;2*48
        bne @l  ;3*48
      tay      ;2-1
      dex      ;2
      jmp @do_x ;3
@rts:  rts</pre>
 
This function is constructed by concatenating delay_a_25_clocks and the inline delay code for 5&mdash;65285 cycles.
 
=== 851968×Y + 3328×A + 13×X + 30 cycles of delay, clobbers A, X, Y, Z&amp;N, C, V ===
 
<pre>;;;;;;;;;;;;;;;;;;;;;;;;
; Delays 30+13*(65536*Y+256*A+X) cycles including JSR.
; Clobbers A,X,Y.
delay_851968y_3328a_13x_30_clocks:
        iny
@l1:    nop
        nop
@l2:    cpx #1
        dex
        sbc #0
        bcs @l1
        dey
        bne @l2
        rts</pre>
 
This is constructed by wrapping the 18&mdash;218103813 cycles inline delay code in a function.


== See also ==
== See also ==
* [[Cycle counting]]
* [[Fixed cycle delay]]
* [[Fixed cycle delay]]

Latest revision as of 16:08, 27 July 2022

Code that causes a parametrised number of cycles of delay.

Note that all branch instructions are written assuming that no page wrap occurs. If you want to ensure this condition at compile time, use the bccnw/beqnw/etc. macros that are listed at Fixed cycle delay.

Inline code

2—3 cycles of delay: delay=r+2; 0 ≤ r ≤ 1, r⊢Z, Δr = 0)

        bne @1
@1:

4—5 cycles of delay: delay=r+4; 0 ≤ r ≤ 1, Δr = 0)

        ora #0 ; use ora=A, cpx=X, cpy=Y
        bne @1
@1:

4—5 cycles of delay: delay=X+4; 0 ≤ X ≤ 1)

        dex
        bpl @1
@1:

5—7 cycles of delay: delay=A+5; 0 ≤ A ≤ 2, A⊢Z)

        beq @2
        lsr
@2:     bne @3
@3:

5—7 cycles of delay: delay=r+5; 0 ≤ r ≤ 2, Δr = 0)

        cmp #1 ; use cmp=A, cpx=X, cpy=Y
        bcc @3
        bne @3
@3:

5—7 cycles of delay: delay=X+5; 0 ≤ X ≤ 2)

        dex
        bmi @3
        bne @3
@3:

6—9 cycles of delay: delay=A+6; 0 ≤ A ≤ 3, A⊢Z)

        beq @2
        lsr
@2:     beq @4
        bcs @4
@4:

7—10 cycles of delay: delay=A+7; 0 ≤ A ≤ 3)

        lsr
        beq @3
        bpl @3
@3:     bcs @4
@4:

8—11 cycles of delay: delay=X+8; 0 ≤ X ≤ 3)

        dex
        bmi @4
        dex
        bmi @5
@4:     bne @5
@5:

9—14 cycles of delay: delay=A−242; 251 ≤ A ≤ 255; C = 0)

        adc #3  ;  2 2 2 2 2  FE FF 00 01 02
        bcc @4  ;  3 3 2 2 2  FE FF 00 01 02 ;bmi works too
        lsr     ;  - - 2 2 2  -- -- 00 00 01
        beq @5  ;  - - 3 3 2  -- -- 00 00 01
@4:     lsr     ;  2 2 - - 2  7F 7F -- -- 00
@5:     bcs @6  ;  2 3 2 3 2  7F 7F 00 00 00
@6:

10—14 cycles of delay: delay=X+10; 0 ≤ X ≤ 4)

        cpx #3
        bcc @3
        bne @3
@3:     dex
        bmi @6
        bne @6
@6:

9—14 cycles of delay: delay=A+9; 0 ≤ A ≤ 5)

        lsr
        bcs @2
@2:     beq @5
        lsr
        bcs @6 ;beq works too
@5:     bne @6
@6:

9—16 cycles of delay: delay=A+9; 0 ≤ A ≤ 7)

        lsr
        bcs @2
@2:     beq @6
        lsr
        beq @7
        bcc @7
@6:     bne @7
@7:

11—19 cycles of delay: delay=A+11; 0 ≤ A ≤ 8)

;      Cycles | A | 0  0  0  0  0  0  0  0  0  | 0  1  2  3  4  5  6  7  8
        lsr       ; 2  2  2  2  2  2  2  2  2  | 0  0  1  1  2  2  3  3  4
        bcs @3    ; 2  3  2  3  2  3  2  3  2  | 0 c0  1 c1  2 c2  3 c3  4
        adc #255  ; 2  -  2  -  2  -  2  -  2  |-1  - c0  - c1  - c2  - c3
@3:     beq @7    ; 2  3  3  2  2  2  2  2  2  |-1 c0 c0 c1 c1 c2 c2 c3 c3
        bcc @9    ; 3  -  -  2  2  2  2  2  2  |-1  -  - c1 c1 c2 c2 c3 c3 ;bmi works too
        lsr       ; -  -  -  2  2  2  2  2  2  | -  -  - c0 c0  1  1 c1 c1
        beq @9    ; -  -  -  3  3  2  2  2  2  | -  -  - c0 c0  1  1 c1 c1
@7:     bcc @9    ; -  2  2  -  -  3  3  2  2  | - c0 c0  -  -  1  1 c1 c1
        bne @9    ; -  2  2  -  -  -  -  3  3  | - c0 c0  -  -  -  - c1 c1
@9:       ;Total:  11 12 13 14 15 16 17 18 19

12—23 cycles of delay: delay=A+12; 0 ≤ A ≤ 11)

;      Cycles | A | 0  0  0  0  0  0  0  0  0  0  0  0  | 0  1  2  3  4  5  6  7  8  9 10 11
        lsr       ; 2  2  2  2  2  2  2  2  2  2  2  2  | 0  0  1  1  2  2  3  3  4  4  5  5
        bcs @2    ; 2  3  2  3  2  3  2  3  2  3  2  3  | 0  0  1  1  2  2  3  3  4  4  5  5
@2:     lsr       ; 2  2  2  2  2  2  2  2  2  2  2  2  | 0  0  0  0  1  1  1  1  2  2  2  2
        bcc @5    ; 3  3  2  2  3  3  2  2  3  3  2  2  | 0  0  0  0  1  1  1  1  2  2  2  2
        bcs @5    ; -  -  3  3  -  -  3  3  -  -  3  3  | -  -  0  0  -  -  1  1  -  -  2  2 ;bpl works too
@5:     beq @10   ; 3  3  3  3  2  2  2  2  2  2  2  2  | 0  0  0  0  1  1  1  1  2  2  2  2
        lsr       ; -  -  -  -  2  2  2  2  2  2  2  2  | -  -  -  -  0  0  0  0  1  1  1  1
        bcs @10   ; -  -  -  -  3  3  3  3  2  2  2  2  | -  -  -  -  0  0  0  0  1  1  1  1 ;beq works too
        delay_n 5 ; -  -  -  -  -  -  -  -  5  5  5  5  | -  -  -  -  -  -  -  -  1  1  1  1
@10:      ;Total:  12 13 14 15 16 17 18 19 20 21 22 23

For delay_n 5, anything that causes 5 cycles of delay works. Examples: inc $00, nop + cmp $C5

15—270 cycles of delay: delay=A+15; 0 ≤ A ≤ 255)

This code peels slices of 5 cycles with a SBC-BCS loop, and then executes the delay code for A=251—255. The same code will appear later as a function version (which adds 12 cycles overhead due to JSR+RTS cost).

        sec     
@L:     sbc #5  
        bcs @L  ;  6 6 6 6 6  FB FC FD FE FF
        adc #3  ;  2 2 2 2 2  FE FF 00 01 02
        bcc @4  ;  3 3 2 2 2  FE FF 00 01 02
        lsr     ;  - - 2 2 2  -- -- 00 00 01
        beq @5  ;  - - 3 3 2  -- -- 00 00 01
@4:     lsr     ;  2 2 - - 2  7F 7F -- -- 00
@5:     bcs @6  ;  2 3 2 3 2  7F 7F 00 00 00
@6:

16—271 cycles of delay: delay=A+16; 0 ≤ A ≤ 255)

This code peels slices of 9 cycles with a CMP-BCC-SBC-BCS loop, and then executes the delay code for A=0—8.

@L:     cmp #9          ;2
        bcc @0          ;2 (+1)
        sbc #9          ;2
        bcs @L          ;3
;      Cycles | A | 5  5  5  5  5  5  5  5  5  | 0  1  2  3  4  5  6  7  8
@0:     lsr       ; 2  2  2  2  2  2  2  2  2  | 0  0  1  1  2  2  3  3  4
        bcs @3    ; 2  3  2  3  2  3  2  3  2  | 0 c0  1 c1  2 c2  3 c3  4
        adc #255  ; 2  -  2  -  2  -  2  -  2  |-1  - c0  - c1  - c2  - c3
@3:     beq @7    ; 2  3  3  2  2  2  2  2  2  |-1 c0 c0 c1 c1 c2 c2 c3 c3
        bcc @9    ; 3  -  -  2  2  2  2  2  2  |-1  -  - c1 c1 c2 c2 c3 c3
        lsr       ; -  -  -  2  2  2  2  2  2  | -  -  - c0 c0  1  1 c1 c1
        beq @9    ; -  -  -  3  3  2  2  2  2  | -  -  - c0 c0  1  1 c1 c1
@7:     bcc @9    ; -  2  2  -  -  3  3  2  2  | - c0 c0  -  -  1  1 c1 c1
        bne @9    ; -  2  2  -  -  -  -  3  3  | - c0 c0  -  -  -  - c1 c1
@9:       ;Total:  16 17 18 19 20 21 22 23 24

5—65285 cycles of delay: delay = 256×X + 5

Clobbers A:

@0:     txa       ;2
        beq @10   ;3
        nop       ;2
        tya       ;2
         ldy #48  ;2
@l:      dey      ;2×48
         bne @l   ;3×48
        tay       ;2−1
        dex       ;2
        jmp @0    ;3
@10:

Doesn’t clobber A (2 bytes longer):

@0:     cpx #0    ;2
        beq @10   ;3
        pha       ;3
        tya       ;2
         ldy #47  ;2
@l:      dey      ;2×47
         bne @l   ;3×47
        tay       ;2−1
        pla       ;4
        jmp @0    ;3
@10:

18—218103813 cycles of delay: delay = 13×(65536×Y + 256×A + X) + 18

        iny
@l1:    nop
        nop
@l2:    cpx #1
        dex
        sbc #0
        bcs @l1
        dey
        bne @l2
        rts

Callable functions

A + 25 cycles of delay, clobbers A, Z&N, C, V

This code peels slices of 7 cycles with a CMP-BCS-SBC loop, and then executes the delay code for 9—16 cycles with A = 0—6. The reason its overhead is smaller than in the version that peels 5 cycles is because the case for A<7 executes only two instructions instead of three. This comes at the cost that the entry point is not the first instruction. Therefore the code can only exist as a callable function and not inline code.

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Clobbers A. Preserves X,Y.
; Time: A+25 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
                  ;       Cycles              Accumulator         Carry flag
                  ; 0  1  2  3  4  5  6          (hex)           0 1 2 3 4 5 6
                  ;
                  ; 6  6  6  6  6  6  6   00 01 02 03 04 05 06
:      sbc #7     ; carry set by CMP
delay_a_25_clocks:
       cmp #7     ; 2  2  2  2  2  2  2   00 01 02 03 04 05 06   0 0 0 0 0 0 0
       bcs :-     ; 2  2  2  2  2  2  2   00 01 02 03 04 05 06   0 0 0 0 0 0 0
       lsr        ; 2  2  2  2  2  2  2   00 00 01 01 02 02 03   0 1 0 1 0 1 0
       bcs *+2    ; 2  3  2  3  2  3  2   00 00 01 01 02 02 03   0 1 0 1 0 1 0
       beq :+     ; 3  3  2  2  2  2  2   00 00 01 01 02 02 03   0 1 0 1 0 1 0
       lsr        ;       2  2  2  2  2         00 00 01 01 01       1 1 0 0 1
       beq @rts   ;       3  3  2  2  2         00 00 01 01 01       1 1 0 0 1
       bcc @rts   ;             3  3  2               01 01 01           0 0 1
:      bne @rts   ; 2  2              3   00 00             01   0 1         0
@rts:  rts        ; 6  6  6  6  6  6  6   00 00 00 00 01 01 01   0 1 1 1 0 0 1
; Total cycles:    25 26 27 28 29 30 31

A + 27 cycles of delay, clobbers A, Z&N, C, V

This function has longer overhead than delay_a_25_clocks, but it can be appended into other functions, as the execution begins from the first instruction.

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A clocks + overhead
; Clobbers A. Preserves X,Y.
; Time: A+27 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
delay_a_27_clocks:
        sec     
@L:     sbc #5  
        bcs @L  ;  6 6 6 6 6  FB FC FD FE FF
        adc #3  ;  2 2 2 2 2  FE FF 00 01 02
        bcc @4  ;  3 3 2 2 2  FE FF 00 01 02
        lsr     ;  - - 2 2 2  -- -- 00 00 01
        beq @5  ;  - - 3 3 2  -- -- 00 00 01
@4:     lsr     ;  2 2 - - 2  7F 7F -- -- 00
@5:     bcs @6  ;  2 3 2 3 2  7F 7F 00 00 00
@6:     rts     ;

It is created by wrapping the code for 15—270 cycles of delay into a function. The JSR+RTS instructions adds 12 cycles of overhead.

256×A + X + 33 cycles of delay, clobbers A, Z&N, C, V

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A:X clocks+overhead
; Time: 256*A+X+33 clocks (including JSR)
; Clobbers A. Preserves X,Y. Has relocations.
;;;;;;;;;;;;;;;;;;;;;;;;
:	; 5 cycles done, do 256-5 more.
	sbc #1			; 2 cycles - Carry was set from cmp
	pha                     ; 3
	 lda #(256-27 - 16)     ; 2
	 jsr delay_a_27_clocks  ; 240
	pla                     ; 4
delay_256a_x_33_clocks:
	cmp #1			; +2
	bcs :-			; +3 
	; 0-255 cycles remain, overhead = 4
	txa 			; -1+2; 6; +27 = 33
	;passthru
<<Place the function delay_a_27_clocks immediately following here>>

Can be trivially changed to swap X, Y.

If you can clobber Y, change the part that begins with "pha" and ends with "pla" into this, for 1 byte shorter code:

	ldy #49  ; 2
@l:     dey      ; 49*2
        bne @l   ; 49*3
        ldy $A4  ; 3-1

256×A + 16 cycles of delay, clobbers A, Z&N, C, V

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays A*256 clocks + overhead
; Clobbers A. Preserves X,Y.
; Time: A*256+16 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256a_16_clocks:
	cmp #0
	bne delay_256a_11_clocks_
	rts
delay_256a_11_clocks_:
	;5 cycles done. Must consume 256 cycles; 251 cycles remain.
        pha                      ;3 - 3
        tya                      ;2 - 5
         ldy #46                 ;2 - 7
@l:      dey                     ;2*46 - 99
         bne @l                  ;3*46 - 237
         nop                     ;2 - 238
        tay                      ;2 - 240
        pla                      ;4 - 244
	sec                      ;2 - 246
	sbc #1                   ;2 - 248
	jmp delay_256a_16_clocks ;3 - 251

If you can clobber Y, change the part that begins with pha and ends in pla, into this, for shorter code:

	ldy #48  ; 2
@l:     dey      ; 49*2
        bne @l   ; 49*3
        ldy $A4  ; 3-1

256×X + 16 cycles of delay, clobbers X, Z&N

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays X*256 clocks + overhead
; Clobbers X,Y. Preserves A. Relocatable.
; Time: X*256+16 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_16_clocks:
	cpx #0
	bne delay_256x_11_clocks_
	rts
delay_256x_11_clocks_:
	;5 cycles done. Must consume 256 cycles; 251 cycles remain.
        pha                      ;3
        tya                      ;2
         ldy #46                 ;2
@l:      dey                     ;2*46
         bne @l                  ;3*46
         nop                     ;2-1
         nop                     ;2
        tay                      ;2
        pla                      ;4
	dex                      ;2
	jmp delay_256x_16_clocks ;3

Can be trivially changed to swap X, Y.

If you can clobber Y, change the part that begins with pha and ends in pla, into this, for shorter code:

        ldy #50                  ;2-1
@l:	dey                      ;2*50
	bne @l                   ;3*50

256×X + A + 30 cycles of delay, clobbers A, X, Z&N, C, V

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays X*256+A clocks + overhead
; Clobbers A,X. Preserves Y.
; Depends on delay_a_25_clocks within short branch distance
; Time: X*256+A+30 clocks (including JSR)
;;;;;;;;;;;;;;;;;;;;;;;;
delay_256x_a_30_clocks:
        cpx #0                  ;2
        beq delay_a_25_clocks   ;3
        ;4 cycles done. Must consume 256 cycles; 252 cycles remain.
        pha                             ;3
         lda #(256-4-(3+2+4+2+3))-25    ;2
         jsr delay_a_25_clocks          ;238
        pla                             ;4
        dex                             ;2
        jmp delay_256x_a_30_clocks      ;3

Can be trivially changed to swap X, Y.

Alternative version that does not depend on other delay functions, but has otherwise the same implications:

:      sbc #7    ; carry set by CMP
delay_256x_a_30_clocks_b:
       cmp #7    ; 2  2  2  2  2  2  2   00 01 02 03 04 05 06   0 0 0 0 0 0 0
       bcs :-    ; 2  2  2  2  2  2  2   00 01 02 03 04 05 06   0 0 0 0 0 0 0
       lsr       ; 2  2  2  2  2  2  2   00 00 01 01 02 02 03   0 1 0 1 0 1 0
       bcs @2    ; 2  3  2  3  2  3  2   00 00 01 01 02 02 03   0 1 0 1 0 1 0
@2:    beq @6    ; 3  3  2  2  2  2  2   00 00 01 01 02 02 03   0 1 0 1 0 1 0
       lsr       ;       2  2  2  2  2         00 00 01 01 01       1 1 0 0 1
       beq @do_x ;       3  3  2  2  2         00 00 01 01 01       1 1 0 0 1
       bcc @do_x ;             3  3  2               01 01 01           0 0 1
@6:    bne @do_x ; 2  2              3   00 00             01   0 1         0
@do_x: txa       ;2
       beq @rts  ;3
       ;4 cycles done. Must consume 256 cycles; 252 cycles remain.
       nop       ;2
       tya       ;2
        ldy #48  ;2
@l:     dey      ;2*48
        bne @l   ;3*48
       tay       ;2-1
       dex       ;2
       jmp @do_x ;3
@rts:  rts

This function is constructed by concatenating delay_a_25_clocks and the inline delay code for 5—65285 cycles.

851968×Y + 3328×A + 13×X + 30 cycles of delay, clobbers A, X, Y, Z&N, C, V

;;;;;;;;;;;;;;;;;;;;;;;;
; Delays 30+13*(65536*Y+256*A+X) cycles including JSR.
; Clobbers A,X,Y.
delay_851968y_3328a_13x_30_clocks:
        iny
@l1:    nop
        nop
@l2:    cpx #1
        dex
        sbc #0
        bcs @l1
        dey
        bne @l2
        rts

This is constructed by wrapping the 18—218103813 cycles inline delay code in a function.

See also