Cycle counting: Difference between revisions

Latest revision as of 03:22, 3 June 2024

It is often useful to delay a specific number of CPU cycles. Timing raster effects or generating PCM audio are some examples that might utilize this. This article outlines a few relevant techniques.

Instruction timings

You can use a comprehensive guide^[1]^[2] as reference for instruction timings, but there are some rules-of-thumb^[3] that can help remember most of them:

There is a minimum of 2 cycles per instruction.
Each byte of memory read or written adds 1 more cycle to the instruction. This includes fetching the instruction, and each byte of its operand, then any memory it references.
Indexed instructions which cross a page take 1 extra cycle to adjust the high byte of the effective address first.
Read-modify-write instructions perform a dummy write during the "modify" stage and thus take 1 extra cycle.
Instructions that push data onto the stack take 1 extra cycle.
Instructions that pop data from the stack take 2 extra cycles, since they also need to pre-increment the stack pointer.
"Extra" cycles often include an extra read or write that usually does not affect the outcome.

Examples:

Instructions	Cycles	Bytes	Details
`SEC`	2	1	opcode, but has to wait for the 2-cycle minimum.
`AND #imm`	2	2	opcode + operand. Only affects registers.
`LDA zp`	3	2	opcode + operand + byte fetched from zp.
`STA abs`	4	3	opcode + 2 byte operand + byte written to abs.
`LDA abs,X`	4 or 5	3	opcode + 2 byte operand + read from abs, but it delays 1 extra cycle if the addition of the X index causes a page crossing.
`STA abs,X`	5	3	Like LDA abs,X, but assumes the worst case of page crossing and always spends 1 extra read cycle.
`ASL zp`	5	2	opcode + operand + read from zp + write to zp, but it takes 1 extra cycle to modify the value.
`LDA (indirect),Y`	5 or 6	2	opcode + operand + two reads from zp + read from indirect address. 1 extra cycle if a page is crossed.
`STA (indirect),Y`	6	2	Like LDA (indirect),Y, but assumes the worst case of page crossing and always spends 1 extra read cycle.
`PHA`	3	1	opcode + stack write, but requires 1 extra cycle to perform the stack operation.
`RTS`	6	1	opcode + two stack reads, but requires 2 extra cycles to perform the stack operations, plus 1 cycle to post-increment the program counter (to compensate for the off-by-1 address pushed by JSR).

Short delays

Here are few ways to create short delays without side effects. As the shortest instruction time is 2 cycles, it is not possible to delay 1 cycle on its own. NOP is essential for 2 cycle delays. 3 cycle delays always take 2 bytes, but usually have some compromise. More options become available as delays become longer.

Instructions	Cycles	Bytes	Side effects and notes
`NOP`	2	1
`JMP *+3`	3	3
`Bxx *+2`	3	2	None, but requires a known flag state (e.g. BCC if carry is known to be clear).
`BIT zp`	3	2	Clobbers NVZ flags. Reads zp.
`IGN zp`	3	2	Reads zp. (Unofficial instruction.)
`NOP, NOP`	4	2
`NOP, ...`	5	3	(... = 3 cycle delay of choice)
`CLV, BVC *+2`	5	3	Clears V flag. (Can instead use C flag with CLC, BCC or SEC, BCS.)
`NOP, NOP, NOP`	6	3
`PHP, PLP`	7	2	Modifies 1 byte of stack.
`NOP, NOP, NOP, NOP`	8	4
`PHP, PLP, NOP`	9	3	Modifies 1 byte of stack.
`PHP, CMP zp, PLP`	10	4	Modifies 1 byte of stack. Reads zp.
`PHP, PLP, NOP, NOP`	11	4	Modifies 1 byte of stack.
`JSR, RTS`	12	3	Modifies 2 bytes of stack. (Takes 3 bytes only if reusing an existing RTS; otherwise 4.)

Clockslide

A clockslide^[4] is a sequence of instructions that wastes a small constant amount of cycles plus one cycle per executed byte, no matter whether it's entered on an odd or even address. With official instructions, one can construct a clockslide from CMP instructions: ... C9 C9 C9 C9 C5 EA Disassemble from the start and you get CMP #$C9 CMP #$C9 CMP $EA (6 bytes, 7 cycles). Disassemble one byte in and you get CMP #$C9 CMP #$C5 NOP (5 bytes, 6 cycles). The entry point can be controlled with an indirect jump or the RTS Trick to precisely control raster effect or sample playback timing.

CMP has a side effect of destroying most of the flags, but you can substitute other instructions with the same size and timing that preserve whichever flags/registers you need at the end of the slide. There are unofficial instructions that can avoid altering any state: replace $C9 (CMP) with $89 or $80, which ignores an immediate operand, and replace $C5 with $04, $44, or $64, which ignore a read from the zero page.

Resources

Delay code - various variable-cycle delays
Fixed cycle delay - shortest fixed-cycle delays
Fixed-cycle delay code vending machine - code for generating shortest-possible delay routines at compile-time.
6502 vdelay - code for delaying a variable number of cycles at run-time.

References

[1] Obelisk: 6502 instruction reference

[2] 6502_cpu.txt: cycle-by-cycle instruction behaviour

[3] Forum: Is there a logic to instruction timings?

[4] Clockslide: How to waste an exact number of clock cycles on the 6502

[1]

[2]

[3]

[4]

@@ Line 4: / Line 4: @@
 You can use a comprehensive
-guide<ref>[http://www.obelisk.me.uk/6502/reference.html Obelisk: 6502 instruction reference]</ref><ref>[http://nesdev.org/6502_cpu.txt 6502_cpu.txt: cycle-by-cycle instruction behaviour]</ref>
+guide<ref>[https://www.nesdev.org/obelisk-6502-guide/reference.html Obelisk: 6502 instruction reference]</ref><ref>[http://nesdev.org/6502_cpu.txt 6502_cpu.txt: cycle-by-cycle instruction behaviour]</ref>
 as reference for instruction timings,
 but there are some
@@ Line 10: / Line 10: @@
 that can help remember most of them:
-* Each byte of memory read or write adds another cycle to the instruction. This includes fetching the instruction, and each byte of its operand, then any memory it references.
+* There is a minimum of 2 cycles per instruction.
+* Each byte of memory read or written adds 1 more cycle to the instruction. This includes fetching the instruction, and each byte of its operand, then any memory it references.
 * Indexed instructions which cross a page take 1 extra cycle to adjust the high byte of the effective address first.
-* Read-modify-write instructions perform a dummy write during the "modify" stage and thus take an extra cycle.
+* Read-modify-write instructions perform a dummy write during the "modify" stage and thus take 1 extra cycle.
-* Instructions that push data onto the stack take extra cycles.
+* Instructions that push data onto the stack take 1 extra cycle.
-** Instructions which pop data from the stack take '''two''' extra cycles since they also need to pre-increment the stack pointer
+* Instructions that pop data from the stack take 2 extra cycles, since they also need to pre-increment the stack pointer.
 * "Extra" cycles often include an extra read or write that usually does not affect the outcome.
-* There is a minimum of 2 cycles per instruction.
 Examples:
-* <code>SEC</code> - 2 cycles: 1 byte opcode, but has to wait for the 2-cycle minimum.
+{|class="tabular"
-* <code>AND #imm</code> - 2 cycles: opcode + operand = 2 bytes. Only affects registers.
+! Instructions || Cycles || Bytes || Details
-* <code>LDA zp</code> - 3 cycles: opcode + operand + byte fetched from zp.
+|-
-* <code>STA abs</code> - 4 cycles: opcode + 2 byte operand + byte written to abs.
+| <code>SEC</code> || 2 || 1 || opcode, but has to wait for the 2-cycle minimum.
-* <code>LDA abs, X</code> - 4 or 5 cycles: opcode + 2 byte operand + read from abs, but if the addition of the X index causes a page crossing it delays 1 extra cycle.
+|-
-* <code>ASL zp</code> - 5 cycles: opcode + operand + read from zp + write to zp, but it takes 1 extra cycle to modify the value.
+| <code>AND #imm</code> || 2 || 2 || opcode + operand. Only affects registers.
-* <code>LDA (indirect), Y</code> - 5 or 6 cycles: opcode + operand + two reads from zp + read from indirect address. 1 extra cycle if a page is crossed.
+|-
-* <code>STA (indirect), Y</code> - 6 cycles: like LDA (indirect) but assumes the worst case of page crossing, so always spends 1 extra cycle reading in case the page correction is being applied.
+| <code>LDA zp</code> || 3 || 2 || opcode + operand + byte fetched from zp.
-* <code>PHA</code> - 3 cycles: opcode + stack write, but requires 1 extra cycle to perform the stack operation.
+|-
-* <code>RTS</code> - 6 cycles: opcode + two stack reads, but requires 2 extra cycles to perform the stack operations, plus an additional cycle to post-increment the program counter (to compensate for the off-by-1 address pushed by JSR)
+| <code>STA abs</code> || 4 || 3 || opcode + 2 byte operand + byte written to abs.
+|-
+| <code>LDA abs,X</code> || 4 or 5 || 3 || opcode + 2 byte operand + read from abs, but it delays 1 extra cycle if the addition of the X index causes a page crossing.
+|-
+| <code>STA abs,X || 5 || 3 || Like LDA abs,X, but assumes the worst case of page crossing and always spends 1 extra read cycle.
+|-
+| <code>ASL zp</code> || 5 || 2 || opcode + operand + read from zp + write to zp, but it takes 1 extra cycle to modify the value.
+|-
+| <code>LDA (indirect),Y</code> || 5 or 6 || 2 || opcode + operand + two reads from zp + read from indirect address. 1 extra cycle if a page is crossed.
+|-
+| <code>STA (indirect),Y</code> || 6 || 2 || Like LDA (indirect),Y, but assumes the worst case of page crossing and always spends 1 extra read cycle.
+|-
+| <code>PHA</code> || 3 || 1 || opcode + stack write, but requires 1 extra cycle to perform the stack operation.
+|-
+| <code>RTS</code>|| 6 || 1 || opcode + two stack reads, but requires 2 extra cycles to perform the stack operations, plus 1 cycle to post-increment the program counter (to compensate for the off-by-1 address pushed by JSR).
+|}
 == Short delays ==
@@ Line 36: / Line 51: @@
 NOP is essential for 2 cycle delays. 3 cycle delays always take 2 bytes, but usually have some compromise. More options become available as delays become longer.
-* <code>NOP</code> - 2 cycles, 1 byte, no side effects
+{|class="wikitable sortable"
-* <code>JMP *+3</code> - 3 cycles, 3 bytes, no side effects
+! Instructions !! Cycles !! Bytes !!class="unsortable"| Side effects and notes
-* <code>Bxx *+2</code> - 3 cycles, 2 bytes, no side effects but requires a known flag state (e.g. BCC if carry is known to be clear)
+|-
-* <code>BIT zp</code> - 3 cycles, 2 bytes, clobbers NVZ but preserves C, reads zp
+| <code>NOP</code> || 2 || 1 ||
-* <code>[[Programming with unofficial opcodes#NOPs|IGN zp]]</code> - 3 cycles, 2 bytes, only side effect is a read, unofficial instruction
+|-
-* <code>NOP, NOP</code> - 4 cycles, 2 bytes
+| <code>JMP *+3</code> || 3 || 3 ||
-* <code>NOP, ...</code> - 5 cycles, 3 bytes (... = 3 cycle delay of choice)
+|-
-* <code>CLV, BVC *+2</code> - 5 cycles, 3 bytes, clears V flag (can instead use C flag with CLC, BCC or SEC, BCS)
+| <code>Bxx *+2</code> || 3 || 2 || None, but requires a known flag state (e.g. BCC if carry is known to be clear).
-* <code>NOP, NOP, NOP</code> - 6 cycles, 4 bytes
+|-
-* <code>PHP, PLP</code> - 7 cycles, 2 bytes, modifies 1 byte of stack
+| <code>BIT zp</code> || 3 || 2 || Clobbers NVZ flags. Reads zp.
-* <code>NOP, NOP, NOP, NOP</code> - 8 cycles, 4 bytes
+|-
-* <code>PHP, PLP, NOP</code> - 9 cycles, 3 bytes, modifies 1 byte of stack
+| <code>[[Programming with unofficial opcodes#NOPs|IGN zp]]</code> || 3 || 2 || Reads zp. (Unofficial instruction.)
-* <code>PHP, CMP zp, PLP</code> - 10 cycles, 4 byes, modifies 1 byte of stack, reads zp
+|-
-* <code>PHP, PLP, NOP, NOP</code> - 11 cycles, 4 bytes, modifies 1 byte of stack
+| <code>NOP, NOP</code> || 4 || 2
-* <code>JSR, RTS</code> - 12 cycles, 3 bytes (if taking advantage of an existing RTS elsewhere), modifies 2 bytes of stack
+|-
+| <code>NOP, ...</code> || 5 || 3 || (... = 3 cycle delay of choice)
+|-
+| <code>CLV, BVC *+2</code> || 5 || 3 || Clears V flag. (Can instead use C flag with CLC, BCC or SEC, BCS.)
+|-
+| <code>NOP, NOP, NOP</code> || 6 || 3
+|-
+| <code>PHP, PLP</code> || 7 || 2 || Modifies 1 byte of stack.
+|-
+| <code>NOP, NOP, NOP, NOP</code> || 8 || 4
+|-
+| <code>PHP, PLP, NOP</code> || 9 || 3 || Modifies 1 byte of stack.
+|-
+| <code>PHP, CMP zp, PLP</code> || 10 || 4 || Modifies 1 byte of stack. Reads zp.
+|-
+| <code>PHP, PLP, NOP, NOP</code> || 11 || 4 || Modifies 1 byte of stack.
+|-
+| <code>JSR, RTS</code> || 12 || 3 || Modifies 2 bytes of stack. (Takes 3 bytes only if reusing an existing RTS; otherwise 4.)
+|}
 == Clockslide ==
@@ Line 59: / Line 92: @@
 The entry point can be controlled with an indirect jump or the [[RTS Trick]] to precisely control raster effect or sample playback timing.
-CMP has a side effect of destroying most of the flags, but [[CPU unofficial opcodes|unofficial instructions]] that skip one byte can be used to preserve them.
+CMP has a side effect of destroying most of the flags, but you can substitute other instructions with the same size and timing that preserve whichever flags/registers you need at the end of the slide.
-For example, replace $C9 (CMP) with $89 or $80, which skips one immediate byte, and replace $C5 with $04, $44, or $64, which reads a byte from zero page and ignores it.
+There are [[CPU unofficial opcodes|unofficial instructions]] that can avoid altering any state:
+replace $C9 (CMP) with $89 or $80, which ignores an immediate operand, and replace $C5 with $04, $44, or $64, which ignore a read from the zero page.
 == Resources ==
+* [[Delay code]] - various variable-cycle delays
+* [[Fixed cycle delay]] - shortest fixed-cycle delays
 * [https://bisqwit.iki.fi/utils/nesdelay.php Fixed-cycle delay code vending machine] - code for generating shortest-possible delay routines at compile-time.
 * [https://github.com/bbbradsmith/6502vdelay 6502 vdelay] - code for delaying a variable number of cycles at run-time.

Cycle counting: Difference between revisions

Latest revision as of 03:22, 3 June 2024

Contents

Instruction timings

Short delays

Clockslide

Resources

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools