Consistent frame synchronization

Introduction

This page describes a method for consistently synchronizing with the PPU every frame from inside an NMI handler, without having to cycle-time everything. This method allows synchronization just as good as is possible with completely cycle-timed code. At the beginning, the PPU is precisely synchronized with, ensuring that the code behaves the same every time it's run, every time the NES is powered up or reset. It's fully predictable.

Currently only PAL version is covered, since the PAL PPU's frame timing is simpler. The NTSC version operates in a similar manner, and will be covered eventually.

PAL timing

The PAL NES has a master clock shared by the PPU and CPU. The CPU divides the master clock by 16 to get its instruction cycle clock, which we'll call cycle for simplicity. For example, a NOP instruction takes 2 cycles. The PPU divides the master clock by 5 to get its pixel clock, which we'll call pixel for simplicity. There are 16/5 = 3.2 pixels per cycle.

A video frame consists of 312 scanlines, each 341 pixels long. Unlike NTSC, there are no short frames, and rendering being enabled or disabled has no effect on frame length. This, every frame is exactly 312*341 = 106396 pixels = 33247.5 cycles long. We'll have pixel 0 refer to the first pixel of a frame, and pixel 106395 refer to the last pixel of a frame.

A frame begins with the vertical blanking interval (VBL), then the visible scanlines. The notation VBL+N refers to N cycles after the cycle that VBL began within, VBL+0. To talk about pixels since VBL, we simply refer to pixel P, where pixel 0 is the beginning of VBL, and pixel 106395 is the last pixel in the frame.

Basic synchronization

If we're going to write at a particular pixel, we must first synchronize the CPU to the beginning of a frame, so that pixel 0 begins at the beginning of a cycle, and we know how many cycles ago that was. Reading $2002 gives the current status of the VBL flag in bit 7, then clears it. The VBL flag is set at pixel 0 of each frame, and cleared around when VBL ends. We can use the VBL flag to achieve synchronization.

A frame is 33247.5 cycles long. If we could somehow read $2002 every 33247.5 cycles, we'd read at the same point in each frame. But if we read $2002 every 33248 cycles, we'll be reading 0.5 cycles (1.6 pixels) later each successive frame. If we have a loop do this until it finds the VBL flag set, it will synchronize with the PPU. Each time through, it will read later in the frame, until it reads just as the VBL flag for the next frame is set.

       ; Fine synchronize

delay 33241

       bit $2002
       bpl :-

</source>

Cycle	PPU	CPU
0
1
...
33246		Read $2002 = 0
33246.5
33247
33247.5	Set VBL flag
...
66494		Read $2002 = 0
66494.5
66495	Set VBL flag
...
99742		Read $2002 = 0
99742.5	Set VBL flag
...
132990	Set VBL flag	Read $2002 = $80

Looking at it relative to each frame, we more clearly see how the CPU effectively reads later by half a cycle each frame.

Cycle	Frame 1	Frame 2	Frame 3	Frame 4	Event
-1.5	read
-1.0		read
-0.5			read
0				read	VBL flag set

The loop must be started so that the first $2002 read is slightly before the end of the frame, otherwise it might start out reading well after the flag has been set. We can do this by starting with a simpler coarse synchronization loop.

<source lang="6502tasm"> sync_ppu:

       ; Coarse synchronize
       bit $2002

bit $2002

       bpl :-
       
       delay 33231
       jmp first
       
       ; Fine synchronize

delay 33241

first: bit $2002

       bpl :-

       rts

</source>

The coarse synchronization loop might read $2002 just as the VBL flag was set, or read it nearly 7 cycles after it was set. Then, in the fine synchronization loop, $2002 is read 33240 to 33247 cycles later. In most cases, this will be slightly before the VBL flag is set, so the loop will delay and read $2002 again 33248 cycles later, etc.

Once done, the CPU will have executed two cycles after the final $2002 read that found the VBL flag just set.

Writing to a particular pixel

In order to achieve some graphical effect, we want to write to the PPU at a particular pixel every frame. As an example, we'll write to $2006 at pixel 30400, which is near the upper-center of the screen. To simplify things, we'll not care what value we write. This requires that we write to $2006 at VBL+9500.

       ; Synchronize to PPU
       jsr sync_ppu
       
       ; Delay almost a full frame, so that the code below begins on
       ; a frame.
       delay 33238

vbl: ; VBL begins in this cycle

       delay 9497
       sta $2006

</source>

Pixel	Cycle	Event
0	0	VBL begins delay 9497
...
	9497	STA $2006
	9498
	9499
30400	9500	$2006 write

If we try to make this write to the same pixel each frame, we run into a problem: the frame length isn't a whole number of cycles. We'll count frames and treat odd frames as being 33247 cycles long, and even frames 33248 cycles long, which will average to the correct 33247.5 cycles per frame.

       ; Synchronize to PPU
       jsr sync_ppu
        
       ; Delay almost a full frame, so that the code below begins on
       ; a frame.
       delay 33233
       
       ; We were on frame 1 after sync_ppu, but vbl will begin on frame 2
       lda #2
       sta frame_count

vbl: ; VBL begins in this cycle

       delay 9497
       sta $2006
       
       delay 23731
       
       ; Delay extra cycle on even frames
       lda frame_count
       and #$01
       beq extra

extra: inc frame_count

       jmp vbl

</source>

Now our write time doesn't drift, but it still doesn't write to the same pixel each frame. Since even frames begin in the middle of a cycle, our write is half a cycle/1.6 pixels earlier.

Odd frame pixel	Even frame pixel	Cycle	Event
0		0	VBL begins delay 9497
	0	0.5	VBL begins delay 9497
...
		9497	STA $2006
		9498
		9499
30400	30398.4	9500	$2006 write

Our write will thus fall on pixel 30400 on odd frames, and pixel 30398.4 on even frames. That's the best we can do, regardless of how we write our code, as this is a hardware limitation.

Another similar limitation is that when the NES is powered up or reset, the CPU and PPU master clock dividers start in random states, adding up to 1.6 additional pixels of variance. This offset doesn't change until the NES is powered off or reset.

Ideal NMI

Above, all the code had to be cycle-timed to ensure that each write occurred at the correct time. This isn't practical in most programs, which instead use NMI for synchronizing roughly to VBL. In these programs, timing-critical code is at the beginning of the NMI handler, followed by code that isn't carefully timed. Thus, such code relies on NMI occurring shortly after VBL, and not being delayed.

Ideally, NMI would begin a fixed number of cycles after VBL, without waiting for the current instruction to finish. If that were the case, we'd have it nearly as easy as before. Here, we'll imagine NMI always occurs at VBL+2. NMI takes 7 cycles to vector to our NMI handler, so that our NMI handler begins at VBL+9. To simplify the code and timing diagrams, we won't bother saving any registers as we'd normally do in an NMI handler.

<source lang="6502tasm"> nmi: ; VBL+9

       delay 9488
       sta $2006         ; write at VBL+9500

</source>

Even frame pixel	Odd frame pixel	Cycle	Event
0		0	VBL begins
	0	0.5	VBL begins
		1
		2	NMI vectored
		3
		4
		5
		6
		7
		8
		9	delay 9488
...
		9497	STA $2006
		9498
		9499
30400	30398.4	9500	$2006 write

NMI delay

In reality, NMI waits until the current instruction completes before vectoring to the NMI handler, adding an extra delay as compared to the ideal NMI described above. Also, sometimes the NES powers up with the PPU and CPU dividers such that the NMI occurs an additional cycle later.

By ensuring that a short instruction is executing when VBL occurs, we can minimize the delay before NMI is vectored. For example, if we have a series of NOP instructions executing when VBL occurs, NMI will occur from 2 to 4 cycles after VBL. The table shows the four possible timings, with each column titled with the time NMI vectoring begins.

       nop
       nop
       nop

</source>

Cycle	VBL + 2	VBL + 3	VBL + 4	Event
-1		NOP
0	NOP		NOP	VBL begins
1		NOP
2	NMI vectored		NOP
3		NMI vectored
4			NMI vectored

So, at best, we have 2 to 4 cycles of delay between VBL and our NMI handler.

Using a long sequence of NOP instructions isn't practical, because it requires either a large number of NOP instructions, or that we know how long the code before them takes so that we can delay entry into the NOP sequence until NMI is about to occur. If we instead have a simple infinite loop made of a single JMP instruction, we only increase the maximum delay by one cycle, to 5.

loop:   jmp loop

Cycle	VBL + 2	VBL + 3	VBL + 4	VBL + 5	Event
-1	JMP			JMP
0		JMP			VBL begins
1			JMP
2	NMI vectored			JMP
3		NMI vectored
4			NMI vectored
5				NMI vectored

Compensating for NMI delay

With a JMP loop to wait for NMI, we have 2 to 5 cycles of delay between VBL and our NMI handler. We want to compensate for this delay D by delaying an additional 5-D cycles. Here, we have the NOP always begin at VBL+12. We can't actually do this, but it shows what we must do the equivalent of.

<source lang="6502tasm"> nmi: delay 5-D nop </source>

Cycle	VBL + 2	VBL + 3	VBL + 4	VBL + 5	Event
-1	JMP			JMP
0		JMP			VBL begins
1			JMP
2	NMI vectored			JMP
3		NMI vectored
4			NMI vectored
5				NMI vectored
6
7
8
9	delay 3
10		delay 2
11			delay 1
12	NOP	NOP	NOP	NOP (no delay)

We just have to find out how to determine the number of cycles of delay to add.

Sprite DMA always ends on even cycle

When sprite DMA ($4014) is written to, the next instruction always begins on an odd cycle. If the $4014 write is on an odd cycle, it pauses the CPU for an additional 513 cycles, otherwise 514 cycles. We can use this aspect to partially compensate for NMI's variable delay.

<source lang="6502tasm"> nmi: lda #$07 ; sprites at $700

       sta $4014
       nop

</source>

Cycle	VBL + 2	VBL + 3	VBL + 4	VBL + 5
0	VBL begins
1
2	NMI
3		NMI
4			NMI
5				NMI
6
7
8
9	LDA #$07
10		LDA #$07
11	STA $4014		LDA #$07
12		STA $4014		LDA #$07
13			STA $4014
14	$4014 write			STA $4014
15	514-cycle DMA	$4014 write
16		513-cycle DMA	$4014 write
17			514-cycle DMA	$4014 write
18				513-cycle DMA
...
527
528	DMA finishes	DMA finishes
529	NOP	NOP
530			DMA finishes	DMA finishes
531			NOP	NOP

This reduces the number of different delays from four to two. The NOP always executes at either VBL+529 or VBL+531. This is an improvement. We just need a way to determine which time DMA finished at, and delay two extra cycles if it was the earlier one.

VBL flag cleared at end of VBL

The VBL flag is cleared near the end of VBL. If we read $2002 around the time the flag is cleared, we can determine whether the read occurred before or after the flag was cleared. We will have to avoid reading $2002 elsewhere in the NMI handler, since reading $2002 clears the flag.

The VBL flag is cleared around pixel 23869, sometimes one less, so we want to read $2002 at VBL+7458 or VBL+7460. It works out nicely that sprite DMA leaves two cycles between the possible ending times, as this ensures that our $2002 read is several pixels before or after when the flag is cleared, giving us a good margin for error. If we find the flag set, we know we are on the earlier of the two DMA ending times, so we delay an extra two cycles.

<source lang="6502tasm"> nmi: lda #$07 ; sprites at $700

       sta $4014
       delay 6926
       bit $2002         ; read at VBL+7458 or VBL+7460
       bpl skip
       bit 0

skip: nop </source>

Cycle	VBL + 2	VBL + 3	VBL + 4	VBL + 5
0	VBL begins
1
2	NMI
3		NMI
4			NMI
5				NMI
6
7
8
9	LDA #$07
10		LDA #$07
11	STA $4014		LDA #$07
12		STA $4014		LDA #$07
13			STA $4014
14	$4014 write			STA $4014
15	514-cycle DMA	$4014 write
16		513-cycle DMA	$4014 write
17			514-cycle DMA	$4014 write
18				513-cycle DMA
...
527
528	DMA finishes	DMA finishes
529	delay 6926	delay 6926
530			DMA finishes	DMA finishes
531			delay 6926	delay 6926
...
7455	BIT $2002	BIT $2002
7456
7457			BIT $2002	BIT $2002
7458	$2002 read = $80	$2002 read = $80
7459	BPL not taken	BPL not taken	VBL cleared	VBL cleared
7460			$2002 read = 0	$2002 read = 0
7461	BIT 0	BIT 0	BPL taken	BPL taken
7462
7463
7464	NOP	NOP	NOP	NOP

This achieves our goal, but not in all cases.

VBL begins on odd cycles

Unfortunately, VBL doesn't always begin during an even cycle, as we've so far assumed. When VBL begins during an odd cycle, our code doesn't work so well:

Cycle	VBL + 2	VBL + 3	VBL + 4	VBL + 5
1	VBL begins
2
3	NMI
4		NMI
5			NMI
6				NMI
7
8
9
10	LDA #$07
11		LDA #$07
12	STA $4014		LDA #$07
13		STA $4014		LDA #$07
14			STA $4014
15	$4014 write			STA $4014
16	513-cycle DMA	$4014 write
17		514-cycle DMA	$4014 write
18			513-cycle DMA	$4014 write
19				514-cycle DMA
...
527
528	DMA finishes
529
530		DMA finishes	DMA finishes
531
532				DMA finishes

Now DMA ends at three different times, covering a wider range than the original NMI times did, thus making things worse!

We need to keep track of when VBL begins during an odd cycle, and compensate before we begin DMA. After our PPU synchronization routine finishes, the last $2002 read it makes will have just found the VBL flag set. In the following table, that is cycle 0.

Pixel	Cycle	Frame
0	0	1
106392	33247.5	2
212784	66495	3
319176	99742.5	4
425568	132990	5
531960	166237.5	6
638352	199485	7
744744	232732.5	8

Looking at which cycle each frame begins on, we see they follow a four-frame pattern: even, odd, odd, even. So we'll just have a variable that starts out at 1 and increments every frame, then examine bit 1 and delay an extra cycle if it's clear. This extra code takes 8 cycles on frames where VBL begins during an even cycle, and 7 cycles otherwise.

But we also need to insert a complementary delay after DMA, before the $2002 read, since on frames where VBL begins during an odd cycle we'll need to read $2002 one cycle later after DMA than for even frames.

<source lang="6502tasm"> nmi: lda frame_count

       and #$02
       beq even

even: lda #$07 ; sprites at $700

       sta $4014
       delay 6911
       lda frame_count
       and #$02
       bne odd

odd: bit $2002

       bpl skip
       bit 0

skip: inc frame_count

       delay 2028
       sta $2006

</source>

Cycle	Frames 1, 4, 5, 8 ...				Frames 2, 3, 6, 7 ...
Cycle	VBL + 2	VBL + 3	VBL + 4	VBL + 5	VBL + 2	VBL + 3	VBL + 4	VBL + 5
0	VBL	VBL	VBL	VBL
1					VBL	VBL	VBL	VBL
2	NMI
3		NMI			NMI
4			NMI			NMI
5				NMI			NMI
6								NMI
7
8
9	LDA frame_count
10		LDA frame_count			LDA frame_count
11			LDA frame_count			LDA frame_count
12	AND #$02			LDA frame_count			LDA frame_count
13		AND #$02			AND #$02			LDA frame_count
14	BEQ taken		AND #$02			AND #$02
15		BEQ taken		AND #$02	BEQ not taken		AND #$02
16			BEQ taken			BEQ not taken		AND #$02
17	LDA #$07			BEQ taken	LDA #$07		BEQ not taken
18		LDA #$07				LDA #$07		BEQ not taken
19	STA $4014		LDA #$07		STA $4014		LDA #$07
20		STA $4014		LDA #$07		STA $4014		LDA #$07
21			STA $4014				STA $4014
22	$4014 write			STA $4014	$4014 write			STA $4014
23	514-cycle DMA	$4014 write			514-cycle DMA	$4014 write
24		513-cycle DMA	$4014 write			513-cycle DMA	$4014 write
25			514-cycle DMA	$4014 write			514-cycle DMA	$4014 write
26				513-cycle DMA				513-cycle DMA
...
535
536	DMA finishes	DMA finishes			DMA finishes	DMA finishes
537	delay 6911	delay 6911			delay 6911	delay 6911
538			DMA finishes	DMA finishes			DMA finishes	DMA finishes
539			delay 6911	delay 6911			delay 6911	delay 6911
...
7448	LDA frame_count	LDA frame_count			LDA frame_count	LDA frame_count
7449
7450			LDA frame_count	LDA frame_count			LDA frame_count	LDA frame_count
7451	AND #$02	AND #$02			AND #$02	AND #$02
7452
7453	BNE not taken	BNE not taken	AND #$02	AND #$02	BNE taken	BNE taken	AND #$02	AND #$02
7454
7455	BIT $2002	BIT $2002	BNE not taken	BNE not taken			BNE taken	BNE taken
7456					BIT $2002	BIT $2002
7457			BIT $2002	BIT $2002
7458	$2002 read = $80	$2002 read = $80					BIT $2002	BIT $2002
7459	BPL not taken	BPL not taken	VBL cleared	VBL cleared	$2002 read = $80	$2002 read = $80
7460			$2002 read = 0	$2002 read = 0	BPL not taken	BPL not taken	VBL cleared	VBL cleared
7461	BIT 0	BIT 0	BPL taken	BPL taken			$2002 read = 0	$2002 read = 0
7462					BIT 0	BIT 0	BPL taken	BPL taken
7463
7464	INC frame_count	INC frame_count	INC frame_count	INC frame_count
7465					INC frame_count	INC frame_count	INC frame_count	INC frame_count
7466
7467
7468
7469	delay 2028	delay 2028	delay 2028	delay 2028
7470					delay 2028	delay 2028	delay 2028	delay 2028
...
9497	STA $2006	STA $2006	STA $2006	STA $2006
9498					STA $2006	STA $2006	STA $2006	STA $2006
9499
9500	$2006 write at VBL+9500	$2006 write at VBL+9500	$2006 write at VBL+9500	$2006 write at VBL+9500
9501					$2006 write at VBL+9500	$2006 write at VBL+9500	$2006 write at VBL+9500	$2006 write at VBL+9500

The $2006 write is done at VBL+9500 in all cases. Remember that the right four columns have VBL beginning on cycle 1 (an odd cycle), which is why the final writes appear to be one cycle later than the others.

Synchronizing with even CPU cycle

Since our final synchronization method relies on knowing whether a given frame begins during an even or odd cycle, we must initially ensure that our PPU synchronization routine's final $2002 read is also during an even cycle. Since the fine synchronization loop takes an even number of cycles, we merely need to ensure that the first time through that the $2002 read is on an even cycle. We can do this by initiating sprite DMA before the fine synchronization loop.

<source lang="6502tasm"> sync_ppu:

       ; Coarse synchronize
       bit $2002

bit $2002

       bpl :-
       
       sta $4014
       delay 32713
       jmp first
       
       ; Fine synchronize

delay 33241

first: bit $2002

       bpl :-
       
       ; NMI won't be fired until frame 2
       lda #2
       sta frame_count
       
       rts

</source>

The STA $4014 takes up to 518 cycles, so we subtract that from the initial delay. After the STA $4014, the delay begins on an odd cycle. Since also it's an odd number of cycles until the $2002 read, it will occur on an even cycle, as desired.

Simpler synchronization routine

The PPU synchronization routine is pretty short, but it requires use of the delay macro, which takes a fair amount of code to implement. It's possible to eliminate that without any negative impact.

The fine synchronization loop needs to read $2002 every 33248 cycles, so it can find when the VBL flag is set just before the read. This seems to require a long delay between reads. Until the final iteration, it must not find the VBL flag set. If it were like the coarse loop and read the VBL flag every 7 cycles, it would clearly stop somewhere near the beginning of the first frame, but rarely right at the beginning. It might read $2002 one cycle before the VBL flag is set, loop, then read it 7 cycles later and find it now set. This isn't what we want. If we read it slightly more often, like every 33248/2 = 16624 cycles, it would still work, since the VBL flag is automatically cleared near the end of VBL.

<source lang="6502tasm"> sync_ppu:

       ; Coarse synchronize
       bit $2002

bit $2002

       bpl :-
       
       sta $4014
       delay 16089
       jmp first
       
       ; Fine synchronize

delay 16617

first: bit $2002

       bpl :-

rts

</source>

Cycle	PPU	CPU
0	Set VBL flag
7459	Clear VBL flag
16622		Read $2002 = 0
33246		Read $2002 = 0
33247.5	Set VBL flag
40706.5	Clear VBL flag
49870		Read $2002 = 0
66494		Read $2002 = 0
66495	Set VBL flag
73954	Clear VBL flag
83118		Read $2002 = 0
99742		Read $2002 = 0
99742.5	Set VBL flag
107201.5	Clear VBL flag
116366		Read $2002 = 0
132990	Set VBL flag	Read $2002 = $80

That works, but reducing the delays doesn't eliminate the need for them. The important thing is that the $2002 read only be able to happen just after the VBL flag is set, rather than many cycles after it was set. Rather than rely on the PPU to clear the VBL flag, we can clear it ourselves. 16 is a factor of 33248, so we can have the loop take only 16 cycles and still synchronize properly.

<source lang="6502tasm"> sync_ppu:

       ; Coarse synchronize
       bit $2002

bit $2002

       bpl :-
       
       sta $4014
       bit <0
       
       ; Fine synchronize

bit <0

       nop
       bit $2002
       bit $2002
       bpl :-

rts

</source>

Cycle	PPU	CPU
0	Set VBL flag
10		Dummy read $2002 = $80
14		Read $2002 = 0
26		Dummy read $2002 = 0
30		Read $2002 = 0
...
33242		Dummy read $2002 = 0
33246		Read $2002 = 0
33247.5	Set VBL flag
33258		Dummy read $2002 = $80
33262		Read $2002 = 0
...
66490		Dummy read $2002 = 0
66494		Read $2002 = 0
66495	Set VBL flag
66506		Dummy read $2002 = $80
66510		Read $2002 = 0
...
99738		Dummy read $2002 = 0
99742		Read $2002 = 0
99742.5	Set VBL flag
99754		Dummy read $2002 = $80
99758		Read $2002 = 0
...
132986		Dummy read $2002 = 0
132990	Set VBL flag	Read $2002 = $80

Essentially there's a four-cycle window that the second $2002 read in the loop is watching for the VBL flag to be set within. On entry to the loop, we ensure that the flag will never be set within this window. Every 33248/16 = 2078 iterations, the second $2002 read is half a cycle later in the frame, just like the original version. On every other iteration, the dummy $2002 read four cycles before has ensured that the VBL flag is cleared.

Consistent frame synchronization

Contents

Introduction

PAL timing

Basic synchronization

Writing to a particular pixel

Ideal NMI

NMI delay

Compensating for NMI delay

Sprite DMA always ends on even cycle

VBL flag cleared at end of VBL

VBL begins on odd cycles

Synchronizing with even CPU cycle

Simpler synchronization routine

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools