Consistent frame synchronization: Difference between revisions

From NESdev Wiki
Jump to navigationJump to search
(Removal of syntax highlighter)
(add links)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
== Introduction ==
== Introduction ==


This page describes a method for consistently synchronizing with the PPU every frame from inside an NMI handler, without having to cycle-time everything. This method allows synchronization just as good as is possible with completely cycle-timed code. At the beginning, the PPU is precisely synchronized with, ensuring that the code behaves the same every time it's run, every time the NES is powered up or reset. It's fully predictable.
This page describes a method for consistently synchronizing with the [[PPU]] every frame from inside an [[NMI]] handler, without having to cycle-time everything. This method allows synchronization just as good as is possible with completely cycle-timed code. At the beginning, the PPU is precisely synchronized with, ensuring that the code behaves the same every time it's run, every time the NES is powered up or reset. It's fully predictable.


Currently only PAL version is covered, since the PAL PPU's frame timing is simpler. The NTSC version operates in a similar manner, and will be covered eventually.
Currently only [[PAL video|PAL]] version is covered, since the PAL PPU's frame timing is simpler. The NTSC version operates in a similar manner, and will be covered eventually.


== PAL timing ==
== PAL timing ==
{{See also|Cycle reference chart}}


The PAL NES has a master clock shared by the PPU and CPU. The CPU divides the master clock by 16 to get its instruction cycle clock, which we'll call ''cycle'' for simplicity. For example, a NOP instruction takes 2 cycles. The PPU divides the master clock by 5 to get its pixel clock, which we'll call ''pixel'' for simplicity. There are 16/5 = 3.2 pixels per cycle.
The PAL NES has a master clock shared by the PPU and CPU. The CPU divides the master clock by 16 to get its instruction cycle clock, which we'll call ''cycle'' for simplicity. For example, a NOP instruction takes 2 cycles. The PPU divides the master clock by 5 to get its pixel clock, which we'll call ''pixel'' for simplicity. There are 16/5 = 3.2 pixels per cycle.


A video frame consists of 312 scanlines, each 341 pixels long. Unlike NTSC, there are no short frames, and rendering being enabled or disabled has no effect on frame length. This, ''every'' frame is exactly 312*341 = 106396 pixels = 33247.5 cycles long. We'll have pixel 0 refer to the first pixel of a frame, and pixel 106395 refer to the last pixel of a frame.
A video frame consists of 312 scanlines, each 341 pixels long. Unlike NTSC, there are no short frames, and rendering being enabled or disabled has no effect on frame length. Thus, ''every'' frame is exactly 312*341 = 106396 pixels = 33247.5 cycles long. We'll have pixel 0 refer to the first pixel of a frame, and pixel 106395 refer to the last pixel of a frame.


A frame begins with the vertical blanking interval (VBL), then the visible scanlines. The notation VBL+N refers to N cycles after the cycle that VBL began within, VBL+0. To talk about pixels since VBL, we simply refer to pixel P, where pixel 0 is the beginning of VBL, and pixel 106395 is the last pixel in the frame.
A frame begins with the [[The frame and NMIs#VBlank, Rendering Time, and NMIs|vertical blanking interval (VBL)]], then the visible scanlines. The notation VBL+N refers to N cycles after the cycle that VBL began within, VBL+0. To talk about pixels since VBL, we simply refer to pixel P, where pixel 0 is the beginning of VBL, and pixel 106395 is the last pixel in the frame.


== Basic synchronization ==
== Basic synchronization ==


If we're going to write at a particular pixel, we must first synchronize the CPU to the beginning of a frame, so that pixel 0 begins at the beginning of a cycle, and we know how many cycles ago that was. Reading $2002 gives the current status of the VBL flag in bit 7, then clears it. The VBL flag is set at pixel 0 of each frame, and cleared around when VBL ends. We can use the VBL flag to achieve synchronization.
If we're going to write at a particular pixel, we must first synchronize the CPU to the beginning of a frame, so that pixel 0 begins at the beginning of a cycle, and we know how many cycles ago that was. Reading [[PPU registers#Status ($2002) < read|$2002]] gives the current status of the VBL flag in bit 7, then clears it. The VBL flag is set at pixel 0 of each frame, and cleared around when VBL ends. We can use the VBL flag to achieve synchronization.


A frame is 33247.5 cycles long. If we could somehow read $2002 every 33247.5 cycles, we'd read at the same point in each frame. But if we read $2002 every 33248 cycles, we'll be reading 0.5 cycles (1.6 pixels) later each successive frame. If we have a loop do this until it finds the VBL flag set, it will synchronize with the PPU. Each time through, it will read later in the frame, until it reads just as the VBL flag for the next frame is set.
A frame is 33247.5 cycles long. If we could somehow read $2002 every 33247.5 cycles, we'd read at the same point in each frame. But if we read $2002 every 33248 cycles, we'll be reading 0.5 cycles (1.6 pixels) later each successive frame. If we have a loop do this until it finds the VBL flag set, it will synchronize with the PPU. Each time through, it will read later in the frame, until it reads just as the VBL flag for the next frame is set.

Latest revision as of 15:35, 24 March 2024

Introduction

This page describes a method for consistently synchronizing with the PPU every frame from inside an NMI handler, without having to cycle-time everything. This method allows synchronization just as good as is possible with completely cycle-timed code. At the beginning, the PPU is precisely synchronized with, ensuring that the code behaves the same every time it's run, every time the NES is powered up or reset. It's fully predictable.

Currently only PAL version is covered, since the PAL PPU's frame timing is simpler. The NTSC version operates in a similar manner, and will be covered eventually.

PAL timing

See also: Cycle reference chart

The PAL NES has a master clock shared by the PPU and CPU. The CPU divides the master clock by 16 to get its instruction cycle clock, which we'll call cycle for simplicity. For example, a NOP instruction takes 2 cycles. The PPU divides the master clock by 5 to get its pixel clock, which we'll call pixel for simplicity. There are 16/5 = 3.2 pixels per cycle.

A video frame consists of 312 scanlines, each 341 pixels long. Unlike NTSC, there are no short frames, and rendering being enabled or disabled has no effect on frame length. Thus, every frame is exactly 312*341 = 106396 pixels = 33247.5 cycles long. We'll have pixel 0 refer to the first pixel of a frame, and pixel 106395 refer to the last pixel of a frame.

A frame begins with the vertical blanking interval (VBL), then the visible scanlines. The notation VBL+N refers to N cycles after the cycle that VBL began within, VBL+0. To talk about pixels since VBL, we simply refer to pixel P, where pixel 0 is the beginning of VBL, and pixel 106395 is the last pixel in the frame.

Basic synchronization

If we're going to write at a particular pixel, we must first synchronize the CPU to the beginning of a frame, so that pixel 0 begins at the beginning of a cycle, and we know how many cycles ago that was. Reading $2002 gives the current status of the VBL flag in bit 7, then clears it. The VBL flag is set at pixel 0 of each frame, and cleared around when VBL ends. We can use the VBL flag to achieve synchronization.

A frame is 33247.5 cycles long. If we could somehow read $2002 every 33247.5 cycles, we'd read at the same point in each frame. But if we read $2002 every 33248 cycles, we'll be reading 0.5 cycles (1.6 pixels) later each successive frame. If we have a loop do this until it finds the VBL flag set, it will synchronize with the PPU. Each time through, it will read later in the frame, until it reads just as the VBL flag for the next frame is set.

        ; Fine synchronize
:       delay 33241
        bit $2002
        bpl :-
Cycle PPU CPU
0
1
...
33246 Read $2002 = 0
33246.5
33247
33247.5 Set VBL flag
...
66494 Read $2002 = 0
66494.5
66495 Set VBL flag
...
99742 Read $2002 = 0
99742.5 Set VBL flag
...
132990 Set VBL flag Read $2002 = $80

Looking at it relative to each frame, we more clearly see how the CPU effectively reads later by half a cycle each frame.

Cycle Frame 1 Frame 2 Frame 3 Frame 4 Event
-1.5 read
-1.0 read
-0.5 read
0 read VBL flag set

The loop must be started so that the first $2002 read is slightly before the end of the frame, otherwise it might start out reading well after the flag has been set. We can do this by starting with a simpler coarse synchronization loop.

sync_ppu:
        ; Coarse synchronize
        bit $2002
:       bit $2002
        bpl :-
        
        delay 33231
        jmp first
        
        ; Fine synchronize
:       delay 33241
first:  bit $2002
        bpl :-
 
        rts

The coarse synchronization loop might read $2002 just as the VBL flag was set, or read it nearly 7 cycles after it was set. Then, in the fine synchronization loop, $2002 is read 33240 to 33247 cycles later. In most cases, this will be slightly before the VBL flag is set, so the loop will delay and read $2002 again 33248 cycles later, etc.

Once done, the CPU will have executed two cycles after the final $2002 read that found the VBL flag just set.

Writing to a particular pixel

In order to achieve some graphical effect, we want to write to the PPU at a particular pixel every frame. As an example, we'll write to $2006 at pixel 30400, which is near the upper-center of the screen. To simplify things, we'll not care what value we write. This requires that we write to $2006 at VBL+9500.

        ; Synchronize to PPU
        jsr sync_ppu
        
        ; Delay almost a full frame, so that the code below begins on
        ; a frame.
        delay 33238
        
vbl:    ; VBL begins in this cycle
        
        delay 9497
        sta $2006
Pixel Cycle Event
0 0 VBL begins
delay 9497
...
9497 STA $2006
9498
9499
30400 9500 $2006 write

If we try to make this write to the same pixel each frame, we run into a problem: the frame length isn't a whole number of cycles. We'll count frames and treat odd frames as being 33247 cycles long, and even frames 33248 cycles long, which will average to the correct 33247.5 cycles per frame.

        ; Synchronize to PPU
        jsr sync_ppu
         
        ; Delay almost a full frame, so that the code below begins on
        ; a frame.
        delay 33233
        
        ; We were on frame 1 after sync_ppu, but vbl will begin on frame 2
        lda #2
        sta frame_count
        
vbl:    ; VBL begins in this cycle
        
        delay 9497
        sta $2006
        
        delay 23731
        
        ; Delay extra cycle on even frames
        lda frame_count
        and #$01
        beq extra
extra:  inc frame_count

        jmp vbl

Now our write time doesn't drift, but it still doesn't write to the same pixel each frame. Since even frames begin in the middle of a cycle, our write is half a cycle/1.6 pixels earlier.

Odd frame pixel Even frame pixel Cycle Event
0 0 VBL begins
delay 9497
0 0.5
...
9497 STA $2006
9498
9499
30400 30398.4 9500 $2006 write

Our write will thus fall on pixel 30400 on odd frames, and pixel 30398.4 on even frames. That's the best we can do, regardless of how we write our code, as this is a hardware limitation.

Another similar limitation is that when the NES is powered up or reset, the CPU and PPU master clock dividers start in random states, adding up to 1.6 additional pixels of variance. This offset doesn't change until the NES is powered off or reset.

Ideal NMI

Above, all the code had to be cycle-timed to ensure that each write occurred at the correct time. This isn't practical in most programs, which instead use NMI for synchronizing roughly to VBL. In these programs, timing-critical code is at the beginning of the NMI handler, followed by code that isn't carefully timed. Thus, such code relies on NMI occurring shortly after VBL, and not being delayed.

Ideally, NMI would begin a fixed number of cycles after VBL, without waiting for the current instruction to finish. If that were the case, we'd have it nearly as easy as before. Here, we'll imagine NMI always occurs at VBL+2. NMI takes 7 cycles to vector to our NMI handler, so that our NMI handler begins at VBL+9. To simplify the code and timing diagrams, we won't bother saving any registers as we'd normally do in an NMI handler.

nmi:    ; VBL+9
        delay 9488
        sta $2006         ; write at VBL+9500
Even frame pixel Odd frame pixel Cycle Event
0 0 VBL begins
0 0.5
1
2 NMI vectored
3
4
5
6
7
8
9 delay 9488
...
9497 STA $2006
9498
9499
30400 30398.4 9500 $2006 write

NMI delay

In reality, NMI waits until the current instruction completes before vectoring to the NMI handler, adding an extra delay as compared to the ideal NMI described above. Also, sometimes the NES powers up with the PPU and CPU dividers such that the NMI occurs an additional cycle later.

By ensuring that a short instruction is executing when VBL occurs, we can minimize the delay before NMI is vectored. For example, if we have a series of NOP instructions executing when VBL occurs, NMI will occur from 2 to 4 cycles after VBL. The table shows the four possible timings, with each column titled with the time NMI vectoring begins.

        nop
        nop
        nop
Cycle VBL + 2 VBL + 3 VBL + 4 Event
-1 NOP
0 NOP NOP VBL begins
1 NOP
2 NMI vectored NOP
3 NMI vectored
4 NMI vectored

So, at best, we have 2 to 4 cycles of delay between VBL and our NMI handler.

Using a long sequence of NOP instructions isn't practical, because it requires either a large number of NOP instructions, or that we know how long the code before them takes so that we can delay entry into the NOP sequence until NMI is about to occur. If we instead have a simple infinite loop made of a single JMP instruction, we only increase the maximum delay by one cycle, to 5.

 loop:   jmp loop
Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5 Event
-1 JMP JMP
0 JMP VBL begins
1 JMP
2 NMI vectored JMP
3 NMI vectored
4 NMI vectored
5 NMI vectored

Compensating for NMI delay

With a JMP loop to wait for NMI, we have 2 to 5 cycles of delay between VBL and our NMI handler. We want to compensate for this delay D by delaying an additional 5-D cycles. Here, we have the NOP always begin at VBL+12. We can't actually do this, but it shows what we must do the equivalent of.

nmi:	delay 5-D
	nop
Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5 Event
-1 JMP JMP
0 JMP VBL begins
1 JMP
2 NMI vectored JMP
3 NMI vectored
4 NMI vectored
5 NMI vectored
6
7
8
9 delay 3
10 delay 2
11 delay 1
12 NOP NOP NOP NOP (no delay)

We just have to find out how to determine the number of cycles of delay to add.

Sprite DMA always ends on even cycle

When sprite DMA ($4014) is written to, the next instruction always begins on an odd cycle. If the $4014 write is on an odd cycle, it pauses the CPU for an additional 513 cycles, otherwise 514 cycles. We can use this aspect to partially compensate for NMI's variable delay.

nmi:    lda #$07          ; sprites at $700
        sta $4014
        nop
Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5
0 VBL begins
1
2 NMI
3 NMI
4 NMI
5 NMI
6
7
8
9 LDA #$07
10 LDA #$07
11 STA $4014 LDA #$07
12 STA $4014 LDA #$07
13 STA $4014
14 $4014 write STA $4014
15 514-cycle DMA $4014 write
16 513-cycle DMA $4014 write
17 514-cycle DMA $4014 write
18 513-cycle DMA
...
527
528 DMA finishes DMA finishes
529 NOP NOP
530 DMA finishes DMA finishes
531 NOP NOP

This reduces the number of different delays from four to two. The NOP always executes at either VBL+529 or VBL+531. This is an improvement. We just need a way to determine which time DMA finished at, and delay two extra cycles if it was the earlier one.

VBL flag cleared at end of VBL

The VBL flag is cleared near the end of VBL. If we read $2002 around the time the flag is cleared, we can determine whether the read occurred before or after the flag was cleared. We will have to avoid reading $2002 elsewhere in the NMI handler, since reading $2002 clears the flag.

The VBL flag is cleared around pixel 23869, sometimes one less, so we want to read $2002 at VBL+7458 or VBL+7460. It works out nicely that sprite DMA leaves two cycles between the possible ending times, as this ensures that our $2002 read is several pixels before or after when the flag is cleared, giving us a good margin for error. If we find the flag set, we know we are on the earlier of the two DMA ending times, so we delay an extra two cycles.

nmi:    lda #$07          ; sprites at $700
        sta $4014
        delay 6926
        bit $2002         ; read at VBL+7458 or VBL+7460
        bpl skip
        bit 0
skip:   nop
Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5
0 VBL begins
1
2 NMI
3 NMI
4 NMI
5 NMI
6
7
8
9 LDA #$07
10 LDA #$07
11 STA $4014 LDA #$07
12 STA $4014 LDA #$07
13 STA $4014
14 $4014 write STA $4014
15 514-cycle DMA $4014 write
16 513-cycle DMA $4014 write
17 514-cycle DMA $4014 write
18 513-cycle DMA
...
527
528 DMA finishes DMA finishes
529 delay 6926 delay 6926
530 DMA finishes DMA finishes
531 delay 6926 delay 6926
...
7455 BIT $2002 BIT $2002
7456
7457 BIT $2002 BIT $2002
7458 $2002 read = $80 $2002 read = $80
7459 BPL not taken BPL not taken VBL cleared VBL cleared
7460 $2002 read = 0 $2002 read = 0
7461 BIT 0 BIT 0 BPL taken BPL taken
7462
7463
7464 NOP NOP NOP NOP

This achieves our goal, but not in all cases.

VBL begins on odd cycles

Unfortunately, VBL doesn't always begin during an even cycle, as we've so far assumed. When VBL begins during an odd cycle, our code doesn't work so well:

Cycle VBL + 2 VBL + 3 VBL + 4 VBL + 5
1 VBL begins
2
3 NMI
4 NMI
5 NMI
6 NMI
7
8
9
10 LDA #$07
11 LDA #$07
12 STA $4014 LDA #$07
13 STA $4014 LDA #$07
14 STA $4014
15 $4014 write STA $4014
16 513-cycle DMA $4014 write
17 514-cycle DMA $4014 write
18 513-cycle DMA $4014 write
19 514-cycle DMA
...
527
528 DMA finishes
529
530 DMA finishes DMA finishes
531
532 DMA finishes

Now DMA ends at three different times, covering a wider range than the original NMI times did, thus making things worse!

We need to keep track of when VBL begins during an odd cycle, and compensate before we begin DMA. After our PPU synchronization routine finishes, the last $2002 read it makes will have just found the VBL flag set. In the following table, that is cycle 0.

Pixel Cycle Frame
0 0 1
106392 33247.5 2
212784 66495 3
319176 99742.5 4
425568 132990 5
531960 166237.5 6
638352 199485 7
744744 232732.5 8

Looking at which cycle each frame begins on, we see they follow a four-frame pattern: even, odd, odd, even. So we'll just have a variable that starts out at 1 and increments every frame, then examine bit 1 and delay an extra cycle if it's clear. This extra code takes 8 cycles on frames where VBL begins during an even cycle, and 7 cycles otherwise.

But we also need to insert a complementary delay after DMA, before the $2002 read, since on frames where VBL begins during an odd cycle we'll need to read $2002 one cycle later after DMA than for even frames.

nmi:    lda frame_count
        and #$02
        beq even
even:   lda #$07          ; sprites at $700
        sta $4014
        delay 6911
        lda frame_count
        and #$02
        bne odd
odd:    bit $2002
        bpl skip
        bit 0
skip:   inc frame_count
        delay 2028
        sta $2006
Cycle Frames 1, 4, 5, 8 ... Frames 2, 3, 6, 7 ...
VBL + 2 VBL + 3 VBL + 4 VBL + 5 VBL + 2 VBL + 3 VBL + 4 VBL + 5
0 VBL VBL VBL VBL
1 VBL VBL VBL VBL
2 NMI
3 NMI NMI
4 NMI NMI
5 NMI NMI
6 NMI
7
8
9 LDA frame_count
10 LDA frame_count LDA frame_count
11 LDA frame_count LDA frame_count
12 AND #$02 LDA frame_count LDA frame_count
13 AND #$02 AND #$02 LDA frame_count
14 BEQ taken AND #$02 AND #$02
15 BEQ taken AND #$02 BEQ not taken AND #$02
16 BEQ taken BEQ not taken AND #$02
17 LDA #$07 BEQ taken LDA #$07 BEQ not taken
18 LDA #$07 LDA #$07 BEQ not taken
19 STA $4014 LDA #$07 STA $4014 LDA #$07
20 STA $4014 LDA #$07 STA $4014 LDA #$07
21 STA $4014 STA $4014
22 $4014 write STA $4014 $4014 write STA $4014
23 514-cycle DMA $4014 write 514-cycle DMA $4014 write
24 513-cycle DMA $4014 write 513-cycle DMA $4014 write
25 514-cycle DMA $4014 write 514-cycle DMA $4014 write
26 513-cycle DMA 513-cycle DMA
...
535
536 DMA finishes DMA finishes DMA finishes DMA finishes
537 delay 6911 delay 6911 delay 6911 delay 6911
538 DMA finishes DMA finishes DMA finishes DMA finishes
539 delay 6911 delay 6911 delay 6911 delay 6911
...
7448 LDA frame_count LDA frame_count LDA frame_count LDA frame_count
7449
7450 LDA frame_count LDA frame_count LDA frame_count LDA frame_count
7451 AND #$02 AND #$02 AND #$02 AND #$02
7452
7453 BNE not taken BNE not taken AND #$02 AND #$02 BNE taken BNE taken AND #$02 AND #$02
7454
7455 BIT $2002 BIT $2002 BNE not taken BNE not taken BNE taken BNE taken
7456 BIT $2002 BIT $2002
7457 BIT $2002 BIT $2002
7458 $2002 read = $80 $2002 read = $80 BIT $2002 BIT $2002
7459 BPL not taken BPL not taken VBL cleared VBL cleared $2002 read = $80 $2002 read = $80
7460 $2002 read = 0 $2002 read = 0 BPL not taken BPL not taken VBL cleared VBL cleared
7461 BIT 0 BIT 0 BPL taken BPL taken $2002 read = 0 $2002 read = 0
7462 BIT 0 BIT 0 BPL taken BPL taken
7463
7464 INC frame_count INC frame_count INC frame_count INC frame_count
7465 INC frame_count INC frame_count INC frame_count INC frame_count
7466
7467
7468
7469 delay 2028 delay 2028 delay 2028 delay 2028
7470 delay 2028 delay 2028 delay 2028 delay 2028
...
9497 STA $2006 STA $2006 STA $2006 STA $2006
9498 STA $2006 STA $2006 STA $2006 STA $2006
9499
9500 $2006 write at VBL+9500 $2006 write at VBL+9500 $2006 write at VBL+9500 $2006 write at VBL+9500
9501 $2006 write at VBL+9500 $2006 write at VBL+9500 $2006 write at VBL+9500 $2006 write at VBL+9500

The $2006 write is done at VBL+9500 in all cases. Remember that the right four columns have VBL beginning on cycle 1 (an odd cycle), which is why the final writes appear to be one cycle later than the others.

Synchronizing with even CPU cycle

Since our final synchronization method relies on knowing whether a given frame begins during an even or odd cycle, we must initially ensure that our PPU synchronization routine's final $2002 read is also during an even cycle. Since the fine synchronization loop takes an even number of cycles, we merely need to ensure that the first time through that the $2002 read is on an even cycle. We can do this by initiating sprite DMA before the fine synchronization loop.

sync_ppu:
        ; Coarse synchronize
        bit $2002
:       bit $2002
        bpl :-
        
        sta $4014
        delay 32713
        jmp first
        
        ; Fine synchronize
:       delay 33241
first:  bit $2002
        bpl :-
        
        ; NMI won't be fired until frame 2
        lda #2
        sta frame_count
        
        rts

The STA $4014 takes up to 518 cycles, so we subtract that from the initial delay. After the STA $4014, the delay begins on an odd cycle. Since also it's an odd number of cycles until the $2002 read, it will occur on an even cycle, as desired.

Simpler synchronization routine

The PPU synchronization routine is pretty short, but it requires use of the delay macro, which takes a fair amount of code to implement. It's possible to eliminate that without any negative impact.

The fine synchronization loop needs to read $2002 every 33248 cycles, so it can find when the VBL flag is set just before the read. This seems to require a long delay between reads. Until the final iteration, it must not find the VBL flag set. If it were like the coarse loop and read the VBL flag every 7 cycles, it would clearly stop somewhere near the beginning of the first frame, but rarely right at the beginning. It might read $2002 one cycle before the VBL flag is set, loop, then read it 7 cycles later and find it now set. This isn't what we want. If we read it slightly more often, like every 33248/2 = 16624 cycles, it would still work, since the VBL flag is automatically cleared near the end of VBL.

sync_ppu:
        ; Coarse synchronize
        bit $2002
:       bit $2002
        bpl :-
        
        sta $4014
        delay 16089
        jmp first
        
        ; Fine synchronize
:       delay 16617
first:  bit $2002
        bpl :-

        rts
Cycle PPU CPU
0 Set VBL flag
7459 Clear VBL flag
16622 Read $2002 = 0
33246 Read $2002 = 0
33247.5 Set VBL flag
40706.5 Clear VBL flag
49870 Read $2002 = 0
66494 Read $2002 = 0
66495 Set VBL flag
73954 Clear VBL flag
83118 Read $2002 = 0
99742 Read $2002 = 0
99742.5 Set VBL flag
107201.5 Clear VBL flag
116366 Read $2002 = 0
132990 Set VBL flag Read $2002 = $80

That works, but reducing the delays doesn't eliminate the need for them. The important thing is that the $2002 read only be able to happen just after the VBL flag is set, rather than many cycles after it was set. Rather than rely on the PPU to clear the VBL flag, we can clear it ourselves. 16 is a factor of 33248, so we can have the loop take only 16 cycles and still synchronize properly.

sync_ppu:
        ; Coarse synchronize
        bit $2002
:       bit $2002
        bpl :-
        
        sta $4014
        bit <0
        
        ; Fine synchronize
:       bit <0
        nop
        bit $2002
        bit $2002
        bpl :-

        rts
Cycle PPU CPU
0 Set VBL flag
10 Dummy read $2002 = $80
14 Read $2002 = 0
26 Dummy read $2002 = 0
30 Read $2002 = 0
...
33242 Dummy read $2002 = 0
33246 Read $2002 = 0
33247.5 Set VBL flag
33258 Dummy read $2002 = $80
33262 Read $2002 = 0
...
66490 Dummy read $2002 = 0
66494 Read $2002 = 0
66495 Set VBL flag
66506 Dummy read $2002 = $80
66510 Read $2002 = 0
...
99738 Dummy read $2002 = 0
99742 Read $2002 = 0
99742.5 Set VBL flag
99754 Dummy read $2002 = $80
99758 Read $2002 = 0
...
132986 Dummy read $2002 = 0
132990 Set VBL flag Read $2002 = $80

Essentially there's a four-cycle window that the second $2002 read in the loop is watching for the VBL flag to be set within. On entry to the loop, we ensure that the flag will never be set within this window. Every 33248/16 = 2078 iterations, the second $2002 read is half a cycle later in the frame, just like the original version. On every other iteration, the dummy $2002 read four cycles before has ensured that the VBL flag is cleared.