amiga-bootcamp/17_demoscene/timing_optimization.md

29 KiB
Raw Blame History

← Home · Demoscene Techniques

Timing Optimization — Cycle Counting, Blitter-CPU Interleaving, and Self-Modifying Code

Overview

On a stock Amiga 500, the 68000 runs at 7.09 MHz and must share memory bandwidth with the Copper, Blitter, bitplane DMA, sprite DMA, and audio DMA — all running simultaneously. A PAL frame lasts exactly 19,968 CPU cycles (20ms). After DMA steals its share, the CPU might only get 8,00012,000 usable cycles per frame. Every instruction, every memory access, every bus arbitration event is a battle for scarce bandwidth.

Demoscene coding is the art of extracting maximum performance from this constrained environment. This article covers the timing optimization techniques that demoscene coders developed: cycle-accurate instruction scheduling, Blitter-CPU interleaving to recover stolen cycles, copper-wait placement to minimize stalls, memory access pattern optimization, and self-modifying code for runtime specialization.

graph TB
    subgraph "Analysis"
        CC["Cycle Counting<br/>Know every instruction cost"]
        PROF["Profiling<br/>Measure actual DMA budget"]
    end
    subgraph "Scheduling"
        BCI["Blitter-CPU Interleave<br/>Overlap compute with DMA"]
        CW["Copper Wait Placement<br/>Minimize bus contention"]
    end
    subgraph "Memory"
        SEQ["Sequential Access<br/>Exploit 68000 prefetch"]
        CHIP["Chip vs Fast<br/>Put code in Fast RAM"]
    end
    subgraph "Advanced"
        SMC["Self-Modifying Code<br/>Runtime specialization"]
        FLI["Line-F Trap<br/>Transparent FPU emulation"]
        ROL["Register Allocation<br/>Minimize memory traffic"]
    end

    CC --> BCI
    CC --> SEQ
    PROF --> CW
    BCI --> SMC
    SEQ --> ROL
    CHIP --> BCI

Foundation: The DMA Budget

Per-Frame Cycle Budget (PAL)

Resource Cycles per Frame Percentage Notes
Total frame 19,968 100% 312 scanlines × 226.8 DMA slots/line × ~2 cycles
Bitplane DMA 2,4967,488 1237% Depends on depth/resolution
Sprite DMA 1,248 6% Fixed: 8 sprites always active
Audio DMA 312 1.5% Fixed: 4 channels
Copper DMA 6241,248 36% Depends on copper list length
Refresh DMA 624 3% DRAM refresh (fixed)
Available for CPU + Blitter ~9,00015,000 4575% Shared between CPU and Blitter

The Key Constraint: Bus Arbitration

The 68000 and DMA controllers share the same memory bus. When DMA is active, the CPU stalls — it cannot fetch instructions or data. The BLTPRI bit (Blitter Nasty mode) gives the Blitter total bus priority, starving the CPU almost completely:

Mode Blitter Priority CPU Gets Use When
Normal (BLTPRI=0) Every 4th cycle ~75% of remaining cycles Normal operation
Blitter Nasty (BLTPRI=1) All cycles ~0% (only between Blitter cycles) Critical Blitter operations

Technique 1: Cycle Counting

Every 68000 instruction has a known cycle cost. Demoscene coders count cycles the way a financial analyst counts dollars — every single one matters.

68000 Instruction Cycle Costs (Most Common)

Instruction Cycles Notes
MOVE.W Dn,Dn 4 Register-to-register: fastest
MOVE.W (An),Dn 8 Memory read: 4 + 4 (prefetch penalty)
MOVE.W Dn,(An) 8 Memory write
MOVE.L (An)+,(An)+ 12 Post-increment: 2 memory accesses
ADD.W Dn,Dn 4 Register add
ADD.W #imm,Dn 8 Immediate add (2 words to fetch)
MULS.W Dn,Dn 28 Signed 16×16→32 multiply
MULU.W Dn,Dn 28 Unsigned multiply
DIVS.W Dn,Dn 44-140 Signed divide: 44 best, 140 worst
MULS.L Dn,Dn ~28-44 68020+: 32×32→64 multiply
DBRA Dn,label 10 (taken) / 6 (exit) Loop branch
BRA label 10 Unconditional branch
BCC label 8 (taken) / 6 (not) Conditional branch
JSR (An) 12 Subroutine call
RTS 12 Return
LEA (An),An 4 Address computation (no memory access)
SWAP Dn 4 Swap 16-bit halves
LSL.W #n,Dn 6+n Shift left: 6 + number of positions
ROL.W #n,Dn 6+n Rotate left
MOVE.W (An)+,Dn 8 Read with auto-increment

DMA Stall Impact

When DMA is active, instruction cycles increase due to bus contention:

/*
 * Effective cycle cost = base_cycles + dma_stalls
 *
 * dma_stalls depends on:
 *   1. Number of DMA channels active on current scanline
 *   2. Whether the access is to Chip RAM or Fast RAM
 *   3. BLTPRI (Blitter Nasty) mode
 *
 * Rule of thumb on stock A500:
 *   - With 4 bitplanes LoRes, DMA steals ~40% of bus cycles
 *   - With 6 bitplanes HiRes, DMA steals ~60% of bus cycles
 *   - During Blitter operation (normal): CPU gets ~25% of cycles
 *   - During Blitter Nasty: CPU gets ~0-5% of cycles
 */

Technique 2: Blitter-CPU Interleaving

The most important optimization on the Amiga. When the Blitter is running (copying, filling, drawing lines), the CPU normally stalls waiting for bus access. Interleaving means finding useful work for the CPU to do during Blitter wait periods — computation that doesn't require memory access (register-only operations).

The Interleaving Principle

gantt
    title Blitter-CPU Interleaving (Single Frame)
    dateFormat X
    axisFormat %s

    section CPU
    Compute 3D vertices    :crit, 0, 4
    (stalled by Blitter)   : 4, 7
    Compute more vertices  :crit, 7, 9
    (stalled by Blitter)   : 9, 12
    Prepare next blit      :crit, 12, 14

    section Blitter
    Fill polygon A         :active, 0, 7
    Fill polygon B         :active, 7, 12
    Fill polygon C         :active, 12, 16

Implementation Pattern

/*
 * interleave.c — Blitter-CPU interleaving pattern
 *
 * The key: start a Blitter operation, then do CPU computation
 * that only uses registers (no memory access) while Blitter runs.
 */

void render_frame(void) {
    /* Phase 1: Start Blitter fill for first polygon */
    start_blitter_fill(&polygons[0]);

    /* Phase 2: CPU computes next polygon's vertex positions
       while Blitter fills the first one.
       IMPORTANT: Only register-to-register operations here!
       Any memory access will stall until Blitter finishes. */
    {
        register FIXED rx asm("d0");
        register FIXED ry asm("d1");
        register FIXED rz asm("d2");

        /* Transform vertices for polygon 2 — register-only math */
        rx = fixed_mul(m00, vx1) + fixed_mul(m01, vy1) + fixed_mul(m02, vz1);
        ry = fixed_mul(m10, vx1) + fixed_mul(m11, vy1) + fixed_mul(m12, vz1);
        rz = fixed_mul(m20, vx1) + fixed_mul(m21, vy1) + fixed_mul(m22, vz1);

        /* Store results (will stall if Blitter still running) */
        screen_x1 = project_x(rx, rz);
        screen_y1 = project_y(ry, rz);
    }

    /* Phase 3: Wait for Blitter to finish, then start next blit */
    wait_blitter();
    start_blitter_fill(&polygons[1]);

    /* Phase 4: More CPU computation for polygon 3... */
    /* ... repeat ... */
}

Assembly Interleaving

In 68000 assembly, the interleaving is explicit:

; interleave.asm — Start Blitter, do CPU work, wait for Blitter

        ; ---- Start Blitter fill for polygon A ----
        lea     $DFF000,a6
        move.w  #$01F2,BLTCON0(a6)     ; Fill mode
        move.l  poly_a_data,BLTAPTH(a6)
        move.l  poly_a_data,BLTDPTH(a6)
        move.w  #(HEIGHT<<6)|WIDTH_BLT,BLTSIZE(a6)  ; Start!

        ; ---- CPU work: compute polygon B vertices ----
        ; These are register-only operations, no memory access
        ; (the data was pre-loaded into registers)
        move.l  d0,d4           ; 4 cycles
        muls.w  d1,d4           ; 28 cycles
        add.l   d4,d2           ; 4 cycles  (36 total)
        swap    d2              ; 4 cycles  (40 total)
        move.w  d2,d5           ; 4 cycles  (44 total)
        muls.w  d3,d5           ; 28 cycles (72 total)
        ; ... more register math ...

        ; ---- Now we need memory — check if Blitter is done ----
.blit_wait:
        btst    #6,DMACONR(a6)  ; Read DMA status (1 memory access)
        bne.s   .blit_wait      ; Loop if Blitter busy

        ; ---- Start Blitter fill for polygon B ----
        move.w  #$01F2,BLTCON0(a6)
        move.l  poly_b_data,BLTAPTH(a6)
        move.l  poly_b_data,BLTDPTH(a6)
        move.w  #(HEIGHT2<<6)|WIDTH_BLT,BLTSIZE(a6)  ; Start!

        ; ---- CPU work: compute polygon C vertices ----
        ; ... register-only math again ...

Technique 3: Copper-Wait Placement

The Copper competes with the CPU for bus cycles. Poorly-placed copper lists steal cycles from the CPU during critical computation windows. The optimization: move copper activity to scanlines where the CPU is idle (during vertical blank or display border areas).

Optimal Copper-CPU Scheduling

graph LR
    subgraph "Poor Schedule"
        P1["CPU computing during display<br/>Copper also active → contention"]
    end
    subgraph "Good Schedule"
        G1["CPU computes during VBlank<br/>Copper active during display"]
        G2["CPU idle during display<br/>Copper runs free"]
    end

    P1 -->|"Restructure"| G1

Practical Scheduling

/*
 * schedule.c — Optimal copper-CPU scheduling
 *
 * Principle: Move CPU-heavy computation to VBlank period
 * when the Copper is idle (already executed its list).
 * Let the Copper do its work during the display period
 * when the CPU has less to do.
 */

void main_loop(void) {
    while (1) {
        /* Wait for VBlank (vertical blanking interval)
           During VBlank: no display DMA, minimal Copper activity */
        WaitTOF();  /* Wait for Top of Frame */

        /* ---- VBlank period: CPU-heavy computation ---- */
        /* This runs during lines 0-19 (top border) and
           lines 250-311 (bottom border + VBlank)
           Minimal DMA contention here! */
        update_3d_vertices();
        update_physics();
        update_audio_buffers();
        build_copper_list();

        /* ---- Display period: let Copper run ---- */
        /* During lines 20-249, the Copper is writing color
           registers and the CPU should do minimal work.
           Only Blitter operations (which have their own DMA)
           or register-only computation should happen here. */
        render_blitter_objects();
    }
}

Technique 4: Memory Access Optimization

The 68000 has a 2-word instruction prefetch pipeline. Sequential memory accesses are faster because the prefetch buffer is already filled. Random accesses cause pipeline refills and additional wait states.

Memory Access Rules

Access Pattern Effective Speed Notes
Sequential read (post-increment) Fast Prefetch buffer hits
Sequential write (post-increment) Fast Blitter-style linear access
Random read (indexed) Slow Pipeline refill + possible cache miss
Register-only operations Fastest No memory access at all
Chip RAM access Variable DMA contention adds wait states
Fast RAM access Consistent No DMA contention

Optimization Techniques

; memory_opt.asm — Optimize memory access patterns

        ; ---- BAD: Random access pattern ----
        move.w  0(a0,d0.w*2),d1       ; Indexed: pipeline stall
        move.w  2(a0,d1.w*2),d2       ; Indexed: pipeline stall
        move.w  4(a0,d2.w*2),d3       ; Indexed: pipeline stall

        ; ---- GOOD: Sequential access with post-increment ----
        move.w  (a0)+,d1              ; Sequential: fast
        move.w  (a0)+,d2              ; Sequential: fast
        move.w  (a0)+,d3              ; Sequential: fast

        ; ---- GOOD: Process in cache-line-friendly blocks ----
        ; Unroll loops for sequential access
        movem.l (a0)+,d0-d7           ; Burst read: 8 registers
        ; ... process d0-d7 ...
        movem.l d0-d7,(a1)+           ; Burst write: 8 registers

Chip RAM vs Fast RAM Strategy

Data Type Best Location Why
Copper lists Chip RAM Copper DMA can only read Chip RAM
Bitplane data Chip RAM Display DMA can only read Chip RAM
Sprite data Chip RAM Sprite DMA can only read Chip RAM
Audio sample data Chip RAM Audio DMA can only read Chip RAM
Code (instructions) Fast RAM No DMA contention, consistent speed
Vertex data Fast RAM No DMA stalls during computation
Sine tables Fast RAM No DMA stalls during lookup
Stack Fast RAM No DMA stalls during JSR/RTS/PEA
Lookup tables Fast RAM No DMA stalls during indexed access

Tip

On a stock A500 with only 512 KB Chip RAM, there is no Fast RAM — all code runs in Chip RAM and contends with DMA. The A501 trapdoor expansion adds 512 KB of "Slow RAM" (Trapdoor RAM, a.k.a. "Ranger" memory) which is not true Fast RAM but doesn't conflict with DMA, making it ~30% faster for code execution than Chip RAM.


Technique 5: Self-Modifying Code

Self-modifying code (SMC) changes its own instructions at runtime. On the Amiga, this is used for:

  1. Loop unrolling with constants — Patch immediate values in unrolled loops
  2. Branch optimization — Replace computed branches with direct branches
  3. Copper list generation — Write copper instructions directly into the instruction stream
  4. Function specialization — Remove condition checks for known states

SMC for Copper List Patching

The most common demoscene SMC pattern: a copper list is embedded in the code segment, and the CPU patches the color values each frame:

; smc_copper.asm — Self-modifying copper list

        ; The copper list lives in the code segment
        ; Color values are patched by the CPU each frame
copper_list:
        dc.w    $8032,$FFFE
        dc.w    $0180,$DEAD           ; ← CPU patches $DEAD each frame
        dc.w    $8050,$FFFE
        dc.w    $0180,$BEEF           ; ← CPU patches $BEEF each frame
        dc.w    $FFFF,$FFFE

update_copper:
        ; Calculate new colors for this frame
        move.w  #some_color,d0

        ; Patch copper list directly (self-modifying!)
        move.w  d0,copper_list+3*2    ; Overwrite $DEAD

        move.w  #other_color,d0
        move.w  d0,copper_list+7*2    ; Overwrite $BEEF
        rts

SMC for Loop Specialization

; smc_loop.asm — Self-modifying loop with patched constant

        ; The loop inner constant starts as a placeholder
inner_count:
        dc.w    320                    ; ← Patched at runtime

render_line:
        move.w  inner_count(pc),d0     ; Load (possibly patched) count
.next_pixel:
        ; ... pixel processing ...
        dbra    d0,.next_pixel
        rts

; Somewhere during setup:
specialize_loop:
        ; If we know we only need 160 pixels, patch the count
        move.w  #160,inner_count(pc)
        rts

Warning

Self-modifying code requires the modified instructions to be in writable memory (RAM, not ROM). On the 68000, there is no instruction cache, so modified instructions take effect immediately. On the 68020+ with instruction cache, you must flush the cache after modification (CPUSHA IC on 68040, MOVEC CACR,D0; BCLR #8,D0; MOVEC D0,CACR on 68030).


Technique 6: Fast Division via Reciprocal Table

Division is the most expensive 68000 operation (up to 140 cycles). For 3D rendering where division by Z is needed for every vertex, demoscene coders pre-compute reciprocal tables:

/* reciprocal.c — Pre-computed 1/z table for fast division */

#define RECIP_TABLE_SIZE 1024
#define RECIP_SHIFT      16     /* 16.16 fixed-point */

static FIXED recip_table[RECIP_TABLE_SIZE];

void build_reciprocal_table(void) {
    int i;
    for (i = 1; i < RECIP_TABLE_SIZE; i++) {
        /* 1.0 / i in 16.16 fixed-point */
        recip_table[i] = ((FIXED)1 << RECIP_SHIFT) / i;
    }
    recip_table[0] = 0x7FFFFFFF;  /* "Infinity" */
}

/* Fast divide: x / z ≈ x × recip_table[z] */
static inline FIXED fast_div(FIXED x, int z) {
    if (z <= 0 || z >= RECIP_TABLE_SIZE) return 0;
    return fixed_mul(x, recip_table[z]);
}

Technique 7: Line-F Trap (FPU Transparency)

On 68040/060 systems, floating-point instructions that the FPU doesn't implement in hardware trigger a Line-F exception (trap vector $2C). The OS provides emulation routines, but demoscene coders can install custom traps that are faster than the OS defaults:

; linef_trap.asm — Custom Line-F trap handler for 68040/060

        ; Install custom Line-F handler
        move.l  $2C.w,old_linef_handler   ; Save old handler
        lea     my_linef_handler,a0
        move.l  a0,$2C.w                   ; Install new handler

        ; The handler decodes the trapped FPU instruction
        ; and executes an optimized software equivalent
my_linef_handler:
        move.l  (sp),a0            ; Get faulting PC
        move.w  (a0),d0            ; Read the FPU opcode
        and.w   #$FE00,d0          ; Mask to Line-F family
        cmp.w   #$F200,d0          ; Is it an FPU instruction?
        beq.s   .handle_fpu
        ; ... chain to old handler if not ...

.handle_fpu:
        ; Decode specific FPU instruction and emulate
        ; (e.g., FSIN → table lookup + interpolation)
        ; ... specific emulation code ...
        addq.l  #2,(sp)            ; Skip past the FPU opcode
        rte                        ; Return from exception

Antipatterns

1. The Blitter Busy Loop

Polling the Blitter's busy flag in a tight loop while the CPU could be doing useful work.

Broken:

/* CPU does nothing while waiting for Blitter */
start_blitter_fill(&poly);
while (blitter_busy()) {
    /* Tight loop — wastes every cycle */
}
start_blitter_fill(&next_poly);

Fixed:

start_blitter_fill(&poly);

/* Do useful register-only work while Blitter runs */
compute_next_vertices();  /* Register math only */
prepare_next_blit_params();

/* Now check if Blitter is done */
while (blitter_busy()) {}  /* Minimal wait */
start_blitter_fill(&next_poly);

2. The Chip RAM Code Trap

Running performance-critical code from Chip RAM on a system with Fast RAM available. Chip RAM access is slowed by DMA contention.

Broken:

/* Code runs in Chip RAM by default */
void hot_function(void) {
    /* Every instruction fetch contends with DMA */
    for (i = 0; i < 1000; i++) { ... }
}

Fixed:

/* Copy hot function to Fast RAM at startup */
extern UBYTE fast_ram_code[];
extern const UBYTE hot_function_src[];
extern const UBYTE hot_function_end[];

void init(void) {
    ULONG size = hot_function_end - hot_function_src;
    memcpy(fast_ram_code, hot_function_src, size);
    /* Call fast_ram_code() instead of hot_function() */
}

3. The Naive Division

Using DIVS.W or DIVS.L in inner loops. Division is 44-140 cycles on 68000 — the single most expensive instruction.

Broken:

; Division in inner loop — 140 cycles each!
.loop:
        divs.w  d1,d0           ; d0 = d0 / d1 (SLOW!)
        ; ...
        dbra    d7,.loop

Fixed:

; Pre-compute reciprocal, use multiply instead
        move.w  recip_table(pc,d1.w*2),d2  ; Load 1/divisor
.loop:
        muls.w  d2,d0           ; d0 = d0 × (1/divisor) — 28 cycles
        ; ...
        dbra    d7,.loop

4. The Indexed Array Trap

Using register-indirect with index addressing (d(An,Dn.W)) in tight loops. The 68000's pipeline stalls on non-sequential accesses.

Broken:

; Indexed access — breaks sequential prefetch
        move.w  (a0,d0.w*2),d1  ; Random access: pipeline stall
        move.w  (a0,d2.w*2),d3  ; Random access: pipeline stall

Fixed:

; Restructure data for sequential access
        lea     (a0,d0.w*2),a1  ; Compute base address once
        move.w  (a1)+,d1        ; Sequential: fast
        move.w  (a1)+,d3        ; Sequential: fast

5. The Cache-Coherency Miss

On 68030+ with data cache enabled, modifying code or copper lists without flushing the cache. The CPU reads stale cached data instead of the modified version.

Broken:

/* Modify copper list in RAM, but cache has old values */
copper_list[offset] = new_color;
/* On 68030+, the CPU may read the old value from cache! */
custom.cop1lc = (ULONG)copper_list;

Fixed:

copper_list[offset] = new_color;

/* Flush data cache for modified region (68030+) */
#if defined(__m68030) || defined(__m68040)
    CacheClearU();  /* Or flush specific address range */
#endif

custom.cop1lc = (ULONG)copper_list;

Decision Guide

flowchart TD
    START[Need to optimize] --> Q1{What is the bottleneck?}
    Q1 -->|CPU too slow| Q2{Memory or computation bound?}
    Q1 -->|Blitter waiting| BCI[Blitter-CPU Interleave]
    Q1 -->|Bus contention| Q3{During display or VBlank?}

    Q2 -->|Memory| CHIP[Move code/data to Fast RAM]
    Q2 -->|Computation| Q4{Division or multiply heavy?}

    Q4 -->|Division| RECIP[Use reciprocal table]
    Q4 -->|Multiply| Q5{Can precompute any values?}

    Q5 -->|Yes| PRECOMP[Pre-calculate tables]
    Q5 -->|No| SMC[Self-modifying code<br/>for specialization]

    Q3 -->|During display| CW[Move CPU work to VBlank]
    Q3 -->|During VBlank| DMA[Reduce DMA activity<br/>or disable unused channels]

    BCI --> Q6{Blitter operations<br/>overlap with CPU math?}
    Q6 -->|Yes| OK[Interleave: start Blit,<br/>then compute in registers]
    Q6 -->|No| SEQ[Reorder: batch Blits,<br/>then batch CPU work]

Historical Timeline

timeline
    title Timing Optimization Evolution
    1985 : Amiga launch — 7.09 MHz 68000, shared bus
         : Coders learn bus arbitration overhead
    1987 : First cycle-counted inner loops
         : Copper-CPU scheduling understood
    1988 : Blitter-CPU interleaving becomes standard
         : Self-modifying copper lists in demos
    1989 : Reciprocal tables replace division
         : Fast RAM awareness for A2000/A500+accelerator
    1990 : Cycle-accurate demo effects (Scoopex, Sanity)
         : MOVEM burst optimization for large copies
    1991 : 68020 acceleration — more registers, MULS.L
         : Cache coherency becomes a concern
    1992 : 68030/040 optimization — cache management
         : Line-F trap handlers for FPU emulation
    1993 : Instruction cache awareness in demo code
         : 68060 superscalar scheduling
    1994 : Demo coders master 68040/060 pipelines
         : Cache-line alignment for critical loops
    2000+ : Cycle-accurate emulators enable precise profiling
          : MiSTer FPGA provides real-hardware verification

Modern Analogies

Amiga Optimization Modern Equivalent Why It Maps
Cycle counting GPU occupancy profiling Both count execution units per cycle
Blitter-CPU interleave Async compute (GPU) Both overlap independent operations
Chip vs Fast RAM VRAM vs system RAM Both have bandwidth vs capacity tradeoffs
Self-modifying code JIT compilation Both generate code at runtime
Reciprocal table Fast inverse sqrt / RCP Both approximate division with table/LUT
Copper-wait scheduling Pipeline barrier placement Both minimize stalls from synchronization
MOVEM burst read SIMD load (NEON/SSE) Both load multiple values in one operation
Cache flush Cache maintenance instructions Both ensure data consistency
Bus arbitration Memory bandwidth allocation Both divide bandwidth between agents
Register allocation Register allocation (compiler) Both minimize memory traffic

Use Cases

Use Case Technique Impact
3D vertex transform Blitter-CPU interleave ~2× throughput
Inner loop rendering Cycle counting + fast RAM ~30% speedup
Division-heavy code Reciprocal table ~5× vs DIVS
Dynamic effects Self-modifying copper list Eliminates copy overhead
Blitter-heavy frame Interleave + scheduling ~50% more CPU time
Fast data copy MOVEM burst ~4× vs MOVE.W loop
68040/060 code Cache management Prevents stale data bugs
Interrupt handlers Register-only computation Minimal latency
Audio mixing Fast RAM + sequential access Consistent 50 FPS

FPGA / Emulation Impact

Concern Impact Notes
Cycle-accurate timing Demos that rely on exact cycle counts break if timing is wrong WinUAE "cycle-exact" mode required for many demos
Bus arbitration CPU/DMA cycle interleaving must match Agnus scheduler Minimig implements 4-cycle DMA slots
68000 prefetch Instruction prefetch buffer must be modeled Affects branch timing and instruction pairing
Blitter busy detection DMACONR bit 6 timing must be exact Some demos poll at precise cycle counts
Cache behavior 68020+ instruction/data cache affects timing Emulators must model cache size and replacement policy
Self-modifying code Instruction cache flush must work correctly 68040+ demos depend on CPUSHA instruction

Note

The MiSTer Amiga core (based on Minimig) implements cycle-exact bus arbitration, which is why many timing-sensitive demos work on MiSTer but not on simpler FPGA implementations that approximate timing.


FAQ

Q: How do I measure actual DMA contention on real hardware? A: Use the E-Clock counter (ReadEClock()) or the CIA timers to measure execution time of specific code blocks. Compare timing with display DMA enabled vs disabled. The difference reveals the DMA overhead. Alternatively, use WinUAE's built-in profiler in cycle-exact mode.

Q: Is self-modifying code still useful on modern processors? A: Not on x86/ARM — their deeply pipelined superscalar architectures with complex branch prediction make SMC counterproductive (cache invalidation stalls). On the 68000, which has no cache and a simple 2-stage prefetch, SMC is nearly free and often beneficial.

Q: What is "Blitter Nasty" mode and when should I use it? A: Setting BLTPRI (bit 10 in DMACON) gives the Blitter total bus priority, leaving almost no cycles for the CPU. Use it only when the Blitter operation is the critical path and you have no useful CPU work to do. In practice, most demos use normal Blitter mode with interleaving instead.

Q: How much faster is Fast RAM really? A: On an A1200 with Fast RAM expansion, code runs ~2-3× faster when placed in Fast RAM vs Chip RAM (during display), because there is no DMA contention. During VBlank (no display DMA), the difference is much smaller. The improvement is most dramatic during the display period when bitplane DMA is active.

Q: Can I use all these techniques together? A: Yes, and the best demos do. The optimal pattern is: schedule CPU computation during VBlank, interleave register-only computation during Blitter operations, use Fast RAM for code and data tables, and patch copper lists via self-modifying code. The techniques are complementary.

Q: What is the single most impactful optimization? A: Blitter-CPU interleaving. On a stock A500, the Blitter and CPU share the bus. If you wait for the Blitter to finish before doing any CPU work, you waste 50-75% of available cycles. Starting the Blitter and then doing register-only computation nearly doubles effective throughput.


References

External Resources

  • 68000 Instruction Timing — Motorola M68000 Programmer's Reference Manual, Appendix A
  • Amiga Hardware Reference Manual — DMA timing, bus arbitration
  • WinUAE — Cycle-exact Amiga emulator with profiler
  • Pouet.nethttps://www.pouet.net — Demo source code with optimization notes
  • Demozoohttps://demozoo.org — Demoscene encyclopedia
  • Amiga Graphics Archivehttps://amiga.lychesis.net — Per-game copper list and DMA budget analysis
  • Scoopex Amiga Hardware Programming (Photon) — YouTube playlist — Cycle-exact programming techniques and DMA interleaving; companion site: coppershade.org