amiga-bootcamp/08_graphics/blitter_programming.md
Ilia Sharin f8f8d1c834 docs(amiga): add Tier 4 content — AHI, cross-compilation, RTG, demoscene section
- New: 11_libraries/ahi_programming.md — AHI retargetable audio API
- New: 13_toolchain/cross_compilation_guide.md — cross-compiling for Amiga
- New: 08_graphics/rtg_programming.md — RTG Picasso96/CyberGraphX programming
- New: 17_demoscene/ — full demoscene techniques section:
  - copper_effects.md (6 techniques, 10 Pouet screenshots, antipatterns)
  - sprite_techniques.md (5 techniques, antipatterns)
  - pixel_tricks.md (5 techniques, antipatterns)
  - 3d_rendering.md (fixed-point math, 4 techniques, antipatterns)
  - timing_optimization.md (7 techniques, instruction timing tables)
  - README.md (section index with Mermaid diagrams)
  - images/ (10 authentic Amiga screenshots from Pouet.net)
- New: 05_reversing/games/ (4 copper-analysis screenshots)
- Updated: README index, TODO status (30/30 complete)
- Added external references: Pouet/Demozoo links, Scoopex YouTube
  tutorial series, Amiga Graphics Archive, coppershade.org
2026-05-13 17:49:28 -04:00

48 KiB
Raw Permalink Blame History

← Home · Graphics

Blitter Programming — Deep Dive

Overview

The Blitter (Block Image Transferrer) is a DMA coprocessor inside the Agnus chip that performs raster operations on rectangular memory blocks at bus speed — without CPU involvement. While the 68000 executes game logic, physics, or AI, the Blitter simultaneously clears screens, copies bitmap regions, composites masked sprites ("cookie-cut"), draws lines, and fills polygons. This parallelism is fundamental to why the Amiga could deliver arcade-quality 2D graphics on a 7 MHz processor with 512 KB of RAM.

The Blitter operates on up to 4 DMA channels (A, B, C → D) using a programmable 8-bit minterm truth table that encodes any Boolean function of three inputs. Combined with per-channel shift, modulo, and first/last word masking, this makes the Blitter a general-purpose 2D rasterization engine — not merely a memory copier.

Warning

The Blitter can only access Chip RAM. Pointing any channel register at Fast RAM causes silent data corruption or system crashes. Always allocate blitter-visible memory with AllocMem(size, MEMF_CHIP).

Channel A ──→ ┐
Channel B ──→ ├──→ Minterm Logic ──→ Channel D (output)
Channel C ──→ ┘

A = mask/pattern (e.g., cookie shape, font glyph)
B = source image data
C = background / destination read-back
D = output destination

Architecture

The Blitter sits inside Agnus (OCS/ECS) or Alice (AGA), sharing the DMA bus with the Copper, bitplane fetches, sprite DMA, disk, and audio. It accesses memory through 4 independent DMA channels, each with its own pointer and modulo register:

graph LR
    subgraph "Agnus / Alice"
        A["Channel A<br/>(mask/pattern)"] --> ML["Minterm Logic<br/>(8-bit truth table)"]
        B["Channel B<br/>(source data)"] --> ML
        C["Channel C<br/>(background read-back)"] --> ML
        ML --> D["Channel D<br/>(output)"]
    end

    CRAM["Chip RAM"] --> A
    CRAM --> B
    CRAM --> C
    D --> CRAM

    style ML fill:#fff9c4,stroke:#f9a825
    style CRAM fill:#e8f4fd,stroke:#2196f3

The Minterm Logic block is the Blitter's core innovation. It takes the current bit from channels A, B, and C (three Boolean inputs) and produces one output bit for channel D according to a programmable 8-bit truth table stored in BLTCON0 bits 70. Since 3 inputs have 8 possible combinations (2³), the 8-bit minterm encodes any Boolean function of three variables — that's 256 possible logic operations in a single register write. This is what lets one piece of hardware do copies (D=A, minterm $F0), clears (D=0, minterm $00), cookie-cut compositing (D=A·B+¬A·C, minterm $CA), XOR highlighting (D=A⊕C, minterm $5A), and any other combination — all without changing hardware, just the 8-bit minterm value. See Minterm Logic below for the full truth table and common values.

Each channel reads (or writes, for D) from a different memory pointer with independent modulo, allowing operations on sub-rectangles within larger bitmaps. Writing to BLTSIZE ($DFF058) starts the blit immediately — always configure all other registers first.

Channel Roles

Channel DMA Direction Typical Role Has Shift? Has Mask?
A Read Mask, cookie shape, font glyph, line texture Yes (ASH, 015 px) Yes (BLTAFWM/BLTALWM)
B Read Source image data Yes (BSH, 015 px) No
C Read Background / destination read-back No No
D Write Output destination No No

Note

Any channel can be disabled per operation via BLTCON0 bits 118 (USEA/B/C/D). Disabling unused channels saves DMA cycles — a D-only clear (1 channel) runs 4× faster than a full ABCD blit.

CPU / Blitter Bus Interaction

The Blitter and the 68000 CPU share the Chip RAM bus — they cannot access it simultaneously. Agnus arbitrates access on a cycle-by-cycle basis:

┌────────────────────────────────────────────────────────────┐
│                    Chip RAM Bus (16-bit)                   │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│ Bitplane │  Sprite  │  Copper  │ Blitter  │   CPU (left-   │
│   DMA    │   DMA    │   DMA    │   DMA    │   over slots)  │
├──────────┴──────────┴──────────┴──────────┴────────────────│
│               Fixed priority (high → low)                  │
└────────────────────────────────────────────────────────────┘
  • Without BLTPRI: The Blitter gets every other free DMA slot. The CPU gets the remaining slots. Both run at roughly half speed on the Chip RAM bus.
  • With BLTPRI (nasty mode): The Blitter takes all free DMA slots. The CPU is completely frozen on any Chip RAM access until the blit completes. The CPU can still execute from Fast RAM or ROM — but any Chip RAM read/write stalls.
  • Display DMA always wins: Bitplane, sprite, and audio DMA have fixed priority above the Blitter. In high-resolution modes, display DMA alone consumes most of the bus, leaving few slots for blitter operations.

Chip RAM vs. Fast RAM

The Blitter is physically wired to the Chip RAM bus inside Agnus. It has no connection to the Fast RAM (Zorro) bus:

Memory Type Blitter Access? CPU Access? Notes
Chip RAM (first 512 KB2 MB) ✓ Yes ✓ Yes (contended) Screen buffers, audio, sprites, all DMA-visible data
Fast RAM (Zorro II/III) ✗ No ✓ Yes (uncontended) Code, variables, non-DMA data
ROM ($F80000$FFFFFF) ✗ No ✓ Yes Kickstart, libraries

This creates the key optimization opportunity on accelerated Amigas (A1200, A3000, A4000): the CPU can execute code and access Fast RAM at full speed while the Blitter simultaneously works on Chip RAM. On a stock A500 with only Chip RAM, the CPU and Blitter always contend for the same bus.

Important

There is no hardware error when pointing blitter registers at Fast RAM addresses. The Blitter's 22-bit address lines (OCS/ECS) simply wrap into Chip RAM space — producing silent data corruption at an unpredictable Chip RAM location.

Minterm Logic

The minterm is an 8-bit value stored in BLTCON0 (bits 70) that tells the Blitter what to do with each pixel. Think of it as a tiny program: for every pixel position, the Blitter reads the current bit from channels A, B, and C, looks up the answer in the minterm, and writes that answer to channel D (destination memory).

Since there are 3 inputs (A, B, C), each either 0 or 1, there are exactly 8 possible input combinations. The 8-bit minterm has one bit for each combination — that bit decides whether the output pixel is on (1) or off (0):

Minterm Bit Input A (mask) Input B (source) Input C (background) "If these inputs look like this…"
Bit 7 1 1 1 …mask on, source on, background on
Bit 6 1 1 0 …mask on, source on, background off
Bit 5 1 0 1 …mask on, source off, background on
Bit 4 1 0 0 …mask on, source off, background off
Bit 3 0 1 1 …mask off, source on, background on
Bit 2 0 1 0 …mask off, source on, background off
Bit 1 0 0 1 …mask off, source off, background on
Bit 0 0 0 0 …mask off, source off, background off

Each bit is a simple yes/no: "should the output pixel be on for this combination?"

The most important minterm is $CA — the cookie-cut blit used for sprite compositing. In binary, $CA = 11001010. Let's read each bit:

Bit A (mask) B (source) C (background) $CA bit value Output pixel Why
7 on on on 1 on Inside the shape, source pixel is on → show it
6 on on off 1 on Inside the shape, source pixel is on → show it
5 on off on 0 off Inside the shape, source pixel is off → show it (it's dark)
4 on off off 0 off Inside the shape, source pixel is off → show it
3 off on on 1 on Outside the shape → keep background (it's on)
2 off on off 0 off Outside the shape → keep background (it's off)
1 off off on 1 on Outside the shape → keep background (it's on)
0 off off off 0 off Outside the shape → keep background (it's off)

The pattern: where the mask (A) is set → take the source pixel (B). Where the mask is clear → keep the background pixel (C). That's a sprite draw with transparency — exactly what every Amiga game uses.

Common Minterms

Minterm Hex Operation Description Real-World Use Case
D = A $F0 Copy A Output is a copy of channel A — every A-set pixel appears in D Block copy: duplicate a screen region, copy a font glyph to the display
D = B $CC Copy B Output is a copy of channel B regardless of A and C Shifted copy: B has a barrel shift, so this copies with pixel-level repositioning
D = C $AA Copy C Output is a copy of the destination read-back No-op / readback: useful for fill mode where C→D with fill carry toggling
D = A·B + ¬A·C $CA Cookie-cut Where mask (A) is 1: show source (B). Where mask is 0: show background (C) Sprite compositing: draw a player character with transparency onto the game world
D = 0 $00 Clear Output is always 0 regardless of inputs Screen clear: zero out a bitplane, erase a region
D = $FFFF $FF Set all Output is always 1 Fill with 1s: set all pixels in a region (useful for masks)
D = A XOR C $5A XOR Output toggles wherever A has a set bit Cursor blink: XOR the cursor shape to toggle it on/off without saving background
D = A OR C $FA OR Output is set wherever either A or C has a set bit Overlay: stamp a shape onto the background without erasing existing pixels
D = ¬A AND C $0A Mask out Output keeps C pixels only where A is clear — erases through the mask Erase shape: cut a hole in the background matching the mask shape (first pass of two-pass sprite draw)
D = A AND B $C0 AND Output is set only where both A and B agree Masked pattern: apply a fill pattern (B) clipped to a shape (A)
D = A XOR B $3C XOR (A,B) Output toggles between A and B differences Difference detection: find which pixels changed between two frames
D = NOT A $0F Invert Output is the bitwise complement of A Mask inversion: generate a negative mask from a positive one
A = mask (1 = sprite pixel, 0 = transparent)
B = sprite image data
C = background
D = result

Minterm $CA:
  Where A=1: D = B  (show sprite)
  Where A=0: D = C  (show background)

Register Reference

Address Name R/W Description
$DFF040 BLTCON0 W Control: ASH (bits 1512), channel enables (bits 118), minterm (bits 70)
$DFF042 BLTCON1 W Control: BSH (bits 1512), fill/line mode (bits 40)
$DFF044 BLTAFWM W First word mask for channel A
$DFF046 BLTALWM W Last word mask for channel A
$DFF048 BLTCPTH/L W Channel C pointer (32-bit)
$DFF04C BLTBPTH/L W Channel B pointer (32-bit)
$DFF050 BLTAPTH/L W Channel A pointer (32-bit)
$DFF054 BLTDPTH/L W Channel D pointer (32-bit)
$DFF058 BLTSIZE W Blit dimensions + START (write triggers blit!)
$DFF05A BLTSIZV W Blit height — AGA only (15-bit, up to 32768 lines)
$DFF05C BLTSIZH W Blit width + START — AGA only (11-bit, up to 2048 words)
$DFF060 BLTCMOD W Channel C modulo (bytes to skip per row)
$DFF062 BLTBMOD W Channel B modulo
$DFF064 BLTAMOD W Channel A modulo
$DFF066 BLTDMOD W Channel D modulo
$DFF070 BLTCDAT W Channel C data register (preload)
$DFF072 BLTBDAT W Channel B data register (preload)
$DFF074 BLTADAT W Channel A data register (preload / line texture)
$DFF002 DMACONR R DMA status — bit 14 (BBUSY) = blitter busy

BLTCON0 Encoding

Bits 1512: ASH  — A channel barrel shift (015 pixels right)
Bit  11:    USEA — enable channel A DMA
Bit  10:    USEB — enable channel B DMA
Bit   9:    USEC — enable channel C DMA
Bit   8:    USED — enable channel D DMA (almost always 1)
Bits  70:  LF   — minterm (logic function truth table)

BLTCON1 Encoding

Bits 1512: BSH  — B channel barrel shift (015 pixels right)
Bit   4:    IFE  — inclusive fill enable
Bit   3:    EFE  — exclusive fill enable
Bit   2:    FCI  — fill carry input (initial state)
Bit   1:    DESC — descending mode (blit bottom-right → top-left)
Bit   0:    LINE — line draw mode

BLTSIZE Encoding (OCS/ECS)

Bits 156: Height in lines (11024, 0 = 1024)
Bits  50: Width in words  (164,   0 = 64)

Warning

Writing BLTSIZE starts the blit! Always configure all other registers (pointers, modulos, control, masks) before writing BLTSIZE. On AGA, write BLTSIZV first, then BLTSIZH (which triggers the blit).

Ascending vs. Descending Mode

When source and destination overlap in memory, the blit direction determines whether data is corrupted:

Ascending (default, DESC=0):
  Reads/writes top-left → bottom-right
  Use when: dest address > source address

Descending (DESC=1):
  Reads/writes bottom-right → top-left
  Use when: dest address < source address
  Pointers must be set to the LAST word of the block
  Modulos are subtracted instead of added

This is critical for scrolling — shifting the screen contents by a few pixels requires an overlapping copy, and using the wrong direction produces garbage.

Shift and Alignment

The Blitter is a word-aligned (16-bit) processor. Moving objects to arbitrary pixel positions requires the barrel shifter:

  • ASH (channel A shift) and BSH (channel B shift) shift data 015 pixels to the right
  • A rectangle N pixels wide at a non-aligned X position spans ⌈(N + shift) / 16⌉ words — one more than aligned
  • BLTAFWM (first word mask) and BLTALWM (last word mask) prevent the shifted data from corrupting pixels outside the target area

Complete Examples

Example 1: Clear Screen (320×256, 1 bitplane)

    lea     $DFF000,a5

    ; Wait for blitter idle:
.bwait:
    btst    #14,$002(a5)    ; DMACONR bit 14 = BBUSY
    bne.s   .bwait

    ; D channel only, minterm $00 (clear):
    move.l  #$01000000,$040(a5)  ; BLTCON0: USED=1, minterm=$00
    clr.w   $042(a5)             ; BLTCON1: 0
    move.l  #ScreenMem,$054(a5)  ; BLTDPT
    clr.w   $066(a5)             ; BLTDMOD: 0 (contiguous)
    move.w  #(256<<6)|20,$058(a5) ; BLTSIZE: 256 lines × 20 words (320/16)
    ; Blit is now running!

Example 2: Block Copy (No Shift)

    ; Copy 64×64 pixel block from source to dest (1 bitplane)
    ; Source and dest are in contiguous bitmap, 320 pixels wide

    ; Width = 64 pixels = 4 words
    ; Modulo = (320 - 64) / 16 = 16 words = 32 bytes

    lea     $DFF000,a5

.bwait:
    btst    #14,$002(a5)
    bne.s   .bwait

    move.l  #$09F00000,$040(a5)  ; BLTCON0: USEA+USED, minterm=$F0 (A→D)
    clr.w   $042(a5)             ; BLTCON1
    move.w  #$FFFF,$044(a5)      ; BLTAFWM = all bits
    move.w  #$FFFF,$046(a5)      ; BLTALWM = all bits
    move.l  #SourceAddr,$050(a5) ; BLTAPT
    move.l  #DestAddr,$054(a5)   ; BLTDPT
    move.w  #32,$064(a5)         ; BLTAMOD = 32 bytes
    move.w  #32,$066(a5)         ; BLTDMOD = 32 bytes
    move.w  #(64<<6)|4,$058(a5)  ; BLTSIZE: 64 lines × 4 words → GO!
    ; Blit a 16×16 masked sprite onto background
    ; A = mask, B = sprite data, C = background, D = destination

    lea     $DFF000,a5

.bwait:
    btst    #14,$002(a5)
    bne.s   .bwait

    move.l  #$0FCA0000,$040(a5)  ; BLTCON0: A+B+C+D, minterm=$CA
    clr.w   $042(a5)             ; BLTCON1
    move.w  #$FFFF,$044(a5)      ; BLTAFWM
    move.w  #$FFFF,$046(a5)      ; BLTALWM
    move.l  #MaskData,$050(a5)   ; BLTAPT = mask
    move.l  #SpriteData,$04C(a5) ; BLTBPT = sprite imagery
    move.l  #ScreenPos,$048(a5)  ; BLTCPT = background (read-back)
    move.l  #ScreenPos,$054(a5)  ; BLTDPT = same as C (overwrite)
    clr.w   $064(a5)             ; BLTAMOD = 0 (mask is 16px = 1 word wide)
    clr.w   $062(a5)             ; BLTBMOD = 0
    move.w  #38,$060(a5)         ; BLTCMOD = (320-16)/8 = 38 bytes
    move.w  #38,$066(a5)         ; BLTDMOD = 38
    move.w  #(16<<6)|1,$058(a5)  ; BLTSIZE: 16 lines × 1 word → GO!

Example 4: Line Drawing

    ; Draw a line from (x1,y1) to (x2,y2) using blitter line mode
    ; This is complex — blitter line mode uses a Bresenham-style algorithm
    ; implemented in hardware

    ; BLTCON1 bit 0 = LINE mode
    ; Channel A = single word (texture pattern)
    ; Channel C/D = destination bitmap

    ; See HRM for the full algorithm; here's the concept:
    move.l  #$0B4A0000,$040(a5)  ; BLTCON0: A+C+D, minterm=$4A (XOR), ASH=dx
    move.w  #$0001,$042(a5)      ; BLTCON1: LINE=1, octant bits set per slope
    move.w  #$8000,$074(a5)      ; BLTADAT: single pixel pattern
    move.w  #$FFFF,$044(a5)      ; BLTAFWM
    move.l  #StartPos,$048(a5)   ; BLTCPT: line start position in bitmap
    move.l  #StartPos,$054(a5)   ; BLTDPT: same
    move.w  #Modulo,$060(a5)     ; BLTCMOD
    move.w  #Modulo,$066(a5)     ; BLTDMOD
    move.w  #(len<<6)|2,$058(a5) ; BLTSIZE: length × 2 → GO!

Advanced Use Cases & Cookbook

Use Case 1: Shifted BOB (Sprite at Arbitrary X Position)

The most common real-world blitter task: draw a 16×16 sprite at pixel position (x, y) on a 320-pixel-wide screen. Since x may not be word-aligned, the barrel shifter handles sub-word positioning:

    ; Draw 16×16 BOB at pixel (x, y) on a 320px wide screen
    ; Inputs: d0.w = x position, d1.w = y position
    ;         a0 = mask data, a1 = sprite data, a2 = screen base

    lea     $DFF000,a5

    ; Calculate screen byte offset:
    move.w  d1,d2
    mulu    #40,d2              ; y × 40 bytes/row (320 pixels / 8)
    move.w  d0,d3
    lsr.w   #3,d3               ; x / 8 = byte offset in row
    and.w   #$FFFE,d3           ; word-align (drop bit 0)
    add.w   d3,d2               ; total byte offset into screen
    lea     (a2,d2.w),a3        ; a3 = screen pointer for this BOB

    ; Calculate shift amount:
    move.w  d0,d3
    and.w   #$000F,d3           ; shift = x mod 16 (015 pixels)
    ror.w   #4,d3               ; move to bits 1512 for BLTCON0
    or.w    #$0FCA,d3           ; channels A+B+C+D, minterm $CA

.bwait:
    btst    #14,$002(a5)
    bne.s   .bwait

    move.w  d3,$040(a5)         ; BLTCON0: shift + channels + minterm
    clr.w   $042(a5)            ; BLTCON1: ascending, no fill
    move.w  #$FFFF,$044(a5)     ; BLTAFWM: all bits in first word
    move.w  #$0000,$046(a5)     ; BLTALWM: mask off last word (shift overflow)
    move.l  a0,$050(a5)         ; BLTAPT = mask
    move.l  a1,$04C(a5)         ; BLTBPT = sprite imagery
    move.l  a3,$048(a5)         ; BLTCPT = background read-back
    move.l  a3,$054(a5)         ; BLTDPT = write back to same position
    clr.w   $064(a5)            ; BLTAMOD = 0 (mask is 1 word wide)
    clr.w   $062(a5)            ; BLTBMOD = 0 (sprite is 1 word wide)
    move.w  #36,$060(a5)        ; BLTCMOD = 40 - (2 words × 2) = 36 bytes
    move.w  #36,$066(a5)        ; BLTDMOD = 36
    move.w  #(16<<6)|2,$058(a5) ; BLTSIZE: 16 lines × 2 words (1 extra for shift) → GO!

Key insight: the blit is 2 words wide even though the sprite is only 16 pixels (1 word). The barrel shift pushes bits into the second word, so we need that extra word — and BLTALWM=$0000 masks it so we don't corrupt adjacent pixels.

Use Case 2: Hardware Scroll (Left by N Pixels)

Scrolling the screen left means the destination is at a lower address than the source — we must use descending mode to avoid overwriting source data:

    ; Scroll 320×256 screen left by 16 pixels (1 word = fastest case)
    ; Source: screen + 2 bytes (one word right)
    ; Dest:   screen base
    ; No shift needed for 16-pixel increments

    lea     $DFF000,a5

.bwait:
    btst    #14,$002(a5)
    bne.s   .bwait

    move.l  #$09F00000,$040(a5) ; BLTCON0: A+D, minterm $F0 (copy)
    clr.w   $042(a5)            ; BLTCON1: ascending (dest > source is OK here)
    move.w  #$FFFF,$044(a5)     ; BLTAFWM
    move.w  #$FFFF,$046(a5)     ; BLTALWM
    move.l  #Screen+2,$050(a5)  ; BLTAPT: source is 1 word to the right
    move.l  #Screen,$054(a5)    ; BLTDPT: destination is screen start
    clr.w   $064(a5)            ; BLTAMOD = 0 (full-width rows)
    clr.w   $066(a5)            ; BLTDMOD = 0
    move.w  #(256<<6)|20,$058(a5) ; BLTSIZE: 256 lines × 20 words → GO!
    ; After blit: draw new column at right edge (column 19)

For sub-word scrolling (115 pixels), combine this with the barrel shifter and draw the new edge column from tile data.

Use Case 3: Area Fill (Filled Polygon)

The blitter's fill mode is a two-step process: (1) draw the polygon outline with XOR lines, (2) fill the region. This is how games like Carrier Command and Starglider 2 achieved real-time filled 3D:

    ; Step 1: Draw polygon edges using blitter line mode (XOR, single-bit)
    ; (Repeat for each edge of the polygon)
    ; Use minterm $4A (A XOR C) and BLTCON1 bit 0 = LINE, bit 1 = SING

    ; Step 2: Fill the outlined region
    ; Fill works RIGHT-TO-LEFT, BOTTOM-TO-TOP — requires descending mode
    ; Pointers must point to the LAST word of the bitmap region

    lea     $DFF000,a5

.bwait:
    btst    #14,$002(a5)
    bne.s   .bwait

    ; Set up inclusive fill (IFE):
    move.l  #$09F00000,$040(a5)  ; BLTCON0: A+D, minterm $F0 (copy with fill)
    move.w  #$000A,$042(a5)      ; BLTCON1: DESC=1 (bit 1), IFE=1 (bit 3)
                                  ; IFE = inclusive fill enable
    move.w  #$FFFF,$044(a5)      ; BLTAFWM
    move.w  #$FFFF,$046(a5)      ; BLTALWM

    ; Pointers to LAST word of the fill region (descending!):
    move.l  #FillBufferEnd,$050(a5) ; BLTAPT: last word of source
    move.l  #FillBufferEnd,$054(a5) ; BLTDPT: last word of dest (same buffer)
    clr.w   $064(a5)               ; BLTAMOD = 0
    clr.w   $066(a5)               ; BLTDMOD = 0
    move.w  #(Height<<6)|Width,$058(a5)  ; BLTSIZE → GO!

How it works: the fill carry bit (FCI) toggles on every set pixel. Between two outline pixels on the same scanline, the carry stays on — filling the interior. This is why the outline must use single-bit mode (SING=1) — otherwise double-width line pixels break the fill toggle.

Use Case 4: Interleaved Bitplane BOBs

Standard bitplane layout stores all of plane 0, then all of plane 1, etc. Interleaved layout stores one row of plane 0, then one row of plane 1, alternating. This allows a single blit to draw a BOB across all bitplanes at once:

    ; Interleaved screen layout:
    ;   Row 0, Plane 0 (40 bytes)
    ;   Row 0, Plane 1 (40 bytes)
    ;   Row 0, Plane 2 (40 bytes)
    ;   Row 0, Plane 3 (40 bytes)
    ;   Row 0, Plane 4 (40 bytes)
    ;   Row 1, Plane 0 (40 bytes)
    ;   ...

    ; Blit a 16×16 cookie-cut BOB across all 5 bitplanes in ONE operation:
    ; Height = 16 lines × 5 planes = 80 rows
    ; Modulo = 40 - 2 = 38 bytes per interleaved row (skip rest of scanline row)
    ; BOB data is also stored interleaved

    lea     $DFF000,a5

.bwait:
    btst    #14,$002(a5)
    bne.s   .bwait

    move.l  #$0FCA0000,$040(a5) ; BLTCON0: A+B+C+D, minterm $CA
    clr.w   $042(a5)            ; BLTCON1
    move.w  #$FFFF,$044(a5)     ; BLTAFWM
    move.w  #$FFFF,$046(a5)     ; BLTALWM
    move.l  #BOBMask,$050(a5)   ; BLTAPT (interleaved mask: same mask for all planes)
    move.l  #BOBData,$04C(a5)   ; BLTBPT (interleaved sprite data)
    move.l  a3,$048(a5)         ; BLTCPT (screen position)
    move.l  a3,$054(a5)         ; BLTDPT (same)
    clr.w   $064(a5)            ; BLTAMOD = 0 (mask repeats)
    clr.w   $062(a5)            ; BLTBMOD = 0
    move.w  #38,$060(a5)        ; BLTCMOD = 38 (skip to next interleaved row)
    move.w  #38,$066(a5)        ; BLTDMOD = 38
    move.w  #(80<<6)|1,$058(a5) ; BLTSIZE: 80 rows (16×5) × 1 word → GO!

Why this matters: without interleaving, drawing one BOB on a 5-plane screen requires 5 separate blits (one per plane), each with its own WaitBlit + register setup overhead. Interleaving does it in 1 blit — 5× less setup time, critical when drawing 15+ BOBs per frame.

Use Case 5: Double-Buffered Game Loop

The standard pattern for flicker-free game rendering:

MainLoop:
    ; --- Wait for vertical blank ---
    bsr     WaitVBL             ; Wait for beam to reach line 0

    ; --- Swap display buffer ---
    ; Copper list points to the currently visible buffer
    ; We draw into the hidden back buffer
    move.l  BackBuffer,a0
    move.l  FrontBuffer,a1
    move.l  a0,FrontBuffer      ; Back buffer becomes front (display)
    move.l  a1,BackBuffer       ; Old front becomes new back (draw target)

    ; Update Copper list bitplane pointers to show new front buffer:
    bsr     UpdateCopperBPLPTRs

    ; --- Clear back buffer ---
    bsr     WaitBlit
    move.l  #$01000000,$040(a5) ; D-only, minterm $00
    clr.w   $042(a5)
    move.l  a1,$054(a5)         ; BLTDPT = back buffer
    clr.w   $066(a5)
    move.w  #(256<<6)|20,$058(a5) ; Clear 320×256 → GO!

    ; --- Draw all BOBs ---
    ; CPU can process game logic while the clear blit runs!
    bsr     UpdateGameLogic     ; Physics, AI, input — runs on CPU
    bsr     WaitBlit            ; Wait for clear to finish
    bsr     DrawAllBOBs         ; Chain of cookie-cut blits

    bra     MainLoop

Key optimization: UpdateGameLogic runs on the CPU while the screen clear runs on the Blitter. This is the core of the Amiga's parallelism — ~1.5 ms of free CPU time per frame from a single D-only clear.

Use Case 6: GUI Window Drag (System-Friendly)

Workbench and applications use graphics.library for window dragging, icon rendering, and menu drawing. The OS handles Blitter synchronization:

#include <graphics/gfx.h>
#include <graphics/rastport.h>

/* Scroll a window's contents up by 8 pixels (text scroll): */
ScrollRaster(rp,       /* RastPort */
             0, 8,     /* dx=0, dy=8 (scroll up by 8 pixels) */
             0, 0,     /* top-left corner of scroll area */
             319, 199); /* bottom-right */
/* The OS automatically uses an ascending/descending blit, sets modulos, */
/* and clears the exposed bottom strip. */

/* Copy a rectangular region between two bitmaps: */
BltBitMap(srcBM, 0, 0,       /* source bitmap, x, y */
          dstBM, 100, 50,    /* dest bitmap, x, y */
          64, 32,            /* width, height */
          0xC0,              /* minterm: A AND B → masked copy */
          0xFF,              /* all bitplanes */
          NULL);             /* no temp buffer needed */

/* Draw a filled rectangle (uses the Blitter internally): */
SetAPen(rp, 3);              /* Set pen color to index 3 */
RectFill(rp, 10, 10, 100, 50); /* Filled rectangle */

Use Case 7: Tile Map Renderer

Games like The Settlers, Cannon Fodder, and most platformers render backgrounds from tile maps. Each tile is a 16×16 (or 32×32) block blitted to screen coordinates:

    ; Render a 20×16 tile map (320×256 screen, 16×16 tiles)
    ; TileMap: array of 320 bytes (20×16), each byte = tile index
    ; TileGfx: tile graphics, 16×16 pixels × 5 planes, interleaved

    lea     TileMap,a0
    lea     Screen,a2
    moveq   #16-1,d7            ; 16 tile rows

.tilerow:
    moveq   #20-1,d6            ; 20 tiles per row

.tilecol:
    moveq   #0,d0
    move.b  (a0)+,d0            ; Get tile index
    mulu    #16*5*2,d0          ; Tile data offset (16 rows × 5 planes × 2 bytes)
    lea     TileGfx,a1
    add.l   d0,a1               ; a1 = tile graphics pointer

    bsr     WaitBlit
    move.l  #$09F00000,$040(a5) ; BLTCON0: A+D, minterm $F0 (straight copy)
    clr.w   $042(a5)            ; BLTCON1
    move.w  #$FFFF,$044(a5)     ; BLTAFWM
    move.w  #$FFFF,$046(a5)     ; BLTALWM
    move.l  a1,$050(a5)         ; BLTAPT = tile data (interleaved)
    move.l  a2,$054(a5)         ; BLTDPT = screen position
    clr.w   $064(a5)            ; BLTAMOD = 0 (tile data is contiguous)
    move.w  #38,$066(a5)        ; BLTDMOD = 40 - 2 = 38 (interleaved screen)
    move.w  #(80<<6)|1,$058(a5) ; BLTSIZE: 80 rows (16×5) × 1 word → GO!

    addq.l  #2,a2               ; Next tile position (1 word right)
    dbf     d6,.tilecol

    ; Move to next tile row: advance screen pointer by 16 scanlines × 5 planes × 40 bytes
    add.l   #16*5*40-40,a2      ; Subtract the 40 bytes already advanced by 20 tiles
    dbf     d7,.tilerow

Good and Bad Patterns

✓ Pattern: "Blit and Compute" — Overlap CPU and Blitter Work

    ; Start a blit, then do CPU work while it runs:
    bsr     SetupAndStartBlit   ; Triggers BLTSIZE write
    bsr     UpdatePlayerPhysics ; CPU work — runs in parallel!
    bsr     ProcessInput        ; More CPU work
    bsr     WaitBlit            ; NOW wait for blit to finish
    bsr     SetupNextBlit       ; Safe to touch registers

This is the entire point of having a Blitter. Any code that busy-waits immediately after starting a blit wastes the Amiga's key advantage.

✗ Antipattern: "The Busy-Wait Hog"

    ; ✗ BAD: Wait immediately after every blit — wastes CPU cycles
    bsr     StartBlit
.wait1: btst #14,$002(a5)
    bne.s   .wait1              ; CPU does NOTHING while blitter runs
    bsr     StartNextBlit
.wait2: btst #14,$002(a5)
    bne.s   .wait2              ; More wasted time

✓ Pattern: "Batch Then Wait" — Chain Setup, Single Sync Point

    ; Process all game logic FIRST:
    bsr     RunAI
    bsr     RunPhysics
    bsr     AnimateFrames
    ; THEN start the rendering blits in sequence:
    bsr     WaitBlit
    bsr     BlitBOB1
    bsr     WaitBlit
    bsr     BlitBOB2
    bsr     WaitBlit
    bsr     BlitBOB3
    ; The CPU-intensive work happened during the previous frame's display time

✗ Antipattern: "The Single-Plane-At-A-Time"

    ; ✗ BAD: Blit each bitplane separately (5× setup overhead)
    lea     Plane0,a0
    bsr     BlitBOBOnePlane
    lea     Plane1,a0
    bsr     BlitBOBOnePlane
    lea     Plane2,a0
    bsr     BlitBOBOnePlane
    lea     Plane3,a0
    bsr     BlitBOBOnePlane
    lea     Plane4,a0
    bsr     BlitBOBOnePlane     ; 5 blits, 5 WaitBlit calls, 5× register setup
    ; ✓ GOOD: Use interleaved bitplanes — ONE blit for all planes
    bsr     BlitBOBInterleaved  ; 1 blit, 1 WaitBlit, 1× register setup

✗ Antipattern: "System-Unfriendly Direct Access"

/* ✗ BAD: Hit blitter registers directly from a Workbench app */
custom.bltcon0 = 0x09F00000;
/* The OS may be using the blitter RIGHT NOW for window operations */
/* ✓ GOOD: Use OwnBlitter/DisownBlitter for exclusive access */
OwnBlitter();           /* Wait for and lock the blitter */
WaitBlit();              /* Ensure previous blit is done */
/* ... safe to program registers directly ... */
DisownBlitter();         /* Release for OS use */

✗ Antipattern: "Hardcoded 320-Pixel Modulo"

    ; ✗ BAD: Assumes screen width is always 320 pixels (modulo = 40 - blit_width*2)
    move.w  #36,$066(a5)        ; BLTDMOD = 36 (hardcoded for 320px)

Many Amiga programs run on PAL overscan (352 or 384 pixels), productivity modes (640+), or RTG screens. Always calculate modulo from the actual screen byte width:

    ; ✓ GOOD: Calculate modulo from actual bitmap width
    move.w  ScreenBytesPerRow,d0
    sub.w   BlitWidthBytes,d0
    move.w  d0,$066(a5)         ; BLTDMOD = dynamic

✗ Antipattern: "Ignoring the DMA Budget"

The Blitter shares the DMA bus with display, audio, and disk. In high-bandwidth display modes, there are fewer free DMA slots:

Display Mode DMA Slots Used by Display Remaining for Blitter Effect
Lores 320×256 × 5 planes ~100 per line ~126 per line Full blitter speed
Hires 640×256 × 4 planes ~160 per line ~66 per line Blitter runs at ~50% speed
Super Hires 1280 × 4 planes ~200+ per line ~26 per line Blitter barely runs
HAM8 (AGA) ~200 per line ~26 per line Blitter barely runs

Rule of thumb: if your game stutters in hires modes, it's probably DMA contention, not CPU speed.


Practical Limitations

Limitation Detail Workaround
Max blit size (OCS/ECS) 1024 lines × 64 words (1024×1024 pixels) Split into multiple blits
Max blit size (AGA) 32768 lines × 2048 words (BLTSIZV/BLTSIZH) Rarely a practical issue
Word alignment Blitter operates on 16-bit word boundaries only Use barrel shift + masks for sub-word positioning; costs 1 extra word of width
No scaling Cannot scale or rotate — purely rectangular block ops Use CPU for affine transforms, then blit the result
No clipping Blitter will happily write outside the screen bitmap Implement clipping in software before setting up the blit
Single operation at a time Only one blit can run at a time — no queue Pipeline setup: compute next blit's parameters on CPU while current blit runs
Chip RAM only All 4 channels must point to Chip RAM Use MEMF_CHIP for all blitter-visible allocations; see memory_types.md
Fill carry direction Fill mode only works right-to-left (descending) Always use DESC=1 with fill; set pointers to the end of the data
No transparency levels Boolean operations only — 1-bit masking, no alpha Dithering or multiple passes for graduated transparency
Line mode limitations Lines drawn with SING=1 for fill prep are single-dot-per-row — visible gaps on steep angles Use non-SING mode for visible lines, SING only for fill boundaries

Performance Analysis

DMA Cycle Costs

The Blitter consumes DMA cycles proportional to the number of active channels. Each active channel adds 1 DMA cycle per word per row:

Channels Active Cycles/Word Example Operation Time for 320×256 (1 plane)
D only 1 cycle Screen clear ~0.3 ms
A + D 2 cycles Simple copy (A→D) ~0.6 ms
A + B + D 3 cycles Masked copy ~0.9 ms
A + B + C + D 4 cycles Cookie-cut blit ~1.3 ms

At 3.58 MHz DMA clock, 1 cycle ≈ 280 ns. A full 320×256×5-plane screen clear takes ~1.5 ms (D-only × 5 planes).

CPU vs. Blitter Crossover

The Blitter is not always faster than the 68000:

Operation Size Winner Why
< ~40 words CPU (68000) Blitter setup overhead (~20 cycles) exceeds the DMA savings
40200 words Tie Depends on whether CPU needs the bus
> 200 words Blitter DMA runs independently; CPU can compute in parallel
Any size (A1200) Measure 68020 can access 32-bit Fast RAM while Blitter uses Chip RAM bus — often faster to do both

Nasty Mode (BLTPRI)

Setting bit 10 of DMACON (BLTPRI) gives the Blitter absolute DMA priority — the CPU is frozen until the blit completes. This maximizes blitter throughput but:

  • Disables all interrupt servicing during the blit
  • Breaks timing-sensitive code (audio, serial)
  • Most professional software avoids it; demos use it freely

When to Use / When NOT to Use

When to Use the Blitter

  • Screen clearing — D-only blit at 1 cycle/word is unbeatable
  • BOB/sprite compositing — cookie-cut blit is the standard technique for all Amiga game objects
  • Scrolling — overlapping copy with correct ascending/descending mode
  • Polygon filling — exclusive/inclusive fill after boundary line drawing
  • Large memory copies — any block > ~40 words benefits from DMA parallelism
  • Line drawing — hardware Bresenham is faster than any software implementation on 68000

When NOT to Use

  • Small copies (< 40 words) — 68000 MOVEM or MOVE.L loop is faster due to blitter setup overhead
  • Fast RAM operations — the Blitter cannot access Fast RAM at all; use CPU
  • Pixel-level operations — the Blitter works on word-aligned rectangles; per-pixel logic requires CPU
  • A1200/A4000 with Fast RAM — the 68020/030 running from 32-bit Fast RAM can often outperform the Blitter on Chip RAM, especially if you can overlap CPU work with display DMA

Applicability Ranges

  • BOBs: Practical limit ~1520 per frame at 320×256×5 planes before exhausting DMA bandwidth
  • Fill mode: Works on single bitplanes only — filling a 5-plane display requires 5 passes
  • Line mode: Maximum line length limited by BLTSIZE height field (1024 on OCS/ECS, 32768 on AGA)

Historical Context — The 1985 Competitive Landscape

The Blitter was architecturally unprecedented in 1985. No competing home computer shipped with a comparable 2D rasterization coprocessor:

Feature Amiga (1985) Atari ST (1985) PC EGA (1984) Mac 128K (1984) C64 (1982)
Hardware blitter Yes — 4-channel DMA with minterm logic No (added in Mega ST/STE, 1987 — 1 source only) No No No
Channels 3 source + 1 dest 1 source + 1 dest (STE)
Boolean ops 256 minterms (arbitrary 3-input logic) 16 logic ops (STE)
Line drawing Hardware Bresenham No No No No
Area fill Hardware inclusive/exclusive fill No No No No
Shift/mask Per-channel barrel shift + first/last word masks Shift + endmask (STE)
CPU relief Full DMA — CPU free during blit Partial — CPU still involved (STE) CPU does everything CPU does everything CPU does everything

Pros (in 1985 context)

  • Parallelism: The 68000 could execute game logic while the Blitter handled all rendering — this was the Amiga's key advantage over every competitor
  • Generality: 256 minterm combinations meant any Boolean compositing operation was a single register write, not a software loop
  • Integration: Shared DMA bus with Copper and sprites meant the entire display pipeline was hardware-driven
  • Line + fill in hardware: Enabled real-time filled polygon rendering (used in games like Carrier Command, Starglider 2) that was impossible on competing platforms

Cons (in 1985 context)

  • Chip RAM only: All blitter-visible data had to live in the first 512 KB (later 12 MB), competing with screen memory, audio, and disk buffers
  • Word alignment: Sub-pixel positioning required shift + extra word width + masking — complex setup for simple operations
  • No scaling/rotation: Purely rectangular block operations; affine transforms required CPU
  • DMA contention: Heavy blitter use starved the CPU of bus cycles even without nasty mode

Modern Analogies

Amiga Blitter Concept Modern Equivalent Notes
4-channel minterm blit GPU blend equations (Vulkan VkBlendOp) The minterm is a fixed-function Boolean blend; modern GPUs use programmable shaders but the concept of combining sources through a logic function is identical
Cookie-cut (A·B + ¬A·C) Alpha compositing / Porter-Duff SrcOver The Amiga used 1-bit masks; modern systems use 8-bit alpha channels, but the compositing algebra is the same
DMA-driven blit vkCmdCopyImage / MTLBlitCommandEncoder Modern GPUs have dedicated DMA/copy engines that run asynchronously, exactly like the Blitter ran independently of the 68000
OwnBlitter/DisownBlitter Vulkan queue submission / Metal command buffer Exclusive access to a shared hardware resource, then release — the synchronization pattern is identical
BLTPRI (nasty mode) GPU preemption priority Giving the transfer engine absolute bus priority at the cost of starving other consumers
Fill mode GPU rasterizer fill Hardware polygon fill is now done by the rasterizer stage; the Blitter's XOR-toggle fill was a clever 1985 approximation
BLTSIZE triggers blit Command buffer submission Writing the final register starts execution — analogous to vkQueueSubmit or [commandBuffer commit]
Barrel shift + word masks Texture sampling with sub-texel offset Achieving sub-pixel positioning through hardware shift and masking

Pitfalls & Common Mistakes

Pitfall 1: "The Silent Corruption" — Fast RAM Pointers

    ; ✗ BAD: Buffer allocated in Fast RAM
    move.l  #FastRAMBuffer,$054(a5)  ; BLTDPT points to Fast RAM
    move.w  #(256<<6)|20,$058(a5)   ; Blit runs... but writes garbage

The Blitter's DMA engine is wired to the Chip RAM bus only. Fast RAM addresses silently alias to Chip RAM addresses or produce random data. There is no error signal — the blit completes "successfully" with corrupt output.

    ; ✓ GOOD: Buffer in Chip RAM
    move.l  #ChipRAMBuffer,$054(a5) ; Allocated with MEMF_CHIP

Pitfall 2: "The Race Condition" — Missing WaitBlit

    ; ✗ BAD: Start a new blit without waiting for previous one
    move.l  #$09F00000,$040(a5)  ; Overwrite BLTCON0 while previous blit runs!
    move.l  #NewSource,$050(a5)  ; Corrupt the in-progress blit
    move.w  #(64<<6)|4,$058(a5)  ; Start another blit — undefined behavior

Modifying blitter registers while a blit is in progress produces unpredictable results — partial data, corrupted pointers, or system crashes.

    ; ✓ GOOD: Always wait
.bwait:
    btst    #14,$002(a5)         ; Test BBUSY in DMACONR
    bne.s   .bwait
    ; Now safe to set up the next blit

Pitfall 3: "The Wrong Direction" — Overlapping Copy Corruption

    ; ✗ BAD: Scrolling left (dest < source) with ascending mode
    ; Source at offset 2, dest at offset 0 — ascending overwrites source data
    ; before it's read, producing smeared garbage
    ; ✓ GOOD: Use descending mode when dest < source
    move.w  #$0002,$042(a5)      ; BLTCON1: DESC=1
    ; Set pointers to LAST word of block, not first

Pitfall 4: "The Off-By-One Word" — Forgetting Shift Width Expansion

    ; ✗ BAD: 32-pixel wide blit at non-aligned X — width still set to 2 words
    ; Shifted data spills into adjacent word, corrupting neighboring pixels
    move.w  #(16<<6)|2,$058(a5)  ; Only 2 words wide — but shift needs 3!
    ; ✓ GOOD: Add 1 word when shift > 0
    move.w  #(16<<6)|3,$058(a5)  ; 3 words: 2 for data + 1 for shift overflow
    move.w  #$FFF0,$046(a5)      ; BLTALWM masks off the rightmost 4 pixels

Pitfall 5: "The Stale Pointer" — Reusing Registers After a Blit

After a blit completes, all pointer registers have advanced to the end of the data. A second blit with the same pointers starts where the first one left off — not at the original position.

    ; ✓ GOOD: Always reload all pointers before each blit
    move.l  #SourceAddr,$050(a5) ; Reload BLTAPT
    move.l  #DestAddr,$054(a5)   ; Reload BLTDPT

Impact on FPGA/Emulation

The Blitter is one of the most complex subsystems to reproduce accurately in an FPGA core:

  • DMA slot timing: The Blitter shares DMA slots with bitplane, sprite, Copper, disk, and audio DMA. Incorrect slot allocation produces visible glitches in demos that count cycles
  • Barrel shifter pipeline: The A and B channel shifts operate on a word pipeline — off-by-one in the shift register produces 1-pixel horizontal offset errors visible in scrolling
  • Fill mode carry propagation: The fill carry bit (FCI) must propagate correctly from right to left within each word and across word boundaries; errors produce "zebra stripe" artifacts
  • Line mode octant handling: The Bresenham algorithm implementation requires precise handling of 8 octants with correct sign and direction — many emulators get diagonal lines wrong in edge cases
  • BLTSIZE write-trigger: The blit must start on the exact cycle that BLTSIZE is written, not one cycle later; demos that chain blits back-to-back depend on this timing
  • Nasty mode interaction: BLTPRI must correctly freeze the CPU and still allow DMA from other sources (Copper, bitplanes) — freezing everything breaks display output

Real-World Software Usage

Software Blitter Usage Notes
Deluxe Paint Brush compositing, flood fill, line tools Canonical use of BltBitMap + BltMaskBitMapRastPort through the OS
Shadow of the Beast Multi-layer parallax scrolling Custom blitter routines for layer compositing, bypasses OS
Carrier Command Filled polygon 3D rendering Blitter line draw + fill mode for real-time vector graphics
Lemmings Terrain destruction, character compositing Cookie-cut blits for each lemming; XOR blits for terrain modification
Workbench Window dragging, icon rendering, menu drawing All through graphics.library — system-friendly blitter usage
Demo scene Virtually everything Chunky-to-planar conversion, texture mapping, copper+blitter co-programming

Best Practices

  1. Always call WaitBlit() or poll BBUSY before touching any blitter register
  2. Write BLTSIZE last — it triggers the blit; all other registers must be configured first
  3. Use OwnBlitter()/DisownBlitter() for system-friendly code — never assume you have exclusive access
  4. Disable unused channels — fewer channels = fewer DMA cycles = faster blit
  5. Set BLTAFWM and BLTALWM to $FFFF for word-aligned blits — forgetting this produces partial-word masking bugs
  6. Account for shift width expansion — non-aligned blits are 1 word wider than you expect
  7. Choose ascending/descending correctly for overlapping copies — test both scroll directions
  8. Interleave CPU work with blitter operations — the whole point of DMA is parallelism; don't busy-wait when you could be computing
  9. Profile before choosing Blitter vs CPU — on accelerated Amigas, the 68020+ with Fast RAM often wins

References

  • HRM: Amiga Hardware Reference Manual — Blitter chapter (complete register descriptions and timing)
  • NDK 3.9: hardware/blit.h, hardware/custom.h, graphics/gfx.h
  • ADCD 2.1: Hardware Manual — Blitter chapter
  • See also: blitter.md — hardware register reference
  • See also: animation.md — GEL system (BOBs use the Blitter internally)
  • See also: copper.md — Copper coprocessor (often co-programmed with the Blitter)
  • See also: rastport.md — RastPort drawing context (uses Blitter for all draw operations)
  • See also: display_modes.md — DMA slot budget (Blitter competes for bus bandwidth)
  • See also: Akiko — CD32 C2P — hardware Chunky-to-Planar conversion (CD32 alternative to CPU/Blitter C2P)
  • Scoopex Amiga Hardware Programming (Photon) — YouTube: Blitter episodes — Video walkthroughs of Blitter setup, cookie-cut masking, line draw, and fill mode. Companion articles: coppershade.org