48 KiB
Blitter Programming — Deep Dive
Overview
The Blitter (Block Image Transferrer) is a DMA coprocessor inside the Agnus chip that performs raster operations on rectangular memory blocks at bus speed — without CPU involvement. While the 68000 executes game logic, physics, or AI, the Blitter simultaneously clears screens, copies bitmap regions, composites masked sprites ("cookie-cut"), draws lines, and fills polygons. This parallelism is fundamental to why the Amiga could deliver arcade-quality 2D graphics on a 7 MHz processor with 512 KB of RAM.
The Blitter operates on up to 4 DMA channels (A, B, C → D) using a programmable 8-bit minterm truth table that encodes any Boolean function of three inputs. Combined with per-channel shift, modulo, and first/last word masking, this makes the Blitter a general-purpose 2D rasterization engine — not merely a memory copier.
Warning
The Blitter can only access Chip RAM. Pointing any channel register at Fast RAM causes silent data corruption or system crashes. Always allocate blitter-visible memory with
AllocMem(size, MEMF_CHIP).
Channel A ──→ ┐
Channel B ──→ ├──→ Minterm Logic ──→ Channel D (output)
Channel C ──→ ┘
A = mask/pattern (e.g., cookie shape, font glyph)
B = source image data
C = background / destination read-back
D = output destination
Architecture
The Blitter sits inside Agnus (OCS/ECS) or Alice (AGA), sharing the DMA bus with the Copper, bitplane fetches, sprite DMA, disk, and audio. It accesses memory through 4 independent DMA channels, each with its own pointer and modulo register:
graph LR
subgraph "Agnus / Alice"
A["Channel A<br/>(mask/pattern)"] --> ML["Minterm Logic<br/>(8-bit truth table)"]
B["Channel B<br/>(source data)"] --> ML
C["Channel C<br/>(background read-back)"] --> ML
ML --> D["Channel D<br/>(output)"]
end
CRAM["Chip RAM"] --> A
CRAM --> B
CRAM --> C
D --> CRAM
style ML fill:#fff9c4,stroke:#f9a825
style CRAM fill:#e8f4fd,stroke:#2196f3
The Minterm Logic block is the Blitter's core innovation. It takes the current bit from channels A, B, and C (three Boolean inputs) and produces one output bit for channel D according to a programmable 8-bit truth table stored in BLTCON0 bits 7–0. Since 3 inputs have 8 possible combinations (2³), the 8-bit minterm encodes any Boolean function of three variables — that's 256 possible logic operations in a single register write. This is what lets one piece of hardware do copies (D=A, minterm $F0), clears (D=0, minterm $00), cookie-cut compositing (D=A·B+¬A·C, minterm $CA), XOR highlighting (D=A⊕C, minterm $5A), and any other combination — all without changing hardware, just the 8-bit minterm value. See Minterm Logic below for the full truth table and common values.
Each channel reads (or writes, for D) from a different memory pointer with independent modulo, allowing operations on sub-rectangles within larger bitmaps. Writing to BLTSIZE ($DFF058) starts the blit immediately — always configure all other registers first.
Channel Roles
| Channel | DMA Direction | Typical Role | Has Shift? | Has Mask? |
|---|---|---|---|---|
| A | Read | Mask, cookie shape, font glyph, line texture | Yes (ASH, 0–15 px) | Yes (BLTAFWM/BLTALWM) |
| B | Read | Source image data | Yes (BSH, 0–15 px) | No |
| C | Read | Background / destination read-back | No | No |
| D | Write | Output destination | No | No |
Note
Any channel can be disabled per operation via BLTCON0 bits 11–8 (USEA/B/C/D). Disabling unused channels saves DMA cycles — a D-only clear (1 channel) runs 4× faster than a full ABCD blit.
CPU / Blitter Bus Interaction
The Blitter and the 68000 CPU share the Chip RAM bus — they cannot access it simultaneously. Agnus arbitrates access on a cycle-by-cycle basis:
┌────────────────────────────────────────────────────────────┐
│ Chip RAM Bus (16-bit) │
├──────────┬──────────┬──────────┬──────────┬────────────────┤
│ Bitplane │ Sprite │ Copper │ Blitter │ CPU (left- │
│ DMA │ DMA │ DMA │ DMA │ over slots) │
├──────────┴──────────┴──────────┴──────────┴────────────────│
│ Fixed priority (high → low) │
└────────────────────────────────────────────────────────────┘
- Without
BLTPRI: The Blitter gets every other free DMA slot. The CPU gets the remaining slots. Both run at roughly half speed on the Chip RAM bus. - With
BLTPRI(nasty mode): The Blitter takes all free DMA slots. The CPU is completely frozen on any Chip RAM access until the blit completes. The CPU can still execute from Fast RAM or ROM — but any Chip RAM read/write stalls. - Display DMA always wins: Bitplane, sprite, and audio DMA have fixed priority above the Blitter. In high-resolution modes, display DMA alone consumes most of the bus, leaving few slots for blitter operations.
Chip RAM vs. Fast RAM
The Blitter is physically wired to the Chip RAM bus inside Agnus. It has no connection to the Fast RAM (Zorro) bus:
| Memory Type | Blitter Access? | CPU Access? | Notes |
|---|---|---|---|
| Chip RAM (first 512 KB–2 MB) | ✓ Yes | ✓ Yes (contended) | Screen buffers, audio, sprites, all DMA-visible data |
| Fast RAM (Zorro II/III) | ✗ No | ✓ Yes (uncontended) | Code, variables, non-DMA data |
| ROM ($F80000–$FFFFFF) | ✗ No | ✓ Yes | Kickstart, libraries |
This creates the key optimization opportunity on accelerated Amigas (A1200, A3000, A4000): the CPU can execute code and access Fast RAM at full speed while the Blitter simultaneously works on Chip RAM. On a stock A500 with only Chip RAM, the CPU and Blitter always contend for the same bus.
Important
There is no hardware error when pointing blitter registers at Fast RAM addresses. The Blitter's 22-bit address lines (OCS/ECS) simply wrap into Chip RAM space — producing silent data corruption at an unpredictable Chip RAM location.
Minterm Logic
The minterm is an 8-bit value stored in BLTCON0 (bits 7–0) that tells the Blitter what to do with each pixel. Think of it as a tiny program: for every pixel position, the Blitter reads the current bit from channels A, B, and C, looks up the answer in the minterm, and writes that answer to channel D (destination memory).
Since there are 3 inputs (A, B, C), each either 0 or 1, there are exactly 8 possible input combinations. The 8-bit minterm has one bit for each combination — that bit decides whether the output pixel is on (1) or off (0):
| Minterm Bit | Input A (mask) | Input B (source) | Input C (background) | "If these inputs look like this…" |
|---|---|---|---|---|
| Bit 7 | 1 | 1 | 1 | …mask on, source on, background on |
| Bit 6 | 1 | 1 | 0 | …mask on, source on, background off |
| Bit 5 | 1 | 0 | 1 | …mask on, source off, background on |
| Bit 4 | 1 | 0 | 0 | …mask on, source off, background off |
| Bit 3 | 0 | 1 | 1 | …mask off, source on, background on |
| Bit 2 | 0 | 1 | 0 | …mask off, source on, background off |
| Bit 1 | 0 | 0 | 1 | …mask off, source off, background on |
| Bit 0 | 0 | 0 | 0 | …mask off, source off, background off |
Each bit is a simple yes/no: "should the output pixel be on for this combination?"
Worked Example: Cookie-Cut ($CA)
The most important minterm is $CA — the cookie-cut blit used for sprite compositing. In binary, $CA = 11001010. Let's read each bit:
| Bit | A (mask) | B (source) | C (background) | $CA bit value |
Output pixel | Why |
|---|---|---|---|---|---|---|
| 7 | on | on | on | 1 | on | Inside the shape, source pixel is on → show it |
| 6 | on | on | off | 1 | on | Inside the shape, source pixel is on → show it |
| 5 | on | off | on | 0 | off | Inside the shape, source pixel is off → show it (it's dark) |
| 4 | on | off | off | 0 | off | Inside the shape, source pixel is off → show it |
| 3 | off | on | on | 1 | on | Outside the shape → keep background (it's on) |
| 2 | off | on | off | 0 | off | Outside the shape → keep background (it's off) |
| 1 | off | off | on | 1 | on | Outside the shape → keep background (it's on) |
| 0 | off | off | off | 0 | off | Outside the shape → keep background (it's off) |
The pattern: where the mask (A) is set → take the source pixel (B). Where the mask is clear → keep the background pixel (C). That's a sprite draw with transparency — exactly what every Amiga game uses.
Common Minterms
| Minterm | Hex | Operation | Description | Real-World Use Case |
|---|---|---|---|---|
D = A |
$F0 |
Copy A | Output is a copy of channel A — every A-set pixel appears in D | Block copy: duplicate a screen region, copy a font glyph to the display |
D = B |
$CC |
Copy B | Output is a copy of channel B regardless of A and C | Shifted copy: B has a barrel shift, so this copies with pixel-level repositioning |
D = C |
$AA |
Copy C | Output is a copy of the destination read-back | No-op / readback: useful for fill mode where C→D with fill carry toggling |
D = A·B + ¬A·C |
$CA |
Cookie-cut | Where mask (A) is 1: show source (B). Where mask is 0: show background (C) | Sprite compositing: draw a player character with transparency onto the game world |
D = 0 |
$00 |
Clear | Output is always 0 regardless of inputs | Screen clear: zero out a bitplane, erase a region |
D = $FFFF |
$FF |
Set all | Output is always 1 | Fill with 1s: set all pixels in a region (useful for masks) |
D = A XOR C |
$5A |
XOR | Output toggles wherever A has a set bit | Cursor blink: XOR the cursor shape to toggle it on/off without saving background |
D = A OR C |
$FA |
OR | Output is set wherever either A or C has a set bit | Overlay: stamp a shape onto the background without erasing existing pixels |
D = ¬A AND C |
$0A |
Mask out | Output keeps C pixels only where A is clear — erases through the mask | Erase shape: cut a hole in the background matching the mask shape (first pass of two-pass sprite draw) |
D = A AND B |
$C0 |
AND | Output is set only where both A and B agree | Masked pattern: apply a fill pattern (B) clipped to a shape (A) |
D = A XOR B |
$3C |
XOR (A,B) | Output toggles between A and B differences | Difference detection: find which pixels changed between two frames |
D = NOT A |
$0F |
Invert | Output is the bitwise complement of A | Mask inversion: generate a negative mask from a positive one |
Cookie-Cut Explained
A = mask (1 = sprite pixel, 0 = transparent)
B = sprite image data
C = background
D = result
Minterm $CA:
Where A=1: D = B (show sprite)
Where A=0: D = C (show background)
Register Reference
| Address | Name | R/W | Description |
|---|---|---|---|
$DFF040 |
BLTCON0 | W | Control: ASH (bits 15–12), channel enables (bits 11–8), minterm (bits 7–0) |
$DFF042 |
BLTCON1 | W | Control: BSH (bits 15–12), fill/line mode (bits 4–0) |
$DFF044 |
BLTAFWM | W | First word mask for channel A |
$DFF046 |
BLTALWM | W | Last word mask for channel A |
$DFF048 |
BLTCPTH/L | W | Channel C pointer (32-bit) |
$DFF04C |
BLTBPTH/L | W | Channel B pointer (32-bit) |
$DFF050 |
BLTAPTH/L | W | Channel A pointer (32-bit) |
$DFF054 |
BLTDPTH/L | W | Channel D pointer (32-bit) |
$DFF058 |
BLTSIZE | W | Blit dimensions + START (write triggers blit!) |
$DFF05A |
BLTSIZV | W | Blit height — AGA only (15-bit, up to 32768 lines) |
$DFF05C |
BLTSIZH | W | Blit width + START — AGA only (11-bit, up to 2048 words) |
$DFF060 |
BLTCMOD | W | Channel C modulo (bytes to skip per row) |
$DFF062 |
BLTBMOD | W | Channel B modulo |
$DFF064 |
BLTAMOD | W | Channel A modulo |
$DFF066 |
BLTDMOD | W | Channel D modulo |
$DFF070 |
BLTCDAT | W | Channel C data register (preload) |
$DFF072 |
BLTBDAT | W | Channel B data register (preload) |
$DFF074 |
BLTADAT | W | Channel A data register (preload / line texture) |
$DFF002 |
DMACONR | R | DMA status — bit 14 (BBUSY) = blitter busy |
BLTCON0 Encoding
Bits 15–12: ASH — A channel barrel shift (0–15 pixels right)
Bit 11: USEA — enable channel A DMA
Bit 10: USEB — enable channel B DMA
Bit 9: USEC — enable channel C DMA
Bit 8: USED — enable channel D DMA (almost always 1)
Bits 7–0: LF — minterm (logic function truth table)
BLTCON1 Encoding
Bits 15–12: BSH — B channel barrel shift (0–15 pixels right)
Bit 4: IFE — inclusive fill enable
Bit 3: EFE — exclusive fill enable
Bit 2: FCI — fill carry input (initial state)
Bit 1: DESC — descending mode (blit bottom-right → top-left)
Bit 0: LINE — line draw mode
BLTSIZE Encoding (OCS/ECS)
Bits 15–6: Height in lines (1–1024, 0 = 1024)
Bits 5–0: Width in words (1–64, 0 = 64)
Warning
Writing BLTSIZE starts the blit! Always configure all other registers (pointers, modulos, control, masks) before writing BLTSIZE. On AGA, write BLTSIZV first, then BLTSIZH (which triggers the blit).
Ascending vs. Descending Mode
When source and destination overlap in memory, the blit direction determines whether data is corrupted:
Ascending (default, DESC=0):
Reads/writes top-left → bottom-right
Use when: dest address > source address
Descending (DESC=1):
Reads/writes bottom-right → top-left
Use when: dest address < source address
Pointers must be set to the LAST word of the block
Modulos are subtracted instead of added
This is critical for scrolling — shifting the screen contents by a few pixels requires an overlapping copy, and using the wrong direction produces garbage.
Shift and Alignment
The Blitter is a word-aligned (16-bit) processor. Moving objects to arbitrary pixel positions requires the barrel shifter:
- ASH (channel A shift) and BSH (channel B shift) shift data 0–15 pixels to the right
- A rectangle N pixels wide at a non-aligned X position spans
⌈(N + shift) / 16⌉words — one more than aligned - BLTAFWM (first word mask) and BLTALWM (last word mask) prevent the shifted data from corrupting pixels outside the target area
Complete Examples
Example 1: Clear Screen (320×256, 1 bitplane)
lea $DFF000,a5
; Wait for blitter idle:
.bwait:
btst #14,$002(a5) ; DMACONR bit 14 = BBUSY
bne.s .bwait
; D channel only, minterm $00 (clear):
move.l #$01000000,$040(a5) ; BLTCON0: USED=1, minterm=$00
clr.w $042(a5) ; BLTCON1: 0
move.l #ScreenMem,$054(a5) ; BLTDPT
clr.w $066(a5) ; BLTDMOD: 0 (contiguous)
move.w #(256<<6)|20,$058(a5) ; BLTSIZE: 256 lines × 20 words (320/16)
; Blit is now running!
Example 2: Block Copy (No Shift)
; Copy 64×64 pixel block from source to dest (1 bitplane)
; Source and dest are in contiguous bitmap, 320 pixels wide
; Width = 64 pixels = 4 words
; Modulo = (320 - 64) / 16 = 16 words = 32 bytes
lea $DFF000,a5
.bwait:
btst #14,$002(a5)
bne.s .bwait
move.l #$09F00000,$040(a5) ; BLTCON0: USEA+USED, minterm=$F0 (A→D)
clr.w $042(a5) ; BLTCON1
move.w #$FFFF,$044(a5) ; BLTAFWM = all bits
move.w #$FFFF,$046(a5) ; BLTALWM = all bits
move.l #SourceAddr,$050(a5) ; BLTAPT
move.l #DestAddr,$054(a5) ; BLTDPT
move.w #32,$064(a5) ; BLTAMOD = 32 bytes
move.w #32,$066(a5) ; BLTDMOD = 32 bytes
move.w #(64<<6)|4,$058(a5) ; BLTSIZE: 64 lines × 4 words → GO!
Example 3: Cookie-Cut Blit (Masked Sprite)
; Blit a 16×16 masked sprite onto background
; A = mask, B = sprite data, C = background, D = destination
lea $DFF000,a5
.bwait:
btst #14,$002(a5)
bne.s .bwait
move.l #$0FCA0000,$040(a5) ; BLTCON0: A+B+C+D, minterm=$CA
clr.w $042(a5) ; BLTCON1
move.w #$FFFF,$044(a5) ; BLTAFWM
move.w #$FFFF,$046(a5) ; BLTALWM
move.l #MaskData,$050(a5) ; BLTAPT = mask
move.l #SpriteData,$04C(a5) ; BLTBPT = sprite imagery
move.l #ScreenPos,$048(a5) ; BLTCPT = background (read-back)
move.l #ScreenPos,$054(a5) ; BLTDPT = same as C (overwrite)
clr.w $064(a5) ; BLTAMOD = 0 (mask is 16px = 1 word wide)
clr.w $062(a5) ; BLTBMOD = 0
move.w #38,$060(a5) ; BLTCMOD = (320-16)/8 = 38 bytes
move.w #38,$066(a5) ; BLTDMOD = 38
move.w #(16<<6)|1,$058(a5) ; BLTSIZE: 16 lines × 1 word → GO!
Example 4: Line Drawing
; Draw a line from (x1,y1) to (x2,y2) using blitter line mode
; This is complex — blitter line mode uses a Bresenham-style algorithm
; implemented in hardware
; BLTCON1 bit 0 = LINE mode
; Channel A = single word (texture pattern)
; Channel C/D = destination bitmap
; See HRM for the full algorithm; here's the concept:
move.l #$0B4A0000,$040(a5) ; BLTCON0: A+C+D, minterm=$4A (XOR), ASH=dx
move.w #$0001,$042(a5) ; BLTCON1: LINE=1, octant bits set per slope
move.w #$8000,$074(a5) ; BLTADAT: single pixel pattern
move.w #$FFFF,$044(a5) ; BLTAFWM
move.l #StartPos,$048(a5) ; BLTCPT: line start position in bitmap
move.l #StartPos,$054(a5) ; BLTDPT: same
move.w #Modulo,$060(a5) ; BLTCMOD
move.w #Modulo,$066(a5) ; BLTDMOD
move.w #(len<<6)|2,$058(a5) ; BLTSIZE: length × 2 → GO!
Advanced Use Cases & Cookbook
Use Case 1: Shifted BOB (Sprite at Arbitrary X Position)
The most common real-world blitter task: draw a 16×16 sprite at pixel position (x, y) on a 320-pixel-wide screen. Since x may not be word-aligned, the barrel shifter handles sub-word positioning:
; Draw 16×16 BOB at pixel (x, y) on a 320px wide screen
; Inputs: d0.w = x position, d1.w = y position
; a0 = mask data, a1 = sprite data, a2 = screen base
lea $DFF000,a5
; Calculate screen byte offset:
move.w d1,d2
mulu #40,d2 ; y × 40 bytes/row (320 pixels / 8)
move.w d0,d3
lsr.w #3,d3 ; x / 8 = byte offset in row
and.w #$FFFE,d3 ; word-align (drop bit 0)
add.w d3,d2 ; total byte offset into screen
lea (a2,d2.w),a3 ; a3 = screen pointer for this BOB
; Calculate shift amount:
move.w d0,d3
and.w #$000F,d3 ; shift = x mod 16 (0–15 pixels)
ror.w #4,d3 ; move to bits 15–12 for BLTCON0
or.w #$0FCA,d3 ; channels A+B+C+D, minterm $CA
.bwait:
btst #14,$002(a5)
bne.s .bwait
move.w d3,$040(a5) ; BLTCON0: shift + channels + minterm
clr.w $042(a5) ; BLTCON1: ascending, no fill
move.w #$FFFF,$044(a5) ; BLTAFWM: all bits in first word
move.w #$0000,$046(a5) ; BLTALWM: mask off last word (shift overflow)
move.l a0,$050(a5) ; BLTAPT = mask
move.l a1,$04C(a5) ; BLTBPT = sprite imagery
move.l a3,$048(a5) ; BLTCPT = background read-back
move.l a3,$054(a5) ; BLTDPT = write back to same position
clr.w $064(a5) ; BLTAMOD = 0 (mask is 1 word wide)
clr.w $062(a5) ; BLTBMOD = 0 (sprite is 1 word wide)
move.w #36,$060(a5) ; BLTCMOD = 40 - (2 words × 2) = 36 bytes
move.w #36,$066(a5) ; BLTDMOD = 36
move.w #(16<<6)|2,$058(a5) ; BLTSIZE: 16 lines × 2 words (1 extra for shift) → GO!
Key insight: the blit is 2 words wide even though the sprite is only 16 pixels (1 word). The barrel shift pushes bits into the second word, so we need that extra word — and BLTALWM=$0000 masks it so we don't corrupt adjacent pixels.
Use Case 2: Hardware Scroll (Left by N Pixels)
Scrolling the screen left means the destination is at a lower address than the source — we must use descending mode to avoid overwriting source data:
; Scroll 320×256 screen left by 16 pixels (1 word = fastest case)
; Source: screen + 2 bytes (one word right)
; Dest: screen base
; No shift needed for 16-pixel increments
lea $DFF000,a5
.bwait:
btst #14,$002(a5)
bne.s .bwait
move.l #$09F00000,$040(a5) ; BLTCON0: A+D, minterm $F0 (copy)
clr.w $042(a5) ; BLTCON1: ascending (dest > source is OK here)
move.w #$FFFF,$044(a5) ; BLTAFWM
move.w #$FFFF,$046(a5) ; BLTALWM
move.l #Screen+2,$050(a5) ; BLTAPT: source is 1 word to the right
move.l #Screen,$054(a5) ; BLTDPT: destination is screen start
clr.w $064(a5) ; BLTAMOD = 0 (full-width rows)
clr.w $066(a5) ; BLTDMOD = 0
move.w #(256<<6)|20,$058(a5) ; BLTSIZE: 256 lines × 20 words → GO!
; After blit: draw new column at right edge (column 19)
For sub-word scrolling (1–15 pixels), combine this with the barrel shifter and draw the new edge column from tile data.
Use Case 3: Area Fill (Filled Polygon)
The blitter's fill mode is a two-step process: (1) draw the polygon outline with XOR lines, (2) fill the region. This is how games like Carrier Command and Starglider 2 achieved real-time filled 3D:
; Step 1: Draw polygon edges using blitter line mode (XOR, single-bit)
; (Repeat for each edge of the polygon)
; Use minterm $4A (A XOR C) and BLTCON1 bit 0 = LINE, bit 1 = SING
; Step 2: Fill the outlined region
; Fill works RIGHT-TO-LEFT, BOTTOM-TO-TOP — requires descending mode
; Pointers must point to the LAST word of the bitmap region
lea $DFF000,a5
.bwait:
btst #14,$002(a5)
bne.s .bwait
; Set up inclusive fill (IFE):
move.l #$09F00000,$040(a5) ; BLTCON0: A+D, minterm $F0 (copy with fill)
move.w #$000A,$042(a5) ; BLTCON1: DESC=1 (bit 1), IFE=1 (bit 3)
; IFE = inclusive fill enable
move.w #$FFFF,$044(a5) ; BLTAFWM
move.w #$FFFF,$046(a5) ; BLTALWM
; Pointers to LAST word of the fill region (descending!):
move.l #FillBufferEnd,$050(a5) ; BLTAPT: last word of source
move.l #FillBufferEnd,$054(a5) ; BLTDPT: last word of dest (same buffer)
clr.w $064(a5) ; BLTAMOD = 0
clr.w $066(a5) ; BLTDMOD = 0
move.w #(Height<<6)|Width,$058(a5) ; BLTSIZE → GO!
How it works: the fill carry bit (FCI) toggles on every set pixel. Between two outline pixels on the same scanline, the carry stays on — filling the interior. This is why the outline must use single-bit mode (SING=1) — otherwise double-width line pixels break the fill toggle.
Use Case 4: Interleaved Bitplane BOBs
Standard bitplane layout stores all of plane 0, then all of plane 1, etc. Interleaved layout stores one row of plane 0, then one row of plane 1, alternating. This allows a single blit to draw a BOB across all bitplanes at once:
; Interleaved screen layout:
; Row 0, Plane 0 (40 bytes)
; Row 0, Plane 1 (40 bytes)
; Row 0, Plane 2 (40 bytes)
; Row 0, Plane 3 (40 bytes)
; Row 0, Plane 4 (40 bytes)
; Row 1, Plane 0 (40 bytes)
; ...
; Blit a 16×16 cookie-cut BOB across all 5 bitplanes in ONE operation:
; Height = 16 lines × 5 planes = 80 rows
; Modulo = 40 - 2 = 38 bytes per interleaved row (skip rest of scanline row)
; BOB data is also stored interleaved
lea $DFF000,a5
.bwait:
btst #14,$002(a5)
bne.s .bwait
move.l #$0FCA0000,$040(a5) ; BLTCON0: A+B+C+D, minterm $CA
clr.w $042(a5) ; BLTCON1
move.w #$FFFF,$044(a5) ; BLTAFWM
move.w #$FFFF,$046(a5) ; BLTALWM
move.l #BOBMask,$050(a5) ; BLTAPT (interleaved mask: same mask for all planes)
move.l #BOBData,$04C(a5) ; BLTBPT (interleaved sprite data)
move.l a3,$048(a5) ; BLTCPT (screen position)
move.l a3,$054(a5) ; BLTDPT (same)
clr.w $064(a5) ; BLTAMOD = 0 (mask repeats)
clr.w $062(a5) ; BLTBMOD = 0
move.w #38,$060(a5) ; BLTCMOD = 38 (skip to next interleaved row)
move.w #38,$066(a5) ; BLTDMOD = 38
move.w #(80<<6)|1,$058(a5) ; BLTSIZE: 80 rows (16×5) × 1 word → GO!
Why this matters: without interleaving, drawing one BOB on a 5-plane screen requires 5 separate blits (one per plane), each with its own WaitBlit + register setup overhead. Interleaving does it in 1 blit — 5× less setup time, critical when drawing 15+ BOBs per frame.
Use Case 5: Double-Buffered Game Loop
The standard pattern for flicker-free game rendering:
MainLoop:
; --- Wait for vertical blank ---
bsr WaitVBL ; Wait for beam to reach line 0
; --- Swap display buffer ---
; Copper list points to the currently visible buffer
; We draw into the hidden back buffer
move.l BackBuffer,a0
move.l FrontBuffer,a1
move.l a0,FrontBuffer ; Back buffer becomes front (display)
move.l a1,BackBuffer ; Old front becomes new back (draw target)
; Update Copper list bitplane pointers to show new front buffer:
bsr UpdateCopperBPLPTRs
; --- Clear back buffer ---
bsr WaitBlit
move.l #$01000000,$040(a5) ; D-only, minterm $00
clr.w $042(a5)
move.l a1,$054(a5) ; BLTDPT = back buffer
clr.w $066(a5)
move.w #(256<<6)|20,$058(a5) ; Clear 320×256 → GO!
; --- Draw all BOBs ---
; CPU can process game logic while the clear blit runs!
bsr UpdateGameLogic ; Physics, AI, input — runs on CPU
bsr WaitBlit ; Wait for clear to finish
bsr DrawAllBOBs ; Chain of cookie-cut blits
bra MainLoop
Key optimization: UpdateGameLogic runs on the CPU while the screen clear runs on the Blitter. This is the core of the Amiga's parallelism — ~1.5 ms of free CPU time per frame from a single D-only clear.
Use Case 6: GUI Window Drag (System-Friendly)
Workbench and applications use graphics.library for window dragging, icon rendering, and menu drawing. The OS handles Blitter synchronization:
#include <graphics/gfx.h>
#include <graphics/rastport.h>
/* Scroll a window's contents up by 8 pixels (text scroll): */
ScrollRaster(rp, /* RastPort */
0, 8, /* dx=0, dy=8 (scroll up by 8 pixels) */
0, 0, /* top-left corner of scroll area */
319, 199); /* bottom-right */
/* The OS automatically uses an ascending/descending blit, sets modulos, */
/* and clears the exposed bottom strip. */
/* Copy a rectangular region between two bitmaps: */
BltBitMap(srcBM, 0, 0, /* source bitmap, x, y */
dstBM, 100, 50, /* dest bitmap, x, y */
64, 32, /* width, height */
0xC0, /* minterm: A AND B → masked copy */
0xFF, /* all bitplanes */
NULL); /* no temp buffer needed */
/* Draw a filled rectangle (uses the Blitter internally): */
SetAPen(rp, 3); /* Set pen color to index 3 */
RectFill(rp, 10, 10, 100, 50); /* Filled rectangle */
Use Case 7: Tile Map Renderer
Games like The Settlers, Cannon Fodder, and most platformers render backgrounds from tile maps. Each tile is a 16×16 (or 32×32) block blitted to screen coordinates:
; Render a 20×16 tile map (320×256 screen, 16×16 tiles)
; TileMap: array of 320 bytes (20×16), each byte = tile index
; TileGfx: tile graphics, 16×16 pixels × 5 planes, interleaved
lea TileMap,a0
lea Screen,a2
moveq #16-1,d7 ; 16 tile rows
.tilerow:
moveq #20-1,d6 ; 20 tiles per row
.tilecol:
moveq #0,d0
move.b (a0)+,d0 ; Get tile index
mulu #16*5*2,d0 ; Tile data offset (16 rows × 5 planes × 2 bytes)
lea TileGfx,a1
add.l d0,a1 ; a1 = tile graphics pointer
bsr WaitBlit
move.l #$09F00000,$040(a5) ; BLTCON0: A+D, minterm $F0 (straight copy)
clr.w $042(a5) ; BLTCON1
move.w #$FFFF,$044(a5) ; BLTAFWM
move.w #$FFFF,$046(a5) ; BLTALWM
move.l a1,$050(a5) ; BLTAPT = tile data (interleaved)
move.l a2,$054(a5) ; BLTDPT = screen position
clr.w $064(a5) ; BLTAMOD = 0 (tile data is contiguous)
move.w #38,$066(a5) ; BLTDMOD = 40 - 2 = 38 (interleaved screen)
move.w #(80<<6)|1,$058(a5) ; BLTSIZE: 80 rows (16×5) × 1 word → GO!
addq.l #2,a2 ; Next tile position (1 word right)
dbf d6,.tilecol
; Move to next tile row: advance screen pointer by 16 scanlines × 5 planes × 40 bytes
add.l #16*5*40-40,a2 ; Subtract the 40 bytes already advanced by 20 tiles
dbf d7,.tilerow
Good and Bad Patterns
✓ Pattern: "Blit and Compute" — Overlap CPU and Blitter Work
; Start a blit, then do CPU work while it runs:
bsr SetupAndStartBlit ; Triggers BLTSIZE write
bsr UpdatePlayerPhysics ; CPU work — runs in parallel!
bsr ProcessInput ; More CPU work
bsr WaitBlit ; NOW wait for blit to finish
bsr SetupNextBlit ; Safe to touch registers
This is the entire point of having a Blitter. Any code that busy-waits immediately after starting a blit wastes the Amiga's key advantage.
✗ Antipattern: "The Busy-Wait Hog"
; ✗ BAD: Wait immediately after every blit — wastes CPU cycles
bsr StartBlit
.wait1: btst #14,$002(a5)
bne.s .wait1 ; CPU does NOTHING while blitter runs
bsr StartNextBlit
.wait2: btst #14,$002(a5)
bne.s .wait2 ; More wasted time
✓ Pattern: "Batch Then Wait" — Chain Setup, Single Sync Point
; Process all game logic FIRST:
bsr RunAI
bsr RunPhysics
bsr AnimateFrames
; THEN start the rendering blits in sequence:
bsr WaitBlit
bsr BlitBOB1
bsr WaitBlit
bsr BlitBOB2
bsr WaitBlit
bsr BlitBOB3
; The CPU-intensive work happened during the previous frame's display time
✗ Antipattern: "The Single-Plane-At-A-Time"
; ✗ BAD: Blit each bitplane separately (5× setup overhead)
lea Plane0,a0
bsr BlitBOBOnePlane
lea Plane1,a0
bsr BlitBOBOnePlane
lea Plane2,a0
bsr BlitBOBOnePlane
lea Plane3,a0
bsr BlitBOBOnePlane
lea Plane4,a0
bsr BlitBOBOnePlane ; 5 blits, 5 WaitBlit calls, 5× register setup
; ✓ GOOD: Use interleaved bitplanes — ONE blit for all planes
bsr BlitBOBInterleaved ; 1 blit, 1 WaitBlit, 1× register setup
✗ Antipattern: "System-Unfriendly Direct Access"
/* ✗ BAD: Hit blitter registers directly from a Workbench app */
custom.bltcon0 = 0x09F00000;
/* The OS may be using the blitter RIGHT NOW for window operations */
/* ✓ GOOD: Use OwnBlitter/DisownBlitter for exclusive access */
OwnBlitter(); /* Wait for and lock the blitter */
WaitBlit(); /* Ensure previous blit is done */
/* ... safe to program registers directly ... */
DisownBlitter(); /* Release for OS use */
✗ Antipattern: "Hardcoded 320-Pixel Modulo"
; ✗ BAD: Assumes screen width is always 320 pixels (modulo = 40 - blit_width*2)
move.w #36,$066(a5) ; BLTDMOD = 36 (hardcoded for 320px)
Many Amiga programs run on PAL overscan (352 or 384 pixels), productivity modes (640+), or RTG screens. Always calculate modulo from the actual screen byte width:
; ✓ GOOD: Calculate modulo from actual bitmap width
move.w ScreenBytesPerRow,d0
sub.w BlitWidthBytes,d0
move.w d0,$066(a5) ; BLTDMOD = dynamic
✗ Antipattern: "Ignoring the DMA Budget"
The Blitter shares the DMA bus with display, audio, and disk. In high-bandwidth display modes, there are fewer free DMA slots:
| Display Mode | DMA Slots Used by Display | Remaining for Blitter | Effect |
|---|---|---|---|
| Lores 320×256 × 5 planes | ~100 per line | ~126 per line | Full blitter speed |
| Hires 640×256 × 4 planes | ~160 per line | ~66 per line | Blitter runs at ~50% speed |
| Super Hires 1280 × 4 planes | ~200+ per line | ~26 per line | Blitter barely runs |
| HAM8 (AGA) | ~200 per line | ~26 per line | Blitter barely runs |
Rule of thumb: if your game stutters in hires modes, it's probably DMA contention, not CPU speed.
Practical Limitations
| Limitation | Detail | Workaround |
|---|---|---|
| Max blit size (OCS/ECS) | 1024 lines × 64 words (1024×1024 pixels) | Split into multiple blits |
| Max blit size (AGA) | 32768 lines × 2048 words (BLTSIZV/BLTSIZH) | Rarely a practical issue |
| Word alignment | Blitter operates on 16-bit word boundaries only | Use barrel shift + masks for sub-word positioning; costs 1 extra word of width |
| No scaling | Cannot scale or rotate — purely rectangular block ops | Use CPU for affine transforms, then blit the result |
| No clipping | Blitter will happily write outside the screen bitmap | Implement clipping in software before setting up the blit |
| Single operation at a time | Only one blit can run at a time — no queue | Pipeline setup: compute next blit's parameters on CPU while current blit runs |
| Chip RAM only | All 4 channels must point to Chip RAM | Use MEMF_CHIP for all blitter-visible allocations; see memory_types.md |
| Fill carry direction | Fill mode only works right-to-left (descending) | Always use DESC=1 with fill; set pointers to the end of the data |
| No transparency levels | Boolean operations only — 1-bit masking, no alpha | Dithering or multiple passes for graduated transparency |
| Line mode limitations | Lines drawn with SING=1 for fill prep are single-dot-per-row — visible gaps on steep angles | Use non-SING mode for visible lines, SING only for fill boundaries |
Performance Analysis
DMA Cycle Costs
The Blitter consumes DMA cycles proportional to the number of active channels. Each active channel adds 1 DMA cycle per word per row:
| Channels Active | Cycles/Word | Example Operation | Time for 320×256 (1 plane) |
|---|---|---|---|
| D only | 1 cycle | Screen clear | ~0.3 ms |
| A + D | 2 cycles | Simple copy (A→D) | ~0.6 ms |
| A + B + D | 3 cycles | Masked copy | ~0.9 ms |
| A + B + C + D | 4 cycles | Cookie-cut blit | ~1.3 ms |
At 3.58 MHz DMA clock, 1 cycle ≈ 280 ns. A full 320×256×5-plane screen clear takes ~1.5 ms (D-only × 5 planes).
CPU vs. Blitter Crossover
The Blitter is not always faster than the 68000:
| Operation Size | Winner | Why |
|---|---|---|
| < ~40 words | CPU (68000) | Blitter setup overhead (~20 cycles) exceeds the DMA savings |
| 40–200 words | Tie | Depends on whether CPU needs the bus |
| > 200 words | Blitter | DMA runs independently; CPU can compute in parallel |
| Any size (A1200) | Measure | 68020 can access 32-bit Fast RAM while Blitter uses Chip RAM bus — often faster to do both |
Nasty Mode (BLTPRI)
Setting bit 10 of DMACON (BLTPRI) gives the Blitter absolute DMA priority — the CPU is frozen until the blit completes. This maximizes blitter throughput but:
- Disables all interrupt servicing during the blit
- Breaks timing-sensitive code (audio, serial)
- Most professional software avoids it; demos use it freely
When to Use / When NOT to Use
When to Use the Blitter
- Screen clearing — D-only blit at 1 cycle/word is unbeatable
- BOB/sprite compositing — cookie-cut blit is the standard technique for all Amiga game objects
- Scrolling — overlapping copy with correct ascending/descending mode
- Polygon filling — exclusive/inclusive fill after boundary line drawing
- Large memory copies — any block > ~40 words benefits from DMA parallelism
- Line drawing — hardware Bresenham is faster than any software implementation on 68000
When NOT to Use
- Small copies (< 40 words) — 68000
MOVEMorMOVE.Lloop is faster due to blitter setup overhead - Fast RAM operations — the Blitter cannot access Fast RAM at all; use CPU
- Pixel-level operations — the Blitter works on word-aligned rectangles; per-pixel logic requires CPU
- A1200/A4000 with Fast RAM — the 68020/030 running from 32-bit Fast RAM can often outperform the Blitter on Chip RAM, especially if you can overlap CPU work with display DMA
Applicability Ranges
- BOBs: Practical limit ~15–20 per frame at 320×256×5 planes before exhausting DMA bandwidth
- Fill mode: Works on single bitplanes only — filling a 5-plane display requires 5 passes
- Line mode: Maximum line length limited by BLTSIZE height field (1024 on OCS/ECS, 32768 on AGA)
Historical Context — The 1985 Competitive Landscape
The Blitter was architecturally unprecedented in 1985. No competing home computer shipped with a comparable 2D rasterization coprocessor:
| Feature | Amiga (1985) | Atari ST (1985) | PC EGA (1984) | Mac 128K (1984) | C64 (1982) |
|---|---|---|---|---|---|
| Hardware blitter | Yes — 4-channel DMA with minterm logic | No (added in Mega ST/STE, 1987 — 1 source only) | No | No | No |
| Channels | 3 source + 1 dest | 1 source + 1 dest (STE) | — | — | — |
| Boolean ops | 256 minterms (arbitrary 3-input logic) | 16 logic ops (STE) | — | — | — |
| Line drawing | Hardware Bresenham | No | No | No | No |
| Area fill | Hardware inclusive/exclusive fill | No | No | No | No |
| Shift/mask | Per-channel barrel shift + first/last word masks | Shift + endmask (STE) | — | — | — |
| CPU relief | Full DMA — CPU free during blit | Partial — CPU still involved (STE) | CPU does everything | CPU does everything | CPU does everything |
Pros (in 1985 context)
- Parallelism: The 68000 could execute game logic while the Blitter handled all rendering — this was the Amiga's key advantage over every competitor
- Generality: 256 minterm combinations meant any Boolean compositing operation was a single register write, not a software loop
- Integration: Shared DMA bus with Copper and sprites meant the entire display pipeline was hardware-driven
- Line + fill in hardware: Enabled real-time filled polygon rendering (used in games like Carrier Command, Starglider 2) that was impossible on competing platforms
Cons (in 1985 context)
- Chip RAM only: All blitter-visible data had to live in the first 512 KB (later 1–2 MB), competing with screen memory, audio, and disk buffers
- Word alignment: Sub-pixel positioning required shift + extra word width + masking — complex setup for simple operations
- No scaling/rotation: Purely rectangular block operations; affine transforms required CPU
- DMA contention: Heavy blitter use starved the CPU of bus cycles even without nasty mode
Modern Analogies
| Amiga Blitter Concept | Modern Equivalent | Notes |
|---|---|---|
| 4-channel minterm blit | GPU blend equations (Vulkan VkBlendOp) |
The minterm is a fixed-function Boolean blend; modern GPUs use programmable shaders but the concept of combining sources through a logic function is identical |
| Cookie-cut (A·B + ¬A·C) | Alpha compositing / Porter-Duff SrcOver |
The Amiga used 1-bit masks; modern systems use 8-bit alpha channels, but the compositing algebra is the same |
| DMA-driven blit | vkCmdCopyImage / MTLBlitCommandEncoder |
Modern GPUs have dedicated DMA/copy engines that run asynchronously, exactly like the Blitter ran independently of the 68000 |
| OwnBlitter/DisownBlitter | Vulkan queue submission / Metal command buffer | Exclusive access to a shared hardware resource, then release — the synchronization pattern is identical |
| BLTPRI (nasty mode) | GPU preemption priority | Giving the transfer engine absolute bus priority at the cost of starving other consumers |
| Fill mode | GPU rasterizer fill | Hardware polygon fill is now done by the rasterizer stage; the Blitter's XOR-toggle fill was a clever 1985 approximation |
| BLTSIZE triggers blit | Command buffer submission | Writing the final register starts execution — analogous to vkQueueSubmit or [commandBuffer commit] |
| Barrel shift + word masks | Texture sampling with sub-texel offset | Achieving sub-pixel positioning through hardware shift and masking |
Pitfalls & Common Mistakes
Pitfall 1: "The Silent Corruption" — Fast RAM Pointers
; ✗ BAD: Buffer allocated in Fast RAM
move.l #FastRAMBuffer,$054(a5) ; BLTDPT points to Fast RAM
move.w #(256<<6)|20,$058(a5) ; Blit runs... but writes garbage
The Blitter's DMA engine is wired to the Chip RAM bus only. Fast RAM addresses silently alias to Chip RAM addresses or produce random data. There is no error signal — the blit completes "successfully" with corrupt output.
; ✓ GOOD: Buffer in Chip RAM
move.l #ChipRAMBuffer,$054(a5) ; Allocated with MEMF_CHIP
Pitfall 2: "The Race Condition" — Missing WaitBlit
; ✗ BAD: Start a new blit without waiting for previous one
move.l #$09F00000,$040(a5) ; Overwrite BLTCON0 while previous blit runs!
move.l #NewSource,$050(a5) ; Corrupt the in-progress blit
move.w #(64<<6)|4,$058(a5) ; Start another blit — undefined behavior
Modifying blitter registers while a blit is in progress produces unpredictable results — partial data, corrupted pointers, or system crashes.
; ✓ GOOD: Always wait
.bwait:
btst #14,$002(a5) ; Test BBUSY in DMACONR
bne.s .bwait
; Now safe to set up the next blit
Pitfall 3: "The Wrong Direction" — Overlapping Copy Corruption
; ✗ BAD: Scrolling left (dest < source) with ascending mode
; Source at offset 2, dest at offset 0 — ascending overwrites source data
; before it's read, producing smeared garbage
; ✓ GOOD: Use descending mode when dest < source
move.w #$0002,$042(a5) ; BLTCON1: DESC=1
; Set pointers to LAST word of block, not first
Pitfall 4: "The Off-By-One Word" — Forgetting Shift Width Expansion
; ✗ BAD: 32-pixel wide blit at non-aligned X — width still set to 2 words
; Shifted data spills into adjacent word, corrupting neighboring pixels
move.w #(16<<6)|2,$058(a5) ; Only 2 words wide — but shift needs 3!
; ✓ GOOD: Add 1 word when shift > 0
move.w #(16<<6)|3,$058(a5) ; 3 words: 2 for data + 1 for shift overflow
move.w #$FFF0,$046(a5) ; BLTALWM masks off the rightmost 4 pixels
Pitfall 5: "The Stale Pointer" — Reusing Registers After a Blit
After a blit completes, all pointer registers have advanced to the end of the data. A second blit with the same pointers starts where the first one left off — not at the original position.
; ✓ GOOD: Always reload all pointers before each blit
move.l #SourceAddr,$050(a5) ; Reload BLTAPT
move.l #DestAddr,$054(a5) ; Reload BLTDPT
Impact on FPGA/Emulation
The Blitter is one of the most complex subsystems to reproduce accurately in an FPGA core:
- DMA slot timing: The Blitter shares DMA slots with bitplane, sprite, Copper, disk, and audio DMA. Incorrect slot allocation produces visible glitches in demos that count cycles
- Barrel shifter pipeline: The A and B channel shifts operate on a word pipeline — off-by-one in the shift register produces 1-pixel horizontal offset errors visible in scrolling
- Fill mode carry propagation: The fill carry bit (
FCI) must propagate correctly from right to left within each word and across word boundaries; errors produce "zebra stripe" artifacts - Line mode octant handling: The Bresenham algorithm implementation requires precise handling of 8 octants with correct sign and direction — many emulators get diagonal lines wrong in edge cases
- BLTSIZE write-trigger: The blit must start on the exact cycle that BLTSIZE is written, not one cycle later; demos that chain blits back-to-back depend on this timing
- Nasty mode interaction:
BLTPRImust correctly freeze the CPU and still allow DMA from other sources (Copper, bitplanes) — freezing everything breaks display output
Real-World Software Usage
| Software | Blitter Usage | Notes |
|---|---|---|
| Deluxe Paint | Brush compositing, flood fill, line tools | Canonical use of BltBitMap + BltMaskBitMapRastPort through the OS |
| Shadow of the Beast | Multi-layer parallax scrolling | Custom blitter routines for layer compositing, bypasses OS |
| Carrier Command | Filled polygon 3D rendering | Blitter line draw + fill mode for real-time vector graphics |
| Lemmings | Terrain destruction, character compositing | Cookie-cut blits for each lemming; XOR blits for terrain modification |
| Workbench | Window dragging, icon rendering, menu drawing | All through graphics.library — system-friendly blitter usage |
| Demo scene | Virtually everything | Chunky-to-planar conversion, texture mapping, copper+blitter co-programming |
Best Practices
- Always call
WaitBlit()or poll BBUSY before touching any blitter register - Write BLTSIZE last — it triggers the blit; all other registers must be configured first
- Use
OwnBlitter()/DisownBlitter()for system-friendly code — never assume you have exclusive access - Disable unused channels — fewer channels = fewer DMA cycles = faster blit
- Set BLTAFWM and BLTALWM to
$FFFFfor word-aligned blits — forgetting this produces partial-word masking bugs - Account for shift width expansion — non-aligned blits are 1 word wider than you expect
- Choose ascending/descending correctly for overlapping copies — test both scroll directions
- Interleave CPU work with blitter operations — the whole point of DMA is parallelism; don't busy-wait when you could be computing
- Profile before choosing Blitter vs CPU — on accelerated Amigas, the 68020+ with Fast RAM often wins
References
- HRM: Amiga Hardware Reference Manual — Blitter chapter (complete register descriptions and timing)
- NDK 3.9:
hardware/blit.h,hardware/custom.h,graphics/gfx.h - ADCD 2.1: Hardware Manual — Blitter chapter
- See also: blitter.md — hardware register reference
- See also: animation.md — GEL system (BOBs use the Blitter internally)
- See also: copper.md — Copper coprocessor (often co-programmed with the Blitter)
- See also: rastport.md — RastPort drawing context (uses Blitter for all draw operations)
- See also: display_modes.md — DMA slot budget (Blitter competes for bus bandwidth)
- See also: Akiko — CD32 C2P — hardware Chunky-to-Planar conversion (CD32 alternative to CPU/Blitter C2P)