rslr/amiga-bootcamp

Fork 0

mirror of https://github.com/alfishe/amiga-bootcamp.git synced 2026-06-12 16:16:28 +00:00

Ilia Sharin aa72007e48 Added ATA/ATAPI article - dramatic story as always

2026-06-02 23:48:29 -04:00

52 KiB

Raw Permalink Blame History

← Home · Hardware

Bus Architecture & Register Access

Overview

Every instruction the CPU executes, every pixel the display fetches, and every audio sample Paula plays involves a bus transaction — a precisely timed handshake between a bus master and a target device over shared address and data lines. Understanding how these transactions physically propagate through the Amiga's bus hierarchy is the key to understanding why Chip RAM is slow, why Fast RAM is fast, how custom chip registers must be accessed, and what happens when an accelerated CPU tries to talk to hardware designed for a 7 MHz 68000.

This article covers the mechanics of bus operation. For where things are mapped, see address_space.md. For what kinds of memory exist, see memory_types.md. For DMA slot scheduling and bandwidth, see dma_architecture.md. For the clock tree that drives all bus timing, see video_timing.md.

§1 — The Bus Hierarchy

The Dual-Domain Architecture

The Amiga is not a single-bus machine. Even a stock A500 has a hierarchy of interconnected buses, and adding an accelerator card creates a dual-domain system with two fundamentally different clock domains:

┌─────────────────────────────────────────────────────────────────────┐
│                    ACCELERATOR LOCAL BUS                            │
│            (32-bit, CPU speed: 25–50+ MHz)                          │
│  ┌──────────┐    ┌──────────────┐                                   │
│  │ 68030/40 │◄──►│  Fast RAM    │  ← Zero wait states, no DMA       │
│  │  /060    │    │  (private)   │     contention, full CPU speed    │
│  └────┬─────┘    └──────────────┘                                   │
│       │                                                             │
│  ┌────▼─────────────────────┐                                       │
│  │   BUS ADAPTER / BRIDGE   │  ← Clock domain crossing              │
│  │  (CPLD/FPGA/gate array)  │     Synchronizes to 7 MHz             │
│  └────┬─────────────────────┘                                       │
└───────┼─────────────────────────────────────────────────────────────┘
        │
┌───────▼─────────────────────────────────────────────────────────────┐
│                  MOTHERBOARD BUS (16-bit, 7.09 MHz)                 │
│                                                                     │
│  ┌──────────┐                                                       │
│  │  Gary /  │  ← Address decoder: reads A17–A23,                    │
│  │ Fat Gary │    generates chip-select signals                      │
│  └────┬─────┘                                                       │
│       │                                                             │
│  ┌────▼──────┐  ┌──────────┐  ┌───────────┐  ┌──────────┐           │
│  │   Agnus   │  │  CIAs    │  │ Kickstart │  │  Zorro   │           │
│  │  (DMA     │  │ ($BFx000)│  │   ROM     │  │  Slots   │           │
│  │  master)  │  │          │  │($F80000)  │  │($200000+)│           │
│  └────┬──────┘  └──────────┘  └───────────┘  └──────────┘           │
│       │                                                             │
│  ┌────▼───────────────────────────────────────────────────┐         │
│  │              CHIP RAM BUS                              │         │
│  │     (16-bit, DMA-interleaved, video-synchronous)       │         │
│  │                                                        │         │
│  │  Shared between: Agnus DMA, Denise, Paula, CPU         │         │
│  │  Access rule: time-slot interleaving (see dma_arch.)   │         │
│  └────────────────────────────────────────────────────────┘         │
└─────────────────────────────────────────────────────────────────────┘

Key Principles

Everything below Gary runs at the video-derived clock — 7.09 MHz (PAL) or 7.16 MHz (NTSC). This frequency is not arbitrary; it is the master crystal ÷ 4, the same clock that drives CPU cycles, DMA slots, and the Copper. See video_timing.md §2 for the full derivation.
Agnus is the bus master — on the Chip RAM bus, Agnus (or Alice on AGA) decides who gets access each cycle. The CPU gets whatever slots the custom chips don't need. On a heavy display (6 bitplanes, all sprites active, Blitter running), the CPU may get zero Chip RAM slots during the active display area.
Gary is the address decoder — it doesn't touch the data bus. It reads the upper address lines and asserts chip-select signals: /CAS for Chip RAM, /ROMEN for Kickstart, /CIAA//CIAB for the CIAs, etc. See gary_system_controller.md for the full decode matrix.
The bus adapter is a clock-domain bridge — when an accelerated CPU accesses anything on the motherboard (Chip RAM, custom registers, CIAs, ROM), the bus adapter must synchronize the request to the 7 MHz motherboard clock and wait for an available slot. This is the "Chip RAM penalty" — typically 5–15× slower than local Fast RAM access.
Fast RAM is private — it lives on the accelerator's local bus, runs at the CPU's native clock, and has no contention with DMA. This is why adding Fast RAM to an Amiga produces a dramatic speedup even without changing the CPU.

§2 — Anatomy of a Bus Cycle

The 68000 Bus Cycle

Every CPU access to memory or I/O follows a strict protocol defined by the 68000 bus specification. The minimum bus cycle is 4 clock periods (S0–S7, where each S-state is a half-clock):

Clock:  ─┐  ┌──┐  ┌──┐  ┌──┐  ┌──
         │  │  │  │  │  │  │  │
         └──┘  └──┘  └──┘  └──┘

State:  S0  S1  S2  S3  S4  S5  S6  S7
        ├───┤   ├───┤   ├───┤   ├───┤
         ↑       ↑       ↑       ↑
         │       │       │       └── Data latched (read) or removed (write)
         │       │       └────────── DTACK sampled — if not asserted, insert
         │       │                   wait states (repeat S4–S5)
         │       └────────────────── Address stable, AS asserted, R/W set
         └────────────────────────── Address output begins

At 7.09 MHz (PAL), one clock period = 141 ns, so the minimum 4-clock bus cycle = ~564 ns. This is the fastest possible memory access on a stock Amiga.

Read vs. Write

Phase	Read Cycle	Write Cycle
S0–S1	CPU drives address, sets R/W = Read	CPU drives address, sets R/W = Write
S2	`/AS` asserted, address valid	`/AS` asserted, address valid
S3	—	CPU drives data onto bus
S4	Target samples address, asserts `/DTACK`	Target samples address + data, asserts `/DTACK`
S5	— (wait for DTACK if not yet asserted)	—
S6	CPU samples data from bus	—
S7	`/AS` released, bus cycle ends	`/AS` released, bus cycle ends

Note

The CLR.W instruction on the 68000 performs a read-modify-write cycle: it reads the target, then writes zero. On a read-only register or a strobe register (like COPJMP1), the read has side effects. Later CPUs (68020+) changed CLR to write-only, creating a behavioral difference that matters for hardware register access. See §4.

Wait States

When the target is not ready, it delays asserting /DTACK. The CPU inserts wait states — extra S4–S5 pairs — until DTACK arrives. Each wait state adds one full clock (141 ns):

Target	Typical Wait States	Total Cycle Time (PAL)
Chip RAM (CPU's turn)	0	~564 ns (4 clocks)
Chip RAM (DMA busy)	2–8+	~846 ns – 1.7 µs
Kickstart ROM	0–2	~564–846 ns
CIA registers	6–10	~1.4–2.0 µs (E-clock sync)
Zorro II I/O	0–4 (card-dependent)	~564 ns – 1.1 µs
Non-existent address	Timeout → Bus Error	~10 µs (Gary bus timeout)

Word and Long-Word on a 16-bit Bus

The 68000 has a 16-bit data bus. Accessing different data sizes costs different numbers of bus cycles:

Access Size	Bus Cycles	Minimum Time (PAL)	Notes
Byte (`.B`)	1	~564 ns	Upper or lower byte via `/UDS`/`/LDS`
Word (`.W`)	1	~564 ns	Full 16-bit transfer
Long (`.L`)	2	~1.13 µs	Two consecutive word transfers

On 68020/030 with 32-bit bus (A3000, A4000 motherboard bus, or accelerator local bus), a long-word access completes in a single bus cycle. But when these CPUs access the 16-bit motherboard bus, the bus adapter performs dynamic bus sizing — splitting the 32-bit access into two 16-bit cycles automatically.

§3 — Address Decoding — The Bus Glue Chips

Every bus cycle begins with the CPU placing an address on the bus. Something must decide which device should respond. On the Amiga, this job falls to a hierarchy of bus glue chips — custom gate arrays that decode the upper address bits and assert the appropriate chip-select signal.

The Decode Chain

CPU Address Lines (A17–A23 / A17–A31)
          │
          ▼
    ┌───────────┐
    │   Gary /  │──── /CAS (Chip RAM)
    │ Fat Gary  │──── /ROMEN (Kickstart ROM)
    │           │──── /CIAA, /CIAB (CIA pair)
    │           │──── /OVL (overlay control)
    │           │──── /DTACK (cycle acknowledge)
    │           │──── /EXTRN (expansion bus)
    └───────────┘
          │ (for A3000/A4000)
          ▼
    ┌───────────┐
    │  Buster   │──── Zorro III arbitration
    │           │──── DMA handshake (DMAEN, DSACK)
    └───────────┘
          │
          ▼
    ┌───────────┐
    │  Ramsey   │──── DRAM /RAS, /CAS
    │           │──── Refresh control
    │           │──── Fast RAM page-hit optimization
    └───────────┘

Per-Model Glue Chips

Chip	Models	Role	Key Functions
Gary (CSG 5719)	A500, A2000	Address decoder	A17–A23 decode, chip selects, bus timeout, ROM overlay
Fat Gary (CSG 5391/5393)	A3000	32-bit address decoder	Gary + SCSI/FPU chip select, Zorro III decode, 32-bit DTACK
Gayle (CSG 391424)	A600, A1200, A4000	System controller	IDE, PCMCIA, address decode (replaces Gary on these models)
Buster (CSG 5394/5396)	A3000, A4000	Zorro III controller	DMA arbitration, burst mode, slot negotiation
Ramsey (CSG 3901/3902)	A3000, A4000	DRAM controller	/RAS//CAS generation, refresh, page-mode, burst
Budgie	CDTV	System controller	Gary equivalent + DMAC glue for CD-ROM

Tip

FPGA developers: Gary's decode logic is purely combinational — no internal state. Fat Gary adds a small state machine for SCSI and FPU acknowledgment. Both are well-documented in the dedicated articles: Gary (OCS), Fat Gary (ECS).

The ROM Overlay Mechanism

At power-on, the CPU fetches its initial stack pointer and program counter from address $000000. But Chip RAM lives there — and Chip RAM is empty at boot. Gary solves this with the overlay bit:

At reset, Gary maps Kickstart ROM ($F80000) over Chip RAM at $000000
CPU reads exception vectors from ROM
Kickstart writes to CIA-A port bit 0, clearing the overlay
Gary unmaps ROM from $000000, making Chip RAM visible
Kickstart copies exception vectors into Chip RAM

This mechanism is identical across all Amiga models — only the chip implementing it changes (Gary → Fat Gary → Gayle).

Bus Timeout

If no device asserts /DTACK within a configurable window, Gary generates a /BERR (Bus Error) exception. This protects the system from hanging when accessing non-existent hardware:

Gary (A500/A2000): ~10 µs timeout
Fat Gary (A3000): ~8 µs timeout, configurable
Gayle (A1200/A4000): similar timeout with IDE-specific extensions

Bus Arbitration Algorithms

Multiple devices compete for the bus: the CPU, Agnus DMA, expansion cards (Zorro DMA masters), and accelerator bridges. The Amiga uses a layered priority scheme:

Layer 1 — Agnus Chip RAM Arbitration (fixed priority)

Agnus controls Chip RAM access using a deterministic slot-based schedule. Priority is hardwired — not negotiable:

Priority (highest to lowest):
  1. Disk DMA         — guaranteed slots, cannot be preempted
  2. Audio DMA        — 4 channels, guaranteed slots
  3. Sprite DMA       — 8 sprites, guaranteed slots per scanline
  4. Bitplane DMA     — 1–8 bitplanes, consumes proportional slots
  5. Copper DMA       — interleaved with display
  6. Blitter DMA      — fills remaining DMA slots (nasty mode overrides CPU)
  7. CPU              — gets whatever is left

When BLTPRI (Blitter nasty mode) is set in DMACON, the Blitter takes priority over the CPU for all remaining slots, effectively freezing the CPU during Blitter operations. Without nasty mode, CPU and Blitter alternate on available slots.

See dma_architecture.md for the complete slot allocation table and per-scanline bandwidth analysis.

Layer 2 — 68000 Bus Arbitration (3-wire handshake)

External DMA masters (Zorro cards, accelerator bridges) request the bus using the standard 68000 protocol:

Device asserts /BR (Bus Request)
CPU finishes current bus cycle, asserts /BG (Bus Grant)
Device asserts /BGACK (Bus Grant Acknowledge) — CPU tri-states its bus lines
Device performs DMA transfers
Device releases /BR and /BGACK — CPU reclaims bus

This is a daisy-chain priority — multiple requestors are wired in series. The device closest to the CPU in the chain has highest priority. On the A2000, the Zorro II slots are daisy-chained in slot order.

Note

This is the motherboard bus arbitration — separate from Agnus's Chip RAM slot scheduling. A Zorro DMA master can own the motherboard bus, but Agnus still controls Chip RAM access independently. The Zorro master can access Chip RAM only through Agnus's slot allocation.

Layer 3 — Buster Zorro III Arbitration (A3000/A4000)

On Zorro III systems, Buster provides a more sophisticated fair arbitration:

Round-robin among Zorro III slots — no permanent starvation
Burst mode — a master can hold the bus for multiple transfers (up to 256 bytes on rev 11)
Bus tenure — each master gets a maximum time window before forced release
Buster revision matters: rev 9 (A3000) has known arbitration bugs that cause DMA corruption with certain card combinations. Rev 11 (A4000, fixed A3000 boards) resolves these.

§4 — Custom Chip Register Access (`$DFF000`)

The custom chip registers occupy a 512-byte block at $DFF000–$DFF1FE. These are not memory — they are hardware ports with strict access rules. Violating them causes silent corruption, not exceptions.

Register Categories

Category	Access	Count	Examples	Notes
Write-only	Write	~130	`COLOR00`, `BPLCON0`, `BPL1PTH`	Reading returns bus noise or last DMA value
Read-only	Read	~15	`VPOSR`, `VHPOSR`, `DMACONR`, `JOY0DAT`	Writing is silently ignored
Strobe	Write	~10	`COPJMP1`, `COPJMP2`, `INTREQ`	Write triggers action; data value may be irrelevant
Read-clear	Read	~5	`INTREQR`, `ADKCONR`	Reading returns current value
Set/Clear	Write	~10	`DMACON`, `INTENA`, `INTREQ`	Bit 15 = SET (1) or CLEAR (0) for remaining bits

The Word-Only Rule

Custom chip registers are 16-bit wide and must be accessed with word-sized operations:

; ✓ CORRECT — word write to COLOR00
    MOVE.W  #$0F00,$DFF180      ; Set background to red

; ✗ WRONG — byte write
    MOVE.B  #$0F,$DFF180        ; Unpredictable! Byte lane mismatch

; ✓ CORRECT — long-word write to paired register (high/low pointer)
    MOVE.L  #copperlist,$DFF080 ; Writes COP1LCH and COP1LCL atomically

Long-word writes to consecutive register pairs (e.g., BPL1PTH/BPL1PTL at $DFF0E0/$DFF0E2) work correctly because the 68000 performs two sequential word writes. This is a common and intentional optimization for pointer registers.

The `CLR.W` Hazard

The 68000's CLR instruction performs a read-modify-write bus cycle:

; ✗ DANGEROUS on 68000 — reads the register first!
    CLR.W   $DFF09C             ; INTREQ — the READ has side effects!

; ✓ SAFE — pure write, no read cycle
    MOVE.W  #0,$DFF09C          ; INTREQ — write-only, no read side effects

On the 68000, CLR.W $DFF09C first reads INTREQ (which on some registers clears flags), then writes zero. The read is an unintended side effect. On 68020+ processors, Motorola changed CLR to a write-only cycle, so the same instruction behaves differently.

Warning

Portable rule: Never use CLR on hardware registers. Always use MOVE.W #0,addr. This works identically on all 680x0 processors and avoids the read side effect entirely.

C Language Access Pattern

When accessing hardware registers from C, two things are mandatory:

volatile — prevents the compiler from optimizing away or reordering register accesses
Word pointer type — ensures the compiler generates MOVE.W, not MOVE.B or MOVE.L

/* Standard pattern — NDK-style custom base */
struct Custom *custom = (struct Custom *)0xDFF000;

/* Direct register access (volatile is built into the NDK struct definition) */
custom->color[0] = 0x000;          /* MOVE.W #0,$DFF180 */
custom->intreq = INTF_SETCLR | INTF_VERTB;  /* Set VBL interrupt */

/* Raw pointer pattern (for register not in struct) */
#define VHPOSR (*(volatile UWORD *)0xDFF006)
UWORD pos = VHPOSR;               /* Read beam position */

Note

The Amiga NDK hardware/custom.h defines struct Custom with all fields already volatile. If you use a custom base pointer, your code is automatically safe from optimization bugs.

Timing Semantics

Register writes take effect at the start of the next color clock (CCK) cycle — not instantaneously. A write to COLOR00 at $DFF180 changes the background color starting from the next pixel output cycle. This is why the Copper can create per-scanline color changes with single-cycle precision: each MOVE in the Copper list is timed to a specific horizontal position.

For registers that affect DMA (like DMACONW or BPLxPTH), the new value typically takes effect at the next DMA slot boundary. See dma_architecture.md for DMA slot timing.

§5 — CIA Register Access (`$BFx000`)

The two CIA 8520 chips occupy a peculiar position in the Amiga's bus architecture. Unlike the custom chip registers (which are decoded by Agnus and respond at bus speed), the CIAs are asynchronous peripherals clocked by the E-clock — the slowest clock in the system.

Byte-Lane Placement

The CIAs are 8-bit devices on a 16-bit data bus. Their placement on the bus lanes is asymmetric:

CIA	Base Address	Data Bus Connection	Access Size
CIA-A	`$BFE001`	Odd byte lane (D0–D7, `/LDS`)	`MOVE.B` to odd addresses
CIA-B	`$BFD000`	Even byte lane (D8–D15, `/UDS`)	`MOVE.B` to even addresses

; ✓ Reading CIA-A PRA (parallel port data)
    MOVE.B  $BFE001,D0          ; CIA-A at odd address

; ✓ Reading CIA-B PRB
    MOVE.B  $BFD000,D0          ; CIA-B at even address

; ✗ WRONG — word read spans both CIAs!
    MOVE.W  $BFD000,D0          ; Reads CIA-B high byte AND CIA-A low byte simultaneously
                                 ; This is a famous source of bugs!

Warning

Never use MOVE.W or MOVE.L on CIA addresses. A word read at $BFD000 reads CIA-B's register on the high byte and CIA-A's register on the low byte simultaneously — potentially triggering unintended side effects (like clearing an ICR flag on the other CIA).

E-Clock Synchronization

The CIAs are clocked by the E-clock — the 68000's slowest bus signal, running at the CPU clock ÷ 10:

Parameter	PAL	NTSC
E-clock frequency	709.379 kHz	715.909 kHz
E-clock period	~1.41 µs	~1.40 µs
CIA access time	6 E-cycles ≈ 8.4 µs	6 E-cycles ≈ 8.4 µs

A CIA register access requires the CPU to wait for the E-clock to reach the correct phase, then hold the bus for 6 full E-clock cycles. This makes CIA access the slowest bus operation on the Amiga — roughly 15× slower than a Chip RAM access:

CIA access:   ~8,400 ns  (6 E-cycles)
Chip RAM:     ~564 ns    (4 CPU clocks, best case)
Fast RAM:     ~40–80 ns  (accelerated, zero wait)

On an accelerated system, the penalty is even more extreme. A 50 MHz 68060 can execute ~400 instructions in the time one CIA register read takes.

ICR Read-and-Clear Semantics

The CIA Interrupt Control Register (ICR) at offset $0D uses read-and-clear semantics:

Reading ICR returns the current interrupt flags and clears them atomically
Writing ICR with bit 7 = 1 sets the specified interrupt enable bits
Writing ICR with bit 7 = 0 clears the specified interrupt enable bits

This means you cannot "poll" the ICR without consuming the flags. Code that reads ICR must process all returned flags in that read — a second read will return zero.

; Read and process CIA-A ICR
    MOVE.B  $BFED01,D0          ; Read ICR — flags cleared atomically
    BTST    #0,D0               ; Timer A underflow?
    BNE     .handle_timer_a
    BTST    #2,D0               ; TOD alarm?
    BNE     .handle_tod

For complete CIA register semantics, see cia_chips.md. For the E-clock's derivation from the video crystal, see video_timing.md §2.

§6 — The Accelerator Bus Bridge

When a 68030/040/060 accelerator is installed, the original 68000 is removed and the accelerator takes over the CPU socket. But the motherboard hardware — Chip RAM, custom registers, CIAs, ROM — still runs at 7 MHz. The bus adapter (also called the bridge or glue logic) is the hardware that translates between the two clock domains.

The Arbitration Sequence

When the accelerated CPU needs to access a motherboard resource:

                  Accelerator                     Motherboard
                  ──────────                      ───────────
    1. CPU issues address     ──────────────►
    2. Bridge detects motherboard target
    3. Bridge asserts /BR      ──────────────►    Bus Request
    4.                         ◄──────────────    /BG (Bus Grant)
    5. Bridge asserts /BGACK   ──────────────►    Bus Grant Acknowledge
    6. Bridge waits for next 7 MHz clock phase
    7. Bridge drives address/data on motherboard bus
    8.                         ◄──────────────    Target asserts /DTACK
    9. Bridge relays data back to CPU
   10. Bridge releases /BGACK  ──────────────►    Bus released

Steps 3–6 are the sync-up penalty — the CPU is frozen (via /DSACK or /STERM withholding) while the bridge waits for the motherboard's clock phase alignment. This penalty varies from 2 to 8+ motherboard clocks depending on where in the 7 MHz cycle the request arrives.

The Bridge as a Speed Limiter

The bridge doesn't make motherboard access faster — it makes it not slower than the motherboard's native speed. But from the accelerated CPU's perspective, the penalty is enormous because the CPU could have executed many instructions from Fast RAM in the same time:

CPU	Clock	Fast RAM Access	Chip RAM Access	Penalty Ratio
68000 (stock)	7 MHz	N/A	~564 ns	1× (baseline)
68020 (A1200)	14 MHz	~140 ns	~564 ns	~4×
68030 @ 25 MHz	25 MHz	~80 ns	~700 ns	~9×
68030 @ 50 MHz	50 MHz	~40 ns	~700 ns	~18×
68040 @ 25 MHz	25 MHz	~40 ns (burst)	~800 ns	~20×
68060 @ 50 MHz	50 MHz	~20 ns (burst)	~800 ns	~40×

Important

The Chip RAM access time doesn't improve with a faster CPU — it's fixed by the motherboard clock. Only the penalty ratio increases because the CPU can do more useful work per nanosecond from Fast RAM.

Per-CPU Bridge Differences

68020/030 adapters have a relatively straightforward bridge because the 68020/030 bus protocol is similar to the 68000's. The main challenges are:

Dynamic bus sizing (32-bit CPU ↔ 16-bit motherboard)
Cache line fills from motherboard (burst mode negotiation)
Address bit translation (24-bit vs. 32-bit)

68040/060 adapters require much more complex bridge logic because:

The 68040/060 bus protocol is fundamentally different (no dynamic bus sizing, burst-only transfers, separate PCLK/BCLK)
The copyback data cache may hold modified data that DMA hardware needs — the bridge must snoop or flush
MOVE16 cache-line push instructions generate 16-byte burst transfers that must be broken into motherboard-sized chunks
The 68060 adds superscalar execution and branch prediction, making the CPU even more sensitive to stalls

What the CPU "Sees"

From the CPU's perspective, a Chip RAM access looks like a very long wait state:

50 MHz 68060 executing from Fast RAM:
    MOVE.L  (A0),D0     ; A0 points to Fast RAM → 1 cycle = 20 ns
    MOVE.L  (A1),D1     ; A1 points to Chip RAM → ~40 cycles = 800 ns ← CPU frozen here
    ADD.L   D0,D1       ; 1 cycle = 20 ns

The CPU doesn't "know" it's crossing a clock domain — it simply sees /DSACK being withheld for many cycles. The bridge is invisible to software, which is why the same binary runs on stock and accelerated systems. The only difference is speed.

§7 — Cache Coherency with DMA

When the CPU has a data cache, a dangerous situation arises: the CPU may hold a cached copy of Chip RAM data that has been modified by DMA hardware (Blitter, disk DMA), or the CPU may modify cached Chip RAM data that DMA hardware expects to read. This is the cache coherency problem — the cache and main memory disagree about the current state of data.

Per-CPU Cache Characteristics

CPU	Data Cache	Write Policy	Bus Snooping	Coherency Strategy
68000	None	N/A	N/A	No problem — all accesses go to RAM
68020	256-byte instruction cache	N/A	N/A	No data cache — no coherency issue
68030	256-byte data + 256-byte inst.	Write-through	None	Software must invalidate after DMA writes
68040	4 KB data + 4 KB inst.	Copyback	Optional (often not wired)	Software must flush before DMA reads, invalidate after DMA writes
68060	8 KB data + 8 KB inst.	Copyback + store buffer	Optional (rarely wired)	Same as 68040, plus store buffer drain

The Problem in Practice

Scenario: CPU writes a new Copper list to Chip RAM, then triggers Copper restart

CPU cache (copyback):     Contains modified Copper list data
Chip RAM (main memory):   Contains STALE Copper list data
Copper DMA reads from:    Chip RAM (not the CPU cache!)

Result: Copper executes the OLD list → display corruption

The AmigaOS Solution: CachePreDMA / CachePostDMA

AmigaOS provides two exec functions for DMA-safe memory access:

/* Before DMA device reads from memory (Blitter source, display, audio) */
CachePreDMA(address, &length, DMA_ReadFromRAM);
/* → Flushes CPU data cache for the specified range
      so DMA reads the latest data from RAM */

/* After DMA device writes to memory (Blitter dest, disk read) */
CachePostDMA(address, &length, DMA_ReadFromRAM);
/* → Invalidates CPU data cache for the specified range
      so CPU reads the updated data from RAM, not stale cache */

When to Flush vs. Invalidate

Scenario	Action	Function	Why
Before Blitter reads source	Flush (push dirty cache lines to RAM)	`CachePreDMA()`	Blitter must see latest CPU-written data
Before audio DMA plays sample	Flush	`CachePreDMA()`	Paula must read correct sample data
Before Copper executes list	Flush	`CachePreDMA()`	Copper reads from RAM, not cache
After Blitter writes destination	Invalidate (discard cached copies)	`CachePostDMA()`	CPU must see Blitter's output, not stale cache
After disk DMA loads data	Invalidate	`CachePostDMA()`	CPU must see disk data, not old cache content
After network DMA receives packet	Invalidate	`CachePostDMA()`	CPU must see received data

The Nuclear Option: CacheClearU / CacheClearE

For simpler cases (or legacy code), exec provides brute-force cache management:

CacheClearU();                           /* Flush + invalidate ALL caches */
CacheClearE(address, length, CACRF_ClearD | CACRF_ClearI);  /* Flush + invalidate specific range */

CacheClearU() is expensive — it flushes the entire data cache, losing all cached Fast RAM data. Use CacheClearE() with a specific range when possible.

Warning

Games that "take over the system" must still manage caches if they run on 68040/060. A common pattern is to disable the data cache entirely (CacheControl(0, CACRF_EnableD)) or set all Chip RAM as cache-inhibited via the MMU. Both approaches sacrifice performance but guarantee coherency.

Hardware Registers and Caching

Custom chip registers at $DFF000 and CIA registers at $BFx000 must never be cached. On 68040/060, these address ranges are mapped as cache-inhibited, serialized (CI/S) in the MMU translation tables. This ensures:

Every read goes to the hardware register, not a cached copy
Every write is immediately driven to the bus
Access order is strictly maintained (no write buffer reordering)

The Kickstart ROM's MMU setup handles this automatically. Custom MMU tables must preserve these mappings.

§8 — Cross-Domain Transfer Techniques

The most performance-critical operation on an accelerated Amiga is moving data between Fast RAM (where the CPU works efficiently) and Chip RAM (where DMA hardware needs it). Every screen update, every audio buffer fill, and every Blitter setup involves this cross-domain transfer.

The Fundamental Constraint

The Chip RAM bus runs at ~7 MHz with a 16-bit data path. No software technique can exceed the bus's raw throughput ceiling:

Maximum Chip RAM bandwidth:
  16 bits × 7.09 MHz = ~14.2 MB/s (theoretical, all slots to one master)
  CPU share (typical):  ~3.5 MB/s (half the slots, rest to DMA)
  CPU share (heavy display): ~0.5–1 MB/s (most slots consumed by DMA)

All transfer techniques operate within this ceiling. The goal is to reach the ceiling efficiently, not exceed it.

Transfer Method Comparison

All figures are relative to a baseline MOVE.W loop on a 68000 (= 1.0×). Higher is faster. Actual throughput depends on DMA load, CPU model, and alignment.

Method	Relative Speed	Alignment Req.	CPU Occupied?	Works Fast↔Chip?	Notes
`MOVE.W` loop	1.0× (baseline)	Word	Yes	Yes	Simplest, slowest
`MOVE.L` loop	~1.8×	Long	Yes	Yes	Two words per instruction; fewer loop iterations
`MOVEM.L` (unrolled)	~2.5×	Long	Yes	Yes	Moves up to 14 registers (56 bytes) per instruction
`MOVE16` (68040+)	~3.5×	16-byte aligned	Yes	Yes	16-byte cache-line burst; best for 040/060
`CopyMem()`	~2.0–2.5×	Any	Yes	Yes	Library function; uses best available method internally
`CopyMemQuick()`	~2.5×	Long	Yes	Yes	Optimized path; requires longword-aligned src/dest/size
Blitter	~2.0× (A channel copy)	Word	No (CPU free)	Chip↔Chip only	CPU can work Fast RAM in parallel

Important

The Blitter cannot access Fast RAM. It only copies within Chip RAM. Its advantage is freeing the CPU to do other work in Fast RAM while the copy proceeds in parallel. On a stock 68000, the Blitter is faster than the CPU for large copies; on 68030+, the CPU with MOVEM.L is often faster, but the parallelism benefit remains.

Optimal Transfer Patterns

Pattern 1: MOVEM.L — The Workhorse

; Copy 256 bytes from Fast RAM (A0) to Chip RAM (A1)
; Uses 12 data/address registers = 48 bytes per pair of MOVEM instructions
    MOVEM.L (A0)+,D0-D7/A2-A5   ; Load 48 bytes from Fast RAM (fast)
    MOVEM.L D0-D7/A2-A5,(A1)    ; Store 48 bytes to Chip RAM (slow — bus limited)
    LEA     48(A1),A1
    ; ... repeat for remaining 208 bytes (4 more iterations + remainder)

The load from Fast RAM completes at full CPU speed. The store to Chip RAM is bottlenecked by the 7 MHz bus — but MOVEM.L minimizes instruction fetch overhead by moving 48 bytes per instruction pair.

Pattern 2: MOVE16 — The 68040/060 Burst

; Copy 64 bytes using cache-line burst transfers (68040/060 only)
; Source and destination MUST be 16-byte aligned
    MOVE16  (A0)+,(A1)+          ; 16 bytes in one burst cycle
    MOVE16  (A0)+,(A1)+          ; 16 bytes
    MOVE16  (A0)+,(A1)+          ; 16 bytes
    MOVE16  (A0)+,(A1)+          ; 16 bytes (64 bytes total)

MOVE16 bypasses the data cache and performs a 16-byte line transfer directly on the bus. On 68040/060 with proper Chip RAM alignment, this is the fastest CPU-driven method.

Warning

MOVE16 requires 16-byte alignment on both source and destination. Misaligned addresses cause an Address Error exception. The instruction is not available on 68000/020/030.

Pattern 3: CPU + Blitter Pipeline

The most efficient approach combines both engines:

1. CPU renders frame data in Fast RAM         (full CPU speed)
2. CPU converts chunky→planar (C2P) in Fast RAM  (full CPU speed)
3. CPU copies planar result to Chip RAM buffer    (slow — bus limited)
4. Blitter copies within Chip RAM for scrolling   (parallel — CPU free)
5. CPU begins next frame in Fast RAM              (while Blitter works)

This double-buffered pipeline keeps both the CPU and Blitter busy simultaneously. The CPU never waits for the Blitter, and the Blitter never waits for the CPU. Games like Doom (Amiga port) use exactly this pattern.

Audio Streaming: The Ping-Pong Buffer

Audio data follows a similar cross-domain pattern:

Fast RAM:   Decode/decompress audio (MP3, MOD, etc.) at CPU speed
            ↓ Copy decoded PCM samples
Chip RAM:   Two small DMA buffers (ping/pong), each ~1–4 KB
            Paula DMA plays from buffer A
            CPU fills buffer B from decoded data in Fast RAM
            At interrupt: swap A↔B, repeat

The key insight: only the small DMA buffers (~8 KB total) consume Chip RAM. The bulk audio data and decompression workspace live in Fast RAM. See memory_types.md for allocation strategy.

§9 — Peripheral Address Spaces & Per-Model Maps

The Amiga's address space is not static — expansion cards, system controllers, and PCI bridges dynamically insert address windows into the memory map. This section provides annotated per-model maps showing where these windows appear.

A500 / A2000 — The 24-bit Baseline

$000000 ┌──────────────────────────────────┐
        │ Chip RAM (512 KB – 2 MB)         │  DMA-visible
$080000 │ (mirror/wrap if < 2 MB)          │
$200000 ├──────────────────────────────────┤
        │ ◆ Zorro II Fast RAM              │  AutoConfig-assigned (up to 8 MB)
        │   Populated by expansion cards   │  Cards: A2058, A2091, GVP, etc.
$A00000 ├──────────────────────────────────┤
        │ ◆ Zorro II I/O Space             │  Board registers, RTG framebuffers
$BFD000 ├──────────────────────────────────┤
        │ CIA-B ($BFD000, even bytes)      │  E-clock synchronous
$BFE001 │ CIA-A ($BFE001, odd bytes)       │  E-clock synchronous
$C00000 ├──────────────────────────────────┤
        │ Slow RAM (512 KB, trapdoor)      │  On Chip bus, NOT DMA-visible
$C80000 ├──────────────────────────────────┤
        │ ◆ Zorro II I/O (extended)        │  More expansion board registers
$DC0000 ├──────────────────────────────────┤
        │ Real-Time Clock (MSM6242B)       │
$DFF000 ├──────────────────────────────────┤
        │ Custom Chip Registers            │  $DFF000–$DFF1FE (Agnus/Denise/Paula)
$E00000 ├──────────────────────────────────┤
        │ Kick mirror / WCS                │
$E80000 ├──────────────────────────────────┤
        │ ◆ AutoConfig Probe Space         │  Temporary: boards appear here
        │   before relocation              │  during CFGIN/CFGOUT enumeration
$F00000 ├──────────────────────────────────┤
        │ Extended Kickstart ROM (OS 3.1+) │
$F80000 ├──────────────────────────────────┤
        │ Kickstart ROM (512 KB)           │
$FFFFFF └──────────────────────────────────┘

◆ = dynamically populated by AutoConfig or expansion hardware.

A3000 / A4000 — 32-bit Extension

$00000000 ┌──────────────────────────────────┐
          │ (24-bit map as above)            │  Identical $000000–$FFFFFF
$00FFFFFF ├──────────────────────────────────┤
          │ ◆ Zorro III Address Space        │  32-bit, AutoConfig
          │   Fast RAM cards (32–256 MB)     │  e.g., CyberStorm, Fastlane
          │   I/O boards (RTG, SCSI, Net)    │  e.g., CyberVision, Ariadne
          │   Assigned by expansion.library  │
$07FFFFFF ├──────────────────────────────────┤
          │ (unused/reserved)                │
          │                                  │
          │ ◆ PCI Bridge Windows (if present)│  Mediator: 8 MB window
          │   Mapped into Z3 space           │  G-REX: linear via CPU slot
$FFFFFFFF └──────────────────────────────────┘

On A3000/A4000, Ramsey manages on-board Fast RAM (up to 16 MB, separately from Zorro III cards). Buster (rev 11) handles Zorro III DMA and burst negotiation.

A600 — ECS Compact

$000000 ┌──────────────────────────────────┐
        │ Chip RAM (1–2 MB)                │  DMA-visible (ECS Agnus)
$200000 ├──────────────────────────────────┤
        │ (No Zorro slots)                 │
$600000 ├──────────────────────────────────┤
        │ ◆ PCMCIA Attribute/I/O Memory    │  4 MB window
        │   CompactFlash, network cards    │  (Gayle-managed)
$A00000 ├──────────────────────────────────┤
        │ CIA-B, CIA-A, Slow RAM           │  (standard layout)
$DA0000 ├──────────────────────────────────┤
        │ ◆ Gayle IDE Registers            │  Internal 2.5" IDE
$DFF000 ├──────────────────────────────────┤
        │ Custom Registers, ROM            │  (standard layout)
$FFFFFF └──────────────────────────────────┘

A1200 — AGA with Trapdoor Expansion

$000000 ┌──────────────────────────────────┐
        │ Chip RAM (2 MB, fixed)           │  DMA-visible (Alice)
$200000 ├──────────────────────────────────┤
        │ ◆ Accelerator Fast RAM           │  Trapdoor connector (150-pin)
        │   Mapped by accelerator bridge   │  Blizzard 1230/1260, TF1260, etc.
        │   Typically $200000–$5FFFFF      │  (4 MB PCMCIA-friendly)
        │   or $200000–$07FFFFFF           │  (up to 126 MB, disables PCMCIA)
$600000 ├──────────────────────────────────┤
        │ ◆ PCMCIA Window                  │  4 MB ($600000–$9FFFFF)
        │   Conflicts with Fast >4 MB!     │  Gayle-managed
$A00000 ├──────────────────────────────────┤
        │ CIA-B, CIA-A                     │  (standard layout)
$DA0000 ├──────────────────────────────────┤
        │ ◆ Gayle IDE Registers            │  Internal 2.5" IDE + CF adapter
$DA4000 │ ◆ PCMCIA Attribute Memory        │  Card configuration registers
$DFF000 ├──────────────────────────────────┤
        │ Custom Registers (AGA)           │  Alice/Lisa extended register set
$F00000 ├──────────────────────────────────┤
        │ Extended + Primary Kickstart ROM │
$FFFFFF └──────────────────────────────────┘

Warning

The PCMCIA/Fast RAM conflict: On the A1200, PCMCIA maps to $600000–$9FFFFF. Accelerators that place Fast RAM above 4 MB overlap this window and permanently disable PCMCIA (no CF card, no network). Most modern accelerators offer a "PCMCIA-friendly" jumper that limits Fast RAM to 4 MB.

CD32 — AGA Console with Akiko

$000000 ┌──────────────────────────────────┐
        │ Chip RAM (2 MB, fixed)           │  DMA-visible (Alice)
$200000 ├──────────────────────────────────┤
        │ (No expansion bus)               │
$B80000 ├──────────────────────────────────┤
        │ ◆ Akiko Chip                     │  Chunky-to-Planar engine
        │   CD-ROM controller              │  1 KB NVRAM interface
$DFF000 ├──────────────────────────────────┤
        │ Custom Registers (AGA)           │
$E00000 ├──────────────────────────────────┤
        │ CD32 Extended ROM                │  CD filesystem, CDDA player
$F00000 ├──────────────────────────────────┤
        │ CD32 Flash ROM                   │  Firmware, SysInfo
$F80000 ├──────────────────────────────────┤
        │ Kickstart 3.1 ROM                │
$FFFFFF └──────────────────────────────────┘

PCI Bridge Address Windowing

PCI cards have a 4 GB address space, but the Amiga's native bus can only address 16 MB (24-bit) or 4 GB (32-bit). PCI bridges solve this mismatch differently:

Bridge	Technique	Window Size	Performance
Mediator (Elbox)	Memory windowing: 8 MB window in Zorro III space; driver swaps visible region via bank register	8 MB visible at a time	~20–30 MB/s (window-switching overhead)
G-REX (DCE)	Linear mapping via CPU local bus (CyberStorm/Blizzard PPC): entire PCI space directly addressable	Full 4 GB	~60–80 MB/s (no windowing overhead)
Prometheus	PLX bridge chip, single Zorro III window	Varies	~10–15 MB/s

The Mediator windowing works like a bank-switching scheme: the driver writes a bank register to select which 8 MB slice of PCI memory is visible in the Zorro III window. Accessing a different region requires a register write to switch banks — this adds latency for scattered access patterns but works transparently for contiguous operations like framebuffer blits.

For PCI card compatibility and driver details, see zorro_bus.md. For AutoConfig protocol mechanics, see autoconfig.md. For detailed address space semantics, see also address_space.md.

§10 — Best Practices & Hazards

Register Access Quick Reference

Target	Access Size	Alignment	Volatility	Special Rules
Custom chip ($DFF000)	Word only	2-byte	`volatile` required in C	No `CLR`; no byte access; long-word OK for register pairs
CIA-A ($BFE001)	Byte only	Odd addresses	`volatile` required	E-clock sync; ICR is read-and-clear
CIA-B ($BFD000)	Byte only	Even addresses	`volatile` required	E-clock sync; never word-read CIA region
Chip RAM	Any	2-byte for 68000	Not needed (memory)	DMA-visible; may cause wait states under DMA load
Fast RAM	Any	Any (4-byte preferred)	Not needed	CPU-only; zero DMA contention
Zorro II I/O	Word/Long	Card-dependent	`volatile` required	16-bit bus; auto-sized on 32-bit CPU
Zorro III I/O	Long preferred	4-byte	`volatile` required	32-bit bus; burst-capable
IDE (Gayle)	Word	2-byte	`volatile` required	PIO timing requirements

Memory Placement Strategy

For maximum throughput, place data according to how it will be consumed:

Data Type	Place In	Why
Display bitplanes	Chip RAM (mandatory)	Denise/Lisa DMA can only read Chip RAM
Copper lists	Chip RAM (mandatory)	Copper DMA reads from Chip RAM
Audio samples (DMA buffers)	Chip RAM (mandatory)	Paula DMA reads from Chip RAM
Sprite data	Chip RAM (mandatory)	Sprite DMA reads from Chip RAM
Blitter source/dest	Chip RAM (mandatory)	Blitter can only access Chip RAM
Program code	Fast RAM (preferred)	Executes at full CPU speed
Variables, stack	Fast RAM (preferred)	No DMA contention
Audio decode workspace	Fast RAM	Decompress at CPU speed, copy to Chip buffer
3D rendering buffer	Fast RAM	Render at CPU speed, C2P to Chip for display
File I/O buffers	Any (`MEMF_ANY`)	Let OS decide based on availability

Cache Management Checklist

For 68040/060 systems:

Before starting DMA that reads Chip RAM → CachePreDMA() or CacheClearE() with CACRF_ClearD
After DMA that writes Chip RAM → CachePostDMA() or CacheClearE() with CACRF_ClearD
After loading code into RAM → CacheClearE() with CACRF_ClearI (instruction cache invalidation)
Hardware register regions → ensure MMU maps $DFF000 and $BFx000 as cache-inhibited, serialized
When in doubt → CacheClearU() (expensive but safe)

Common Antipatterns

Antipattern	Problem	Fix
`CLR.W $DFF09C`	Read-modify-write reads the strobe register	Use `MOVE.W #0,$DFF09C`
`MOVE.W $BFD000,D0`	Word read hits both CIA-A and CIA-B simultaneously	Use `MOVE.B` with correct address
`MOVE.B #x,$DFF180`	Byte write to 16-bit custom register	Use `MOVE.W`
Blitter source in Fast RAM	Blitter cannot access Fast RAM → silent garbage or hang	Allocate source with `MEMF_CHIP`
No cache flush before display DMA	68040/060 shows stale data on screen	`CachePreDMA()` before Copper restart
`MOVE16` to odd alignment	Address Error exception on 68040/060	Ensure 16-byte alignment on both operands
Polling CIA ICR in a loop	Each read clears the flags — second read returns 0	Save ICR value, process all flags from one read
DMA to Fast RAM (e.g., `AllocMem(MEMF_FAST)` for audio buffer)	Paula, Blitter, Denise cannot reach Fast RAM	Use `MEMF_CHIP` for all DMA buffers

§11 — References & See Also

Companion Articles

Article	Relationship
address_space.md	Memory map tables (where things are) — this article covers how the bus reaches them
memory_types.md	Chip/Fast/Slow classification, MEMF flags — this article covers the physical transfer mechanics
dma_architecture.md	DMA slot scheduling and bandwidth — this article covers CPU bus cycles and register access
video_timing.md	Clock tree, video signal timing — this article covers how those clocks drive bus cycles
autoconfig.md	AutoConfig protocol and board enumeration
zorro_bus.md	Zorro II/III specifications, PCI bridge card catalog
Gary (OCS)	Detailed address decode logic and bus timeout
Fat Gary (ECS)	32-bit decode, SCSI, Zorro III glue
cia_chips.md	CIA register semantics, timer programming
ATA/ATAPI Protocol	IDE register access at `$DA0000` (Gayle), task file operations, PIO transfer mechanics

Primary Sources

Amiga Hardware Reference Manual (HRM) — Chapters 1 (System Overview), 7 (Appendix D: Hardware), Appendix B (Custom Registers)
MC68000 User's Manual — Chapter 5 (Signal Description), Chapter 6 (Bus Operation)
MC68040 User's Manual — Chapter 7 (Bus Operation), Chapter 8 (Cache)
MC68060 User's Manual — Chapter 7 (Bus Operation), Chapter 8 (Cache), Chapter 12 (MOVE16)
Amiga ROM Kernel Reference Manual: Libraries — exec.library cache functions
MOS 8520 CIA Datasheet — Timing specifications, ICR operation

52 KiB Raw Permalink Blame History Unescape Escape