The Amiga's custom chipset displays graphics in **planar** format — each bitplane is a separate contiguous block of memory where pixel information is spread across layers. This was brilliant engineering in 1985: it matched the hardware DMA streaming pattern perfectly, made bandwidth scale linearly with color depth, and enabled Copper palette effects that cost zero memory writes.
But every rendering algorithm developed since 1990 — 3D rasterization, texture mapping, image decompression, PC game ports — produces output in **chunky** format: one byte per pixel, all color information packed contiguously. Converting chunky output to planar display is the single most CPU-intensive bottleneck in Amiga graphics. It consumed thousands of programmer-hours across the demoscene and directly determined which games could run at playable framerates.
On a 7 MHz 68000 with a 16-bit Chip RAM bus, a naive C2P conversion of a single 320×256 frame takes **over one second** — roughly 0.9 FPS. The demoscene's solution — a bit-level butterfly network in hand-tuned assembly — achieves the same conversion in **~35 milliseconds**, a 30× improvement that made 3D gaming possible on stock hardware.
This article covers every known approach — why planar exists, why the conversion is expensive (it is a 90° bit matrix transposition), and how each solution works from naive loops through the Kalms butterfly to Copper Chunky tricks, Akiko hardware, and RTG bypass. It also connects the Amiga's planar/chunky duality to modern concepts every developer already knows: SoA vs AoS layout, GPU texture swizzle, and SIMD transposition.
> The Akiko hardware article covers the CD32's dedicated C2P register interface. This article covers the *algorithm theory* that applies to every Amiga model, and the broader data-layout concepts that connect the Amiga to modern computing.
The Amiga's display DMA engine (Agnus in OCS, Alice in AGA) fetches pixel data from Chip RAM and feeds it to the video encoder in real time, synchronized to the electron beam. The DMA controller fetches 16-bit words on a fixed schedule and shifts bits outward to the video DAC. Each 16-bit word contains 16 bits for 16 adjacent pixels — all from the **same** bitplane. Only after an entire scanline of plane 0 is fetched does the DMA move to plane 1.
This is a **Structure of Arrays (SoA)** access pattern: the hardware streams one field (one bit) across many elements (pixels) sequentially. Planar layout is the natural storage for this — it puts every byte the DMA needs next at consecutive addresses.
### In Memory — Side by Side
Here are the same 8 pixels (16 colors, 4 bpp) stored both ways:
**Chunky — packed pixel:** each pixel is a complete color value. Multiple pixels are packed into bytes at the smallest power-of-2 bit width that fits the color depth. For 16 colors (4 bpp), each byte holds 2 pixels as **nibbles** (4-bit halves).
```
Address: $0000 $0001 $0002 $0003
┌────┬────┐┌────┬────┐┌────┬────┐┌────┬────┐
Byte │$A3 ││$7F ││$10 ││$6C │
└────┴────┘└────┴────┘└────┴────┘└────┴────┘
Nibble: hi lo hi lo hi lo hi lo
$A $3 $7 $F $1 $0 $6 $C
Pixel: p0 p1 p2 p3 p4 p5 p6 p7
```
Reading pixel 4: one read at `$0002`, extract high nibble → `$1`.
**Planar — 4 bitplanes:** each plane is a **separate contiguous memory block**. Planes live at different base addresses.
```
Plane 0 base = $08000 Plane 2 base = $10000
Addr Byte Addr Byte
$08000 ┌────────┐ $10000 ┌────────┐
│ $4D │ ← bit0 of p0–p7 │ $E1 │ ← bit2 of p0–p7
└────────┘ $10001 └────────┘
$08001 ┌────────┐ ...
│ ... │ ← next 8 pixels
└────────┘
Plane 1 base = $0C000 Plane 3 base = $14000
Addr Byte Addr Byte
$0C000 ┌────────┐ $14000 ┌────────┐
│ $B2 │ ← bit1 of p0–p7 │ $2E │ ← bit3 of p0–p7
└────────┘ └────────┘
```
Reading pixel 4: read byte `$00` from plane 0 at `$08000` → bit 4 = `0`; read byte `$00` from plane 1 at `$0C000` → bit 4 = `0`; read byte `$00` from plane 2 at `$10000` → bit 4 = `1`; read byte `$00` from plane 3 at `$14000` → bit 4 = `0`. Collect: `0010` = `$2`. **Four separate memory accesses** vs chunky's one.
> [!NOTE]
> Each bitplane is a standalone byte array of size `(width × height) / 8`. The layout within each plane is linear — plane N, byte 0 is bits N of pixels 0–7. This fundamental indirection means pixel (x,y) lives at address `base[N] + y × (width/8) + x/8` at bit position `7 − (x mod 8)`.
### Why It Was Brilliant
| Advantage | Explanation |
|---|---|
| **Bandwidth efficiency** | Planar allocates exactly the bits needed: 4 colors = 2 bitplanes = 2 bits per pixel. A chunky (packed pixel) format must round up to the next power-of-2 boundary — so 4 colors requires 4 bpp (wasting 2 bits per pixel). DMA fetches only the planes actually used, never padding. This compounds: 32 colors costs 5 bitplanes (5 bpp) vs 8 bpp chunky — a 37% saving. |
| **Scalable color depth** | Adding a bitplane doubles the color count without redesigning the display engine. OCS: 1–6 planes. AGA: 1–8 planes. |
| **Zero-cost color cycling** | Rotating palette indices only requires changing color registers. Copper-driven palette splits re-color large screen regions for free. |
| **Blitter efficiency** | Blitting a masked sprite at 4 colors touches only 2 planes (2 blits), not 4× the data. |
| **Copper integration** | The Copper can change palette registers mid-scanline, multiplying colors without more bitplanes (the basis of HAM modes). |
### Why It Became a Problem
Planar graphics are optimal **when you render directly to bitplanes** — which all Amiga software did through the late 1980s. 2D sprites, tile maps, and vector graphics are all trivially expressible in planar format.
But starting around 1990, three things changed:
1.**3D texture mapping** appeared (demos like *Juggler*, then games like *Hunter*). Perspective-correct texel sampling requires per-pixel color lookups. A planar format means every pixel read requires 8 separate memory accesses (one per bitplane).
2.**PC game ports** became commercially important. PC VGA uses chunky Mode 13h (320×200×256). Porting a DOS game to Amiga requires converting every frame from chunky to planar — or rewriting the entire renderer for planar output.
3.**Real-time effects** like alpha blending, lighting, and particle systems all operate on complete pixel values — you need all 8 bits of a pixel's color to compute the result. Planar storage makes these algorithms hit memory 8× more often.
A chunky buffer is the **natural intermediate format** for a GPU-style rendering pipeline. The problem is getting that buffer onto the planar screen.
Each byte = one pixel. Linear, simple, cache-friendly for rendering. This is how **every modern GPU**, every PC VGA card, every framebuffer since 1990 stores pixels.
Each pixel's color index is **split across N separate memory regions** (bitplanes). For 8-bit pixels (8 bitplanes), each bitplane stores one bit of every pixel:
To read pixel 0's color: collect bit 0 from each of the 8 planes → `10101100` = `$AC`. The 8 planes are **not interleaved** in standard Amiga layout — each is a separate contiguous memory block.
> The Amiga's planar format means memory addresses in bitplane memory don't correspond to pixel positions linearly. Plane 0 byte 0 contains bits for pixels 0–7. Plane 1 byte 0 contains bits for the same pixels 0–7. The byte offset for pixel N is `(N / 8)` in **every** plane. The bit position is `7 - (N mod 8)`. This is the fundamental indirection all planar-format API developers must internalize.
C2P is a **bit matrix transposition**. Given 32 chunky pixels (each 8 bits wide), you have a 32×8 bit matrix (32 rows × 8 columns). C2P transposes this to an 8×32 matrix (8 bitplanes × 32 bits each):
This is equivalent to a 90° bit rotation. On a modern CPU with SIMD, this is trivial. On a 68020 with 8 data registers and no bit-parallel instructions, it is an algorithmic challenge that consumed thousands of programmer-hours across the demoscene.
The fastest software C2P routines (including Kalms' library) use a **butterfly network** — the same structure used in the Fast Fourier Transform (FFT) and Batcher's bitonic sort. The idea: instead of extracting each bit independently, swap bits in pairs of registers at progressively smaller strides until every bit lands in its correct bitplane position.
```mermaid
graph TB
subgraph "8 Chunky Longwords (32 pixels)"
C0["L0: P0-P3"]
C1["L1: P4-P7"]
C2["L2: P8-P11"]
C3["L3: P12-P15"]
C4["L4: P16-P19"]
C5["L5: P20-P23"]
C6["L6: P24-P27"]
C7["L7: P28-P31"]
end
subgraph "Pass 1: 16-bit swap"
P1["Swap word halves<br/>mask=$0000FFFF"]
end
subgraph "Pass 2: 8-bit swap"
P2["Swap bytes<br/>mask=$00FF00FF"]
end
subgraph "Pass 3: 4-bit swap"
P3["Swap nibbles<br/>mask=$0F0F0F0F"]
end
subgraph "Pass 4: 2-bit swap"
P4["Swap bit-pairs<br/>mask=$33333333"]
end
subgraph "Pass 5: 1-bit swap"
P5["Swap single bits<br/>mask=$55555555"]
end
subgraph "8 Planar Longwords"
BP0["D0: Plane 0 bits"]
BP1["D1: Plane 1 bits"]
BP2["D2: Plane 2 bits"]
BP3["D3: Plane 3 bits"]
BP4["D4: Plane 4 bits"]
BP5["D5: Plane 5 bits"]
BP6["D6: Plane 6 bits"]
BP7["D7: Plane 7 bits"]
end
C0 & C1 & C2 & C3 & C4 & C5 & C6 & C7 --> P1
P1 --> P2
P2 --> P3
P3 --> P4
P4 --> P5
P5 --> BP0
P5 --> BP1
P5 --> BP2
P5 --> BP3
P5 --> BP4
P5 --> BP5
P5 --> BP6
P5 --> BP7
```
Each pass uses a specific bit mask and shift distance. After all 5 passes, each data register contains exactly one bitplane's data for 32 pixels. The entire network requires **5 x 4 merges = 20 merge operations** for 8-bitplane conversion.
> Numbers assume 8 bitplanes, naive C code, no caching. On 68000 the inner loop is even slower because BTST/BSET to bitplane memory costs extra cycles on the 16-bit bus.
This is the standard approach used by virtually all serious Amiga C2P routines. Invented independently by several demoscene coders and formalized by **Mikael Kalms** (Kalmalyzer) and others.
Instead of processing one pixel at a time, load **32 pixels** (8 longwords = 256 bits) into CPU registers and perform a series of **bit-level swap operations** (called "merges") that progressively rearrange the bits into planar order. Each merge pass swaps bits at a different granularity: 16-bit blocks, then 8-bit, then 4-bit, 2-bit, and 1-bit.
This is exactly a **butterfly network** — the same structure used in the FFT (Fast Fourier Transform) and Batcher's bitonic sort.
### The Merge Primitive
The fundamental building block is a 2-register swap that exchanges bits at a given stride:
```asm
; merge(d0, d1, mask, shift)
; Exchanges bits between d0 and d1 where mask selects which bits to swap
To understand *why* this works, follow bit 5 of pixel 17 through all 5 passes:
```
Start: P17.b5 in d1 (loaded with pixels 16-19), at bit position 13 in the longword
(bit 5 of pixel 17 = bit 13, since pixels 16-19 = bits 31-0)
Pass 1 (16-bit swap with d3, mask=$0000FFFF):
d1.b13 swaps with d3.b29 → bit moves to d3
Now d3 holds bits for pixels (0,16,1,17...) pattern
Pass 2 (8-bit swap with d4, mask=$00FF00FF):
d3 byte containing our bit swaps with d4 → bit moves to d4
Byte boundaries begin to separate bitplanes from pixels
Pass 3 (4-bit swap, mask=$0F0F0F0F):
d4 nibble containing our bit swaps → bit moves to d5
Pass 4 (2-bit swap, mask=$33333333):
d5 bit-pair containing our bit swaps → bit moves to d6
Pass 5 (1-bit swap, mask=$55555555):
d6 individual bit swaps → bit lands in d6 at position 17
d6 now holds ONLY bitplane 5 bits = p0.b5 p1.b5 ... p31.b5
```
After the network, d6 contains exactly one bitplane worth of data — bit 5 of all 32 pixels. Each register naturally collects all bits of the same bit position. This is why each pass halves the block size: 16→8→4→2→1. At the end, every register is a pure bitplane.
> [!NOTE]
> The register-to-register mapping shown above is conceptual. In real code, the merge operations are optimized so that the final result lands in the correct register without explicit moves between passes. The Kalms routine uses this to avoid intermediate stores to memory.
### Full Working Example — Kalms-Style C2P (68030)
The complete, self-contained C2P routine below is a clean-room implementation based on the Kalms 68030 5-pass merge algorithm. It compiles with `vasm` and can be dropped directly into any Amiga project. For the original production-ready source, grab [c2p1x1_8_c5_030.s](https://github.com/Kalmalyzer/kalms-c2p/blob/main/normal/c2p1x1_8_c5_030.s) from the Kalms repository.
> This is a real, tested routine derived from the Kalms library (Public Domain). It has been simplified slightly for readability — production code from the Kalms archive uses additional tricks: self-modifying code for bitplane size parameters, separate unrolling for 68040/68060 with `MOVE16` writes, and optional Blitter-cooperative variants. For the absolute fastest routines for your specific CPU, clone [kalms-c2p on GitHub](https://github.com/Kalmalyzer/kalms-c2p) and benchmark the variants.
The CD32's Akiko chip implements C2P in dedicated silicon. The CPU feeds 8 longwords of chunky data to register `$B80030` and reads back 8 longwords of planar data from the same address.
Akiko's throughput is approximately the same as optimized software C2P on the 68020 because both are limited by the Chip RAM bus bandwidth (~3.5 MB/s shared). On faster CPUs (68040/060), software C2P **outperforms** Akiko because the CPU can process data faster than the register interface can shuttle it.
> **FPGA Implementation**: On MiSTer, Akiko C2P must be implemented as a state machine triggered by register writes to `$B80030`. The CPU writes 8 longwords to the same address; the state machine reads them sequentially, performs bit transposition in hardware, and presents the 8 planar longwords on subsequent reads from `$B80030`. Throughput is bounded by Chip RAM bus bandwidth (~3.5 MB/s shared), not by the state machine speed — a naive FGPA Akiko implementation that runs at bus speed is already cycle-accurate.
The Blitter can be used as part of a C2P pipeline, but it cannot perform the transposition itself. Typical usage:
1. CPU performs the merge/butterfly in registers → outputs planar longwords to a temporary buffer in Chip RAM
2. Blitter copies the planar data from the temporary buffer to the screen's bitplanes with correct modulo
This approach **overlaps** CPU computation with Blitter DMA — while the Blitter writes frame N's planes to the screen, the CPU computes frame N+1's transposition.
```
Time ──────────────────────────────────────────────────────→
> On 68040/060 systems, the Blitter is often **slower** than letting the CPU do both the merge and the writes via `MOVE16` (68040) or unrolled `MOVEM.L`. The Blitter's 16-bit bus (even in AGA FMODEx4) adds DMA contention that may actually slow down the CPU's merge passes.
**Performance vs baseline:**
| CPU | Naive (baseline) | CPU-only Merge | +Blitter DMA overlap | vs Baseline |
> The Blitter adds ~20% throughput by overlapping the Chip RAM write phase with the next frame's CPU merge. On 68040+, skip the Blitter — MOVE16 is faster.
---
## The Copper Chunky Trick — Pseudo-Chunky Without C2P
### The Idea
There is a radical alternative to C2P that avoids conversion entirely: use the Copper's `MOVE` instruction to write color values directly to a palette register in sync with the electron beam. By changing `COLOR00` at every pixel position on every scanline, you effectively create a **chunky display** with no bitplanes at all.
This technique, known as "Copper Chunky", was used by several influential AGA games:
- **Breathless** (Fields of Vision, 1996) — enhanced version with textured floors
- Various demoscene productions for real-time 3D effects
### How It Works
```
For each scanline y (0..255):
For each pixel x (0..319):
1. WAIT for (x, y) — sync to exact beam position
2. MOVE chunky[x,y] -> COLOR00 — set the pixel color
```
Each pixel requires 2 Copper instructions (WAIT + MOVE). At 320x256 = 81,920 pixels, you need **163,840** instructions. The Copperlist size is 163,840 * 4 bytes = **~640 KB** — larger than typical available Chip RAM.
### Practical Limits
| Constraint | Detail |
|---|---|
| **Resolution** | Practical maximum ~160x128 at full color; 320x256 possible only with pixel doubling (2x1 or 1x2) |
| **Colors** | Only one color register changed per pixel (typically COLOR00) |
| **Copperlist size** | 640 KB for full 320x256 — often exceeds available Chip RAM below 2MB |
| **CPU Cost** | CPU must rebuild the entire Copperlist each frame — effectively a memory copy with format conversion |
**Performance vs baseline (2x1 low-res, 160x128 effective on A1200):**
> At 2x1 low-res, Copper Chunky achieves comparable framerates to software C2P with much less code complexity. At 1x1 full resolution (320x256), the Copperlist is too large to fit in Chip RAM — software C2P wins. See the decision flowchart below.
### Hybrid Approach (Used in Games)
Most games used a hybrid: 1-2 bitplanes for UI/HUD elements, reserving `COLOR00` for the Copper Chunky 3D viewport. This is how Alien Breed 3D displays both a rendered 3D view and on-screen status bar.
### When Copper Chunky Wins
| Scenario | Recommendation |
|---|---|
| Stock A1200, 2x1 low-res 3D viewport | **Copper Chunky** — simple, no assembly C2P code to write |
| Full resolution, any color depth | **Software C2P** — Copperlist too large for 1x1 full res |
| Accelerated Amiga (68040/060) | **Software C2P** — CPU is far faster than building Copperlists |
> [!NOTE]
> Copper Chunky and C2P are not mutually exclusive. Some demos use Copper Chunky for one screen region while simultaneously using C2P for another. The Copperlist can intermix WAIT/MOVE instructions with normal bitplane display controls.
> **FPGA/Emulation Timing Sensitivity**: Copper Chunky is extremely sensitive to Copper timing accuracy. Each `WAIT` must compare against the exact beam counter value, and each `MOVE` to `COLOR00` must take effect at the correct pixel position. DMA contention between Copper and bitplane fetches shifts pixel placement, and emulators must model the Copper's 2-cycle instruction latency (WAIT=2 cycles, MOVE=2 cycles). A one-pixel offset produces visible image shearing. The Minimig-AGA core on MiSTer implements this, but early UAE versions did not — if your Copper Chunky output shows "striped" patterns under emulation, test on MiSTer or real hardware before debugging the algorithm.
AmigaOS 3.0+ provides `WriteChunkyPixels()` in `graphics.library`, which performs C2P conversion internally using the best available method:
```c
#include <graphics/gfx.h>
WriteChunkyPixels(rp,
xstart, ystart, xstop, ystop,
chunky_buffer, chunky_bytes_per_row);
```
On CD32, this function auto-detects Akiko and uses it. On other AGA machines, it uses an internal software C2P. However, the OS implementation is **not** as fast as the best demoscene routines — it prioritises correctness and generality over raw speed.
**Performance vs baseline:** ~20–28x (hardware-dependent). On CD32 with Akiko: ~31x. On stock AGA with internal C2P: ~20x. Still an enormous improvement over the naive loop and requires zero assembly code.
The ultimate solution to C2P is to **not do it at all**. Retargetable Graphics (RTG) cards like the Picasso IV, CyberVision 64, and MiSTer's virtual `uaegfx` provide a chunky framebuffer directly. The rendering engine writes chunky pixels to VRAM, and the card's RAMDAC/scaler converts them to video output.
The irony: RTG cards must perform the **reverse** conversion (P2C — planar-to-chunky) when legacy planar software runs on an RTG screen. The CyberVision 64 included a dedicated **Roxxler** chip for this. Without hardware help, P2C on software is equally expensive.
### uaegfx — The Virtual RTG Card That Makes C2P Optional
**`uaegfx`** is a software-defined RTG card that presents a chunky framebuffer to AmigaOS through the Picasso96 API. It was originally developed for UAE (the Unix Amiga Emulator) and was later ported to WinUAE, FS-UAE, and the MiSTer Minimig-AGA FPGA core.
Instead of a physical RAMDAC, the emulator or FPGA core reads the chunky framebuffer directly from host memory and composites it onto the output display. The Amiga-side Picasso96 driver (`uaegfx.card` / `uaegfx.info`) talks to the emulator through a shared-memory protocol — no C2P, no Blitter, no Copper tricks. The CPU writes RGBA bytes and the screen updates.
On MiSTer, RTG requires a 68020 CPU (TG68K core), Picasso96 installed with the `uaegfx` driver, and the [MiSTer_RTG.lha](https://github.com/MiSTer-devel/Minimig-AGA_MiSTer) package. The framebuffer lives in the FPGA's DDR memory and the scaler reads it directly — no Chip RAM bus contention at all.
> On MiSTer, RTG outputs exclusively through the HDMI scaler by default. To see RTG on the VGA port, set `vga_scaler=1` in `MiSTer.ini`. RTG is also restricted to 68020 CPU mode — it is disabled when 68000 is selected because the TG68K 68000 core lacks the address space to map the framebuffer.
## The Bigger Picture — Data Layout Transformation
C2P is not unique to the Amiga. It is an instance of a fundamental problem in computer architecture: **transforming data layout between Structure-of-Arrays (SoA) and Array-of-Structures (AoS)**.
The Amiga's planar format is **SoA**: each bitplane is an array of one field (one bit) across all pixels. The chunky format is **AoS**: each pixel's fields (all 8 bits) are packed together.
### Where This Problem Appears Today
| Domain | SoA (Planar-Like) | AoS (Chunky-Like) | Conversion |
The Amiga's custom DMA engine streams bitplane data to the display sequentially — plane 0 for the whole line, then plane 1, etc. This is a **SoA access pattern**, perfectly matched by the planar layout. The CPU, which wants to set a single pixel's complete color, has the opposite need — it wants **AoS**.
| **FMODE (fetch width)** | GPU memory bus width (256/384/512-bit) | Wider bus = more data per DMA cycle |
### GPU Texture Swizzle — The Modern Akiko
Modern GPUs store textures in **swizzled** (Morton/Z-order) layouts rather than linear row-major order. This is architecturally identical to what the Amiga does with planar bitmaps: the hardware's memory access pattern doesn't match the CPU's logical layout, so a dedicated hardware unit transparently converts between them.
```
Linear (CPU view): Morton/Z-order (GPU internal):
0 1 2 3 0 1 4 5
4 5 6 7 → 2 3 6 7
8 9 10 11 8 9 12 13
12 13 14 15 10 11 14 15
```
When you call `glTexImage2D()` or `vkCmdCopyBufferToImage()`, the GPU driver performs a layout conversion from linear (CPU-friendly) to swizzled (GPU-cache-friendly). This is the exact same class of operation as Amiga C2P — a hardware-accelerated data layout transformation that is invisible to the application programmer.
---
## Performance Comparison Across Eras
| System | Data Layout Problem | Throughput | Method |
| Apple M2 (2022) | SoA↔AoS for ML tensors | ~100 GB/s | Hardware AMX |
The throughput gap tells the story: what consumed 100% of a 68020's capability is handled by a dedicated hardware unit at 200,000× the bandwidth on modern silicon. But the fundamental problem — **data layout mismatch between producer and consumer** — is identical.
---
## Historical Timeline
| Year | Event |
|---|---|
| 1985 | Amiga launches with planar display. C2P not needed — all software renders directly to bitplanes |
| 1989 | First 3D demos appear (Juggler, etc.). Rendering in chunky buffers starts |
3.**Use a proven C2P library** — Kalms C2P ([GitHub](https://github.com/Kalmalyzer/kalms-c2p) / [lysator](https://www.lysator.liu.se/~mikaelk/c2p/)) is the gold standard
START["I render into a chunky buffer"] --> RTG{"RTG screen?"}
RTG -->|"Yes"| NO_C2P["No C2P — write directly to VRAM"]
RTG -->|"No"| AGA{"AGA hardware?"}
AGA -->|"No (OCS/ECS)"| SW_OCS["Software C2P — Kalms 68000"]
AGA -->|"Yes"| CD32{"CD32 with Akiko?"}
CD32 -->|"Yes"| AKIKO["Akiko hardware C2P"]
CD32 -->|"No"| CPU{"Which CPU?"}
CPU -->|"68020/030"| RES{"Resolution?"}
RES -->|"2x1 low-res"| COPPER["Copper Chunky"]
RES -->|"1x1 full-res"| SW_020["Kalms C2P — 68020"]
CPU -->|"68040/060"| SW_040["Kalms C2P — 68040"]
```
> [!TIP]
> If prototyping, use `WriteChunkyPixels()`. It auto-detects Akiko and uses a decent software C2P. After profiling, switch to the dedicated path.
---
## Named Antipatterns
These are bad habits that compile, produce visible output, and are dangerously easy not to fix.
### 1. "The Bit-by-Bit Beginner" — Per-Pixel RMW on Every Bitplane
**Symptom:** 0.9 FPS. CPU time spent in OR.B/AND.B instructions.
```c
/* BROKEN — read-modify-write per plane per pixel */
for (int i = 0; i <total;i++){
int off = i / 8, bit = 7 - (i & 7);
UBYTE c = chunky[i];
for (int p = 0; p <8;p++){
if (c & (1 <<p))
planes[p][off] |= (1 <<bit);
else
planes[p][off] &= ~(1 <<bit);
}
}
```
**Why it fails:** Each inner loop iteration costs ~140 cycles (read byte, test, set/clear, write). 655,360 RMW operations = ~91 million cycles.
**Fix:** Process 32 pixels at once in registers, write planar longwords in one shot (see Solution 2).
### 2. "The Chip RAM Trap" — Chunky Buffer in Chip RAM
**Symptom:** C2P stalls unpredictably when display DMA is active.
```c
/* BROKEN — chunky buffer competes with bitplane DMA */
UBYTE *chunky = AllocMem(w * h, MEMF_CHIP);
```
**Why it fails:** The CPU reads chunky data during butterfly merge. If in Chip RAM, every read contends with display DMA. Both the CPU and Agnus/Alice stall.
**Fix:**
```c
/* CORRECT — chunky in Fast RAM, only planar output in Chip RAM */
### 3. "The Odd-Width Screen" — Non-Multiple-of-32 Width
**Symptom:** C2P runs at half expected speed.
```c
/* BROKEN — 336 pixels wide */
#define WIDTH 336
```
**Why it fails:** Bitplane row length (WIDTH/8) must be longword-aligned for optimal DMA. Non-aligned rows break caching and add per-line overhead.
**Fix:** Always use widths that are multiples of 32.
### 4. "The Shared Blitter" — Using Blitter on 68040+
**Symptom:** Blitter "assistance" slows down 68040 conversion.
**Why it fails:** The Blitter has a 16-bit interface to Chip RAM. The 68040 MOVE16 moves 16 bytes at once, consuming fewer bus cycles. On 68060, the superscalar core outperforms the Blitter in all scenarios.
**Fix:** On 68040/060, let the CPU handle merge + planar writes. Skip the Blitter entirely.
---
## Pitfalls — Bad Code That Compiles
### 1. Missing Cache Flush on 68040/060 After C2P
On 68040+, CPU caches may hold stale data after DMA writes. If C2P stores planar output via MOVE16 and the display hardware reads those same addresses immediately, stale cache lines may be served.
```asm
; WRONG — no cache flush after C2P
bsr c2p_convert
; display may read stale data
; CORRECT
bsr c2p_convert
moveq #CACRF_ClearD,d0
movec d0,cacr ; flush data cache
; safe to display now
```
### 2. Double Buffering Without Triple Buffering
With a single chunky buffer, the pipeline is fully serial — render, then C2P, then display — and the CPU idles through most of each frame. Even double buffering helps little because the C2P step still stalls everything:
```c
/* BAD — single buffer forces CPU to idle after each step */
render_to(chunky);
c2p_convert(chunky, screen); /* CPU stalled during C2P */
WaitTOF(); /* CPU stalled waiting for vblank */
```
**Result:** ~30% CPU utilization — the CPU spins idle ~70% of the time.
**Fix — triple buffering (good):** Decouple all three stages so they overlap:
- Buffer C is **displayed** by DMA (bitplane fetch)
- Buffer B is being **C2P'd** by the CPU (merge/butterfly)
- Buffer A is being **rendered** by the CPU (game/demo logic)
```c
/* GOOD — three buffers allow full pipelining */
int cur = 0;
while (running) {
c2p_convert(chunky[(cur+2)%3], screen[(cur+1)%3]); /* C2P B → C */
render_to(chunky[cur]); /* render A */
WaitTOF();
set_bitplanes(screen[cur]); /* display C */
cur = (cur + 1) % 3;
}
```
**Result:** ~70% CPU utilization — ~2.3x more work done per frame vs double-buffered.
### 3. Benchmarking Without Forbid()
```c
/* WRONG — includes task switches in measurement */
ULONG start = ReadEClock();
c2p_convert(...);
ULONG end = ReadEClock();
/* CORRECT */
Forbid();
ULONG start = ReadEClock();
c2p_convert(...);
ULONG end = ReadEClock();
Permit();
```
---
## Debugging C2P — Common Visual Artifacts
When your C2P routine produces output but it looks wrong, the artifact pattern usually tells you exactly which stage is broken. Here are the most common failures and how to diagnose them:
**What you see:** Individual pixels or small clusters flicker between correct and wrong colors, sometimes synchronized with scrolling or animation.
**Cause:** On 68040/060, data-cache lines hold stale data after C2P writes. The display DMA reads from Chip RAM but the CPU may still serve cached values if coherency isn't enforced.
**Fix:**
```asm
bsr c2p_convert
moveq #CACRF_ClearD,d0
movec d0,cacr ; flush data cache after C2P
```
### 2. Every Nth Pixel Wrong (Bit Mask Error)
**What you see:** A regular pattern — every 2nd, 4th, 8th, or 16th pixel has the wrong color while neighbors are correct.
**Cause:** One of the merge masks is wrong. If every 16th pixel fails, the 16-bit swap mask (`$0000FFFF`) has a typo. If every 2nd pixel fails, the 1-bit swap mask (`$55555555`) is wrong.
**Fix:** Verify each pass uses the exact mask from the pass structure table above. A single wrong nibble in a mask constant corrupts ONE pass, producing a very regular artifact.
### 3. Horizontal Stripes / Scanline Mismatch
**What you see:** Horizontal bands of correct and corrupted data, often 1–8 scanlines tall.
**Cause:** Bitplane modulo (row-to-row offset) is misconfigured. The C2P routine writes 32 bytes per planar row, but the display fetch expects a different pitch. Common on non-320-width screens.
**Fix:** Ensure `WIDTH/8` is longword-aligned and matches the bitplane modulo in `BPL1MOD`/`BPL2MOD` registers. Always use widths that are multiples of 32.
### 4. Bit-Inverted Color (Complemented Plane)
**What you see:** Colors are mostly correct but "off" — bright where dark should be, or certain color ranges are inverted.
**Cause:** A single bitplane was written with inverted bits (OR used where AND was needed, or EOR instead of OR). This flips all palette indices that have that bit set.
**Fix:** Check the final store loop — ensure MOVE.L writes, not BSET/BCLR. A common mistake is using `BSET` to set bits in pre-cleared bitplane memory, then forgetting to clear the buffer between frames.
### 5. "Garbage Garden" — Random Colored Snow
**What you see:** Entire screen filled with rapidly changing random colors, with occasional flashes of correct data.
**Cause:** Buffer pointer is uninitialized or stale. The C2P routine is reading from the wrong chunky buffer address or writing to the wrong bitplane base.
**Fix:** Trace your buffer pointer arithmetic. Ensure `A0` (chunky) and `A1` (bitplanes) are set correctly before calling `c2p_convert`. Triple-buffer pointer rotation bugs are the most common culprit.
### Quick Diagnostic Table
| Artifact Pattern | Most Likely Cause | Check First |
The Blitter cannot transpose bits — it only manipulates 16-bit words in linear rows. C2P is fundamentally a transposition operation, which requires bit-level swapping that minterms cannot express. The Blitter can help write converted data to bitplanes (Solution 4), but the actual transposition must happen in CPU registers.
### Why are odd screen widths like 336 pixels much slower?
Bitplane modulo calculations on non-aligned rows force the display DMA controller to calculate non-standard memory addresses. At 336 pixels wide, each row is 42 bytes — not longword-aligned, causing extra memory cycles and breaking I-cache patterns in the butterfly merge.
No. Akiko is a custom ASIC that physically only exists in the CD32; it is integrated with the CD-ROM controller on the same die. There is no expansion card addressing `$B80000` on any other Amiga model. On MiSTer, Akiko can be implemented as a soft peripheral in the FPGA core — see the FPGA implementation note in [Solution 3](#solution-3--akiko-hardware-c2p-cd32-only).
### Why doesn't C2P scale linearly with 68060 clock speed?
C2P performance is bounded by Chip RAM bandwidth (~3.5 MB/s shared), not by CPU speed. Once the butterfly merge executes faster than memory can deliver words, bus limitations dominate. On a 50 MHz 68060, the merge takes ~1.3 ms for 320x256, but writing 8 bitplanes to Chip RAM takes ~23 ms — the write phase dominates.
### Does P2C (Planar-to-Chunky) have the same problem?
Yes. Reading planar pixel data requires 8 memory accesses (one per bitplane), then bit-packing these into chunky bytes. The computational complexity is identical because it is the same bit matrix transposition — just in the reverse direction. RTG cards that support legacy planar software include hardware P2C (e.g., CyberVision 64 Roxxler chip).
- Intel — [Structure of Arrays vs Array of Structures](https://www.intel.com/content/www/us/en/developer/articles/technical/memory-layout-transformations.html) — modern SoA/AoS guide