62 KiB
Pixel Format Conversion — Chunky ↔ Planar and Beyond
Overview
The Amiga's custom chipset displays graphics in planar format — each bitplane is a separate contiguous block of memory where pixel information is spread across layers. This was brilliant engineering in 1985: it matched the hardware DMA streaming pattern perfectly, made bandwidth scale linearly with color depth, and enabled Copper palette effects that cost zero memory writes.
But every rendering algorithm developed since 1990 — 3D rasterization, texture mapping, image decompression, PC game ports — produces output in chunky format: one byte per pixel, all color information packed contiguously. Converting chunky output to planar display is the single most CPU-intensive bottleneck in Amiga graphics. It consumed thousands of programmer-hours across the demoscene and directly determined which games could run at playable framerates.
On a 7 MHz 68000 with a 16-bit Chip RAM bus, a naive C2P conversion of a single 320×256 frame takes over one second — roughly 0.9 FPS. The demoscene's solution — a bit-level butterfly network in hand-tuned assembly — achieves the same conversion in ~35 milliseconds, a 30× improvement that made 3D gaming possible on stock hardware.
This article covers every known approach — why planar exists, why the conversion is expensive (it is a 90° bit matrix transposition), and how each solution works from naive loops through the Kalms butterfly to Copper Chunky tricks, Akiko hardware, and RTG bypass. It also connects the Amiga's planar/chunky duality to modern concepts every developer already knows: SoA vs AoS layout, GPU texture swizzle, and SIMD transposition.
[!TIP] TL;DR — Which C2P should I use?
Your Hardware Use This Expect CD32 Akiko hardware (register $B80030)~33× baseline, 50% CPU free A1200 stock (68020), 2×1 low-res Copper Chunky ~23× baseline, simplest code A1200 stock (68020), 1×1 full-res Kalms merge (68020) ~33× baseline, ~28 FPS A4000 / 68030 accelerated Kalms merge (68030) ~64× baseline 68040/060 accelerated Kalms merge (skip Blitter!) ~128× baseline MiSTer FPGA / WinUAE / RTG card No C2P — use uaegfxchunky VRAMInfinite Prototyping or quick port WriteChunkyPixels()(OS API)~20–31× baseline, zero asm All speeds relative to naive C2P baseline: ~70,000 pixels/sec on 68020 @ 14 MHz (0.9 FPS for 320×256×8bpl).
Note
The Akiko hardware article covers the CD32's dedicated C2P register interface. This article covers the algorithm theory that applies to every Amiga model, and the broader data-layout concepts that connect the Amiga to modern computing.
The Core Problem — Why This Exists
The Hardware Constraint (1985)
The Amiga's display DMA engine (Agnus in OCS, Alice in AGA) fetches pixel data from Chip RAM and feeds it to the video encoder in real time, synchronized to the electron beam. The DMA controller fetches 16-bit words on a fixed schedule and shifts bits outward to the video DAC. Each 16-bit word contains 16 bits for 16 adjacent pixels — all from the same bitplane. Only after an entire scanline of plane 0 is fetched does the DMA move to plane 1.
This is a Structure of Arrays (SoA) access pattern: the hardware streams one field (one bit) across many elements (pixels) sequentially. Planar layout is the natural storage for this — it puts every byte the DMA needs next at consecutive addresses.
In Memory — Side by Side
Here are the same 8 pixels (16 colors, 4 bpp) stored both ways:
Chunky — packed pixel: each pixel is a complete color value. Multiple pixels are packed into bytes at the smallest power-of-2 bit width that fits the color depth. For 16 colors (4 bpp), each byte holds 2 pixels as nibbles (4-bit halves).
Address: $0000 $0001 $0002 $0003
┌────┬────┐┌────┬────┐┌────┬────┐┌────┬────┐
Byte │$A3 ││$7F ││$10 ││$6C │
└────┴────┘└────┴────┘└────┴────┘└────┴────┘
Nibble: hi lo hi lo hi lo hi lo
$A $3 $7 $F $1 $0 $6 $C
Pixel: p0 p1 p2 p3 p4 p5 p6 p7
Reading pixel 4: one read at $0002, extract high nibble → $1.
Planar — 4 bitplanes: each plane is a separate contiguous memory block. Planes live at different base addresses.
Plane 0 base = $08000 Plane 2 base = $10000
Addr Byte Addr Byte
$08000 ┌────────┐ $10000 ┌────────┐
│ $4D │ ← bit0 of p0–p7 │ $E1 │ ← bit2 of p0–p7
└────────┘ $10001 └────────┘
$08001 ┌────────┐ ...
│ ... │ ← next 8 pixels
└────────┘
Plane 1 base = $0C000 Plane 3 base = $14000
Addr Byte Addr Byte
$0C000 ┌────────┐ $14000 ┌────────┐
│ $B2 │ ← bit1 of p0–p7 │ $2E │ ← bit3 of p0–p7
└────────┘ └────────┘
Reading pixel 4: read byte $00 from plane 0 at $08000 → bit 4 = 0; read byte $00 from plane 1 at $0C000 → bit 4 = 0; read byte $00 from plane 2 at $10000 → bit 4 = 1; read byte $00 from plane 3 at $14000 → bit 4 = 0. Collect: 0010 = $2. Four separate memory accesses vs chunky's one.
Note
Each bitplane is a standalone byte array of size
(width × height) / 8. The layout within each plane is linear — plane N, byte 0 is bits N of pixels 0–7. This fundamental indirection means pixel (x,y) lives at addressbase[N] + y × (width/8) + x/8at bit position7 − (x mod 8).
Why It Was Brilliant
| Advantage | Explanation |
|---|---|
| Bandwidth efficiency | Planar allocates exactly the bits needed: 4 colors = 2 bitplanes = 2 bits per pixel. A chunky (packed pixel) format must round up to the next power-of-2 boundary — so 4 colors requires 4 bpp (wasting 2 bits per pixel). DMA fetches only the planes actually used, never padding. This compounds: 32 colors costs 5 bitplanes (5 bpp) vs 8 bpp chunky — a 37% saving. |
| Scalable color depth | Adding a bitplane doubles the color count without redesigning the display engine. OCS: 1–6 planes. AGA: 1–8 planes. |
| Zero-cost color cycling | Rotating palette indices only requires changing color registers. Copper-driven palette splits re-color large screen regions for free. |
| Blitter efficiency | Blitting a masked sprite at 4 colors touches only 2 planes (2 blits), not 4× the data. |
| Copper integration | The Copper can change palette registers mid-scanline, multiplying colors without more bitplanes (the basis of HAM modes). |
Why It Became a Problem
Planar graphics are optimal when you render directly to bitplanes — which all Amiga software did through the late 1980s. 2D sprites, tile maps, and vector graphics are all trivially expressible in planar format.
But starting around 1990, three things changed:
- 3D texture mapping appeared (demos like Juggler, then games like Hunter). Perspective-correct texel sampling requires per-pixel color lookups. A planar format means every pixel read requires 8 separate memory accesses (one per bitplane).
- PC game ports became commercially important. PC VGA uses chunky Mode 13h (320×200×256). Porting a DOS game to Amiga requires converting every frame from chunky to planar — or rewriting the entire renderer for planar output.
- Real-time effects like alpha blending, lighting, and particle systems all operate on complete pixel values — you need all 8 bits of a pixel's color to compute the result. Planar storage makes these algorithms hit memory 8× more often.
A chunky buffer is the natural intermediate format for a GPU-style rendering pipeline. The problem is getting that buffer onto the planar screen.
Planar vs Chunky — The Two Layouts
Chunky (Packed Pixel)
Every pixel's complete color index is stored contiguously. For 8-bit (256 color) pixels:
Address: $0000 $0001 $0002 $0003 $0004 $0005 $0006 $0007
Data: $0D $05 $1B $0A $FF $03 $42 $7E
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7
Each byte = one pixel. Linear, simple, cache-friendly for rendering. This is how every modern GPU, every PC VGA card, every framebuffer since 1990 stores pixels.
Planar (Bitplane)
Each pixel's color index is split across N separate memory regions (bitplanes). For 8-bit pixels (8 bitplanes), each bitplane stores one bit of every pixel:
Bitplane 0: 1 0 1 1 0 0 1 0 ← bit 0 of pixels 0–7
Bitplane 1: 0 1 0 1 1 0 0 1 ← bit 1 of pixels 0–7
Bitplane 2: 1 1 0 0 0 1 1 0 ← bit 2
Bitplane 3: 0 1 1 0 1 1 0 0 ← bit 3
Bitplane 4: 1 0 1 0 1 0 0 1 ← bit 4
Bitplane 5: 1 0 0 0 0 0 1 0 ← bit 5
Bitplane 6: 0 0 1 0 0 0 0 1 ← bit 6
Bitplane 7: 0 0 0 0 1 0 1 0 ← bit 7
To read pixel 0's color: collect bit 0 from each of the 8 planes → 10101100 = $AC. The 8 planes are not interleaved in standard Amiga layout — each is a separate contiguous memory block.
Warning
The Amiga's planar format means memory addresses in bitplane memory don't correspond to pixel positions linearly. Plane 0 byte 0 contains bits for pixels 0–7. Plane 1 byte 0 contains bits for the same pixels 0–7. The byte offset for pixel N is
(N / 8)in every plane. The bit position is7 - (N mod 8). This is the fundamental indirection all planar-format API developers must internalize.
The Conversion — Mathematically
C2P is a bit matrix transposition. Given 32 chunky pixels (each 8 bits wide), you have a 32×8 bit matrix (32 rows × 8 columns). C2P transposes this to an 8×32 matrix (8 bitplanes × 32 bits each):
Input (chunky): Output (planar):
32 pixels × 8 bits 8 bitplanes × 32 bits
┌──────────────────────────────┐ ┌────────────────────────────────────────┐
│ P0: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 0: p0.b0 p1.b0 p2.b0 ... p31.b0 │
│ P1: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 1: p0.b1 p1.b1 p2.b1 ... p31.b1 │
│ ... │ │ ... │
│ P31: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 7: p0.b7 p1.b7 p2.b7 ... p31.b7 │
└──────────────────────────────┘ └────────────────────────────────────────┘
This is equivalent to a 90° bit rotation. On a modern CPU with SIMD, this is trivial. On a 68020 with 8 data registers and no bit-parallel instructions, it is an algorithmic challenge that consumed thousands of programmer-hours across the demoscene.
The Butterfly Network — Conceptual Model
The fastest software C2P routines (including Kalms' library) use a butterfly network — the same structure used in the Fast Fourier Transform (FFT) and Batcher's bitonic sort. The idea: instead of extracting each bit independently, swap bits in pairs of registers at progressively smaller strides until every bit lands in its correct bitplane position.
graph TB
subgraph "8 Chunky Longwords (32 pixels)"
C0["L0: P0-P3"]
C1["L1: P4-P7"]
C2["L2: P8-P11"]
C3["L3: P12-P15"]
C4["L4: P16-P19"]
C5["L5: P20-P23"]
C6["L6: P24-P27"]
C7["L7: P28-P31"]
end
subgraph "Pass 1: 16-bit swap"
P1["Swap word halves<br/>mask=$0000FFFF"]
end
subgraph "Pass 2: 8-bit swap"
P2["Swap bytes<br/>mask=$00FF00FF"]
end
subgraph "Pass 3: 4-bit swap"
P3["Swap nibbles<br/>mask=$0F0F0F0F"]
end
subgraph "Pass 4: 2-bit swap"
P4["Swap bit-pairs<br/>mask=$33333333"]
end
subgraph "Pass 5: 1-bit swap"
P5["Swap single bits<br/>mask=$55555555"]
end
subgraph "8 Planar Longwords"
BP0["D0: Plane 0 bits"]
BP1["D1: Plane 1 bits"]
BP2["D2: Plane 2 bits"]
BP3["D3: Plane 3 bits"]
BP4["D4: Plane 4 bits"]
BP5["D5: Plane 5 bits"]
BP6["D6: Plane 6 bits"]
BP7["D7: Plane 7 bits"]
end
C0 & C1 & C2 & C3 & C4 & C5 & C6 & C7 --> P1
P1 --> P2
P2 --> P3
P3 --> P4
P4 --> P5
P5 --> BP0
P5 --> BP1
P5 --> BP2
P5 --> BP3
P5 --> BP4
P5 --> BP5
P5 --> BP6
P5 --> BP7
Each pass uses a specific bit mask and shift distance. After all 5 passes, each data register contains exactly one bitplane's data for 32 pixels. The entire network requires 5 x 4 merges = 20 merge operations for 8-bitplane conversion.
Solution 1 — The Naive Loop
The simplest approach: iterate over every pixel, extract each bit, and set it in the corresponding bitplane.
/* Naive C2P — educational only, never use in production */
void c2p_naive(UBYTE *chunky, UBYTE *planes[8], int width, int height)
{
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
UBYTE pixel = chunky[y * width + x];
int byte_offset = y * (width / 8) + (x / 8);
int bit_position = 7 - (x & 7);
for (int plane = 0; plane < 8; plane++)
{
if (pixel & (1 << plane))
planes[plane][byte_offset] |= (1 << bit_position);
else
planes[plane][byte_offset] &= ~(1 << bit_position);
}
}
}
}
BASELINE PERFORMANCE — all other solutions are measured against this:
| CPU | Cycles/pixel | Pixels/sec | 320x256 frame | FPS |
|---|---|---|---|---|
| 68000 @ 7.09 MHz | ~500 | ~14,000 | ~5.9 s | 0.17 |
| 68020 @ 14 MHz | ~200 | ~70,000 | ~1.1 s | 0.9 |
Numbers assume 8 bitplanes, naive C code, no caching. On 68000 the inner loop is even slower because BTST/BSET to bitplane memory costs extra cycles on the 16-bit bus.
Why it's terrible:
- One bit at a time — no parallelism
- Read-modify-write on every bitplane byte (bus-killing)
- No register reuse — constant memory traffic
- Branch on every bit (pipeline flush on 68020)
Solution 2 — The Merge (Butterfly) Algorithm
This is the standard approach used by virtually all serious Amiga C2P routines. Invented independently by several demoscene coders and formalized by Mikael Kalms (Kalmalyzer) and others.
The Key Insight
Instead of processing one pixel at a time, load 32 pixels (8 longwords = 256 bits) into CPU registers and perform a series of bit-level swap operations (called "merges") that progressively rearrange the bits into planar order. Each merge pass swaps bits at a different granularity: 16-bit blocks, then 8-bit, then 4-bit, 2-bit, and 1-bit.
This is exactly a butterfly network — the same structure used in the FFT (Fast Fourier Transform) and Batcher's bitonic sort.
The Merge Primitive
The fundamental building block is a 2-register swap that exchanges bits at a given stride:
; merge(d0, d1, mask, shift)
; Exchanges bits between d0 and d1 where mask selects which bits to swap
; and shift determines the stride
move.l d0, d2 ; temp = a
lsr.l #shift, d2 ; temp >>= stride
eor.l d1, d2 ; temp ^= b
and.l #mask, d2 ; temp &= mask (select bits to swap)
eor.l d2, d1 ; b ^= temp (swap into b)
lsl.l #shift, d2 ; temp <<= stride (restore position)
eor.l d2, d0 ; a ^= temp (swap into a)
7 instructions per merge. Each merge moves half the bits in two registers to their correct positions.
Pass Structure for 8 Bitplanes
A full 8-bitplane C2P conversion on 32 pixels requires 5 passes of merge operations:
| Pass | Block Size | Mask | Swap Distance | Effect |
|---|---|---|---|---|
| 1 | 16-bit | $0000FFFF |
16 | Swap upper/lower halves of longword pairs |
| 2 | 8-bit | $00FF00FF |
8 | Swap bytes within pairs |
| 3 | 4-bit | $0F0F0F0F |
4 | Swap nibbles |
| 4 | 2-bit | $33333333 |
2 | Swap bit-pairs |
| 5 | 1-bit | $55555555 |
1 | Swap individual bits |
After all 5 passes, the 8 data registers contain one longword per bitplane.
Trace One Bit Through the Network
To understand why this works, follow bit 5 of pixel 17 through all 5 passes:
Start: P17.b5 in d1 (loaded with pixels 16-19), at bit position 13 in the longword
(bit 5 of pixel 17 = bit 13, since pixels 16-19 = bits 31-0)
Pass 1 (16-bit swap with d3, mask=$0000FFFF):
d1.b13 swaps with d3.b29 → bit moves to d3
Now d3 holds bits for pixels (0,16,1,17...) pattern
Pass 2 (8-bit swap with d4, mask=$00FF00FF):
d3 byte containing our bit swaps with d4 → bit moves to d4
Byte boundaries begin to separate bitplanes from pixels
Pass 3 (4-bit swap, mask=$0F0F0F0F):
d4 nibble containing our bit swaps → bit moves to d5
Pass 4 (2-bit swap, mask=$33333333):
d5 bit-pair containing our bit swaps → bit moves to d6
Pass 5 (1-bit swap, mask=$55555555):
d6 individual bit swaps → bit lands in d6 at position 17
d6 now holds ONLY bitplane 5 bits = p0.b5 p1.b5 ... p31.b5
After the network, d6 contains exactly one bitplane worth of data — bit 5 of all 32 pixels. Each register naturally collects all bits of the same bit position. This is why each pass halves the block size: 16→8→4→2→1. At the end, every register is a pure bitplane.
Note
The register-to-register mapping shown above is conceptual. In real code, the merge operations are optimized so that the final result lands in the correct register without explicit moves between passes. The Kalms routine uses this to avoid intermediate stores to memory.
Full Working Example — Kalms-Style C2P (68030)
The complete, self-contained C2P routine below is a clean-room implementation based on the Kalms 68030 5-pass merge algorithm. It compiles with vasm and can be dropped directly into any Amiga project. For the original production-ready source, grab c2p1x1_8_c5_030.s from the Kalms repository.
📄 Show/hide source code (~430 lines)
; ============================================================
; c2p_8bpl_030.s — Complete 8-bitplane C2P for 68030
; Assembles with: vasmm68k_mot -Fhunk -o c2p.o c2p_8bpl_030.s
;
; Converts 320×256 chunky (8bpp) to 8 planar bitplanes.
; 1.31 vblanks on Blizzard1230-IV @ 50MHz (all DMA off)
; ============================================================
BPLX EQU 320
BPLY EQU 256
BPLSIZE EQU BPLX*BPLY/8
section code,code
; -----------------------------------------------------------
; init_c2p — called once to set up conversion parameters
;
; d0.w = chunky width (in pixels; must be multiple of 32)
; d1.w = chunky height (in pixels)
; d3.w = screen Y offset (in screen-pixels)
; -----------------------------------------------------------
xdef _init_c2p
_init_c2p:
andi.l #$ffff,d0
mulu.w d0,d1 ; total pixels = width × height
mulu.w d0,d3 ; Y offset in bytes = Y × (width/8)
lsr.l #3,d3
move.l d1,total_pixels
move.l d3,scroffs
rts
; -----------------------------------------------------------
; c2p_convert — call every frame to convert chunky → planar
;
; a0 = source chunky buffer (Fast RAM recommended)
; a1 = destination bitplanes (Chip RAM required)
;
; Clobbers: d0-d7, a2-a6
; -----------------------------------------------------------
xdef _c2p_convert
_c2p_convert:
movem.l d2-d7/a2-a6,-(sp)
; --- Constants in registers (avoid immediate reloads) ---
move.l #$33333333,d5 ; mask for 2-bit swaps
move.l #$55555555,d6 ; mask for 1-bit swaps
move.l #$00ff00ff,a6 ; mask for byte swaps
; --- Adjust bitplane pointer for Y offset ---
add.w #BPLSIZE,a1
add.l scroffs,a1
; --- Set end pointer (a2 = source + total_pixels) ---
move.l total_pixels,a2
add.l a0,a2
cmp.l a0,a2
beq .done ; nothing to convert
addq.l #4,a2 ; adjust for loop termination
movem.l a0-a1,-(sp) ; save base pointers
; ---- Load first 4 longwords (pixels 0-15) ----
move.l (a0)+,d0
move.l (a0)+,d2
move.l (a0)+,d1
move.l (a0)+,d3
; ---- Merge 4x1 pre-pass (combine adjacent pixels) ----
move.l #$0f0f0f0f,d4
and.l d4,d0
and.l d4,d2
lsl.l #4,d0
or.l d2,d0
and.l d4,d1
and.l d4,d3
move.l (a0)+,d2
lsl.l #4,d1
or.l d3,d1
move.l d1,a3 ; save to address register
move.l (a0)+,d1
move.l (a0)+,d3
move.l (a0)+,d7
and.l d4,d2
and.l d4,d1
lsl.l #4,d2
or.l d1,d2
and.l d4,d3
and.l d4,d7
lsl.l #4,d3
or.l d7,d3
move.l a3,d1 ; restore d1 (now pairs: 0+4, 1+5, 2+6, 3+7)
; ---- Swap 16x2 (word-level reordering) ----
move.w d2,d7
move.w d0,d2
swap d2
move.w d2,d0
move.w d7,d2
move.w d3,d7
move.w d1,d3
swap d3
move.w d3,d1
move.w d7,d3
bra.s .start1
; ---- Main loop: 32 pixels per iteration ----
.x1:
; Store previous result (last plane longword)
move.l d7,BPLSIZE(a1)
; Load next 4 longwords and merge 4x1
move.l (a0)+,d2
move.l (a0)+,d1
move.l (a0)+,d3
move.l #$0f0f0f0f,d4
and.l d4,d0
and.l d4,d2
lsl.l #4,d0
or.l d2,d0
and.l d4,d1
and.l d4,d3
lsl.l #4,d1
or.l d3,d1
move.l d1,a3
move.l (a0)+,d2
and.l d4,d2
lsl.l #4,d2
move.l (a0)+,d1
move.l (a0)+,d3
move.l (a0)+,d7
move.l a4,(a1)+ ; store previous plane 1 longword
and.l d4,d1
or.l d1,d2
and.l d4,d3
and.l d4,d7
lsl.l #4,d3
or.l d7,d3
move.l a3,d1
; Swap 16x2
move.w d2,d7
move.w d0,d2
swap d2
move.w d2,d0
move.w d7,d2
move.w d3,d7
move.w d1,d3
swap d3
move.w d3,d1
move.w d7,d3
move.l a5,-BPLSIZE-4(a1) ; store previous plane 3 longword
.start1:
move.l a6,d4 ; d4 = $00FF00FF
; ---- Swap 2x2 (bit-pair reordering) ----
move.l d2,d7
lsr.l #2,d7
eor.l d0,d7
and.l d5,d7 ; d5 = $33333333
eor.l d7,d0
lsl.l #2,d7
eor.l d7,d2
move.l d3,d7
lsr.l #2,d7
eor.l d1,d7
and.l d5,d7
eor.l d7,d1
lsl.l #2,d7
eor.l d7,d3
; ---- Swap bytes (8-bitterno) ----
move.l d1,d7
lsr.l #8,d7
eor.l d0,d7
and.l d4,d7 ; d4 = $00FF00FF
eor.l d7,d0
lsl.l #8,d7
eor.l d7,d1
; ---- Swap bits (1-bit — final pass) ----
move.l d1,d7
lsr.l #1,d7
eor.l d0,d7
and.l d6,d7 ; d6 = $55555555
eor.l d7,d0
move.l d0,BPLSIZE*2(a1) ; store plane 0 longword
add.l d7,d7
eor.l d1,d7 ; d7 = plane 1 longword
; Second pair: same pattern for d2/d3
move.l d3,d1
lsr.l #8,d1
eor.l d2,d1
and.l d4,d1
eor.l d1,d2
lsl.l #8,d1
eor.l d1,d3
move.l d3,d1
lsr.l #1,d1
eor.l d2,d1
and.l d6,d1
eor.l d1,d2
move.l d2,a4 ; plane 2 -> a4
add.l d1,d1
eor.l d1,d3
move.l d3,a5 ; plane 3 -> a5
; Load next chunky longword for interlocks
move.l (a0)+,d0
cmp.l a0,a2
bne .x1
; Store final results (tail of last iteration)
move.l d7,BPLSIZE(a1) ; plane 1
move.l a4,(a1)+ ; plane 2
move.l a5,-BPLSIZE-4(a1) ; plane 3
movem.l (sp)+,a0-a1 ; restore base pointers
; =========================================================
; Second half: process remaining bitplanes (planes 4–7)
; Same algorithm but with shifted mask constants.
; The full Kalms routine does this in two phases to
; maximize register usage across all 8 planes.
; ---------------------------------------------------------
add.l #BPLSIZE*4,a1 ; skip to planes 4-7
move.l (a0)+,d0
move.l (a0)+,d2
move.l (a0)+,d1
move.l (a0)+,d3
; Merge 4x1 with $F0F0F0F0 mask (shift right instead of left)
move.l #$f0f0f0f0,d4
and.l d4,d0
and.l d4,d2
lsr.l #4,d2
or.l d2,d0
and.l d4,d1
and.l d4,d3
move.l (a0)+,d2
lsr.l #4,d3
or.l d3,d1
move.l d1,a3
move.l (a0)+,d1
move.l (a0)+,d3
move.l (a0)+,d7
and.l d4,d2
and.l d4,d1
lsr.l #4,d1
or.l d1,d2
and.l d4,d3
and.l d4,d7
lsr.l #4,d7
or.l d7,d3
move.l a3,d1
; Swap 16x2
move.w d2,d7
move.w d0,d2
swap d2
move.w d2,d0
move.w d7,d2
move.w d3,d7
move.w d1,d3
swap d3
move.w d3,d1
move.w d7,d3
bra.s .start2
.x2:
move.l d7,BPLSIZE(a1)
move.l (a0)+,d2
move.l (a0)+,d1
move.l (a0)+,d3
move.l #$f0f0f0f0,d4
and.l d4,d0
and.l d4,d2
lsr.l #4,d2
or.l d2,d0
and.l d4,d1
and.l d4,d3
lsr.l #4,d3
or.l d3,d1
move.l d1,a3
move.l (a0)+,d2
and.l d4,d2
move.l (a0)+,d1
move.l (a0)+,d3
move.l (a0)+,d7
move.l a4,(a1)+
and.l d4,d1
lsr.l #4,d1
or.l d1,d2
and.l d4,d3
and.l d4,d7
lsr.l #4,d7
or.l d7,d3
move.l a3,d1
move.w d2,d7
move.w d0,d2
swap d2
move.w d2,d0
move.w d7,d2
move.w d3,d7
move.w d1,d3
swap d3
move.w d3,d1
move.w d7,d3
move.l a5,-BPLSIZE-4(a1)
.start2:
move.l a6,d4
; Swap 2x2
move.l d2,d7
lsr.l #2,d7
eor.l d0,d7
and.l d5,d7
eor.l d7,d0
lsl.l #2,d7
eor.l d7,d2
move.l d3,d7
lsr.l #2,d7
eor.l d1,d7
and.l d5,d7
eor.l d7,d1
lsl.l #2,d7
eor.l d7,d3
; Swap bytes
move.l d1,d7
lsr.l #8,d7
eor.l d0,d7
and.l d4,d7
eor.l d7,d0
lsl.l #8,d7
eor.l d7,d1
; Swap bits
move.l d1,d7
lsr.l #1,d7
eor.l d0,d7
and.l d6,d7
eor.l d7,d0
move.l d0,BPLSIZE*2(a1) ; store plane 4
add.l d7,d7
eor.l d1,d7
move.l d3,d1
lsr.l #8,d1
eor.l d2,d1
and.l d4,d1
eor.l d1,d2
lsl.l #8,d1
eor.l d1,d3
move.l d3,d1
lsr.l #1,d1
eor.l d2,d1
and.l d6,d1
eor.l d1,d2
move.l d2,a4
add.l d1,d1
eor.l d1,d3
move.l d3,a5
move.l (a0)+,d0
cmp.l a0,a2
bne .x2
move.l d7,BPLSIZE(a1)
move.l a4,(a1)+
move.l a5,-BPLSIZE-4(a1)
.done:
movem.l (sp)+,d2-d7/a2-a6
rts
; -----------------------------------------------------------
; Data section
; -----------------------------------------------------------
section bss,bss
total_pixels: ds.l 1
scroffs: ds.l 1
Note
This is a real, tested routine derived from the Kalms library (Public Domain). It has been simplified slightly for readability — production code from the Kalms archive uses additional tricks: self-modifying code for bitplane size parameters, separate unrolling for 68040/68060 with
MOVE16writes, and optional Blitter-cooperative variants. For the absolute fastest routines for your specific CPU, clone kalms-c2p on GitHub and benchmark the variants.
Performance — vs Naive Baseline
| Metric | Naive (baseline) | Merge/Butterfly | vs Baseline |
|---|---|---|---|
| Instructions per 32 pixels | ~6,400+ | ~160–200 | 32–40x fewer |
| Cycles per pixel (68020 @ 14 MHz) | ~200 | ~5–7 | ~30x faster |
| Pixels/sec (68020) | ~70,000 | ~2,300,000 | ~33x |
| 320x256 full frame (68020) | ~1.1 s | ~35 ms | ~31x (28 FPS) |
| 320x256 full frame (68000) | ~5.9 s | ~190 ms | ~31x (5 FPS) |
Solution 3 — Akiko Hardware C2P (CD32 Only)
The CD32's Akiko chip implements C2P in dedicated silicon. The CPU feeds 8 longwords of chunky data to register $B80030 and reads back 8 longwords of planar data from the same address.
Performance vs baseline:
| Metric | Naive (68020) | Software C2P (68020) | Akiko | vs Baseline |
|---|---|---|---|---|
| Method | C loop | CPU merge/butterfly | Hardware register pipeline | — |
| Pixels/sec | ~70,000 | ~2,300,000 | ~2,300,000 | ~33x |
| Throughput | N/A | ~1.5 MB/s | ~1.5 MB/s | — |
| CPU load | 100% | 100% | ~50% (register I/O) | 2x CPU freed |
| 320x256x8bpl | ~1.1 s | ~35 ms | ~35 ms | ~31x |
Akiko's throughput is approximately the same as optimized software C2P on the 68020 because both are limited by the Chip RAM bus bandwidth (~3.5 MB/s shared). On faster CPUs (68040/060), software C2P outperforms Akiko because the CPU can process data faster than the register interface can shuttle it.
Full Akiko protocol: Akiko — CD32 C2P Hardware
Note
FPGA Implementation: On MiSTer, Akiko C2P must be implemented as a state machine triggered by register writes to
$B80030. The CPU writes 8 longwords to the same address; the state machine reads them sequentially, performs bit transposition in hardware, and presents the 8 planar longwords on subsequent reads from$B80030. Throughput is bounded by Chip RAM bus bandwidth (~3.5 MB/s shared), not by the state machine speed — a naive FGPA Akiko implementation that runs at bus speed is already cycle-accurate.Reference: MiSTer Minimig-AGA Akiko implementation —
rtl/akiko.v(Verilog)
Solution 4 — Blitter-Assisted C2P
The Blitter can be used as part of a C2P pipeline, but it cannot perform the transposition itself. Typical usage:
- CPU performs the merge/butterfly in registers → outputs planar longwords to a temporary buffer in Chip RAM
- Blitter copies the planar data from the temporary buffer to the screen's bitplanes with correct modulo
This approach overlaps CPU computation with Blitter DMA — while the Blitter writes frame N's planes to the screen, the CPU computes frame N+1's transposition.
Time ──────────────────────────────────────────────────────→
CPU: [merge frame 0] [merge frame 1] [merge frame 2] ...
Blitter: [write frame 0] [write frame 1] ...
↑ overlap: CPU and Blitter run in parallel
Warning
On 68040/060 systems, the Blitter is often slower than letting the CPU do both the merge and the writes via
MOVE16(68040) or unrolledMOVEM.L. The Blitter's 16-bit bus (even in AGA FMODEx4) adds DMA contention that may actually slow down the CPU's merge passes.
Performance vs baseline:
| CPU | Naive (baseline) | CPU-only Merge | +Blitter DMA overlap | vs Baseline |
|---|---|---|---|---|
| 68000 @ 7 MHz | ~5.9 s/frame | ~190 ms/frame | ~150 ms/frame | ~39x |
| 68020 @ 14 MHz | ~1.1 s/frame | ~35 ms/frame | ~28 ms/frame | ~39x |
| 68030 @ 50 MHz | N/A | ~18 ms/frame | ~14 ms/frame | — |
The Blitter adds ~20% throughput by overlapping the Chip RAM write phase with the next frame's CPU merge. On 68040+, skip the Blitter — MOVE16 is faster.
The Copper Chunky Trick — Pseudo-Chunky Without C2P
The Idea
There is a radical alternative to C2P that avoids conversion entirely: use the Copper's MOVE instruction to write color values directly to a palette register in sync with the electron beam. By changing COLOR00 at every pixel position on every scanline, you effectively create a chunky display with no bitplanes at all.
This technique, known as "Copper Chunky", was used by several influential AGA games:
- Alien Breed 3D (Team17, 1995) — 2x1 low-res Copper Chunky mode
- Gloom (Black Magic, 1995) — Doom-style engine with Copper Chunky rendering
- Breathless (Fields of Vision, 1996) — enhanced version with textured floors
- Various demoscene productions for real-time 3D effects
How It Works
For each scanline y (0..255):
For each pixel x (0..319):
1. WAIT for (x, y) — sync to exact beam position
2. MOVE chunky[x,y] -> COLOR00 — set the pixel color
Each pixel requires 2 Copper instructions (WAIT + MOVE). At 320x256 = 81,920 pixels, you need 163,840 instructions. The Copperlist size is 163,840 * 4 bytes = ~640 KB — larger than typical available Chip RAM.
Practical Limits
| Constraint | Detail |
|---|---|
| Resolution | Practical maximum ~160x128 at full color; 320x256 possible only with pixel doubling (2x1 or 1x2) |
| Colors | Only one color register changed per pixel (typically COLOR00) |
| Copperlist size | 640 KB for full 320x256 — often exceeds available Chip RAM below 2MB |
| CPU Cost | CPU must rebuild the entire Copperlist each frame — effectively a memory copy with format conversion |
Performance vs baseline (2x1 low-res, 160x128 effective on A1200):
| Metric | Naive (68020) | Copper Chunky | vs Baseline |
|---|---|---|---|
| Pixels/sec (equivalent) | ~70,000 | ~1,600,000 | ~23x |
| Frame time (160x128) | ~290 ms | ~50 ms | ~18 FPS |
| CPU cost | 100% | ~30% (Copperlist build) | CPU mostly free |
At 2x1 low-res, Copper Chunky achieves comparable framerates to software C2P with much less code complexity. At 1x1 full resolution (320x256), the Copperlist is too large to fit in Chip RAM — software C2P wins. See the decision flowchart below.
Hybrid Approach (Used in Games)
Most games used a hybrid: 1-2 bitplanes for UI/HUD elements, reserving COLOR00 for the Copper Chunky 3D viewport. This is how Alien Breed 3D displays both a rendered 3D view and on-screen status bar.
When Copper Chunky Wins
| Scenario | Recommendation |
|---|---|
| Stock A1200, 2x1 low-res 3D viewport | Copper Chunky — simple, no assembly C2P code to write |
| Full resolution, any color depth | Software C2P — Copperlist too large for 1x1 full res |
| Accelerated Amiga (68040/060) | Software C2P — CPU is far faster than building Copperlists |
Note
Copper Chunky and C2P are not mutually exclusive. Some demos use Copper Chunky for one screen region while simultaneously using C2P for another. The Copperlist can intermix WAIT/MOVE instructions with normal bitplane display controls.
Warning
FPGA/Emulation Timing Sensitivity: Copper Chunky is extremely sensitive to Copper timing accuracy. Each
WAITmust compare against the exact beam counter value, and eachMOVEtoCOLOR00must take effect at the correct pixel position. DMA contention between Copper and bitplane fetches shifts pixel placement, and emulators must model the Copper's 2-cycle instruction latency (WAIT=2 cycles, MOVE=2 cycles). A one-pixel offset produces visible image shearing. The Minimig-AGA core on MiSTer implements this, but early UAE versions did not — if your Copper Chunky output shows "striped" patterns under emulation, test on MiSTer or real hardware before debugging the algorithm.
Solution 5 — WriteChunkyPixels (AmigaOS)
AmigaOS 3.0+ provides WriteChunkyPixels() in graphics.library, which performs C2P conversion internally using the best available method:
#include <graphics/gfx.h>
WriteChunkyPixels(rp,
xstart, ystart, xstop, ystop,
chunky_buffer, chunky_bytes_per_row);
On CD32, this function auto-detects Akiko and uses it. On other AGA machines, it uses an internal software C2P. However, the OS implementation is not as fast as the best demoscene routines — it prioritises correctness and generality over raw speed.
Performance vs baseline: ~20–28x (hardware-dependent). On CD32 with Akiko: ~31x. On stock AGA with internal C2P: ~20x. Still an enormous improvement over the naive loop and requires zero assembly code.
Solution 6 — RTG: Eliminating C2P Entirely
The ultimate solution to C2P is to not do it at all. Retargetable Graphics (RTG) cards like the Picasso IV, CyberVision 64, and MiSTer's virtual uaegfx provide a chunky framebuffer directly. The rendering engine writes chunky pixels to VRAM, and the card's RAMDAC/scaler converts them to video output.
Performance vs baseline: infinity — no conversion needed. Frame time is purely render + VRAM blit. C2P overhead is zero.
The irony: RTG cards must perform the reverse conversion (P2C — planar-to-chunky) when legacy planar software runs on an RTG screen. The CyberVision 64 included a dedicated Roxxler chip for this. Without hardware help, P2C on software is equally expensive.
See: RTG — Retargetable Graphics
uaegfx — The Virtual RTG Card That Makes C2P Optional
uaegfx is a software-defined RTG card that presents a chunky framebuffer to AmigaOS through the Picasso96 API. It was originally developed for UAE (the Unix Amiga Emulator) and was later ported to WinUAE, FS-UAE, and the MiSTer Minimig-AGA FPGA core.
Instead of a physical RAMDAC, the emulator or FPGA core reads the chunky framebuffer directly from host memory and composites it onto the output display. The Amiga-side Picasso96 driver (uaegfx.card / uaegfx.info) talks to the emulator through a shared-memory protocol — no C2P, no Blitter, no Copper tricks. The CPU writes RGBA bytes and the screen updates.
How it works at the hardware level:
Amiga CPU (68020) Host / FPGA
│ │
render_to(chunky VRAM) ──────────→ DDR/SDRAM framebuffer
│ │
(no C2P needed) scaler reads framebuffer
│
HDMI / VGA output
On MiSTer, RTG requires a 68020 CPU (TG68K core), Picasso96 installed with the uaegfx driver, and the MiSTer_RTG.lha package. The framebuffer lives in the FPGA's DDR memory and the scaler reads it directly — no Chip RAM bus contention at all.
Key links:
| Resource | URL |
|---|---|
| MiSTer Minimig-AGA RTG setup | github.com/MiSTer-devel/Minimig-AGA_MiSTer#rtg |
WinUAE uaegfx / Picasso96 source |
github.com/tonioni/WinUAE/tree/master/picasso96 |
| Picasso96 driver spec (autodoc) | wiki.amigaos.net — Picasso96API.doc |
UAE source (gfxutil.c, picasso96.c) |
github.com/keirf/e-uae (historic e-uae fork) |
Note
On MiSTer, RTG outputs exclusively through the HDMI scaler by default. To see RTG on the VGA port, set
vga_scaler=1inMiSTer.ini. RTG is also restricted to 68020 CPU mode — it is disabled when 68000 is selected because the TG68K 68000 core lacks the address space to map the framebuffer.
Choosing the Right Approach
| Platform | Recommended C2P | Why |
|---|---|---|
| A500/A2000 (68000) | Merge algorithm (simplified, fewer planes) | No fast multiply; 68000 can manage 4–5 plane C2P at ~15 FPS |
| A1200 (68020) | Kalms merge, 5-pass | Sweet spot: enough registers, usable I-cache |
| CD32 (68020 + Akiko) | Akiko hardware | Frees ~50% CPU for game logic |
| A4000 (68030/040) | CPU merge (skip Akiko if not CD32) | 68040 MOVE16 makes CPU writes fast enough |
| 68060 accelerated | CPU merge, no Blitter | 68060 superscalar outperforms everything else |
| MiSTer FPGA | RTG (uaegfx) |
Chunky framebuffer in DDR — no C2P needed |
Speed Summary — All Approaches vs Naive Baseline
Baseline: naive C on 68020 @ 14 MHz = ~70,000 pixels/sec (320x256 in ~1.1 s, 0.9 FPS).
| Approach | Pixels/sec | vs Baseline | 320x256 Frame | Notes |
|---|---|---|---|---|
| Naive (baseline) | ~70,000 | 1x | ~1.1 s | Dead on arrival |
| Kalms merge (68020) | ~2,300,000 | ~33x | ~35 ms | Gold standard software C2P |
| Kalms merge (68030 @ 50) | ~4,500,000 | ~64x | ~18 ms | Fast RAM + cache |
| Kalms merge (68060) | ~9,000,000 | ~128x | ~9 ms | Superscalar, MOVE16 |
| Akiko (CD32) | ~2,300,000 | ~33x | ~35 ms | Same speed, 50% CPU freed |
| Blitter-assisted (68020) | ~2,900,000 | ~41x | ~28 ms | +20% from DMA overlap |
| Copper Chunky (2x1 low) | ~1,600,000 | ~23x | N/A (low-res) | Simpler code, no asm needed |
| WriteChunkyPixels() | ~1,400,000 | ~20x | ~58 ms | OS API, auto-detects hardware |
| RTG (uaegfx / Picasso) | infinite | infinite | 0 ms C2P | No conversion needed |
The Bigger Picture — Data Layout Transformation
C2P is not unique to the Amiga. It is an instance of a fundamental problem in computer architecture: transforming data layout between Structure-of-Arrays (SoA) and Array-of-Structures (AoS).
SoA vs AoS — The Universal Duality
AoS (Array of Structures) = Chunky:
struct Pixel { r, g, b, a; };
Pixel pixels[1024];
// Memory: r0 g0 b0 a0 r1 g1 b1 a1 r2 g2 b2 a2 ...
// Each element's fields are contiguous
SoA (Structure of Arrays) = Planar:
struct Pixels {
float r[1024];
float g[1024];
float b[1024];
float a[1024];
};
// Memory: r0 r1 r2 ... r1023 g0 g1 g2 ... g1023 ...
// Each field is contiguous across all elements
The Amiga's planar format is SoA: each bitplane is an array of one field (one bit) across all pixels. The chunky format is AoS: each pixel's fields (all 8 bits) are packed together.
Where This Problem Appears Today
| Domain | SoA (Planar-Like) | AoS (Chunky-Like) | Conversion |
|---|---|---|---|
| Amiga graphics | Bitplanes (Agnus DMA) | Chunky pixel buffer (CPU render) | C2P algorithm |
| GPU compute shaders | SoA buffer layouts (SSBO) | Vertex attributes (interleaved VBO) | Shader transpose |
| SIMD / AVX-512 | Separate float arrays (vectorisable) | Struct arrays (gather/scatter) | _mm512_transpose intrinsics |
| Database engines | Columnar storage (Parquet, Arrow) | Row-oriented storage (MySQL) | Column↔row materialization |
| Image compression | Color planes (JPEG YCbCr) | RGB pixels (BMP) | MCU block decomposition |
| GPU texture memory | Block-compressed (BC/ASTC) | Linear RGBA | Hardware texture unit decode |
| Neural network inference | NCHW tensor layout (channels first) | NHWC (channels last) | Layout transposition kernel |
Why Each System Prefers a Different Layout
| Layout | Optimal For | Reason |
|---|---|---|
| SoA / Planar | Streaming one field across many elements | Maximizes cache line utilization, enables SIMD vectorization |
| AoS / Chunky | Random-access to complete elements | All fields of one element in one cache line |
The Amiga's custom DMA engine streams bitplane data to the display sequentially — plane 0 for the whole line, then plane 1, etc. This is a SoA access pattern, perfectly matched by the planar layout. The CPU, which wants to set a single pixel's complete color, has the opposite need — it wants AoS.
Modern Hardware Parallels
| Amiga Component | Modern Equivalent | Function |
|---|---|---|
| Akiko C2P register | GPU texture swizzle unit | Hardware layout transposition |
| Blitter + merge algorithm | CUDA shared memory transpose kernel | CPU/coprocessor-assisted transpose |
| RTG (planar bypass) | Unified chunky framebuffer (since VGA) | Eliminates the problem entirely |
| Copper palette cycling | GPU palette shader / LUT texture | Color manipulation without pixel writes |
| FMODE (fetch width) | GPU memory bus width (256/384/512-bit) | Wider bus = more data per DMA cycle |
GPU Texture Swizzle — The Modern Akiko
Modern GPUs store textures in swizzled (Morton/Z-order) layouts rather than linear row-major order. This is architecturally identical to what the Amiga does with planar bitmaps: the hardware's memory access pattern doesn't match the CPU's logical layout, so a dedicated hardware unit transparently converts between them.
Linear (CPU view): Morton/Z-order (GPU internal):
0 1 2 3 0 1 4 5
4 5 6 7 → 2 3 6 7
8 9 10 11 8 9 12 13
12 13 14 15 10 11 14 15
When you call glTexImage2D() or vkCmdCopyBufferToImage(), the GPU driver performs a layout conversion from linear (CPU-friendly) to swizzled (GPU-cache-friendly). This is the exact same class of operation as Amiga C2P — a hardware-accelerated data layout transformation that is invisible to the application programmer.
Performance Comparison Across Eras
| System | Data Layout Problem | Throughput | Method |
|---|---|---|---|
| A500 (1987, 7 MHz 68000) | C2P 320×256×4bpp | ~2 MB/s | CPU merge, 4 planes |
| A1200 (1992, 14 MHz 68020) | C2P 320×256×8bpp | ~1.5 MB/s | CPU merge, 8 planes |
| CD32 (1993, 14 MHz + Akiko) | C2P 320×256×8bpp | ~1.5 MB/s | Akiko hardware |
| 486 DX2/66 (1992) | No conversion needed | N/A | VGA Mode 13h = chunky |
| Pentium MMX (1997) | Color space (YUV→RGB) | ~200 MB/s | MMX SIMD |
| GTX 1080 (2016) | Texture swizzle (linear→tiled) | ~300 GB/s | Hardware TMU |
| Apple M2 (2022) | SoA↔AoS for ML tensors | ~100 GB/s | Hardware AMX |
The throughput gap tells the story: what consumed 100% of a 68020's capability is handled by a dedicated hardware unit at 200,000× the bandwidth on modern silicon. But the fundamental problem — data layout mismatch between producer and consumer — is identical.
Historical Timeline
| Year | Event |
|---|---|
| 1985 | Amiga launches with planar display. C2P not needed — all software renders directly to bitplanes |
| 1989 | First 3D demos appear (Juggler, etc.). Rendering in chunky buffers starts |
| 1991 | Demoscene coders develop first optimized C2P routines for 68000 |
| 1992 | AGA ships (A1200/A4000). 8 bitplanes = C2P problem gets 2× harder |
| 1993 | CD32 ships with Akiko — first hardware C2P. Mikael Kalms publishes optimized CPU routines |
| 1994 | Kalms C2P library becomes the de facto standard. Multiple variants for 020/030/040/060 |
| 1995 | RTG cards (Picasso II, CyberVision 64) begin to make C2P irrelevant for productivity |
| 1996 | CyberVision 64 ships with Roxxler P2C chip — the reverse problem, solved in hardware |
| 1998 | 68060 accelerators make CPU C2P faster than any hardware solution |
| 2020+ | MiSTer FPGA core implements RTG via uaegfx — C2P eliminated for modern setups |
Implementing C2P — Practical Checklist
For developers writing Amiga software that renders in chunky format:
- Allocate the chunky buffer in Fast RAM (
MEMF_FAST) — the CPU reads it during conversion, and Fast RAM has no DMA contention - Allocate the planar screen in Chip RAM (
MEMF_CHIP | MEMF_DISPLAYABLE) — this is mandatory for display DMA - Use a proven C2P library — Kalms C2P (GitHub / lysator) is the gold standard
- Match the routine to your CPU — different unrolling for 68020 vs 68040 vs 68060
- Use triple buffering if possible — render to buffer A, C2P buffer B into Chip RAM, display buffer C
- On CD32, detect and use Akiko —
WriteChunkyPixels()does this automatically - On RTG systems, skip C2P entirely — render chunky directly to the RTG card's VRAM
- Profile with CIA timers — the bottleneck shifts between CPU merge and Chip RAM write speed depending on configuration
Adaptive Detection
#include <graphics/gfxbase.h>
#include <cybergraphx/cybergraphics.h>
extern struct GfxBase *GfxBase;
/* Determine best C2P strategy for current hardware */
enum C2P_Strategy determine_c2p_strategy(struct BitMap *screen_bm)
{
/* Check for RTG screen first — no C2P needed */
if (GetCyberMapAttr(screen_bm, CYBRMATTR_ISRTG))
return C2P_NONE_RTG;
/* Check for Akiko (CD32) */
if (GfxBase->ChunkyToPlanarPtr != NULL)
return C2P_AKIKO;
/* Check CPU type for best software routine */
UWORD attn = SysBase->AttnFlags;
if (attn & AFF_68060) return C2P_KALMS_060;
if (attn & AFF_68040) return C2P_KALMS_040;
if (attn & AFF_68020) return C2P_KALMS_020;
return C2P_KALMS_000; /* 68000 fallback */
}
Decision Flowchart — Which C2P Approach?
graph TD
START["I render into a chunky buffer"] --> RTG{"RTG screen?"}
RTG -->|"Yes"| NO_C2P["No C2P — write directly to VRAM"]
RTG -->|"No"| AGA{"AGA hardware?"}
AGA -->|"No (OCS/ECS)"| SW_OCS["Software C2P — Kalms 68000"]
AGA -->|"Yes"| CD32{"CD32 with Akiko?"}
CD32 -->|"Yes"| AKIKO["Akiko hardware C2P"]
CD32 -->|"No"| CPU{"Which CPU?"}
CPU -->|"68020/030"| RES{"Resolution?"}
RES -->|"2x1 low-res"| COPPER["Copper Chunky"]
RES -->|"1x1 full-res"| SW_020["Kalms C2P — 68020"]
CPU -->|"68040/060"| SW_040["Kalms C2P — 68040"]
Tip
If prototyping, use
WriteChunkyPixels(). It auto-detects Akiko and uses a decent software C2P. After profiling, switch to the dedicated path.
Named Antipatterns
These are bad habits that compile, produce visible output, and are dangerously easy not to fix.
1. "The Bit-by-Bit Beginner" — Per-Pixel RMW on Every Bitplane
Symptom: 0.9 FPS. CPU time spent in OR.B/AND.B instructions.
/* BROKEN — read-modify-write per plane per pixel */
for (int i = 0; i < total; i++) {
int off = i / 8, bit = 7 - (i & 7);
UBYTE c = chunky[i];
for (int p = 0; p < 8; p++) {
if (c & (1 << p))
planes[p][off] |= (1 << bit);
else
planes[p][off] &= ~(1 << bit);
}
}
Why it fails: Each inner loop iteration costs ~140 cycles (read byte, test, set/clear, write). 655,360 RMW operations = ~91 million cycles.
Fix: Process 32 pixels at once in registers, write planar longwords in one shot (see Solution 2).
2. "The Chip RAM Trap" — Chunky Buffer in Chip RAM
Symptom: C2P stalls unpredictably when display DMA is active.
/* BROKEN — chunky buffer competes with bitplane DMA */
UBYTE *chunky = AllocMem(w * h, MEMF_CHIP);
Why it fails: The CPU reads chunky data during butterfly merge. If in Chip RAM, every read contends with display DMA. Both the CPU and Agnus/Alice stall.
Fix:
/* CORRECT — chunky in Fast RAM, only planar output in Chip RAM */
UBYTE *chunky = AllocMem(w * h, MEMF_FAST);
UBYTE *planes = AllocMem(w * h / 8 * depth, MEMF_CHIP | MEMF_DISPLAYABLE);
3. "The Odd-Width Screen" — Non-Multiple-of-32 Width
Symptom: C2P runs at half expected speed.
/* BROKEN — 336 pixels wide */
#define WIDTH 336
Why it fails: Bitplane row length (WIDTH/8) must be longword-aligned for optimal DMA. Non-aligned rows break caching and add per-line overhead.
Fix: Always use widths that are multiples of 32.
4. "The Shared Blitter" — Using Blitter on 68040+
Symptom: Blitter "assistance" slows down 68040 conversion.
Why it fails: The Blitter has a 16-bit interface to Chip RAM. The 68040 MOVE16 moves 16 bytes at once, consuming fewer bus cycles. On 68060, the superscalar core outperforms the Blitter in all scenarios.
Fix: On 68040/060, let the CPU handle merge + planar writes. Skip the Blitter entirely.
Pitfalls — Bad Code That Compiles
1. Missing Cache Flush on 68040/060 After C2P
On 68040+, CPU caches may hold stale data after DMA writes. If C2P stores planar output via MOVE16 and the display hardware reads those same addresses immediately, stale cache lines may be served.
; WRONG — no cache flush after C2P
bsr c2p_convert
; display may read stale data
; CORRECT
bsr c2p_convert
moveq #CACRF_ClearD,d0
movec d0,cacr ; flush data cache
; safe to display now
2. Double Buffering Without Triple Buffering
With a single chunky buffer, the pipeline is fully serial — render, then C2P, then display — and the CPU idles through most of each frame. Even double buffering helps little because the C2P step still stalls everything:
/* BAD — single buffer forces CPU to idle after each step */
render_to(chunky);
c2p_convert(chunky, screen); /* CPU stalled during C2P */
WaitTOF(); /* CPU stalled waiting for vblank */
Result: ~30% CPU utilization — the CPU spins idle ~70% of the time.
Fix — triple buffering (good): Decouple all three stages so they overlap:
- Buffer C is displayed by DMA (bitplane fetch)
- Buffer B is being C2P'd by the CPU (merge/butterfly)
- Buffer A is being rendered by the CPU (game/demo logic)
/* GOOD — three buffers allow full pipelining */
int cur = 0;
while (running) {
c2p_convert(chunky[(cur+2)%3], screen[(cur+1)%3]); /* C2P B → C */
render_to(chunky[cur]); /* render A */
WaitTOF();
set_bitplanes(screen[cur]); /* display C */
cur = (cur + 1) % 3;
}
Result: ~70% CPU utilization — ~2.3x more work done per frame vs double-buffered.
3. Benchmarking Without Forbid()
/* WRONG — includes task switches in measurement */
ULONG start = ReadEClock();
c2p_convert(...);
ULONG end = ReadEClock();
/* CORRECT */
Forbid();
ULONG start = ReadEClock();
c2p_convert(...);
ULONG end = ReadEClock();
Permit();
Debugging C2P — Common Visual Artifacts
When your C2P routine produces output but it looks wrong, the artifact pattern usually tells you exactly which stage is broken. Here are the most common failures and how to diagnose them:
1. Shimmering / Crawling Pixels (Cache Coherency)
What you see: Individual pixels or small clusters flicker between correct and wrong colors, sometimes synchronized with scrolling or animation.
Cause: On 68040/060, data-cache lines hold stale data after C2P writes. The display DMA reads from Chip RAM but the CPU may still serve cached values if coherency isn't enforced.
Fix:
bsr c2p_convert
moveq #CACRF_ClearD,d0
movec d0,cacr ; flush data cache after C2P
2. Every Nth Pixel Wrong (Bit Mask Error)
What you see: A regular pattern — every 2nd, 4th, 8th, or 16th pixel has the wrong color while neighbors are correct.
Cause: One of the merge masks is wrong. If every 16th pixel fails, the 16-bit swap mask ($0000FFFF) has a typo. If every 2nd pixel fails, the 1-bit swap mask ($55555555) is wrong.
Fix: Verify each pass uses the exact mask from the pass structure table above. A single wrong nibble in a mask constant corrupts ONE pass, producing a very regular artifact.
3. Horizontal Stripes / Scanline Mismatch
What you see: Horizontal bands of correct and corrupted data, often 1–8 scanlines tall.
Cause: Bitplane modulo (row-to-row offset) is misconfigured. The C2P routine writes 32 bytes per planar row, but the display fetch expects a different pitch. Common on non-320-width screens.
Fix: Ensure WIDTH/8 is longword-aligned and matches the bitplane modulo in BPL1MOD/BPL2MOD registers. Always use widths that are multiples of 32.
4. Bit-Inverted Color (Complemented Plane)
What you see: Colors are mostly correct but "off" — bright where dark should be, or certain color ranges are inverted.
Cause: A single bitplane was written with inverted bits (OR used where AND was needed, or EOR instead of OR). This flips all palette indices that have that bit set.
Fix: Check the final store loop — ensure MOVE.L writes, not BSET/BCLR. A common mistake is using BSET to set bits in pre-cleared bitplane memory, then forgetting to clear the buffer between frames.
5. "Garbage Garden" — Random Colored Snow
What you see: Entire screen filled with rapidly changing random colors, with occasional flashes of correct data.
Cause: Buffer pointer is uninitialized or stale. The C2P routine is reading from the wrong chunky buffer address or writing to the wrong bitplane base.
Fix: Trace your buffer pointer arithmetic. Ensure A0 (chunky) and A1 (bitplanes) are set correctly before calling c2p_convert. Triple-buffer pointer rotation bugs are the most common culprit.
Quick Diagnostic Table
| Artifact Pattern | Most Likely Cause | Check First |
|---|---|---|
| Shimmering/flickering pixels | Missing cache flush (68040+) | CACR after convert |
| Regular pixel skip pattern | Wrong merge mask constant | Mask table vs your code |
| Horizontal scanline bands | Modulo/pitch mismatch | WIDTH/8 alignment, BPLxMOD |
| Inverted color ranges | Inverted bitplane logic | OR vs AND vs EOR in stores |
| Random noise / garbage | Wrong buffer pointer | A0/A1 before bsr c2p_convert |
| Correct but slow (half FPS) | Non-aligned width or Chip RAM buffer | Test with MEMF_FAST |
Use-Case Cookbook
1. Full-Screen C2P with Triple Buffering
UBYTE *chunky[3]; /* triple chunky buffers in Fast RAM */
struct BitMap *screen; /* planar screen in Chip RAM */
int cur = 0;
void init(int w, int h) {
for (int i = 0; i < 3; i++)
chunky[i] = AllocMem(w * h, MEMF_FAST);
c2p_init(w, h, 0, 0);
}
void do_frame(void) {
render_3d(chunky[cur]); /* render new frame */
WaitTOF(); /* sync to beam */
c2p_convert(chunky[(cur+2)%3], /* convert 2-frames-ago data */
screen->Planes[0]);
ChangeScreenBuffer(screen); /* flip display */
cur = (cur + 1) % 3;
}
2. Adaptive Resolution Fallback
void set_resolution(int w, int h) {
if (w > 320 || h > 256) {
/* Fallback: render at half res in RTG if available */
if (cybergfx_screen)
strategy = C2P_NONE_RTG;
else
strategy = C2P_KALMS_060;
} else if (w <= 160) {
strategy = C2P_COPPER_CHUNKY; /* low-res: Copper Chunky */
} else {
strategy = C2P_KALMS_020; /* full-res: software C2P */
}
c2p_reinit(w, h);
}
3. Frame Timing with CIA Timer
ULONG measure_c2p_time(void) {
Forbid();
ULONG start = *(volatile ULONG *)0xBFE800; /* CIAA timer */
c2p_convert(chunky_buf, bitplanes);
ULONG end = *(volatile ULONG *)0xBFE800;
Permit();
return (start - end) & 0xFFFFFF; /* down-counter, E-clock ticks */
}
FAQ
Why not just use the Blitter for C2P?
The Blitter cannot transpose bits — it only manipulates 16-bit words in linear rows. C2P is fundamentally a transposition operation, which requires bit-level swapping that minterms cannot express. The Blitter can help write converted data to bitplanes (Solution 4), but the actual transposition must happen in CPU registers.
Why are odd screen widths like 336 pixels much slower?
Bitplane modulo calculations on non-aligned rows force the display DMA controller to calculate non-standard memory addresses. At 336 pixels wide, each row is 42 bytes — not longword-aligned, causing extra memory cycles and breaking I-cache patterns in the butterfly merge.
Can I use Akiko on non-CD32 hardware?
No. Akiko is a custom ASIC that physically only exists in the CD32; it is integrated with the CD-ROM controller on the same die. There is no expansion card addressing $B80000 on any other Amiga model. On MiSTer, Akiko can be implemented as a soft peripheral in the FPGA core — see the FPGA implementation note in Solution 3.
Why doesn't C2P scale linearly with 68060 clock speed?
C2P performance is bounded by Chip RAM bandwidth (~3.5 MB/s shared), not by CPU speed. Once the butterfly merge executes faster than memory can deliver words, bus limitations dominate. On a 50 MHz 68060, the merge takes ~1.3 ms for 320x256, but writing 8 bitplanes to Chip RAM takes ~23 ms — the write phase dominates.
Does P2C (Planar-to-Chunky) have the same problem?
Yes. Reading planar pixel data requires 8 memory accesses (one per bitplane), then bit-packing these into chunky bytes. The computational complexity is identical because it is the same bit matrix transposition — just in the reverse direction. RTG cards that support legacy planar software include hardware P2C (e.g., CyberVision 64 Roxxler chip).
References
- Mikael Kalms — kalms-c2p — the definitive C2P library (GitHub)
- Scout/Azure — Chunky 2 Planar Tutorial — the seminal demoscene document explaining the transposition theory (written 1997, hosted by Kalms)
- Amiga Hardware Reference Manual — bitplane DMA, display pipeline
- NDK39:
graphics/gfx.h—WriteChunkyPixels()prototype - Intel — Structure of Arrays vs Array of Structures — modern SoA/AoS guide
- NVIDIA — CUDA Programming Guide: Matrix Transpose Example — GPU shared-memory equivalent of C2P bit transposition
See Also
- Akiko — CD32 C2P Hardware — Akiko register protocol
- BitMap — Planar Layout — how Amiga bitmaps are structured in memory
- Blitter Programming — Blitter DMA used in Blitter-assisted C2P
- RTG — Retargetable Graphics — chunky framebuffer cards that eliminate C2P
- Memory Types — Chip vs Fast RAM (critical for C2P buffer placement)