23 KiB
Pixel Format Conversion — Chunky ↔ Planar and Beyond
The Core Problem
Every Amiga programmer eventually hits the same wall: the custom chipset displays graphics in planar format, but nearly every interesting algorithm — 3D rendering, texture mapping, image decompression, PC game ports — produces output in chunky format. Converting between these two layouts is the single most CPU-intensive bottleneck in Amiga graphics programming.
This article covers:
- What planar and chunky formats are, mathematically
- Why the conversion is expensive
- How every known solution works — from naive loops to the Kalms butterfly
- Where this problem appears in broader computing (SoA/AoS, GPU swizzle, SIMD)
Note
The Akiko hardware article covers the CD32's dedicated C2P register interface. This article covers the algorithm theory that applies to every Amiga model, and the broader data-layout concepts that connect the Amiga to modern computing.
Planar vs Chunky — The Two Layouts
Chunky (Packed Pixel)
Every pixel's complete colour index is stored contiguously. For 8-bit (256 colour) pixels:
Address: $0000 $0001 $0002 $0003 $0004 $0005 $0006 $0007
Data: $0D $05 $1B $0A $FF $03 $42 $7E
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7
Each byte = one pixel. Linear, simple, cache-friendly for rendering. This is how every modern GPU, every PC VGA card, every framebuffer since 1990 stores pixels.
Planar (Bitplane)
Each pixel's colour index is split across N separate memory regions (bitplanes). For 8-bit pixels (8 bitplanes), each bitplane stores one bit of every pixel:
Bitplane 0: 1 0 1 1 0 0 1 0 ← bit 0 of pixels 0–7
Bitplane 1: 0 1 0 1 1 0 0 1 ← bit 1 of pixels 0–7
Bitplane 2: 1 1 0 0 0 1 1 0 ← bit 2
Bitplane 3: 0 1 1 0 1 1 0 0 ← bit 3
Bitplane 4: 1 0 1 0 1 0 0 1 ← bit 4
Bitplane 5: 1 0 0 0 0 0 1 0 ← bit 5
Bitplane 6: 0 0 1 0 0 0 0 1 ← bit 6
Bitplane 7: 0 0 0 0 1 0 1 0 ← bit 7
To read pixel 0's colour: collect bit 0 from each of the 8 planes → 10101100 = $AC. The 8 planes are not interleaved in standard Amiga layout — each is a separate contiguous memory block.
Why the Amiga Uses Planar
The planar format was a brilliant engineering choice in 1985:
| Advantage | Explanation |
|---|---|
| Bandwidth efficiency | A 4-colour screen uses 2 bitplanes = ½ the memory bandwidth of 4bpp chunky. DMA fetches only the planes actually used. |
| Scalable colour depth | Adding a bitplane doubles the colour count without redesigning the display engine. OCS: 1–6 planes. AGA: 1–8 planes. |
| Cheap colour cycling | Rotating palette indices only requires changing colour registers — zero memory writes. |
| Blitter efficiency | Blitting a masked sprite at 4 colours touches only 2 planes (2 blits), not 4× the data. |
| Copper integration | The Copper can change palette registers mid-scanline, effectively multiplying colours without more bitplanes. |
The downside only became critical as rendering algorithms evolved past 2D sprites into 3D, texture mapping, and pixel-level effects that naturally produce chunky output.
The Conversion — Mathematically
C2P is a bit matrix transposition. Given 32 chunky pixels (each 8 bits wide), you have a 32×8 bit matrix (32 rows × 8 columns). C2P transposes this to an 8×32 matrix (8 bitplanes × 32 bits each):
Input (chunky): Output (planar):
32 pixels × 8 bits 8 bitplanes × 32 bits
┌─────────────────┐ ┌──────────────────────────────┐
│ P0: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 0: p0.b0 p1.b0 p2.b0 ... p31.b0 │
│ P1: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 1: p0.b1 p1.b1 p2.b1 ... p31.b1 │
│ ... │ │ ... │
│ P31: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 7: p0.b7 p1.b7 p2.b7 ... p31.b7 │
└─────────────────┘ └──────────────────────────────┘
This is equivalent to a 90° bit rotation. On a modern CPU with SIMD, this is trivial. On a 68020 with 8 data registers and no bit-parallel instructions, it is an algorithmic challenge that consumed thousands of programmer-hours across the demoscene.
Solution 1 — The Naive Loop
The simplest approach: iterate over every pixel, extract each bit, and set it in the corresponding bitplane.
/* Naive C2P — educational only, never use in production */
void c2p_naive(UBYTE *chunky, UBYTE *planes[8], int width, int height)
{
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
UBYTE pixel = chunky[y * width + x];
int byte_offset = y * (width / 8) + (x / 8);
int bit_position = 7 - (x & 7);
for (int plane = 0; plane < 8; plane++)
{
if (pixel & (1 << plane))
planes[plane][byte_offset] |= (1 << bit_position);
else
planes[plane][byte_offset] &= ~(1 << bit_position);
}
}
}
}
Performance: ~200+ cycles per pixel on 68020. For 320×256 = 81,920 pixels → ~16 million cycles → ~1.1 seconds at 14 MHz. This gives roughly 0.9 FPS. Completely unusable.
Why it's terrible:
- One bit at a time — no parallelism
- Read-modify-write on every bitplane byte (bus-killing)
- No register reuse — constant memory traffic
- Branch on every bit (pipeline flush on 68020)
Solution 2 — The Merge (Butterfly) Algorithm
This is the standard approach used by virtually all serious Amiga C2P routines. Invented independently by several demoscene coders and formalised by Mikael Kalms (Kalmalyzer) and others.
The Key Insight
Instead of processing one pixel at a time, load 32 pixels (8 longwords = 256 bits) into CPU registers and perform a series of bit-level swap operations (called "merges") that progressively rearrange the bits into planar order. Each merge pass swaps bits at a different granularity: 16-bit blocks, then 8-bit, then 4-bit, 2-bit, and 1-bit.
This is exactly a butterfly network — the same structure used in the FFT (Fast Fourier Transform) and Batcher's bitonic sort.
The Merge Primitive
The fundamental building block is a 2-register swap that exchanges bits at a given stride:
; merge(d0, d1, mask, shift)
; Exchanges bits between d0 and d1 where mask selects which bits to swap
; and shift determines the stride
move.l d0, d2 ; temp = a
lsr.l #shift, d2 ; temp >>= stride
eor.l d1, d2 ; temp ^= b
and.l #mask, d2 ; temp &= mask (select bits to swap)
eor.l d2, d1 ; b ^= temp (swap into b)
lsl.l #shift, d2 ; temp <<= stride (restore position)
eor.l d2, d0 ; a ^= temp (swap into a)
7 instructions per merge. Each merge moves half the bits in two registers to their correct positions.
Pass Structure for 8 Bitplanes
A full 8-bitplane C2P conversion on 32 pixels requires 5 passes of merge operations:
| Pass | Block Size | Mask | Swap Distance | Effect |
|---|---|---|---|---|
| 1 | 16-bit | $0000FFFF |
16 | Swap upper/lower halves of longword pairs |
| 2 | 8-bit | $00FF00FF |
8 | Swap bytes within pairs |
| 3 | 4-bit | $0F0F0F0F |
4 | Swap nibbles |
| 4 | 2-bit | $33333333 |
2 | Swap bit-pairs |
| 5 | 1-bit | $55555555 |
1 | Swap individual bits |
After all 5 passes, the 8 data registers contain one longword per bitplane.
Full 8-Bitplane C2P Inner Loop
; Kalms-style C2P inner loop — converts 32 chunky pixels (8 longwords)
; to 8 planar longwords (one per bitplane)
;
; Input: d0-d7 = 8 longwords of chunky data (4 pixels each)
; Output: d0-d7 = 8 longwords of planar data (one per bitplane)
; ---- Pass 1: 16-bit swap ----
swap d0 ; exchange upper/lower words of d0
swap d1
swap d2
swap d3
; (merge d0,d4), (merge d1,d5), (merge d2,d6), (merge d3,d7)
; using mask $0000FFFF, shift 16
move.l d0, a3 ; temp save
move.l d4, d0
move.w a3, d0 ; d0 = d4.hi : d0.lo
move.w d4, a3 ; a3 = d0.hi : d4.lo
move.l a3, d4
move.l d1, a3
move.l d5, d1
move.w a3, d1
move.w d5, a3
move.l a3, d5
move.l d2, a3
move.l d6, d2
move.w a3, d2
move.w d6, a3
move.l a3, d6
move.l d3, a3
move.l d7, d3
move.w a3, d3
move.w d7, a3
move.l a3, d7
; ---- Pass 2: 8-bit swap ----
; mask = $00FF00FF, shift = 8
move.l #$00FF00FF, a3
; merge(d0, d2)
move.l d0, a4
lsr.l #8, a4
eor.l d2, a4
and.l a3, a4
eor.l a4, d2
lsl.l #8, a4
eor.l a4, d0
; merge(d1, d3) ... merge(d4, d6) ... merge(d5, d7) ...
; (same pattern repeated for each pair)
; ---- Pass 3: 4-bit swap ----
; mask = $0F0F0F0F, shift = 4
; merge(d0, d1), merge(d2, d3), merge(d4, d5), merge(d6, d7)
; ---- Pass 4: 2-bit swap ----
; mask = $33333333, shift = 2
; ---- Pass 5: 1-bit swap ----
; mask = $55555555, shift = 1
; Result: d0 = bitplane 0 (32 bits), d1 = bitplane 1, ... d7 = bitplane 7
Note
The above is a pedagogical skeleton. Production C2P routines are heavily unrolled and use every register trick available — address registers as temporary storage, interleaving loads with merges to hide memory latency, and sometimes splitting the conversion across two phases to overlap with Chip RAM writes.
Performance
| Metric | Naive | Merge/Butterfly | Improvement |
|---|---|---|---|
| Instructions per 32 pixels | ~6,400+ | ~160–200 | 32–40× |
| Cycles per pixel (68020 @ 14 MHz) | ~200 | ~5–7 | ~30× |
| 320×256 full frame | ~1.1 s | ~35 ms | ~30× (28 FPS) |
| 320×256 per frame budget | 0.9 FPS | 28 FPS | Playable |
Solution 3 — Akiko Hardware C2P (CD32 Only)
The CD32's Akiko chip implements C2P in dedicated silicon. The CPU feeds 8 longwords of chunky data to register $B80030 and reads back 8 longwords of planar data from the same address.
| Metric | Software C2P (68020) | Akiko |
|---|---|---|
| Method | CPU merge/butterfly | Hardware register pipeline |
| Throughput | ~1.5 MB/s | ~1.5 MB/s |
| CPU cost | 100% | ~50% (register I/O) |
| Availability | All Amigas | CD32 only |
Akiko's throughput is approximately the same as optimised software C2P on the 68020 because both are limited by the Chip RAM bus bandwidth (~3.5 MB/s shared). On faster CPUs (68040/060), software C2P outperforms Akiko because the CPU can process data faster than the register interface can shuttle it.
Full Akiko protocol: Akiko — CD32 C2P Hardware
Solution 4 — Blitter-Assisted C2P
The Blitter can be used as part of a C2P pipeline, but it cannot perform the transposition itself. Typical usage:
- CPU performs the merge/butterfly in registers → outputs planar longwords to a temporary buffer in Chip RAM
- Blitter copies the planar data from the temporary buffer to the screen's bitplanes with correct modulo
This approach overlaps CPU computation with Blitter DMA — while the Blitter writes frame N's planes to the screen, the CPU computes frame N+1's transposition.
Time ──────────────────────────────────────────────────────→
CPU: [merge frame 0] [merge frame 1] [merge frame 2] ...
Blitter: [write frame 0] [write frame 1] ...
↑ overlap: CPU and Blitter run in parallel
Warning
On 68040/060 systems, the Blitter is often slower than letting the CPU do both the merge and the writes via
MOVE16(68040) or unrolledMOVEM.L. The Blitter's 16-bit bus (even in AGA FMODE×4) adds DMA contention that may actually slow down the CPU's merge passes.
Solution 5 — WriteChunkyPixels (AmigaOS)
AmigaOS 3.0+ provides WriteChunkyPixels() in graphics.library, which performs C2P conversion internally using the best available method:
#include <graphics/gfx.h>
WriteChunkyPixels(rp,
xstart, ystart, xstop, ystop,
chunky_buffer, chunky_bytes_per_row);
On CD32, this function auto-detects Akiko and uses it. On other AGA machines, it uses an internal software C2P. However, the OS implementation is not as fast as the best demoscene routines — it prioritises correctness and generality over raw speed.
Solution 6 — RTG: Eliminating C2P Entirely
The ultimate solution to C2P is to not do it at all. Retargetable Graphics (RTG) cards like the Picasso IV, CyberVision 64, and MiSTer's virtual uaegfx provide a chunky framebuffer directly. The rendering engine writes chunky pixels to VRAM, and the card's RAMDAC/scaler converts them to video output.
The irony: RTG cards must perform the reverse conversion (P2C — planar-to-chunky) when legacy planar software runs on an RTG screen. The CyberVision 64 included a dedicated Roxxler chip for this. Without hardware help, P2C on software is equally expensive.
See: RTG — Retargetable Graphics
Choosing the Right Approach
| Platform | Recommended C2P | Why |
|---|---|---|
| A500/A2000 (68000) | Merge algorithm (simplified, fewer planes) | No fast multiply; 68000 can manage 4–5 plane C2P at ~15 FPS |
| A1200 (68020) | Kalms merge, 5-pass | Sweet spot: enough registers, usable I-cache |
| CD32 (68020 + Akiko) | Akiko hardware | Frees ~50% CPU for game logic |
| A4000 (68030/040) | CPU merge (skip Akiko if not CD32) | 68040 MOVE16 makes CPU writes fast enough |
| 68060 accelerated | CPU merge, no Blitter | 68060 superscalar outperforms everything else |
| MiSTer FPGA | RTG (uaegfx) |
Chunky framebuffer in DDR — no C2P needed |
The Bigger Picture — Data Layout Transformation
C2P is not unique to the Amiga. It is an instance of a fundamental problem in computer architecture: transforming data layout between Structure-of-Arrays (SoA) and Array-of-Structures (AoS).
SoA vs AoS — The Universal Duality
AoS (Array of Structures) = Chunky:
struct Pixel { r, g, b, a; };
Pixel pixels[1024];
// Memory: r0 g0 b0 a0 r1 g1 b1 a1 r2 g2 b2 a2 ...
// Each element's fields are contiguous
SoA (Structure of Arrays) = Planar:
struct Pixels {
float r[1024];
float g[1024];
float b[1024];
float a[1024];
};
// Memory: r0 r1 r2 ... r1023 g0 g1 g2 ... g1023 ...
// Each field is contiguous across all elements
The Amiga's planar format is SoA: each bitplane is an array of one field (one bit) across all pixels. The chunky format is AoS: each pixel's fields (all 8 bits) are packed together.
Where This Problem Appears Today
| Domain | SoA (Planar-Like) | AoS (Chunky-Like) | Conversion |
|---|---|---|---|
| Amiga graphics | Bitplanes (Agnus DMA) | Chunky pixel buffer (CPU render) | C2P algorithm |
| GPU compute shaders | SoA buffer layouts (SSBO) | Vertex attributes (interleaved VBO) | Shader transpose |
| SIMD / AVX-512 | Separate float arrays (vectorisable) | Struct arrays (gather/scatter) | _mm512_transpose intrinsics |
| Database engines | Columnar storage (Parquet, Arrow) | Row-oriented storage (MySQL) | Column↔row materialisation |
| Image compression | Colour planes (JPEG YCbCr) | RGB pixels (BMP) | MCU block decomposition |
| GPU texture memory | Block-compressed (BC/ASTC) | Linear RGBA | Hardware texture unit decode |
| Neural network inference | NCHW tensor layout (channels first) | NHWC (channels last) | Layout transposition kernel |
Why Each System Prefers a Different Layout
| Layout | Optimal For | Reason |
|---|---|---|
| SoA / Planar | Streaming one field across many elements | Maximises cache line utilisation, enables SIMD vectorisation |
| AoS / Chunky | Random-access to complete elements | All fields of one element in one cache line |
The Amiga's custom DMA engine streams bitplane data to the display sequentially — plane 0 for the whole line, then plane 1, etc. This is a SoA access pattern, perfectly matched by the planar layout. The CPU, which wants to set a single pixel's complete colour, has the opposite need — it wants AoS.
Modern Hardware Parallels
| Amiga Component | Modern Equivalent | Function |
|---|---|---|
| Akiko C2P register | GPU texture swizzle unit | Hardware layout transposition |
| Blitter + merge algorithm | CUDA shared memory transpose kernel | CPU/coprocessor-assisted transpose |
| RTG (planar bypass) | Unified chunky framebuffer (since VGA) | Eliminates the problem entirely |
| Copper palette cycling | GPU palette shader / LUT texture | Colour manipulation without pixel writes |
| FMODE (fetch width) | GPU memory bus width (256/384/512-bit) | Wider bus = more data per DMA cycle |
GPU Texture Swizzle — The Modern Akiko
Modern GPUs store textures in swizzled (Morton/Z-order) layouts rather than linear row-major order. This is architecturally identical to what the Amiga does with planar bitmaps: the hardware's memory access pattern doesn't match the CPU's logical layout, so a dedicated hardware unit transparently converts between them.
Linear (CPU view): Morton/Z-order (GPU internal):
0 1 2 3 0 1 4 5
4 5 6 7 → 2 3 6 7
8 9 10 11 8 9 12 13
12 13 14 15 10 11 14 15
When you call glTexImage2D() or vkCmdCopyBufferToImage(), the GPU driver performs a layout conversion from linear (CPU-friendly) to swizzled (GPU-cache-friendly). This is the exact same class of operation as Amiga C2P — a hardware-accelerated data layout transformation that is invisible to the application programmer.
Performance Comparison Across Eras
| System | Data Layout Problem | Throughput | Method |
|---|---|---|---|
| A500 (1987, 7 MHz 68000) | C2P 320×256×4bpp | ~2 MB/s | CPU merge, 4 planes |
| A1200 (1992, 14 MHz 68020) | C2P 320×256×8bpp | ~1.5 MB/s | CPU merge, 8 planes |
| CD32 (1993, 14 MHz + Akiko) | C2P 320×256×8bpp | ~1.5 MB/s | Akiko hardware |
| 486 DX2/66 (1992) | No conversion needed | N/A | VGA Mode 13h = chunky |
| Pentium MMX (1997) | Colour space (YUV→RGB) | ~200 MB/s | MMX SIMD |
| GTX 1080 (2016) | Texture swizzle (linear→tiled) | ~300 GB/s | Hardware TMU |
| Apple M2 (2022) | SoA↔AoS for ML tensors | ~100 GB/s | Hardware AMX |
The throughput gap tells the story: what consumed 100% of a 68020's capability is handled by a dedicated hardware unit at 200,000× the bandwidth on modern silicon. But the fundamental problem — data layout mismatch between producer and consumer — is identical.
Historical Timeline
| Year | Event |
|---|---|
| 1985 | Amiga launches with planar display. C2P not needed — all software renders directly to bitplanes |
| 1989 | First 3D demos appear (Juggler, etc.). Rendering in chunky buffers starts |
| 1991 | Demoscene coders develop first optimised C2P routines for 68000 |
| 1992 | AGA ships (A1200/A4000). 8 bitplanes = C2P problem gets 2× harder |
| 1993 | CD32 ships with Akiko — first hardware C2P. Mikael Kalms publishes optimised CPU routines |
| 1994 | Kalms C2P library becomes the de facto standard. Multiple variants for 020/030/040/060 |
| 1995 | RTG cards (Picasso II, CyberVision 64) begin to make C2P irrelevant for productivity |
| 1996 | CyberVision 64 ships with Roxxler P2C chip — the reverse problem, solved in hardware |
| 1998 | 68060 accelerators make CPU C2P faster than any hardware solution |
| 2020+ | MiSTer FPGA core implements RTG via uaegfx — C2P eliminated for modern setups |
Implementing C2P — Practical Checklist
For developers writing Amiga software that renders in chunky format:
- Allocate the chunky buffer in Fast RAM (
MEMF_FAST) — the CPU reads it during conversion, and Fast RAM has no DMA contention - Allocate the planar screen in Chip RAM (
MEMF_CHIP | MEMF_DISPLAYABLE) — this is mandatory for display DMA - Use a proven C2P library — Kalms C2P (
kalms-c2pon GitHub/Aminet) is the gold standard - Match the routine to your CPU — different unrolling for 68020 vs 68040 vs 68060
- Use triple buffering if possible — render to buffer A, C2P buffer B into Chip RAM, display buffer C
- On CD32, detect and use Akiko —
WriteChunkyPixels()does this automatically - On RTG systems, skip C2P entirely — render chunky directly to the RTG card's VRAM
- Profile with CIA timers — the bottleneck shifts between CPU merge and Chip RAM write speed depending on configuration
Adaptive Detection
#include <graphics/gfxbase.h>
#include <cybergraphx/cybergraphics.h>
extern struct GfxBase *GfxBase;
/* Determine best C2P strategy for current hardware */
enum C2P_Strategy determine_c2p_strategy(struct BitMap *screen_bm)
{
/* Check for RTG screen first — no C2P needed */
if (GetCyberMapAttr(screen_bm, CYBRMATTR_ISRTG))
return C2P_NONE_RTG;
/* Check for Akiko (CD32) */
if (GfxBase->ChunkyToPlanarPtr != NULL)
return C2P_AKIKO;
/* Check CPU type for best software routine */
UWORD attn = SysBase->AttnFlags;
if (attn & AFF_68060) return C2P_KALMS_060;
if (attn & AFF_68040) return C2P_KALMS_040;
if (attn & AFF_68020) return C2P_KALMS_020;
return C2P_KALMS_000; /* 68000 fallback */
}
References
- Mikael Kalms — kalms-c2p — the definitive C2P library (GitHub)
- Scout/Azure — "Chunky 2 Planar Tutorial" — the seminal demoscene document explaining the transposition theory
- Amiga Hardware Reference Manual — bitplane DMA, display pipeline
- NDK39:
graphics/gfx.h—WriteChunkyPixels()prototype - Intel — Structure of Arrays vs Array of Structures — modern SoA/AoS guide
- NVIDIA — CUDA Programming Guide, "Shared Memory Matrix Transpose" — GPU equivalent of C2P
See Also
- Akiko — CD32 C2P Hardware — Akiko register protocol
- BitMap — Planar Layout — how Amiga bitmaps are structured in memory
- Blitter Programming — Blitter DMA used in Blitter-assisted C2P
- RTG — Retargetable Graphics — chunky framebuffer cards that eliminate C2P
- Memory Types — Chip vs Fast RAM (critical for C2P buffer placement)