# Pixel Format Conversion — Chunky ↔ Planar and Beyond
## The Core Problem
Every Amiga programmer eventually hits the same wall: the custom chipset displays graphics in **planar** format, but nearly every interesting algorithm — 3D rendering, texture mapping, image decompression, PC game ports — produces output in **chunky** format. Converting between these two layouts is the single most CPU-intensive bottleneck in Amiga graphics programming.
This article covers:
1.**What** planar and chunky formats are, mathematically
2.**Why** the conversion is expensive
3.**How** every known solution works — from naive loops to the Kalms butterfly
4.**Where** this problem appears in broader computing (SoA/AoS, GPU swizzle, SIMD)
> [!NOTE]
> The Akiko hardware article covers the CD32's dedicated C2P register interface. This article covers the *algorithm theory* that applies to every Amiga model, and the broader data-layout concepts that connect the Amiga to modern computing.
Each byte = one pixel. Linear, simple, cache-friendly for rendering. This is how **every modern GPU**, every PC VGA card, every framebuffer since 1990 stores pixels.
### Planar (Bitplane)
Each pixel's colour index is **split across N separate memory regions** (bitplanes). For 8-bit pixels (8 bitplanes), each bitplane stores one bit of every pixel:
```
Bitplane 0: 1 0 1 1 0 0 1 0 ← bit 0 of pixels 0–7
Bitplane 1: 0 1 0 1 1 0 0 1 ← bit 1 of pixels 0–7
Bitplane 2: 1 1 0 0 0 1 1 0 ← bit 2
Bitplane 3: 0 1 1 0 1 1 0 0 ← bit 3
Bitplane 4: 1 0 1 0 1 0 0 1 ← bit 4
Bitplane 5: 1 0 0 0 0 0 1 0 ← bit 5
Bitplane 6: 0 0 1 0 0 0 0 1 ← bit 6
Bitplane 7: 0 0 0 0 1 0 1 0 ← bit 7
```
To read pixel 0's colour: collect bit 0 from each of the 8 planes → `10101100` = `$AC`. The 8 planes are **not interleaved** in standard Amiga layout — each is a separate contiguous memory block.
### Why the Amiga Uses Planar
The planar format was a brilliant engineering choice in 1985:
| Advantage | Explanation |
|---|---|
| **Bandwidth efficiency** | A 4-colour screen uses 2 bitplanes = ½ the memory bandwidth of 4bpp chunky. DMA fetches only the planes actually used. |
| **Scalable colour depth** | Adding a bitplane doubles the colour count without redesigning the display engine. OCS: 1–6 planes. AGA: 1–8 planes. |
| **Cheap colour cycling** | Rotating palette indices only requires changing colour registers — zero memory writes. |
| **Blitter efficiency** | Blitting a masked sprite at 4 colours touches only 2 planes (2 blits), not 4× the data. |
| **Copper integration** | The Copper can change palette registers mid-scanline, effectively multiplying colours without more bitplanes. |
The downside only became critical as rendering algorithms evolved past 2D sprites into 3D, texture mapping, and pixel-level effects that naturally produce chunky output.
---
## The Conversion — Mathematically
C2P is a **bit matrix transposition**. Given 32 chunky pixels (each 8 bits wide), you have a 32×8 bit matrix (32 rows × 8 columns). C2P transposes this to an 8×32 matrix (8 bitplanes × 32 bits each):
This is equivalent to a 90° bit rotation. On a modern CPU with SIMD, this is trivial. On a 68020 with 8 data registers and no bit-parallel instructions, it is an algorithmic challenge that consumed thousands of programmer-hours across the demoscene.
---
## Solution 1 — The Naive Loop
The simplest approach: iterate over every pixel, extract each bit, and set it in the corresponding bitplane.
```c
/* Naive C2P — educational only, never use in production */
void c2p_naive(UBYTE *chunky, UBYTE *planes[8], int width, int height)
**Performance:** ~200+ cycles per pixel on 68020. For 320×256 = 81,920 pixels → **~16 million cycles → ~1.1 seconds at 14 MHz**. This gives roughly **0.9 FPS**. Completely unusable.
**Why it's terrible:**
- One bit at a time — no parallelism
- Read-modify-write on every bitplane byte (bus-killing)
- No register reuse — constant memory traffic
- Branch on every bit (pipeline flush on 68020)
---
## Solution 2 — The Merge (Butterfly) Algorithm
This is the standard approach used by virtually all serious Amiga C2P routines. Invented independently by several demoscene coders and formalised by **Mikael Kalms** (Kalmalyzer) and others.
### The Key Insight
Instead of processing one pixel at a time, load **32 pixels** (8 longwords = 256 bits) into CPU registers and perform a series of **bit-level swap operations** (called "merges") that progressively rearrange the bits into planar order. Each merge pass swaps bits at a different granularity: 16-bit blocks, then 8-bit, then 4-bit, 2-bit, and 1-bit.
This is exactly a **butterfly network** — the same structure used in the FFT (Fast Fourier Transform) and Batcher's bitonic sort.
### The Merge Primitive
The fundamental building block is a 2-register swap that exchanges bits at a given stride:
```asm
; merge(d0, d1, mask, shift)
; Exchanges bits between d0 and d1 where mask selects which bits to swap
> The above is a pedagogical skeleton. Production C2P routines are **heavily unrolled** and use every register trick available — address registers as temporary storage, interleaving loads with merges to hide memory latency, and sometimes splitting the conversion across two phases to overlap with Chip RAM writes.
The CD32's Akiko chip implements C2P in dedicated silicon. The CPU feeds 8 longwords of chunky data to register `$B80030` and reads back 8 longwords of planar data from the same address.
| Metric | Software C2P (68020) | Akiko |
|---|---|---|
| Method | CPU merge/butterfly | Hardware register pipeline |
| Throughput | ~1.5 MB/s | ~1.5 MB/s |
| CPU cost | 100% | ~50% (register I/O) |
| Availability | All Amigas | **CD32 only** |
Akiko's throughput is approximately the same as optimised software C2P on the 68020 because both are limited by the Chip RAM bus bandwidth (~3.5 MB/s shared). On faster CPUs (68040/060), software C2P **outperforms** Akiko because the CPU can process data faster than the register interface can shuttle it.
Full Akiko protocol: [Akiko — CD32 C2P Hardware](../01_hardware/aga_a1200_a4000/akiko_cd32.md#chunky-to-planar-c2p-conversion)
---
## Solution 4 — Blitter-Assisted C2P
The Blitter can be used as part of a C2P pipeline, but it cannot perform the transposition itself. Typical usage:
1. CPU performs the merge/butterfly in registers → outputs planar longwords to a temporary buffer in Chip RAM
2. Blitter copies the planar data from the temporary buffer to the screen's bitplanes with correct modulo
This approach **overlaps** CPU computation with Blitter DMA — while the Blitter writes frame N's planes to the screen, the CPU computes frame N+1's transposition.
```
Time ──────────────────────────────────────────────────────→
> On 68040/060 systems, the Blitter is often **slower** than letting the CPU do both the merge and the writes via `MOVE16` (68040) or unrolled `MOVEM.L`. The Blitter's 16-bit bus (even in AGA FMODE×4) adds DMA contention that may actually slow down the CPU's merge passes.
---
## Solution 5 — WriteChunkyPixels (AmigaOS)
AmigaOS 3.0+ provides `WriteChunkyPixels()` in `graphics.library`, which performs C2P conversion internally using the best available method:
```c
#include <graphics/gfx.h>
WriteChunkyPixels(rp,
xstart, ystart, xstop, ystop,
chunky_buffer, chunky_bytes_per_row);
```
On CD32, this function auto-detects Akiko and uses it. On other AGA machines, it uses an internal software C2P. However, the OS implementation is **not** as fast as the best demoscene routines — it prioritises correctness and generality over raw speed.
---
## Solution 6 — RTG: Eliminating C2P Entirely
The ultimate solution to C2P is to **not do it at all**. Retargetable Graphics (RTG) cards like the Picasso IV, CyberVision 64, and MiSTer's virtual `uaegfx` provide a chunky framebuffer directly. The rendering engine writes chunky pixels to VRAM, and the card's RAMDAC/scaler converts them to video output.
The irony: RTG cards must perform the **reverse** conversion (P2C — planar-to-chunky) when legacy planar software runs on an RTG screen. The CyberVision 64 included a dedicated **Roxxler** chip for this. Without hardware help, P2C on software is equally expensive.
| CD32 (68020 + Akiko) | Akiko hardware | Frees ~50% CPU for game logic |
| A4000 (68030/040) | CPU merge (skip Akiko if not CD32) | 68040 `MOVE16` makes CPU writes fast enough |
| 68060 accelerated | CPU merge, no Blitter | 68060 superscalar outperforms everything else |
| MiSTer FPGA | RTG (`uaegfx`) | Chunky framebuffer in DDR — no C2P needed |
---
## The Bigger Picture — Data Layout Transformation
C2P is not unique to the Amiga. It is an instance of a fundamental problem in computer architecture: **transforming data layout between Structure-of-Arrays (SoA) and Array-of-Structures (AoS)**.
The Amiga's planar format is **SoA**: each bitplane is an array of one field (one bit) across all pixels. The chunky format is **AoS**: each pixel's fields (all 8 bits) are packed together.
### Where This Problem Appears Today
| Domain | SoA (Planar-Like) | AoS (Chunky-Like) | Conversion |
| **SoA / Planar** | Streaming one field across many elements | Maximises cache line utilisation, enables SIMD vectorisation |
| **AoS / Chunky** | Random-access to complete elements | All fields of one element in one cache line |
The Amiga's custom DMA engine streams bitplane data to the display sequentially — plane 0 for the whole line, then plane 1, etc. This is a **SoA access pattern**, perfectly matched by the planar layout. The CPU, which wants to set a single pixel's complete colour, has the opposite need — it wants **AoS**.
### Modern Hardware Parallels
| Amiga Component | Modern Equivalent | Function |
| **FMODE (fetch width)** | GPU memory bus width (256/384/512-bit) | Wider bus = more data per DMA cycle |
### GPU Texture Swizzle — The Modern Akiko
Modern GPUs store textures in **swizzled** (Morton/Z-order) layouts rather than linear row-major order. This is architecturally identical to what the Amiga does with planar bitmaps: the hardware's memory access pattern doesn't match the CPU's logical layout, so a dedicated hardware unit transparently converts between them.
```
Linear (CPU view): Morton/Z-order (GPU internal):
0 1 2 3 0 1 4 5
4 5 6 7 → 2 3 6 7
8 9 10 11 8 9 12 13
12 13 14 15 10 11 14 15
```
When you call `glTexImage2D()` or `vkCmdCopyBufferToImage()`, the GPU driver performs a layout conversion from linear (CPU-friendly) to swizzled (GPU-cache-friendly). This is the exact same class of operation as Amiga C2P — a hardware-accelerated data layout transformation that is invisible to the application programmer.
---
## Performance Comparison Across Eras
| System | Data Layout Problem | Throughput | Method |
| Apple M2 (2022) | SoA↔AoS for ML tensors | ~100 GB/s | Hardware AMX |
The throughput gap tells the story: what consumed 100% of a 68020's capability is handled by a dedicated hardware unit at 200,000× the bandwidth on modern silicon. But the fundamental problem — **data layout mismatch between producer and consumer** — is identical.
---
## Historical Timeline
| Year | Event |
|---|---|
| 1985 | Amiga launches with planar display. C2P not needed — all software renders directly to bitplanes |
| 1989 | First 3D demos appear (Juggler, etc.). Rendering in chunky buffers starts |
| 1991 | Demoscene coders develop first optimised C2P routines for 68000 |
| 1992 | AGA ships (A1200/A4000). 8 bitplanes = C2P problem gets 2× harder |
| 1993 | CD32 ships with Akiko — first hardware C2P. Mikael Kalms publishes optimised CPU routines |
| 1994 | Kalms C2P library becomes the de facto standard. Multiple variants for 020/030/040/060 |
| 1995 | RTG cards (Picasso II, CyberVision 64) begin to make C2P irrelevant for productivity |
| 1996 | CyberVision 64 ships with Roxxler P2C chip — the reverse problem, solved in hardware |
| 1998 | 68060 accelerators make CPU C2P faster than any hardware solution |
| 2020+ | MiSTer FPGA core implements RTG via `uaegfx` — C2P eliminated for modern setups |
---
## Implementing C2P — Practical Checklist
For developers writing Amiga software that renders in chunky format:
1.**Allocate the chunky buffer in Fast RAM** (`MEMF_FAST`) — the CPU reads it during conversion, and Fast RAM has no DMA contention
2.**Allocate the planar screen in Chip RAM** (`MEMF_CHIP | MEMF_DISPLAYABLE`) — this is mandatory for display DMA
3.**Use a proven C2P library** — Kalms C2P (`kalms-c2p` on GitHub/Aminet) is the gold standard
4.**Match the routine to your CPU** — different unrolling for 68020 vs 68040 vs 68060
5.**Use triple buffering** if possible — render to buffer A, C2P buffer B into Chip RAM, display buffer C
6.**On CD32, detect and use Akiko** — `WriteChunkyPixels()` does this automatically
7.**On RTG systems, skip C2P entirely** — render chunky directly to the RTG card's VRAM
8.**Profile with CIA timers** — the bottleneck shifts between CPU merge and Chip RAM write speed depending on configuration
### Adaptive Detection
```c
#include <graphics/gfxbase.h>
#include <cybergraphx/cybergraphics.h>
extern struct GfxBase *GfxBase;
/* Determine best C2P strategy for current hardware */
- Intel — [Structure of Arrays vs Array of Structures](https://www.intel.com/content/www/us/en/developer/articles/technical/memory-layout-transformations.html) — modern SoA/AoS guide
- NVIDIA — CUDA Programming Guide, "Shared Memory Matrix Transpose" — GPU equivalent of C2P