[← Home](../README.md) · [Graphics](README.md) # Pixel Format Conversion — Chunky ↔ Planar and Beyond ## The Core Problem Every Amiga programmer eventually hits the same wall: the custom chipset displays graphics in **planar** format, but nearly every interesting algorithm — 3D rendering, texture mapping, image decompression, PC game ports — produces output in **chunky** format. Converting between these two layouts is the single most CPU-intensive bottleneck in Amiga graphics programming. This article covers: 1. **What** planar and chunky formats are, mathematically 2. **Why** the conversion is expensive 3. **How** every known solution works — from naive loops to the Kalms butterfly 4. **Where** this problem appears in broader computing (SoA/AoS, GPU swizzle, SIMD) > [!NOTE] > The Akiko hardware article covers the CD32's dedicated C2P register interface. This article covers the *algorithm theory* that applies to every Amiga model, and the broader data-layout concepts that connect the Amiga to modern computing. > > See: [Akiko — CD32 C2P Hardware](../01_hardware/aga_a1200_a4000/akiko_cd32.md) --- ## Planar vs Chunky — The Two Layouts ### Chunky (Packed Pixel) Every pixel's complete colour index is stored contiguously. For 8-bit (256 colour) pixels: ``` Address: $0000 $0001 $0002 $0003 $0004 $0005 $0006 $0007 Data: $0D $05 $1B $0A $FF $03 $42 $7E pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7 ``` Each byte = one pixel. Linear, simple, cache-friendly for rendering. This is how **every modern GPU**, every PC VGA card, every framebuffer since 1990 stores pixels. ### Planar (Bitplane) Each pixel's colour index is **split across N separate memory regions** (bitplanes). For 8-bit pixels (8 bitplanes), each bitplane stores one bit of every pixel: ``` Bitplane 0: 1 0 1 1 0 0 1 0 ← bit 0 of pixels 0–7 Bitplane 1: 0 1 0 1 1 0 0 1 ← bit 1 of pixels 0–7 Bitplane 2: 1 1 0 0 0 1 1 0 ← bit 2 Bitplane 3: 0 1 1 0 1 1 0 0 ← bit 3 Bitplane 4: 1 0 1 0 1 0 0 1 ← bit 4 Bitplane 5: 1 0 0 0 0 0 1 0 ← bit 5 Bitplane 6: 0 0 1 0 0 0 0 1 ← bit 6 Bitplane 7: 0 0 0 0 1 0 1 0 ← bit 7 ``` To read pixel 0's colour: collect bit 0 from each of the 8 planes → `10101100` = `$AC`. The 8 planes are **not interleaved** in standard Amiga layout — each is a separate contiguous memory block. ### Why the Amiga Uses Planar The planar format was a brilliant engineering choice in 1985: | Advantage | Explanation | |---|---| | **Bandwidth efficiency** | A 4-colour screen uses 2 bitplanes = ½ the memory bandwidth of 4bpp chunky. DMA fetches only the planes actually used. | | **Scalable colour depth** | Adding a bitplane doubles the colour count without redesigning the display engine. OCS: 1–6 planes. AGA: 1–8 planes. | | **Cheap colour cycling** | Rotating palette indices only requires changing colour registers — zero memory writes. | | **Blitter efficiency** | Blitting a masked sprite at 4 colours touches only 2 planes (2 blits), not 4× the data. | | **Copper integration** | The Copper can change palette registers mid-scanline, effectively multiplying colours without more bitplanes. | The downside only became critical as rendering algorithms evolved past 2D sprites into 3D, texture mapping, and pixel-level effects that naturally produce chunky output. --- ## The Conversion — Mathematically C2P is a **bit matrix transposition**. Given 32 chunky pixels (each 8 bits wide), you have a 32×8 bit matrix (32 rows × 8 columns). C2P transposes this to an 8×32 matrix (8 bitplanes × 32 bits each): ``` Input (chunky): Output (planar): 32 pixels × 8 bits 8 bitplanes × 32 bits ┌──────────────────────────────┐ ┌────────────────────────────────────────┐ │ P0: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 0: p0.b0 p1.b0 p2.b0 ... p31.b0 │ │ P1: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 1: p0.b1 p1.b1 p2.b1 ... p31.b1 │ │ ... │ │ ... │ │ P31: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 7: p0.b7 p1.b7 p2.b7 ... p31.b7 │ └──────────────────────────────┘ └────────────────────────────────────────┘ ``` This is equivalent to a 90° bit rotation. On a modern CPU with SIMD, this is trivial. On a 68020 with 8 data registers and no bit-parallel instructions, it is an algorithmic challenge that consumed thousands of programmer-hours across the demoscene. --- ## Solution 1 — The Naive Loop The simplest approach: iterate over every pixel, extract each bit, and set it in the corresponding bitplane. ```c /* Naive C2P — educational only, never use in production */ void c2p_naive(UBYTE *chunky, UBYTE *planes[8], int width, int height) { for (int y = 0; y < height; y++) { for (int x = 0; x < width; x++) { UBYTE pixel = chunky[y * width + x]; int byte_offset = y * (width / 8) + (x / 8); int bit_position = 7 - (x & 7); for (int plane = 0; plane < 8; plane++) { if (pixel & (1 << plane)) planes[plane][byte_offset] |= (1 << bit_position); else planes[plane][byte_offset] &= ~(1 << bit_position); } } } } ``` **Performance:** ~200+ cycles per pixel on 68020. For 320×256 = 81,920 pixels → **~16 million cycles → ~1.1 seconds at 14 MHz**. This gives roughly **0.9 FPS**. Completely unusable. **Why it's terrible:** - One bit at a time — no parallelism - Read-modify-write on every bitplane byte (bus-killing) - No register reuse — constant memory traffic - Branch on every bit (pipeline flush on 68020) --- ## Solution 2 — The Merge (Butterfly) Algorithm This is the standard approach used by virtually all serious Amiga C2P routines. Invented independently by several demoscene coders and formalised by **Mikael Kalms** (Kalmalyzer) and others. ### The Key Insight Instead of processing one pixel at a time, load **32 pixels** (8 longwords = 256 bits) into CPU registers and perform a series of **bit-level swap operations** (called "merges") that progressively rearrange the bits into planar order. Each merge pass swaps bits at a different granularity: 16-bit blocks, then 8-bit, then 4-bit, 2-bit, and 1-bit. This is exactly a **butterfly network** — the same structure used in the FFT (Fast Fourier Transform) and Batcher's bitonic sort. ### The Merge Primitive The fundamental building block is a 2-register swap that exchanges bits at a given stride: ```asm ; merge(d0, d1, mask, shift) ; Exchanges bits between d0 and d1 where mask selects which bits to swap ; and shift determines the stride move.l d0, d2 ; temp = a lsr.l #shift, d2 ; temp >>= stride eor.l d1, d2 ; temp ^= b and.l #mask, d2 ; temp &= mask (select bits to swap) eor.l d2, d1 ; b ^= temp (swap into b) lsl.l #shift, d2 ; temp <<= stride (restore position) eor.l d2, d0 ; a ^= temp (swap into a) ``` **7 instructions** per merge. Each merge moves half the bits in two registers to their correct positions. ### Pass Structure for 8 Bitplanes A full 8-bitplane C2P conversion on 32 pixels requires **5 passes** of merge operations: | Pass | Block Size | Mask | Swap Distance | Effect | |---|---|---|---|---| | 1 | 16-bit | `$0000FFFF` | 16 | Swap upper/lower halves of longword pairs | | 2 | 8-bit | `$00FF00FF` | 8 | Swap bytes within pairs | | 3 | 4-bit | `$0F0F0F0F` | 4 | Swap nibbles | | 4 | 2-bit | `$33333333` | 2 | Swap bit-pairs | | 5 | 1-bit | `$55555555` | 1 | Swap individual bits | After all 5 passes, the 8 data registers contain one longword per bitplane. ### Full 8-Bitplane C2P Inner Loop ```asm ; Kalms-style C2P inner loop — converts 32 chunky pixels (8 longwords) ; to 8 planar longwords (one per bitplane) ; ; Input: d0-d7 = 8 longwords of chunky data (4 pixels each) ; Output: d0-d7 = 8 longwords of planar data (one per bitplane) ; ---- Pass 1: 16-bit swap ---- swap d0 ; exchange upper/lower words of d0 swap d1 swap d2 swap d3 ; (merge d0,d4), (merge d1,d5), (merge d2,d6), (merge d3,d7) ; using mask $0000FFFF, shift 16 move.l d0, a3 ; temp save move.l d4, d0 move.w a3, d0 ; d0 = d4.hi : d0.lo move.w d4, a3 ; a3 = d0.hi : d4.lo move.l a3, d4 move.l d1, a3 move.l d5, d1 move.w a3, d1 move.w d5, a3 move.l a3, d5 move.l d2, a3 move.l d6, d2 move.w a3, d2 move.w d6, a3 move.l a3, d6 move.l d3, a3 move.l d7, d3 move.w a3, d3 move.w d7, a3 move.l a3, d7 ; ---- Pass 2: 8-bit swap ---- ; mask = $00FF00FF, shift = 8 move.l #$00FF00FF, a3 ; merge(d0, d2) move.l d0, a4 lsr.l #8, a4 eor.l d2, a4 and.l a3, a4 eor.l a4, d2 lsl.l #8, a4 eor.l a4, d0 ; merge(d1, d3) ... merge(d4, d6) ... merge(d5, d7) ... ; (same pattern repeated for each pair) ; ---- Pass 3: 4-bit swap ---- ; mask = $0F0F0F0F, shift = 4 ; merge(d0, d1), merge(d2, d3), merge(d4, d5), merge(d6, d7) ; ---- Pass 4: 2-bit swap ---- ; mask = $33333333, shift = 2 ; ---- Pass 5: 1-bit swap ---- ; mask = $55555555, shift = 1 ; Result: d0 = bitplane 0 (32 bits), d1 = bitplane 1, ... d7 = bitplane 7 ``` > [!NOTE] > The above is a pedagogical skeleton. Production C2P routines are **heavily unrolled** and use every register trick available — address registers as temporary storage, interleaving loads with merges to hide memory latency, and sometimes splitting the conversion across two phases to overlap with Chip RAM writes. ### Performance | Metric | Naive | Merge/Butterfly | Improvement | |---|---|---|---| | Instructions per 32 pixels | ~6,400+ | ~160–200 | **32–40×** | | Cycles per pixel (68020 @ 14 MHz) | ~200 | ~5–7 | **~30×** | | 320×256 full frame | ~1.1 s | ~35 ms | **~30× (28 FPS)** | | 320×256 per frame budget | 0.9 FPS | **28 FPS** | Playable | --- ## Solution 3 — Akiko Hardware C2P (CD32 Only) The CD32's Akiko chip implements C2P in dedicated silicon. The CPU feeds 8 longwords of chunky data to register `$B80030` and reads back 8 longwords of planar data from the same address. | Metric | Software C2P (68020) | Akiko | |---|---|---| | Method | CPU merge/butterfly | Hardware register pipeline | | Throughput | ~1.5 MB/s | ~1.5 MB/s | | CPU cost | 100% | ~50% (register I/O) | | Availability | All Amigas | **CD32 only** | Akiko's throughput is approximately the same as optimised software C2P on the 68020 because both are limited by the Chip RAM bus bandwidth (~3.5 MB/s shared). On faster CPUs (68040/060), software C2P **outperforms** Akiko because the CPU can process data faster than the register interface can shuttle it. Full Akiko protocol: [Akiko — CD32 C2P Hardware](../01_hardware/aga_a1200_a4000/akiko_cd32.md#chunky-to-planar-c2p-conversion) --- ## Solution 4 — Blitter-Assisted C2P The Blitter can be used as part of a C2P pipeline, but it cannot perform the transposition itself. Typical usage: 1. CPU performs the merge/butterfly in registers → outputs planar longwords to a temporary buffer in Chip RAM 2. Blitter copies the planar data from the temporary buffer to the screen's bitplanes with correct modulo This approach **overlaps** CPU computation with Blitter DMA — while the Blitter writes frame N's planes to the screen, the CPU computes frame N+1's transposition. ``` Time ──────────────────────────────────────────────────────→ CPU: [merge frame 0] [merge frame 1] [merge frame 2] ... Blitter: [write frame 0] [write frame 1] ... ↑ overlap: CPU and Blitter run in parallel ``` > [!WARNING] > On 68040/060 systems, the Blitter is often **slower** than letting the CPU do both the merge and the writes via `MOVE16` (68040) or unrolled `MOVEM.L`. The Blitter's 16-bit bus (even in AGA FMODE×4) adds DMA contention that may actually slow down the CPU's merge passes. --- ## Solution 5 — WriteChunkyPixels (AmigaOS) AmigaOS 3.0+ provides `WriteChunkyPixels()` in `graphics.library`, which performs C2P conversion internally using the best available method: ```c #include WriteChunkyPixels(rp, xstart, ystart, xstop, ystop, chunky_buffer, chunky_bytes_per_row); ``` On CD32, this function auto-detects Akiko and uses it. On other AGA machines, it uses an internal software C2P. However, the OS implementation is **not** as fast as the best demoscene routines — it prioritises correctness and generality over raw speed. --- ## Solution 6 — RTG: Eliminating C2P Entirely The ultimate solution to C2P is to **not do it at all**. Retargetable Graphics (RTG) cards like the Picasso IV, CyberVision 64, and MiSTer's virtual `uaegfx` provide a chunky framebuffer directly. The rendering engine writes chunky pixels to VRAM, and the card's RAMDAC/scaler converts them to video output. The irony: RTG cards must perform the **reverse** conversion (P2C — planar-to-chunky) when legacy planar software runs on an RTG screen. The CyberVision 64 included a dedicated **Roxxler** chip for this. Without hardware help, P2C on software is equally expensive. See: [RTG — Retargetable Graphics](../16_driver_development/rtg_driver.md#planar-to-chunky-conversion-c2p) --- ## Choosing the Right Approach | Platform | Recommended C2P | Why | |---|---|---| | A500/A2000 (68000) | Merge algorithm (simplified, fewer planes) | No fast multiply; 68000 can manage 4–5 plane C2P at ~15 FPS | | A1200 (68020) | Kalms merge, 5-pass | Sweet spot: enough registers, usable I-cache | | CD32 (68020 + Akiko) | Akiko hardware | Frees ~50% CPU for game logic | | A4000 (68030/040) | CPU merge (skip Akiko if not CD32) | 68040 `MOVE16` makes CPU writes fast enough | | 68060 accelerated | CPU merge, no Blitter | 68060 superscalar outperforms everything else | | MiSTer FPGA | RTG (`uaegfx`) | Chunky framebuffer in DDR — no C2P needed | --- ## The Bigger Picture — Data Layout Transformation C2P is not unique to the Amiga. It is an instance of a fundamental problem in computer architecture: **transforming data layout between Structure-of-Arrays (SoA) and Array-of-Structures (AoS)**. ### SoA vs AoS — The Universal Duality ``` AoS (Array of Structures) = Chunky: struct Pixel { r, g, b, a; }; Pixel pixels[1024]; // Memory: r0 g0 b0 a0 r1 g1 b1 a1 r2 g2 b2 a2 ... // Each element's fields are contiguous SoA (Structure of Arrays) = Planar: struct Pixels { float r[1024]; float g[1024]; float b[1024]; float a[1024]; }; // Memory: r0 r1 r2 ... r1023 g0 g1 g2 ... g1023 ... // Each field is contiguous across all elements ``` The Amiga's planar format is **SoA**: each bitplane is an array of one field (one bit) across all pixels. The chunky format is **AoS**: each pixel's fields (all 8 bits) are packed together. ### Where This Problem Appears Today | Domain | SoA (Planar-Like) | AoS (Chunky-Like) | Conversion | |---|---|---|---| | **Amiga graphics** | Bitplanes (Agnus DMA) | Chunky pixel buffer (CPU render) | C2P algorithm | | **GPU compute shaders** | SoA buffer layouts (SSBO) | Vertex attributes (interleaved VBO) | Shader transpose | | **SIMD / AVX-512** | Separate float arrays (vectorisable) | Struct arrays (gather/scatter) | `_mm512_transpose` intrinsics | | **Database engines** | Columnar storage (Parquet, Arrow) | Row-oriented storage (MySQL) | Column↔row materialisation | | **Image compression** | Colour planes (JPEG YCbCr) | RGB pixels (BMP) | MCU block decomposition | | **GPU texture memory** | Block-compressed (BC/ASTC) | Linear RGBA | Hardware texture unit decode | | **Neural network inference** | NCHW tensor layout (channels first) | NHWC (channels last) | Layout transposition kernel | ### Why Each System Prefers a Different Layout | Layout | Optimal For | Reason | |---|---|---| | **SoA / Planar** | Streaming one field across many elements | Maximises cache line utilisation, enables SIMD vectorisation | | **AoS / Chunky** | Random-access to complete elements | All fields of one element in one cache line | The Amiga's custom DMA engine streams bitplane data to the display sequentially — plane 0 for the whole line, then plane 1, etc. This is a **SoA access pattern**, perfectly matched by the planar layout. The CPU, which wants to set a single pixel's complete colour, has the opposite need — it wants **AoS**. ### Modern Hardware Parallels | Amiga Component | Modern Equivalent | Function | |---|---|---| | **Akiko C2P register** | GPU texture swizzle unit | Hardware layout transposition | | **Blitter + merge algorithm** | CUDA shared memory transpose kernel | CPU/coprocessor-assisted transpose | | **RTG (planar bypass)** | Unified chunky framebuffer (since VGA) | Eliminates the problem entirely | | **Copper palette cycling** | GPU palette shader / LUT texture | Colour manipulation without pixel writes | | **FMODE (fetch width)** | GPU memory bus width (256/384/512-bit) | Wider bus = more data per DMA cycle | ### GPU Texture Swizzle — The Modern Akiko Modern GPUs store textures in **swizzled** (Morton/Z-order) layouts rather than linear row-major order. This is architecturally identical to what the Amiga does with planar bitmaps: the hardware's memory access pattern doesn't match the CPU's logical layout, so a dedicated hardware unit transparently converts between them. ``` Linear (CPU view): Morton/Z-order (GPU internal): 0 1 2 3 0 1 4 5 4 5 6 7 → 2 3 6 7 8 9 10 11 8 9 12 13 12 13 14 15 10 11 14 15 ``` When you call `glTexImage2D()` or `vkCmdCopyBufferToImage()`, the GPU driver performs a layout conversion from linear (CPU-friendly) to swizzled (GPU-cache-friendly). This is the exact same class of operation as Amiga C2P — a hardware-accelerated data layout transformation that is invisible to the application programmer. --- ## Performance Comparison Across Eras | System | Data Layout Problem | Throughput | Method | |---|---|---|---| | A500 (1987, 7 MHz 68000) | C2P 320×256×4bpp | ~2 MB/s | CPU merge, 4 planes | | A1200 (1992, 14 MHz 68020) | C2P 320×256×8bpp | ~1.5 MB/s | CPU merge, 8 planes | | CD32 (1993, 14 MHz + Akiko) | C2P 320×256×8bpp | ~1.5 MB/s | Akiko hardware | | 486 DX2/66 (1992) | No conversion needed | N/A | VGA Mode 13h = chunky | | Pentium MMX (1997) | Colour space (YUV→RGB) | ~200 MB/s | MMX SIMD | | GTX 1080 (2016) | Texture swizzle (linear→tiled) | ~300 GB/s | Hardware TMU | | Apple M2 (2022) | SoA↔AoS for ML tensors | ~100 GB/s | Hardware AMX | The throughput gap tells the story: what consumed 100% of a 68020's capability is handled by a dedicated hardware unit at 200,000× the bandwidth on modern silicon. But the fundamental problem — **data layout mismatch between producer and consumer** — is identical. --- ## Historical Timeline | Year | Event | |---|---| | 1985 | Amiga launches with planar display. C2P not needed — all software renders directly to bitplanes | | 1989 | First 3D demos appear (Juggler, etc.). Rendering in chunky buffers starts | | 1991 | Demoscene coders develop first optimised C2P routines for 68000 | | 1992 | AGA ships (A1200/A4000). 8 bitplanes = C2P problem gets 2× harder | | 1993 | CD32 ships with Akiko — first hardware C2P. Mikael Kalms publishes optimised CPU routines | | 1994 | Kalms C2P library becomes the de facto standard. Multiple variants for 020/030/040/060 | | 1995 | RTG cards (Picasso II, CyberVision 64) begin to make C2P irrelevant for productivity | | 1996 | CyberVision 64 ships with Roxxler P2C chip — the reverse problem, solved in hardware | | 1998 | 68060 accelerators make CPU C2P faster than any hardware solution | | 2020+ | MiSTer FPGA core implements RTG via `uaegfx` — C2P eliminated for modern setups | --- ## Implementing C2P — Practical Checklist For developers writing Amiga software that renders in chunky format: 1. **Allocate the chunky buffer in Fast RAM** (`MEMF_FAST`) — the CPU reads it during conversion, and Fast RAM has no DMA contention 2. **Allocate the planar screen in Chip RAM** (`MEMF_CHIP | MEMF_DISPLAYABLE`) — this is mandatory for display DMA 3. **Use a proven C2P library** — Kalms C2P (`kalms-c2p` on GitHub/Aminet) is the gold standard 4. **Match the routine to your CPU** — different unrolling for 68020 vs 68040 vs 68060 5. **Use triple buffering** if possible — render to buffer A, C2P buffer B into Chip RAM, display buffer C 6. **On CD32, detect and use Akiko** — `WriteChunkyPixels()` does this automatically 7. **On RTG systems, skip C2P entirely** — render chunky directly to the RTG card's VRAM 8. **Profile with CIA timers** — the bottleneck shifts between CPU merge and Chip RAM write speed depending on configuration ### Adaptive Detection ```c #include #include extern struct GfxBase *GfxBase; /* Determine best C2P strategy for current hardware */ enum C2P_Strategy determine_c2p_strategy(struct BitMap *screen_bm) { /* Check for RTG screen first — no C2P needed */ if (GetCyberMapAttr(screen_bm, CYBRMATTR_ISRTG)) return C2P_NONE_RTG; /* Check for Akiko (CD32) */ if (GfxBase->ChunkyToPlanarPtr != NULL) return C2P_AKIKO; /* Check CPU type for best software routine */ UWORD attn = SysBase->AttnFlags; if (attn & AFF_68060) return C2P_KALMS_060; if (attn & AFF_68040) return C2P_KALMS_040; if (attn & AFF_68020) return C2P_KALMS_020; return C2P_KALMS_000; /* 68000 fallback */ } ``` --- ## References - Mikael Kalms — [kalms-c2p](https://github.com/Kalmalyzer/kalms-c2p) — the definitive C2P library (GitHub) - Scout/Azure — "Chunky 2 Planar Tutorial" — the seminal demoscene document explaining the transposition theory - *Amiga Hardware Reference Manual* — bitplane DMA, display pipeline - NDK39: `graphics/gfx.h` — `WriteChunkyPixels()` prototype - Intel — [Structure of Arrays vs Array of Structures](https://www.intel.com/content/www/us/en/developer/articles/technical/memory-layout-transformations.html) — modern SoA/AoS guide - NVIDIA — CUDA Programming Guide, "Shared Memory Matrix Transpose" — GPU equivalent of C2P ## See Also - [Akiko — CD32 C2P Hardware](../01_hardware/aga_a1200_a4000/akiko_cd32.md) — Akiko register protocol - [BitMap — Planar Layout](bitmap.md) — how Amiga bitmaps are structured in memory - [Blitter Programming](blitter_programming.md) — Blitter DMA used in Blitter-assisted C2P - [RTG — Retargetable Graphics](../16_driver_development/rtg_driver.md) — chunky framebuffer cards that eliminate C2P - [Memory Types](../01_hardware/common/memory_types.md) — Chip vs Fast RAM (critical for C2P buffer placement)