amiga-bootcamp/08_graphics/pixel_conversion.md
2026-04-25 14:58:33 -04:00

23 KiB
Raw Blame History

← Home · Graphics

Pixel Format Conversion — Chunky ↔ Planar and Beyond

The Core Problem

Every Amiga programmer eventually hits the same wall: the custom chipset displays graphics in planar format, but nearly every interesting algorithm — 3D rendering, texture mapping, image decompression, PC game ports — produces output in chunky format. Converting between these two layouts is the single most CPU-intensive bottleneck in Amiga graphics programming.

This article covers:

  1. What planar and chunky formats are, mathematically
  2. Why the conversion is expensive
  3. How every known solution works — from naive loops to the Kalms butterfly
  4. Where this problem appears in broader computing (SoA/AoS, GPU swizzle, SIMD)

Note

The Akiko hardware article covers the CD32's dedicated C2P register interface. This article covers the algorithm theory that applies to every Amiga model, and the broader data-layout concepts that connect the Amiga to modern computing.

See: Akiko — CD32 C2P Hardware


Planar vs Chunky — The Two Layouts

Chunky (Packed Pixel)

Every pixel's complete colour index is stored contiguously. For 8-bit (256 colour) pixels:

Address:  $0000  $0001  $0002  $0003  $0004  $0005  $0006  $0007
Data:       $0D    $05    $1B    $0A    $FF    $03    $42    $7E
          pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7

Each byte = one pixel. Linear, simple, cache-friendly for rendering. This is how every modern GPU, every PC VGA card, every framebuffer since 1990 stores pixels.

Planar (Bitplane)

Each pixel's colour index is split across N separate memory regions (bitplanes). For 8-bit pixels (8 bitplanes), each bitplane stores one bit of every pixel:

Bitplane 0: 1 0 1 1 0 0 1 0  ← bit 0 of pixels 07
Bitplane 1: 0 1 0 1 1 0 0 1  ← bit 1 of pixels 07
Bitplane 2: 1 1 0 0 0 1 1 0  ← bit 2
Bitplane 3: 0 1 1 0 1 1 0 0  ← bit 3
Bitplane 4: 1 0 1 0 1 0 0 1  ← bit 4
Bitplane 5: 1 0 0 0 0 0 1 0  ← bit 5
Bitplane 6: 0 0 1 0 0 0 0 1  ← bit 6
Bitplane 7: 0 0 0 0 1 0 1 0  ← bit 7

To read pixel 0's colour: collect bit 0 from each of the 8 planes → 10101100 = $AC. The 8 planes are not interleaved in standard Amiga layout — each is a separate contiguous memory block.

Why the Amiga Uses Planar

The planar format was a brilliant engineering choice in 1985:

Advantage Explanation
Bandwidth efficiency A 4-colour screen uses 2 bitplanes = ½ the memory bandwidth of 4bpp chunky. DMA fetches only the planes actually used.
Scalable colour depth Adding a bitplane doubles the colour count without redesigning the display engine. OCS: 16 planes. AGA: 18 planes.
Cheap colour cycling Rotating palette indices only requires changing colour registers — zero memory writes.
Blitter efficiency Blitting a masked sprite at 4 colours touches only 2 planes (2 blits), not 4× the data.
Copper integration The Copper can change palette registers mid-scanline, effectively multiplying colours without more bitplanes.

The downside only became critical as rendering algorithms evolved past 2D sprites into 3D, texture mapping, and pixel-level effects that naturally produce chunky output.


The Conversion — Mathematically

C2P is a bit matrix transposition. Given 32 chunky pixels (each 8 bits wide), you have a 32×8 bit matrix (32 rows × 8 columns). C2P transposes this to an 8×32 matrix (8 bitplanes × 32 bits each):

  Input (chunky):                    Output (planar):
    32 pixels × 8 bits                 8 bitplanes × 32 bits
  ┌──────────────────────────────┐  ┌────────────────────────────────────────┐
  │ P0:  b7 b6 b5 b4 b3 b2 b1 b0 │  │ Plane 0: p0.b0 p1.b0 p2.b0 ... p31.b0  │
  │ P1:  b7 b6 b5 b4 b3 b2 b1 b0 │  │ Plane 1: p0.b1 p1.b1 p2.b1 ... p31.b1  │
  │ ...                          │  │ ...                                    │
  │ P31: b7 b6 b5 b4 b3 b2 b1 b0 │  │ Plane 7: p0.b7 p1.b7 p2.b7 ... p31.b7  │
  └──────────────────────────────┘  └────────────────────────────────────────┘

This is equivalent to a 90° bit rotation. On a modern CPU with SIMD, this is trivial. On a 68020 with 8 data registers and no bit-parallel instructions, it is an algorithmic challenge that consumed thousands of programmer-hours across the demoscene.


Solution 1 — The Naive Loop

The simplest approach: iterate over every pixel, extract each bit, and set it in the corresponding bitplane.

/* Naive C2P — educational only, never use in production */
void c2p_naive(UBYTE *chunky, UBYTE *planes[8], int width, int height)
{
    for (int y = 0; y < height; y++)
    {
        for (int x = 0; x < width; x++)
        {
            UBYTE pixel = chunky[y * width + x];
            int byte_offset = y * (width / 8) + (x / 8);
            int bit_position = 7 - (x & 7);

            for (int plane = 0; plane < 8; plane++)
            {
                if (pixel & (1 << plane))
                    planes[plane][byte_offset] |= (1 << bit_position);
                else
                    planes[plane][byte_offset] &= ~(1 << bit_position);
            }
        }
    }
}

Performance: ~200+ cycles per pixel on 68020. For 320×256 = 81,920 pixels → ~16 million cycles → ~1.1 seconds at 14 MHz. This gives roughly 0.9 FPS. Completely unusable.

Why it's terrible:

  • One bit at a time — no parallelism
  • Read-modify-write on every bitplane byte (bus-killing)
  • No register reuse — constant memory traffic
  • Branch on every bit (pipeline flush on 68020)

Solution 2 — The Merge (Butterfly) Algorithm

This is the standard approach used by virtually all serious Amiga C2P routines. Invented independently by several demoscene coders and formalised by Mikael Kalms (Kalmalyzer) and others.

The Key Insight

Instead of processing one pixel at a time, load 32 pixels (8 longwords = 256 bits) into CPU registers and perform a series of bit-level swap operations (called "merges") that progressively rearrange the bits into planar order. Each merge pass swaps bits at a different granularity: 16-bit blocks, then 8-bit, then 4-bit, 2-bit, and 1-bit.

This is exactly a butterfly network — the same structure used in the FFT (Fast Fourier Transform) and Batcher's bitonic sort.

The Merge Primitive

The fundamental building block is a 2-register swap that exchanges bits at a given stride:

; merge(d0, d1, mask, shift)
; Exchanges bits between d0 and d1 where mask selects which bits to swap
; and shift determines the stride

    move.l  d0, d2          ; temp = a
    lsr.l   #shift, d2      ; temp >>= stride
    eor.l   d1, d2          ; temp ^= b
    and.l   #mask, d2       ; temp &= mask (select bits to swap)
    eor.l   d2, d1          ; b ^= temp (swap into b)
    lsl.l   #shift, d2      ; temp <<= stride (restore position)
    eor.l   d2, d0          ; a ^= temp (swap into a)

7 instructions per merge. Each merge moves half the bits in two registers to their correct positions.

Pass Structure for 8 Bitplanes

A full 8-bitplane C2P conversion on 32 pixels requires 5 passes of merge operations:

Pass Block Size Mask Swap Distance Effect
1 16-bit $0000FFFF 16 Swap upper/lower halves of longword pairs
2 8-bit $00FF00FF 8 Swap bytes within pairs
3 4-bit $0F0F0F0F 4 Swap nibbles
4 2-bit $33333333 2 Swap bit-pairs
5 1-bit $55555555 1 Swap individual bits

After all 5 passes, the 8 data registers contain one longword per bitplane.

Full 8-Bitplane C2P Inner Loop

; Kalms-style C2P inner loop — converts 32 chunky pixels (8 longwords)
; to 8 planar longwords (one per bitplane)
;
; Input:  d0-d7 = 8 longwords of chunky data (4 pixels each)
; Output: d0-d7 = 8 longwords of planar data (one per bitplane)

; ---- Pass 1: 16-bit swap ----
    swap    d0              ; exchange upper/lower words of d0
    swap    d1
    swap    d2
    swap    d3
    ; (merge d0,d4), (merge d1,d5), (merge d2,d6), (merge d3,d7)
    ; using mask $0000FFFF, shift 16

    move.l  d0, a3          ; temp save
    move.l  d4, d0
    move.w  a3, d0          ; d0 = d4.hi : d0.lo
    move.w  d4, a3          ; a3 = d0.hi : d4.lo
    move.l  a3, d4

    move.l  d1, a3
    move.l  d5, d1
    move.w  a3, d1
    move.w  d5, a3
    move.l  a3, d5

    move.l  d2, a3
    move.l  d6, d2
    move.w  a3, d2
    move.w  d6, a3
    move.l  a3, d6

    move.l  d3, a3
    move.l  d7, d3
    move.w  a3, d3
    move.w  d7, a3
    move.l  a3, d7

; ---- Pass 2: 8-bit swap ----
; mask = $00FF00FF, shift = 8
    move.l  #$00FF00FF, a3
    ; merge(d0, d2)
    move.l  d0, a4
    lsr.l   #8, a4
    eor.l   d2, a4
    and.l   a3, a4
    eor.l   a4, d2
    lsl.l   #8, a4
    eor.l   a4, d0
    ; merge(d1, d3) ... merge(d4, d6) ... merge(d5, d7) ...
    ; (same pattern repeated for each pair)

; ---- Pass 3: 4-bit swap ----
; mask = $0F0F0F0F, shift = 4
    ; merge(d0, d1), merge(d2, d3), merge(d4, d5), merge(d6, d7)

; ---- Pass 4: 2-bit swap ----
; mask = $33333333, shift = 2

; ---- Pass 5: 1-bit swap ----
; mask = $55555555, shift = 1

; Result: d0 = bitplane 0 (32 bits), d1 = bitplane 1, ... d7 = bitplane 7

Note

The above is a pedagogical skeleton. Production C2P routines are heavily unrolled and use every register trick available — address registers as temporary storage, interleaving loads with merges to hide memory latency, and sometimes splitting the conversion across two phases to overlap with Chip RAM writes.

Performance

Metric Naive Merge/Butterfly Improvement
Instructions per 32 pixels ~6,400+ ~160200 3240×
Cycles per pixel (68020 @ 14 MHz) ~200 ~57 ~30×
320×256 full frame ~1.1 s ~35 ms ~30× (28 FPS)
320×256 per frame budget 0.9 FPS 28 FPS Playable

Solution 3 — Akiko Hardware C2P (CD32 Only)

The CD32's Akiko chip implements C2P in dedicated silicon. The CPU feeds 8 longwords of chunky data to register $B80030 and reads back 8 longwords of planar data from the same address.

Metric Software C2P (68020) Akiko
Method CPU merge/butterfly Hardware register pipeline
Throughput ~1.5 MB/s ~1.5 MB/s
CPU cost 100% ~50% (register I/O)
Availability All Amigas CD32 only

Akiko's throughput is approximately the same as optimised software C2P on the 68020 because both are limited by the Chip RAM bus bandwidth (~3.5 MB/s shared). On faster CPUs (68040/060), software C2P outperforms Akiko because the CPU can process data faster than the register interface can shuttle it.

Full Akiko protocol: Akiko — CD32 C2P Hardware


Solution 4 — Blitter-Assisted C2P

The Blitter can be used as part of a C2P pipeline, but it cannot perform the transposition itself. Typical usage:

  1. CPU performs the merge/butterfly in registers → outputs planar longwords to a temporary buffer in Chip RAM
  2. Blitter copies the planar data from the temporary buffer to the screen's bitplanes with correct modulo

This approach overlaps CPU computation with Blitter DMA — while the Blitter writes frame N's planes to the screen, the CPU computes frame N+1's transposition.

Time ──────────────────────────────────────────────────────→
CPU:   [merge frame 0] [merge frame 1] [merge frame 2] ...
Blitter:               [write frame 0] [write frame 1] ...
                       ↑ overlap: CPU and Blitter run in parallel

Warning

On 68040/060 systems, the Blitter is often slower than letting the CPU do both the merge and the writes via MOVE16 (68040) or unrolled MOVEM.L. The Blitter's 16-bit bus (even in AGA FMODE×4) adds DMA contention that may actually slow down the CPU's merge passes.


Solution 5 — WriteChunkyPixels (AmigaOS)

AmigaOS 3.0+ provides WriteChunkyPixels() in graphics.library, which performs C2P conversion internally using the best available method:

#include <graphics/gfx.h>

WriteChunkyPixels(rp,
    xstart, ystart, xstop, ystop,
    chunky_buffer, chunky_bytes_per_row);

On CD32, this function auto-detects Akiko and uses it. On other AGA machines, it uses an internal software C2P. However, the OS implementation is not as fast as the best demoscene routines — it prioritises correctness and generality over raw speed.


Solution 6 — RTG: Eliminating C2P Entirely

The ultimate solution to C2P is to not do it at all. Retargetable Graphics (RTG) cards like the Picasso IV, CyberVision 64, and MiSTer's virtual uaegfx provide a chunky framebuffer directly. The rendering engine writes chunky pixels to VRAM, and the card's RAMDAC/scaler converts them to video output.

The irony: RTG cards must perform the reverse conversion (P2C — planar-to-chunky) when legacy planar software runs on an RTG screen. The CyberVision 64 included a dedicated Roxxler chip for this. Without hardware help, P2C on software is equally expensive.

See: RTG — Retargetable Graphics


Choosing the Right Approach

Platform Recommended C2P Why
A500/A2000 (68000) Merge algorithm (simplified, fewer planes) No fast multiply; 68000 can manage 45 plane C2P at ~15 FPS
A1200 (68020) Kalms merge, 5-pass Sweet spot: enough registers, usable I-cache
CD32 (68020 + Akiko) Akiko hardware Frees ~50% CPU for game logic
A4000 (68030/040) CPU merge (skip Akiko if not CD32) 68040 MOVE16 makes CPU writes fast enough
68060 accelerated CPU merge, no Blitter 68060 superscalar outperforms everything else
MiSTer FPGA RTG (uaegfx) Chunky framebuffer in DDR — no C2P needed

The Bigger Picture — Data Layout Transformation

C2P is not unique to the Amiga. It is an instance of a fundamental problem in computer architecture: transforming data layout between Structure-of-Arrays (SoA) and Array-of-Structures (AoS).

SoA vs AoS — The Universal Duality

AoS (Array of Structures) = Chunky:
  struct Pixel { r, g, b, a; };
  Pixel pixels[1024];
  // Memory: r0 g0 b0 a0 r1 g1 b1 a1 r2 g2 b2 a2 ...
  // Each element's fields are contiguous

SoA (Structure of Arrays) = Planar:
  struct Pixels {
    float r[1024];
    float g[1024];
    float b[1024];
    float a[1024];
  };
  // Memory: r0 r1 r2 ... r1023 g0 g1 g2 ... g1023 ...
  // Each field is contiguous across all elements

The Amiga's planar format is SoA: each bitplane is an array of one field (one bit) across all pixels. The chunky format is AoS: each pixel's fields (all 8 bits) are packed together.

Where This Problem Appears Today

Domain SoA (Planar-Like) AoS (Chunky-Like) Conversion
Amiga graphics Bitplanes (Agnus DMA) Chunky pixel buffer (CPU render) C2P algorithm
GPU compute shaders SoA buffer layouts (SSBO) Vertex attributes (interleaved VBO) Shader transpose
SIMD / AVX-512 Separate float arrays (vectorisable) Struct arrays (gather/scatter) _mm512_transpose intrinsics
Database engines Columnar storage (Parquet, Arrow) Row-oriented storage (MySQL) Column↔row materialisation
Image compression Colour planes (JPEG YCbCr) RGB pixels (BMP) MCU block decomposition
GPU texture memory Block-compressed (BC/ASTC) Linear RGBA Hardware texture unit decode
Neural network inference NCHW tensor layout (channels first) NHWC (channels last) Layout transposition kernel

Why Each System Prefers a Different Layout

Layout Optimal For Reason
SoA / Planar Streaming one field across many elements Maximises cache line utilisation, enables SIMD vectorisation
AoS / Chunky Random-access to complete elements All fields of one element in one cache line

The Amiga's custom DMA engine streams bitplane data to the display sequentially — plane 0 for the whole line, then plane 1, etc. This is a SoA access pattern, perfectly matched by the planar layout. The CPU, which wants to set a single pixel's complete colour, has the opposite need — it wants AoS.

Modern Hardware Parallels

Amiga Component Modern Equivalent Function
Akiko C2P register GPU texture swizzle unit Hardware layout transposition
Blitter + merge algorithm CUDA shared memory transpose kernel CPU/coprocessor-assisted transpose
RTG (planar bypass) Unified chunky framebuffer (since VGA) Eliminates the problem entirely
Copper palette cycling GPU palette shader / LUT texture Colour manipulation without pixel writes
FMODE (fetch width) GPU memory bus width (256/384/512-bit) Wider bus = more data per DMA cycle

GPU Texture Swizzle — The Modern Akiko

Modern GPUs store textures in swizzled (Morton/Z-order) layouts rather than linear row-major order. This is architecturally identical to what the Amiga does with planar bitmaps: the hardware's memory access pattern doesn't match the CPU's logical layout, so a dedicated hardware unit transparently converts between them.

Linear (CPU view):             Morton/Z-order (GPU internal):
  0  1  2  3                     0  1  4  5
  4  5  6  7          →          2  3  6  7
  8  9 10 11                     8  9 12 13
 12 13 14 15                    10 11 14 15

When you call glTexImage2D() or vkCmdCopyBufferToImage(), the GPU driver performs a layout conversion from linear (CPU-friendly) to swizzled (GPU-cache-friendly). This is the exact same class of operation as Amiga C2P — a hardware-accelerated data layout transformation that is invisible to the application programmer.


Performance Comparison Across Eras

System Data Layout Problem Throughput Method
A500 (1987, 7 MHz 68000) C2P 320×256×4bpp ~2 MB/s CPU merge, 4 planes
A1200 (1992, 14 MHz 68020) C2P 320×256×8bpp ~1.5 MB/s CPU merge, 8 planes
CD32 (1993, 14 MHz + Akiko) C2P 320×256×8bpp ~1.5 MB/s Akiko hardware
486 DX2/66 (1992) No conversion needed N/A VGA Mode 13h = chunky
Pentium MMX (1997) Colour space (YUV→RGB) ~200 MB/s MMX SIMD
GTX 1080 (2016) Texture swizzle (linear→tiled) ~300 GB/s Hardware TMU
Apple M2 (2022) SoA↔AoS for ML tensors ~100 GB/s Hardware AMX

The throughput gap tells the story: what consumed 100% of a 68020's capability is handled by a dedicated hardware unit at 200,000× the bandwidth on modern silicon. But the fundamental problem — data layout mismatch between producer and consumer — is identical.


Historical Timeline

Year Event
1985 Amiga launches with planar display. C2P not needed — all software renders directly to bitplanes
1989 First 3D demos appear (Juggler, etc.). Rendering in chunky buffers starts
1991 Demoscene coders develop first optimised C2P routines for 68000
1992 AGA ships (A1200/A4000). 8 bitplanes = C2P problem gets 2× harder
1993 CD32 ships with Akiko — first hardware C2P. Mikael Kalms publishes optimised CPU routines
1994 Kalms C2P library becomes the de facto standard. Multiple variants for 020/030/040/060
1995 RTG cards (Picasso II, CyberVision 64) begin to make C2P irrelevant for productivity
1996 CyberVision 64 ships with Roxxler P2C chip — the reverse problem, solved in hardware
1998 68060 accelerators make CPU C2P faster than any hardware solution
2020+ MiSTer FPGA core implements RTG via uaegfx — C2P eliminated for modern setups

Implementing C2P — Practical Checklist

For developers writing Amiga software that renders in chunky format:

  1. Allocate the chunky buffer in Fast RAM (MEMF_FAST) — the CPU reads it during conversion, and Fast RAM has no DMA contention
  2. Allocate the planar screen in Chip RAM (MEMF_CHIP | MEMF_DISPLAYABLE) — this is mandatory for display DMA
  3. Use a proven C2P library — Kalms C2P (kalms-c2p on GitHub/Aminet) is the gold standard
  4. Match the routine to your CPU — different unrolling for 68020 vs 68040 vs 68060
  5. Use triple buffering if possible — render to buffer A, C2P buffer B into Chip RAM, display buffer C
  6. On CD32, detect and use AkikoWriteChunkyPixels() does this automatically
  7. On RTG systems, skip C2P entirely — render chunky directly to the RTG card's VRAM
  8. Profile with CIA timers — the bottleneck shifts between CPU merge and Chip RAM write speed depending on configuration

Adaptive Detection

#include <graphics/gfxbase.h>
#include <cybergraphx/cybergraphics.h>

extern struct GfxBase *GfxBase;

/* Determine best C2P strategy for current hardware */
enum C2P_Strategy determine_c2p_strategy(struct BitMap *screen_bm)
{
    /* Check for RTG screen first — no C2P needed */
    if (GetCyberMapAttr(screen_bm, CYBRMATTR_ISRTG))
        return C2P_NONE_RTG;

    /* Check for Akiko (CD32) */
    if (GfxBase->ChunkyToPlanarPtr != NULL)
        return C2P_AKIKO;

    /* Check CPU type for best software routine */
    UWORD attn = SysBase->AttnFlags;
    if (attn & AFF_68060) return C2P_KALMS_060;
    if (attn & AFF_68040) return C2P_KALMS_040;
    if (attn & AFF_68020) return C2P_KALMS_020;

    return C2P_KALMS_000;  /* 68000 fallback */
}

References

  • Mikael Kalms — kalms-c2p — the definitive C2P library (GitHub)
  • Scout/Azure — "Chunky 2 Planar Tutorial" — the seminal demoscene document explaining the transposition theory
  • Amiga Hardware Reference Manual — bitplane DMA, display pipeline
  • NDK39: graphics/gfx.hWriteChunkyPixels() prototype
  • Intel — Structure of Arrays vs Array of Structures — modern SoA/AoS guide
  • NVIDIA — CUDA Programming Guide, "Shared Memory Matrix Transpose" — GPU equivalent of C2P

See Also