rslr/amiga-bootcamp

Fork 0

mirror of https://github.com/alfishe/amiga-bootcamp.git synced 2026-06-13 00:26:28 +00:00

Ilia Sharin 2eb8321d00 Minor fixes in layout and indexing

2026-04-25 14:58:33 -04:00

23 KiB

Raw Blame History

← Home · Graphics

Pixel Format Conversion — Chunky ↔ Planar and Beyond

The Core Problem

Every Amiga programmer eventually hits the same wall: the custom chipset displays graphics in planar format, but nearly every interesting algorithm — 3D rendering, texture mapping, image decompression, PC game ports — produces output in chunky format. Converting between these two layouts is the single most CPU-intensive bottleneck in Amiga graphics programming.

This article covers:

What planar and chunky formats are, mathematically
Why the conversion is expensive
How every known solution works — from naive loops to the Kalms butterfly
Where this problem appears in broader computing (SoA/AoS, GPU swizzle, SIMD)

Note

The Akiko hardware article covers the CD32's dedicated C2P register interface. This article covers the algorithm theory that applies to every Amiga model, and the broader data-layout concepts that connect the Amiga to modern computing.

See: Akiko — CD32 C2P Hardware

Planar vs Chunky — The Two Layouts

Chunky (Packed Pixel)

Every pixel's complete colour index is stored contiguously. For 8-bit (256 colour) pixels:

Address:  $0000  $0001  $0002  $0003  $0004  $0005  $0006  $0007
Data:       $0D    $05    $1B    $0A    $FF    $03    $42    $7E
          pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7

Each byte = one pixel. Linear, simple, cache-friendly for rendering. This is how every modern GPU, every PC VGA card, every framebuffer since 1990 stores pixels.

Planar (Bitplane)

Each pixel's colour index is split across N separate memory regions (bitplanes). For 8-bit pixels (8 bitplanes), each bitplane stores one bit of every pixel:

Bitplane 0: 1 0 1 1 0 0 1 0  ← bit 0 of pixels 0–7
Bitplane 1: 0 1 0 1 1 0 0 1  ← bit 1 of pixels 0–7
Bitplane 2: 1 1 0 0 0 1 1 0  ← bit 2
Bitplane 3: 0 1 1 0 1 1 0 0  ← bit 3
Bitplane 4: 1 0 1 0 1 0 0 1  ← bit 4
Bitplane 5: 1 0 0 0 0 0 1 0  ← bit 5
Bitplane 6: 0 0 1 0 0 0 0 1  ← bit 6
Bitplane 7: 0 0 0 0 1 0 1 0  ← bit 7

To read pixel 0's colour: collect bit 0 from each of the 8 planes → 10101100 = $AC. The 8 planes are not interleaved in standard Amiga layout — each is a separate contiguous memory block.

Why the Amiga Uses Planar

The planar format was a brilliant engineering choice in 1985:

Advantage	Explanation
Bandwidth efficiency	A 4-colour screen uses 2 bitplanes = ½ the memory bandwidth of 4bpp chunky. DMA fetches only the planes actually used.
Scalable colour depth	Adding a bitplane doubles the colour count without redesigning the display engine. OCS: 1–6 planes. AGA: 1–8 planes.
Cheap colour cycling	Rotating palette indices only requires changing colour registers — zero memory writes.
Blitter efficiency	Blitting a masked sprite at 4 colours touches only 2 planes (2 blits), not 4× the data.
Copper integration	The Copper can change palette registers mid-scanline, effectively multiplying colours without more bitplanes.

The downside only became critical as rendering algorithms evolved past 2D sprites into 3D, texture mapping, and pixel-level effects that naturally produce chunky output.

The Conversion — Mathematically

C2P is a bit matrix transposition. Given 32 chunky pixels (each 8 bits wide), you have a 32×8 bit matrix (32 rows × 8 columns). C2P transposes this to an 8×32 matrix (8 bitplanes × 32 bits each):

  Input (chunky):                    Output (planar):
    32 pixels × 8 bits                 8 bitplanes × 32 bits
  ┌──────────────────────────────┐  ┌────────────────────────────────────────┐
  │ P0:  b7 b6 b5 b4 b3 b2 b1 b0 │  │ Plane 0: p0.b0 p1.b0 p2.b0 ... p31.b0  │
  │ P1:  b7 b6 b5 b4 b3 b2 b1 b0 │  │ Plane 1: p0.b1 p1.b1 p2.b1 ... p31.b1  │
  │ ...                          │  │ ...                                    │
  │ P31: b7 b6 b5 b4 b3 b2 b1 b0 │  │ Plane 7: p0.b7 p1.b7 p2.b7 ... p31.b7  │
  └──────────────────────────────┘  └────────────────────────────────────────┘

This is equivalent to a 90° bit rotation. On a modern CPU with SIMD, this is trivial. On a 68020 with 8 data registers and no bit-parallel instructions, it is an algorithmic challenge that consumed thousands of programmer-hours across the demoscene.

Solution 1 — The Naive Loop

The simplest approach: iterate over every pixel, extract each bit, and set it in the corresponding bitplane.

/* Naive C2P — educational only, never use in production */
void c2p_naive(UBYTE *chunky, UBYTE *planes[8], int width, int height)
{
    for (int y = 0; y < height; y++)
    {
        for (int x = 0; x < width; x++)
        {
            UBYTE pixel = chunky[y * width + x];
            int byte_offset = y * (width / 8) + (x / 8);
            int bit_position = 7 - (x & 7);

            for (int plane = 0; plane < 8; plane++)
            {
                if (pixel & (1 << plane))
                    planes[plane][byte_offset] |= (1 << bit_position);
                else
                    planes[plane][byte_offset] &= ~(1 << bit_position);
            }
        }
    }
}

Performance: ~200+ cycles per pixel on 68020. For 320×256 = 81,920 pixels → ~16 million cycles → ~1.1 seconds at 14 MHz. This gives roughly 0.9 FPS. Completely unusable.

Why it's terrible:

One bit at a time — no parallelism
Read-modify-write on every bitplane byte (bus-killing)
No register reuse — constant memory traffic
Branch on every bit (pipeline flush on 68020)

Solution 2 — The Merge (Butterfly) Algorithm

This is the standard approach used by virtually all serious Amiga C2P routines. Invented independently by several demoscene coders and formalised by Mikael Kalms (Kalmalyzer) and others.

The Key Insight

Instead of processing one pixel at a time, load 32 pixels (8 longwords = 256 bits) into CPU registers and perform a series of bit-level swap operations (called "merges") that progressively rearrange the bits into planar order. Each merge pass swaps bits at a different granularity: 16-bit blocks, then 8-bit, then 4-bit, 2-bit, and 1-bit.

This is exactly a butterfly network — the same structure used in the FFT (Fast Fourier Transform) and Batcher's bitonic sort.

The Merge Primitive

The fundamental building block is a 2-register swap that exchanges bits at a given stride:

; merge(d0, d1, mask, shift)
; Exchanges bits between d0 and d1 where mask selects which bits to swap
; and shift determines the stride

    move.l  d0, d2          ; temp = a
    lsr.l   #shift, d2      ; temp >>= stride
    eor.l   d1, d2          ; temp ^= b
    and.l   #mask, d2       ; temp &= mask (select bits to swap)
    eor.l   d2, d1          ; b ^= temp (swap into b)
    lsl.l   #shift, d2      ; temp <<= stride (restore position)
    eor.l   d2, d0          ; a ^= temp (swap into a)

7 instructions per merge. Each merge moves half the bits in two registers to their correct positions.

Pass Structure for 8 Bitplanes

A full 8-bitplane C2P conversion on 32 pixels requires 5 passes of merge operations:

Pass	Block Size	Mask	Swap Distance	Effect
1	16-bit	`$0000FFFF`	16	Swap upper/lower halves of longword pairs
2	8-bit	`$00FF00FF`	8	Swap bytes within pairs
3	4-bit	`$0F0F0F0F`	4	Swap nibbles
4	2-bit	`$33333333`	2	Swap bit-pairs
5	1-bit	`$55555555`	1	Swap individual bits

After all 5 passes, the 8 data registers contain one longword per bitplane.

Full 8-Bitplane C2P Inner Loop

; Kalms-style C2P inner loop — converts 32 chunky pixels (8 longwords)
; to 8 planar longwords (one per bitplane)
;
; Input:  d0-d7 = 8 longwords of chunky data (4 pixels each)
; Output: d0-d7 = 8 longwords of planar data (one per bitplane)

; ---- Pass 1: 16-bit swap ----
    swap    d0              ; exchange upper/lower words of d0
    swap    d1
    swap    d2
    swap    d3
    ; (merge d0,d4), (merge d1,d5), (merge d2,d6), (merge d3,d7)
    ; using mask $0000FFFF, shift 16

    move.l  d0, a3          ; temp save
    move.l  d4, d0
    move.w  a3, d0          ; d0 = d4.hi : d0.lo
    move.w  d4, a3          ; a3 = d0.hi : d4.lo
    move.l  a3, d4

    move.l  d1, a3
    move.l  d5, d1
    move.w  a3, d1
    move.w  d5, a3
    move.l  a3, d5

    move.l  d2, a3
    move.l  d6, d2
    move.w  a3, d2
    move.w  d6, a3
    move.l  a3, d6

    move.l  d3, a3
    move.l  d7, d3
    move.w  a3, d3
    move.w  d7, a3
    move.l  a3, d7

; ---- Pass 2: 8-bit swap ----
; mask = $00FF00FF, shift = 8
    move.l  #$00FF00FF, a3
    ; merge(d0, d2)
    move.l  d0, a4
    lsr.l   #8, a4
    eor.l   d2, a4
    and.l   a3, a4
    eor.l   a4, d2
    lsl.l   #8, a4
    eor.l   a4, d0
    ; merge(d1, d3) ... merge(d4, d6) ... merge(d5, d7) ...
    ; (same pattern repeated for each pair)

; ---- Pass 3: 4-bit swap ----
; mask = $0F0F0F0F, shift = 4
    ; merge(d0, d1), merge(d2, d3), merge(d4, d5), merge(d6, d7)

; ---- Pass 4: 2-bit swap ----
; mask = $33333333, shift = 2

; ---- Pass 5: 1-bit swap ----
; mask = $55555555, shift = 1

; Result: d0 = bitplane 0 (32 bits), d1 = bitplane 1, ... d7 = bitplane 7

Note

The above is a pedagogical skeleton. Production C2P routines are heavily unrolled and use every register trick available — address registers as temporary storage, interleaving loads with merges to hide memory latency, and sometimes splitting the conversion across two phases to overlap with Chip RAM writes.

Performance

Metric	Naive	Merge/Butterfly	Improvement
Instructions per 32 pixels	~6,400+	~160–200	32–40×
Cycles per pixel (68020 @ 14 MHz)	~200	~5–7	~30×
320×256 full frame	~1.1 s	~35 ms	~30× (28 FPS)
320×256 per frame budget	0.9 FPS	28 FPS	Playable

Solution 3 — Akiko Hardware C2P (CD32 Only)

The CD32's Akiko chip implements C2P in dedicated silicon. The CPU feeds 8 longwords of chunky data to register $B80030 and reads back 8 longwords of planar data from the same address.

Metric	Software C2P (68020)	Akiko
Method	CPU merge/butterfly	Hardware register pipeline
Throughput	~1.5 MB/s	~1.5 MB/s
CPU cost	100%	~50% (register I/O)
Availability	All Amigas	CD32 only

Akiko's throughput is approximately the same as optimised software C2P on the 68020 because both are limited by the Chip RAM bus bandwidth (~3.5 MB/s shared). On faster CPUs (68040/060), software C2P outperforms Akiko because the CPU can process data faster than the register interface can shuttle it.

Full Akiko protocol: Akiko — CD32 C2P Hardware

Solution 4 — Blitter-Assisted C2P

The Blitter can be used as part of a C2P pipeline, but it cannot perform the transposition itself. Typical usage:

CPU performs the merge/butterfly in registers → outputs planar longwords to a temporary buffer in Chip RAM
Blitter copies the planar data from the temporary buffer to the screen's bitplanes with correct modulo

This approach overlaps CPU computation with Blitter DMA — while the Blitter writes frame N's planes to the screen, the CPU computes frame N+1's transposition.

Time ──────────────────────────────────────────────────────→
CPU:   [merge frame 0] [merge frame 1] [merge frame 2] ...
Blitter:               [write frame 0] [write frame 1] ...
                       ↑ overlap: CPU and Blitter run in parallel

Warning

On 68040/060 systems, the Blitter is often slower than letting the CPU do both the merge and the writes via MOVE16 (68040) or unrolled MOVEM.L. The Blitter's 16-bit bus (even in AGA FMODE×4) adds DMA contention that may actually slow down the CPU's merge passes.

Solution 5 — WriteChunkyPixels (AmigaOS)

AmigaOS 3.0+ provides WriteChunkyPixels() in graphics.library, which performs C2P conversion internally using the best available method:

#include <graphics/gfx.h>

WriteChunkyPixels(rp,
    xstart, ystart, xstop, ystop,
    chunky_buffer, chunky_bytes_per_row);

On CD32, this function auto-detects Akiko and uses it. On other AGA machines, it uses an internal software C2P. However, the OS implementation is not as fast as the best demoscene routines — it prioritises correctness and generality over raw speed.

Solution 6 — RTG: Eliminating C2P Entirely

The ultimate solution to C2P is to not do it at all. Retargetable Graphics (RTG) cards like the Picasso IV, CyberVision 64, and MiSTer's virtual uaegfx provide a chunky framebuffer directly. The rendering engine writes chunky pixels to VRAM, and the card's RAMDAC/scaler converts them to video output.

The irony: RTG cards must perform the reverse conversion (P2C — planar-to-chunky) when legacy planar software runs on an RTG screen. The CyberVision 64 included a dedicated Roxxler chip for this. Without hardware help, P2C on software is equally expensive.

See: RTG — Retargetable Graphics

Choosing the Right Approach

Platform	Recommended C2P	Why
A500/A2000 (68000)	Merge algorithm (simplified, fewer planes)	No fast multiply; 68000 can manage 4–5 plane C2P at ~15 FPS
A1200 (68020)	Kalms merge, 5-pass	Sweet spot: enough registers, usable I-cache
CD32 (68020 + Akiko)	Akiko hardware	Frees ~50% CPU for game logic
A4000 (68030/040)	CPU merge (skip Akiko if not CD32)	68040 `MOVE16` makes CPU writes fast enough
68060 accelerated	CPU merge, no Blitter	68060 superscalar outperforms everything else
MiSTer FPGA	RTG (`uaegfx`)	Chunky framebuffer in DDR — no C2P needed

The Bigger Picture — Data Layout Transformation

C2P is not unique to the Amiga. It is an instance of a fundamental problem in computer architecture: transforming data layout between Structure-of-Arrays (SoA) and Array-of-Structures (AoS).

SoA vs AoS — The Universal Duality

AoS (Array of Structures) = Chunky:
  struct Pixel { r, g, b, a; };
  Pixel pixels[1024];
  // Memory: r0 g0 b0 a0 r1 g1 b1 a1 r2 g2 b2 a2 ...
  // Each element's fields are contiguous

SoA (Structure of Arrays) = Planar:
  struct Pixels {
    float r[1024];
    float g[1024];
    float b[1024];
    float a[1024];
  };
  // Memory: r0 r1 r2 ... r1023 g0 g1 g2 ... g1023 ...
  // Each field is contiguous across all elements

The Amiga's planar format is SoA: each bitplane is an array of one field (one bit) across all pixels. The chunky format is AoS: each pixel's fields (all 8 bits) are packed together.

Where This Problem Appears Today

Domain	SoA (Planar-Like)	AoS (Chunky-Like)	Conversion
Amiga graphics	Bitplanes (Agnus DMA)	Chunky pixel buffer (CPU render)	C2P algorithm
GPU compute shaders	SoA buffer layouts (SSBO)	Vertex attributes (interleaved VBO)	Shader transpose
SIMD / AVX-512	Separate float arrays (vectorisable)	Struct arrays (gather/scatter)	`_mm512_transpose` intrinsics
Database engines	Columnar storage (Parquet, Arrow)	Row-oriented storage (MySQL)	Column↔row materialisation
Image compression	Colour planes (JPEG YCbCr)	RGB pixels (BMP)	MCU block decomposition
GPU texture memory	Block-compressed (BC/ASTC)	Linear RGBA	Hardware texture unit decode
Neural network inference	NCHW tensor layout (channels first)	NHWC (channels last)	Layout transposition kernel

Why Each System Prefers a Different Layout

Layout	Optimal For	Reason
SoA / Planar	Streaming one field across many elements	Maximises cache line utilisation, enables SIMD vectorisation
AoS / Chunky	Random-access to complete elements	All fields of one element in one cache line

The Amiga's custom DMA engine streams bitplane data to the display sequentially — plane 0 for the whole line, then plane 1, etc. This is a SoA access pattern, perfectly matched by the planar layout. The CPU, which wants to set a single pixel's complete colour, has the opposite need — it wants AoS.

Modern Hardware Parallels

Amiga Component	Modern Equivalent	Function
Akiko C2P register	GPU texture swizzle unit	Hardware layout transposition
Blitter + merge algorithm	CUDA shared memory transpose kernel	CPU/coprocessor-assisted transpose
RTG (planar bypass)	Unified chunky framebuffer (since VGA)	Eliminates the problem entirely
Copper palette cycling	GPU palette shader / LUT texture	Colour manipulation without pixel writes
FMODE (fetch width)	GPU memory bus width (256/384/512-bit)	Wider bus = more data per DMA cycle

GPU Texture Swizzle — The Modern Akiko

Modern GPUs store textures in swizzled (Morton/Z-order) layouts rather than linear row-major order. This is architecturally identical to what the Amiga does with planar bitmaps: the hardware's memory access pattern doesn't match the CPU's logical layout, so a dedicated hardware unit transparently converts between them.

Linear (CPU view):             Morton/Z-order (GPU internal):
  0  1  2  3                     0  1  4  5
  4  5  6  7          →          2  3  6  7
  8  9 10 11                     8  9 12 13
 12 13 14 15                    10 11 14 15

When you call glTexImage2D() or vkCmdCopyBufferToImage(), the GPU driver performs a layout conversion from linear (CPU-friendly) to swizzled (GPU-cache-friendly). This is the exact same class of operation as Amiga C2P — a hardware-accelerated data layout transformation that is invisible to the application programmer.

Performance Comparison Across Eras

System	Data Layout Problem	Throughput	Method
A500 (1987, 7 MHz 68000)	C2P 320×256×4bpp	~2 MB/s	CPU merge, 4 planes
A1200 (1992, 14 MHz 68020)	C2P 320×256×8bpp	~1.5 MB/s	CPU merge, 8 planes
CD32 (1993, 14 MHz + Akiko)	C2P 320×256×8bpp	~1.5 MB/s	Akiko hardware
486 DX2/66 (1992)	No conversion needed	N/A	VGA Mode 13h = chunky
Pentium MMX (1997)	Colour space (YUV→RGB)	~200 MB/s	MMX SIMD
GTX 1080 (2016)	Texture swizzle (linear→tiled)	~300 GB/s	Hardware TMU
Apple M2 (2022)	SoA↔AoS for ML tensors	~100 GB/s	Hardware AMX

The throughput gap tells the story: what consumed 100% of a 68020's capability is handled by a dedicated hardware unit at 200,000× the bandwidth on modern silicon. But the fundamental problem — data layout mismatch between producer and consumer — is identical.

Historical Timeline

Year	Event
1985	Amiga launches with planar display. C2P not needed — all software renders directly to bitplanes
1989	First 3D demos appear (Juggler, etc.). Rendering in chunky buffers starts
1991	Demoscene coders develop first optimised C2P routines for 68000
1992	AGA ships (A1200/A4000). 8 bitplanes = C2P problem gets 2× harder
1993	CD32 ships with Akiko — first hardware C2P. Mikael Kalms publishes optimised CPU routines
1994	Kalms C2P library becomes the de facto standard. Multiple variants for 020/030/040/060
1995	RTG cards (Picasso II, CyberVision 64) begin to make C2P irrelevant for productivity
1996	CyberVision 64 ships with Roxxler P2C chip — the reverse problem, solved in hardware
1998	68060 accelerators make CPU C2P faster than any hardware solution
2020+	MiSTer FPGA core implements RTG via `uaegfx` — C2P eliminated for modern setups

Implementing C2P — Practical Checklist

For developers writing Amiga software that renders in chunky format:

Allocate the chunky buffer in Fast RAM (MEMF_FAST) — the CPU reads it during conversion, and Fast RAM has no DMA contention
Allocate the planar screen in Chip RAM (MEMF_CHIP | MEMF_DISPLAYABLE) — this is mandatory for display DMA
Use a proven C2P library — Kalms C2P (kalms-c2p on GitHub/Aminet) is the gold standard
Match the routine to your CPU — different unrolling for 68020 vs 68040 vs 68060
Use triple buffering if possible — render to buffer A, C2P buffer B into Chip RAM, display buffer C
On CD32, detect and use Akiko — WriteChunkyPixels() does this automatically
On RTG systems, skip C2P entirely — render chunky directly to the RTG card's VRAM
Profile with CIA timers — the bottleneck shifts between CPU merge and Chip RAM write speed depending on configuration

Adaptive Detection

#include <graphics/gfxbase.h>
#include <cybergraphx/cybergraphics.h>

extern struct GfxBase *GfxBase;

/* Determine best C2P strategy for current hardware */
enum C2P_Strategy determine_c2p_strategy(struct BitMap *screen_bm)
{
    /* Check for RTG screen first — no C2P needed */
    if (GetCyberMapAttr(screen_bm, CYBRMATTR_ISRTG))
        return C2P_NONE_RTG;

    /* Check for Akiko (CD32) */
    if (GfxBase->ChunkyToPlanarPtr != NULL)
        return C2P_AKIKO;

    /* Check CPU type for best software routine */
    UWORD attn = SysBase->AttnFlags;
    if (attn & AFF_68060) return C2P_KALMS_060;
    if (attn & AFF_68040) return C2P_KALMS_040;
    if (attn & AFF_68020) return C2P_KALMS_020;

    return C2P_KALMS_000;  /* 68000 fallback */
}

References

Mikael Kalms — kalms-c2p — the definitive C2P library (GitHub)
Scout/Azure — "Chunky 2 Planar Tutorial" — the seminal demoscene document explaining the transposition theory
Amiga Hardware Reference Manual — bitplane DMA, display pipeline
NDK39: graphics/gfx.h — WriteChunkyPixels() prototype
Intel — Structure of Arrays vs Array of Structures — modern SoA/AoS guide
NVIDIA — CUDA Programming Guide, "Shared Memory Matrix Transpose" — GPU equivalent of C2P

23 KiB Raw Blame History Unescape Escape