amiga-bootcamp/08_graphics/pixel_conversion.md

495 lines
23 KiB
Markdown
Raw Normal View History

[← Home](../README.md) · [Graphics](README.md)
# Pixel Format Conversion — Chunky ↔ Planar and Beyond
## The Core Problem
Every Amiga programmer eventually hits the same wall: the custom chipset displays graphics in **planar** format, but nearly every interesting algorithm — 3D rendering, texture mapping, image decompression, PC game ports — produces output in **chunky** format. Converting between these two layouts is the single most CPU-intensive bottleneck in Amiga graphics programming.
This article covers:
1. **What** planar and chunky formats are, mathematically
2. **Why** the conversion is expensive
3. **How** every known solution works — from naive loops to the Kalms butterfly
4. **Where** this problem appears in broader computing (SoA/AoS, GPU swizzle, SIMD)
> [!NOTE]
> The Akiko hardware article covers the CD32's dedicated C2P register interface. This article covers the *algorithm theory* that applies to every Amiga model, and the broader data-layout concepts that connect the Amiga to modern computing.
>
> See: [Akiko — CD32 C2P Hardware](../01_hardware/aga_a1200_a4000/akiko_cd32.md)
---
## Planar vs Chunky — The Two Layouts
### Chunky (Packed Pixel)
Every pixel's complete colour index is stored contiguously. For 8-bit (256 colour) pixels:
```
Address: $0000 $0001 $0002 $0003 $0004 $0005 $0006 $0007
Data: $0D $05 $1B $0A $FF $03 $42 $7E
pixel0 pixel1 pixel2 pixel3 pixel4 pixel5 pixel6 pixel7
```
Each byte = one pixel. Linear, simple, cache-friendly for rendering. This is how **every modern GPU**, every PC VGA card, every framebuffer since 1990 stores pixels.
### Planar (Bitplane)
Each pixel's colour index is **split across N separate memory regions** (bitplanes). For 8-bit pixels (8 bitplanes), each bitplane stores one bit of every pixel:
```
Bitplane 0: 1 0 1 1 0 0 1 0 ← bit 0 of pixels 07
Bitplane 1: 0 1 0 1 1 0 0 1 ← bit 1 of pixels 07
Bitplane 2: 1 1 0 0 0 1 1 0 ← bit 2
Bitplane 3: 0 1 1 0 1 1 0 0 ← bit 3
Bitplane 4: 1 0 1 0 1 0 0 1 ← bit 4
Bitplane 5: 1 0 0 0 0 0 1 0 ← bit 5
Bitplane 6: 0 0 1 0 0 0 0 1 ← bit 6
Bitplane 7: 0 0 0 0 1 0 1 0 ← bit 7
```
To read pixel 0's colour: collect bit 0 from each of the 8 planes → `10101100` = `$AC`. The 8 planes are **not interleaved** in standard Amiga layout — each is a separate contiguous memory block.
### Why the Amiga Uses Planar
The planar format was a brilliant engineering choice in 1985:
| Advantage | Explanation |
|---|---|
| **Bandwidth efficiency** | A 4-colour screen uses 2 bitplanes = ½ the memory bandwidth of 4bpp chunky. DMA fetches only the planes actually used. |
| **Scalable colour depth** | Adding a bitplane doubles the colour count without redesigning the display engine. OCS: 16 planes. AGA: 18 planes. |
| **Cheap colour cycling** | Rotating palette indices only requires changing colour registers — zero memory writes. |
| **Blitter efficiency** | Blitting a masked sprite at 4 colours touches only 2 planes (2 blits), not 4× the data. |
| **Copper integration** | The Copper can change palette registers mid-scanline, effectively multiplying colours without more bitplanes. |
The downside only became critical as rendering algorithms evolved past 2D sprites into 3D, texture mapping, and pixel-level effects that naturally produce chunky output.
---
## The Conversion — Mathematically
C2P is a **bit matrix transposition**. Given 32 chunky pixels (each 8 bits wide), you have a 32×8 bit matrix (32 rows × 8 columns). C2P transposes this to an 8×32 matrix (8 bitplanes × 32 bits each):
```
2026-04-25 14:58:33 -04:00
Input (chunky): Output (planar):
32 pixels × 8 bits 8 bitplanes × 32 bits
┌──────────────────────────────┐ ┌────────────────────────────────────────┐
│ P0: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 0: p0.b0 p1.b0 p2.b0 ... p31.b0 │
│ P1: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 1: p0.b1 p1.b1 p2.b1 ... p31.b1 │
│ ... │ │ ... │
│ P31: b7 b6 b5 b4 b3 b2 b1 b0 │ │ Plane 7: p0.b7 p1.b7 p2.b7 ... p31.b7 │
└──────────────────────────────┘ └────────────────────────────────────────┘
```
This is equivalent to a 90° bit rotation. On a modern CPU with SIMD, this is trivial. On a 68020 with 8 data registers and no bit-parallel instructions, it is an algorithmic challenge that consumed thousands of programmer-hours across the demoscene.
---
## Solution 1 — The Naive Loop
The simplest approach: iterate over every pixel, extract each bit, and set it in the corresponding bitplane.
```c
/* Naive C2P — educational only, never use in production */
void c2p_naive(UBYTE *chunky, UBYTE *planes[8], int width, int height)
{
for (int y = 0; y < height; y++)
{
for (int x = 0; x < width; x++)
{
UBYTE pixel = chunky[y * width + x];
int byte_offset = y * (width / 8) + (x / 8);
int bit_position = 7 - (x & 7);
for (int plane = 0; plane < 8; plane++)
{
if (pixel & (1 << plane))
planes[plane][byte_offset] |= (1 << bit_position);
else
planes[plane][byte_offset] &= ~(1 << bit_position);
}
}
}
}
```
**Performance:** ~200+ cycles per pixel on 68020. For 320×256 = 81,920 pixels → **~16 million cycles → ~1.1 seconds at 14 MHz**. This gives roughly **0.9 FPS**. Completely unusable.
**Why it's terrible:**
- One bit at a time — no parallelism
- Read-modify-write on every bitplane byte (bus-killing)
- No register reuse — constant memory traffic
- Branch on every bit (pipeline flush on 68020)
---
## Solution 2 — The Merge (Butterfly) Algorithm
This is the standard approach used by virtually all serious Amiga C2P routines. Invented independently by several demoscene coders and formalised by **Mikael Kalms** (Kalmalyzer) and others.
### The Key Insight
Instead of processing one pixel at a time, load **32 pixels** (8 longwords = 256 bits) into CPU registers and perform a series of **bit-level swap operations** (called "merges") that progressively rearrange the bits into planar order. Each merge pass swaps bits at a different granularity: 16-bit blocks, then 8-bit, then 4-bit, 2-bit, and 1-bit.
This is exactly a **butterfly network** — the same structure used in the FFT (Fast Fourier Transform) and Batcher's bitonic sort.
### The Merge Primitive
The fundamental building block is a 2-register swap that exchanges bits at a given stride:
```asm
; merge(d0, d1, mask, shift)
; Exchanges bits between d0 and d1 where mask selects which bits to swap
; and shift determines the stride
move.l d0, d2 ; temp = a
lsr.l #shift, d2 ; temp >>= stride
eor.l d1, d2 ; temp ^= b
and.l #mask, d2 ; temp &= mask (select bits to swap)
eor.l d2, d1 ; b ^= temp (swap into b)
lsl.l #shift, d2 ; temp <<= stride (restore position)
eor.l d2, d0 ; a ^= temp (swap into a)
```
**7 instructions** per merge. Each merge moves half the bits in two registers to their correct positions.
### Pass Structure for 8 Bitplanes
A full 8-bitplane C2P conversion on 32 pixels requires **5 passes** of merge operations:
| Pass | Block Size | Mask | Swap Distance | Effect |
|---|---|---|---|---|
| 1 | 16-bit | `$0000FFFF` | 16 | Swap upper/lower halves of longword pairs |
| 2 | 8-bit | `$00FF00FF` | 8 | Swap bytes within pairs |
| 3 | 4-bit | `$0F0F0F0F` | 4 | Swap nibbles |
| 4 | 2-bit | `$33333333` | 2 | Swap bit-pairs |
| 5 | 1-bit | `$55555555` | 1 | Swap individual bits |
After all 5 passes, the 8 data registers contain one longword per bitplane.
### Full 8-Bitplane C2P Inner Loop
```asm
; Kalms-style C2P inner loop — converts 32 chunky pixels (8 longwords)
; to 8 planar longwords (one per bitplane)
;
; Input: d0-d7 = 8 longwords of chunky data (4 pixels each)
; Output: d0-d7 = 8 longwords of planar data (one per bitplane)
; ---- Pass 1: 16-bit swap ----
swap d0 ; exchange upper/lower words of d0
swap d1
swap d2
swap d3
; (merge d0,d4), (merge d1,d5), (merge d2,d6), (merge d3,d7)
; using mask $0000FFFF, shift 16
move.l d0, a3 ; temp save
move.l d4, d0
move.w a3, d0 ; d0 = d4.hi : d0.lo
move.w d4, a3 ; a3 = d0.hi : d4.lo
move.l a3, d4
move.l d1, a3
move.l d5, d1
move.w a3, d1
move.w d5, a3
move.l a3, d5
move.l d2, a3
move.l d6, d2
move.w a3, d2
move.w d6, a3
move.l a3, d6
move.l d3, a3
move.l d7, d3
move.w a3, d3
move.w d7, a3
move.l a3, d7
; ---- Pass 2: 8-bit swap ----
; mask = $00FF00FF, shift = 8
move.l #$00FF00FF, a3
; merge(d0, d2)
move.l d0, a4
lsr.l #8, a4
eor.l d2, a4
and.l a3, a4
eor.l a4, d2
lsl.l #8, a4
eor.l a4, d0
; merge(d1, d3) ... merge(d4, d6) ... merge(d5, d7) ...
; (same pattern repeated for each pair)
; ---- Pass 3: 4-bit swap ----
; mask = $0F0F0F0F, shift = 4
; merge(d0, d1), merge(d2, d3), merge(d4, d5), merge(d6, d7)
; ---- Pass 4: 2-bit swap ----
; mask = $33333333, shift = 2
; ---- Pass 5: 1-bit swap ----
; mask = $55555555, shift = 1
; Result: d0 = bitplane 0 (32 bits), d1 = bitplane 1, ... d7 = bitplane 7
```
> [!NOTE]
> The above is a pedagogical skeleton. Production C2P routines are **heavily unrolled** and use every register trick available — address registers as temporary storage, interleaving loads with merges to hide memory latency, and sometimes splitting the conversion across two phases to overlap with Chip RAM writes.
### Performance
| Metric | Naive | Merge/Butterfly | Improvement |
|---|---|---|---|
| Instructions per 32 pixels | ~6,400+ | ~160200 | **3240×** |
| Cycles per pixel (68020 @ 14 MHz) | ~200 | ~57 | **~30×** |
| 320×256 full frame | ~1.1 s | ~35 ms | **~30× (28 FPS)** |
| 320×256 per frame budget | 0.9 FPS | **28 FPS** | Playable |
---
## Solution 3 — Akiko Hardware C2P (CD32 Only)
The CD32's Akiko chip implements C2P in dedicated silicon. The CPU feeds 8 longwords of chunky data to register `$B80030` and reads back 8 longwords of planar data from the same address.
| Metric | Software C2P (68020) | Akiko |
|---|---|---|
| Method | CPU merge/butterfly | Hardware register pipeline |
| Throughput | ~1.5 MB/s | ~1.5 MB/s |
| CPU cost | 100% | ~50% (register I/O) |
| Availability | All Amigas | **CD32 only** |
Akiko's throughput is approximately the same as optimised software C2P on the 68020 because both are limited by the Chip RAM bus bandwidth (~3.5 MB/s shared). On faster CPUs (68040/060), software C2P **outperforms** Akiko because the CPU can process data faster than the register interface can shuttle it.
Full Akiko protocol: [Akiko — CD32 C2P Hardware](../01_hardware/aga_a1200_a4000/akiko_cd32.md#chunky-to-planar-c2p-conversion)
---
## Solution 4 — Blitter-Assisted C2P
The Blitter can be used as part of a C2P pipeline, but it cannot perform the transposition itself. Typical usage:
1. CPU performs the merge/butterfly in registers → outputs planar longwords to a temporary buffer in Chip RAM
2. Blitter copies the planar data from the temporary buffer to the screen's bitplanes with correct modulo
This approach **overlaps** CPU computation with Blitter DMA — while the Blitter writes frame N's planes to the screen, the CPU computes frame N+1's transposition.
```
Time ──────────────────────────────────────────────────────→
CPU: [merge frame 0] [merge frame 1] [merge frame 2] ...
Blitter: [write frame 0] [write frame 1] ...
↑ overlap: CPU and Blitter run in parallel
```
> [!WARNING]
> On 68040/060 systems, the Blitter is often **slower** than letting the CPU do both the merge and the writes via `MOVE16` (68040) or unrolled `MOVEM.L`. The Blitter's 16-bit bus (even in AGA FMODE×4) adds DMA contention that may actually slow down the CPU's merge passes.
---
## Solution 5 — WriteChunkyPixels (AmigaOS)
AmigaOS 3.0+ provides `WriteChunkyPixels()` in `graphics.library`, which performs C2P conversion internally using the best available method:
```c
#include <graphics/gfx.h>
WriteChunkyPixels(rp,
xstart, ystart, xstop, ystop,
chunky_buffer, chunky_bytes_per_row);
```
On CD32, this function auto-detects Akiko and uses it. On other AGA machines, it uses an internal software C2P. However, the OS implementation is **not** as fast as the best demoscene routines — it prioritises correctness and generality over raw speed.
---
## Solution 6 — RTG: Eliminating C2P Entirely
The ultimate solution to C2P is to **not do it at all**. Retargetable Graphics (RTG) cards like the Picasso IV, CyberVision 64, and MiSTer's virtual `uaegfx` provide a chunky framebuffer directly. The rendering engine writes chunky pixels to VRAM, and the card's RAMDAC/scaler converts them to video output.
The irony: RTG cards must perform the **reverse** conversion (P2C — planar-to-chunky) when legacy planar software runs on an RTG screen. The CyberVision 64 included a dedicated **Roxxler** chip for this. Without hardware help, P2C on software is equally expensive.
See: [RTG — Retargetable Graphics](../16_driver_development/rtg_driver.md#planar-to-chunky-conversion-c2p)
---
## Choosing the Right Approach
| Platform | Recommended C2P | Why |
|---|---|---|
| A500/A2000 (68000) | Merge algorithm (simplified, fewer planes) | No fast multiply; 68000 can manage 45 plane C2P at ~15 FPS |
| A1200 (68020) | Kalms merge, 5-pass | Sweet spot: enough registers, usable I-cache |
| CD32 (68020 + Akiko) | Akiko hardware | Frees ~50% CPU for game logic |
| A4000 (68030/040) | CPU merge (skip Akiko if not CD32) | 68040 `MOVE16` makes CPU writes fast enough |
| 68060 accelerated | CPU merge, no Blitter | 68060 superscalar outperforms everything else |
| MiSTer FPGA | RTG (`uaegfx`) | Chunky framebuffer in DDR — no C2P needed |
---
## The Bigger Picture — Data Layout Transformation
C2P is not unique to the Amiga. It is an instance of a fundamental problem in computer architecture: **transforming data layout between Structure-of-Arrays (SoA) and Array-of-Structures (AoS)**.
### SoA vs AoS — The Universal Duality
```
AoS (Array of Structures) = Chunky:
struct Pixel { r, g, b, a; };
Pixel pixels[1024];
// Memory: r0 g0 b0 a0 r1 g1 b1 a1 r2 g2 b2 a2 ...
// Each element's fields are contiguous
SoA (Structure of Arrays) = Planar:
struct Pixels {
float r[1024];
float g[1024];
float b[1024];
float a[1024];
};
// Memory: r0 r1 r2 ... r1023 g0 g1 g2 ... g1023 ...
// Each field is contiguous across all elements
```
The Amiga's planar format is **SoA**: each bitplane is an array of one field (one bit) across all pixels. The chunky format is **AoS**: each pixel's fields (all 8 bits) are packed together.
### Where This Problem Appears Today
| Domain | SoA (Planar-Like) | AoS (Chunky-Like) | Conversion |
|---|---|---|---|
| **Amiga graphics** | Bitplanes (Agnus DMA) | Chunky pixel buffer (CPU render) | C2P algorithm |
| **GPU compute shaders** | SoA buffer layouts (SSBO) | Vertex attributes (interleaved VBO) | Shader transpose |
| **SIMD / AVX-512** | Separate float arrays (vectorisable) | Struct arrays (gather/scatter) | `_mm512_transpose` intrinsics |
| **Database engines** | Columnar storage (Parquet, Arrow) | Row-oriented storage (MySQL) | Column↔row materialisation |
| **Image compression** | Colour planes (JPEG YCbCr) | RGB pixels (BMP) | MCU block decomposition |
| **GPU texture memory** | Block-compressed (BC/ASTC) | Linear RGBA | Hardware texture unit decode |
| **Neural network inference** | NCHW tensor layout (channels first) | NHWC (channels last) | Layout transposition kernel |
### Why Each System Prefers a Different Layout
| Layout | Optimal For | Reason |
|---|---|---|
| **SoA / Planar** | Streaming one field across many elements | Maximises cache line utilisation, enables SIMD vectorisation |
| **AoS / Chunky** | Random-access to complete elements | All fields of one element in one cache line |
The Amiga's custom DMA engine streams bitplane data to the display sequentially — plane 0 for the whole line, then plane 1, etc. This is a **SoA access pattern**, perfectly matched by the planar layout. The CPU, which wants to set a single pixel's complete colour, has the opposite need — it wants **AoS**.
### Modern Hardware Parallels
| Amiga Component | Modern Equivalent | Function |
|---|---|---|
| **Akiko C2P register** | GPU texture swizzle unit | Hardware layout transposition |
| **Blitter + merge algorithm** | CUDA shared memory transpose kernel | CPU/coprocessor-assisted transpose |
| **RTG (planar bypass)** | Unified chunky framebuffer (since VGA) | Eliminates the problem entirely |
| **Copper palette cycling** | GPU palette shader / LUT texture | Colour manipulation without pixel writes |
| **FMODE (fetch width)** | GPU memory bus width (256/384/512-bit) | Wider bus = more data per DMA cycle |
### GPU Texture Swizzle — The Modern Akiko
Modern GPUs store textures in **swizzled** (Morton/Z-order) layouts rather than linear row-major order. This is architecturally identical to what the Amiga does with planar bitmaps: the hardware's memory access pattern doesn't match the CPU's logical layout, so a dedicated hardware unit transparently converts between them.
```
Linear (CPU view): Morton/Z-order (GPU internal):
0 1 2 3 0 1 4 5
4 5 6 7 → 2 3 6 7
8 9 10 11 8 9 12 13
12 13 14 15 10 11 14 15
```
When you call `glTexImage2D()` or `vkCmdCopyBufferToImage()`, the GPU driver performs a layout conversion from linear (CPU-friendly) to swizzled (GPU-cache-friendly). This is the exact same class of operation as Amiga C2P — a hardware-accelerated data layout transformation that is invisible to the application programmer.
---
## Performance Comparison Across Eras
| System | Data Layout Problem | Throughput | Method |
|---|---|---|---|
| A500 (1987, 7 MHz 68000) | C2P 320×256×4bpp | ~2 MB/s | CPU merge, 4 planes |
| A1200 (1992, 14 MHz 68020) | C2P 320×256×8bpp | ~1.5 MB/s | CPU merge, 8 planes |
| CD32 (1993, 14 MHz + Akiko) | C2P 320×256×8bpp | ~1.5 MB/s | Akiko hardware |
| 486 DX2/66 (1992) | No conversion needed | N/A | VGA Mode 13h = chunky |
| Pentium MMX (1997) | Colour space (YUV→RGB) | ~200 MB/s | MMX SIMD |
| GTX 1080 (2016) | Texture swizzle (linear→tiled) | ~300 GB/s | Hardware TMU |
| Apple M2 (2022) | SoA↔AoS for ML tensors | ~100 GB/s | Hardware AMX |
The throughput gap tells the story: what consumed 100% of a 68020's capability is handled by a dedicated hardware unit at 200,000× the bandwidth on modern silicon. But the fundamental problem — **data layout mismatch between producer and consumer** — is identical.
---
## Historical Timeline
| Year | Event |
|---|---|
| 1985 | Amiga launches with planar display. C2P not needed — all software renders directly to bitplanes |
| 1989 | First 3D demos appear (Juggler, etc.). Rendering in chunky buffers starts |
| 1991 | Demoscene coders develop first optimised C2P routines for 68000 |
| 1992 | AGA ships (A1200/A4000). 8 bitplanes = C2P problem gets 2× harder |
| 1993 | CD32 ships with Akiko — first hardware C2P. Mikael Kalms publishes optimised CPU routines |
| 1994 | Kalms C2P library becomes the de facto standard. Multiple variants for 020/030/040/060 |
| 1995 | RTG cards (Picasso II, CyberVision 64) begin to make C2P irrelevant for productivity |
| 1996 | CyberVision 64 ships with Roxxler P2C chip — the reverse problem, solved in hardware |
| 1998 | 68060 accelerators make CPU C2P faster than any hardware solution |
| 2020+ | MiSTer FPGA core implements RTG via `uaegfx` — C2P eliminated for modern setups |
---
## Implementing C2P — Practical Checklist
For developers writing Amiga software that renders in chunky format:
1. **Allocate the chunky buffer in Fast RAM** (`MEMF_FAST`) — the CPU reads it during conversion, and Fast RAM has no DMA contention
2. **Allocate the planar screen in Chip RAM** (`MEMF_CHIP | MEMF_DISPLAYABLE`) — this is mandatory for display DMA
3. **Use a proven C2P library** — Kalms C2P (`kalms-c2p` on GitHub/Aminet) is the gold standard
4. **Match the routine to your CPU** — different unrolling for 68020 vs 68040 vs 68060
5. **Use triple buffering** if possible — render to buffer A, C2P buffer B into Chip RAM, display buffer C
6. **On CD32, detect and use Akiko**`WriteChunkyPixels()` does this automatically
7. **On RTG systems, skip C2P entirely** — render chunky directly to the RTG card's VRAM
8. **Profile with CIA timers** — the bottleneck shifts between CPU merge and Chip RAM write speed depending on configuration
### Adaptive Detection
```c
#include <graphics/gfxbase.h>
#include <cybergraphx/cybergraphics.h>
extern struct GfxBase *GfxBase;
/* Determine best C2P strategy for current hardware */
enum C2P_Strategy determine_c2p_strategy(struct BitMap *screen_bm)
{
/* Check for RTG screen first — no C2P needed */
if (GetCyberMapAttr(screen_bm, CYBRMATTR_ISRTG))
return C2P_NONE_RTG;
/* Check for Akiko (CD32) */
if (GfxBase->ChunkyToPlanarPtr != NULL)
return C2P_AKIKO;
/* Check CPU type for best software routine */
UWORD attn = SysBase->AttnFlags;
if (attn & AFF_68060) return C2P_KALMS_060;
if (attn & AFF_68040) return C2P_KALMS_040;
if (attn & AFF_68020) return C2P_KALMS_020;
return C2P_KALMS_000; /* 68000 fallback */
}
```
---
## References
- Mikael Kalms — [kalms-c2p](https://github.com/Kalmalyzer/kalms-c2p) — the definitive C2P library (GitHub)
- Scout/Azure — "Chunky 2 Planar Tutorial" — the seminal demoscene document explaining the transposition theory
- *Amiga Hardware Reference Manual* — bitplane DMA, display pipeline
- NDK39: `graphics/gfx.h``WriteChunkyPixels()` prototype
- Intel — [Structure of Arrays vs Array of Structures](https://www.intel.com/content/www/us/en/developer/articles/technical/memory-layout-transformations.html) — modern SoA/AoS guide
- NVIDIA — CUDA Programming Guide, "Shared Memory Matrix Transpose" — GPU equivalent of C2P
## See Also
- [Akiko — CD32 C2P Hardware](../01_hardware/aga_a1200_a4000/akiko_cd32.md) — Akiko register protocol
- [BitMap — Planar Layout](bitmap.md) — how Amiga bitmaps are structured in memory
- [Blitter Programming](blitter_programming.md) — Blitter DMA used in Blitter-assisted C2P
- [RTG — Retargetable Graphics](../16_driver_development/rtg_driver.md) — chunky framebuffer cards that eliminate C2P
- [Memory Types](../01_hardware/common/memory_types.md) — Chip vs Fast RAM (critical for C2P buffer placement)