# Gemma 4 12B MTP drafter

Multi-Token Prediction (MTP) drafter for `unsloth/gemma-4-12B-it-qat-GGUF`. It runs as a speculative draft model that shares the target's KV cache and speeds up text generation, with no change to the output because the target verifies every drafted token.

Verified on a single B200 with the `gemma-4-12B-it-qat-UD-Q4_K_XL.gguf` target: 163 tok/s without MTP, 316 tok/s with MTP, 0.82 draft acceptance.

MTP was merged into llama.cpp on 2026-06-07 (PR ggml-org/llama.cpp#23398). You need a llama.cpp build from after that date. Older builds cannot load these (arch `gemma4-assistant`).

## Files

- `gemma-4-12B-it-MTP-Q8_0.gguf` (smallest, recommended)
- `gemma-4-12B-it-MTP-BF16.gguf`
- `gemma-4-12B-it-MTP-F16.gguf`

## Build llama.cpp

```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# CUDA build. Set the arch for your GPU: 89 (RTX 4090), 90 (H100), 100 (B200).
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90
cmake --build build --config Release -j --target llama-server
```

## Run

```bash
hf download unsloth/gemma-4-12B-it-qat-GGUF gemma-4-12B-it-qat-UD-Q4_K_XL.gguf --local-dir .
hf download unsloth/gemma-4-12B-it-qat-GGUF MTP/gemma-4-12B-it-MTP-Q8_0.gguf --local-dir .

./build/bin/llama-server \
  -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \
  --model-draft MTP/gemma-4-12B-it-MTP-Q8_0.gguf \
  --spec-type draft-mtp --spec-draft-n-max 4 \
  -ngl 999 -fa on
```

Multi GPU: add `--spec-draft-device CUDA0 -sm layer`.

The drafter pairs with any quant of the 12B. Quantized KV cache works.