# Gemma 4 12B MTP drafter Multi-Token Prediction (MTP) drafter for `unsloth/gemma-4-12B-it-qat-GGUF`. It runs as a speculative draft model that shares the target's KV cache and speeds up text generation, with no change to the output because the target verifies every drafted token. Verified on a single B200 with the `gemma-4-12B-it-qat-UD-Q4_K_XL.gguf` target: 163 tok/s without MTP, 316 tok/s with MTP, 0.82 draft acceptance. MTP was merged into llama.cpp on 2026-06-07 (PR ggml-org/llama.cpp#23398). You need a llama.cpp build from after that date. Older builds cannot load these (arch `gemma4-assistant`). ## Files - `gemma-4-12B-it-MTP-Q8_0.gguf` (smallest, recommended) - `gemma-4-12B-it-MTP-BF16.gguf` - `gemma-4-12B-it-MTP-F16.gguf` ## Build llama.cpp ```bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp # CUDA build. Set the arch for your GPU: 89 (RTX 4090), 90 (H100), 100 (B200). cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 cmake --build build --config Release -j --target llama-server ``` ## Run ```bash hf download unsloth/gemma-4-12B-it-qat-GGUF gemma-4-12B-it-qat-UD-Q4_K_XL.gguf --local-dir . hf download unsloth/gemma-4-12B-it-qat-GGUF MTP/gemma-4-12B-it-MTP-Q8_0.gguf --local-dir . ./build/bin/llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft MTP/gemma-4-12B-it-MTP-Q8_0.gguf \ --spec-type draft-mtp --spec-draft-n-max 4 \ -ngl 999 -fa on ``` Multi GPU: add `--spec-draft-device CUDA0 -sm layer`. The drafter pairs with any quant of the 12B. Quantized KV cache works.