Instructions to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF", filename="gemma4-coding-Q2_K.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
Use Docker
docker model run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
- Ollama
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with Ollama:
ollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
- Unsloth Studio
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF to start chatting
- Pi
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with Docker Model Runner:
docker model run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
- Lemonade
How to use yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-12B-coder-fable5-composer2.5-v1-GGUF-Q4_K_M
List all available models
lemonade list
- 💻 Gemma4-12B-Coder (GGUF) — Composer 2.5 × Fable 5 ✨
💻 Gemma4-12B-Coder (GGUF) — Composer 2.5 × Fable 5 ✨
🐣 Tiny footprint, big brain — a local coding model for everyone
No matter your GPU. No matter your RAM. If you've got ~4.5 GB of VRAM or unified memory free, you can run your own private, offline coding assistant right now. 🚀 This is the v1 / code edition — distilled from real chain-of-thought so it thinks through a problem before writing the solution. 🧠💻 All local, all yours, no API, no cloud.
🎯 What it is
A focused fine-tune of Gemma 4 12B on verifiable Python coding data — every training example's reasoning leads to code that actually passed its tests. The result reasons in the open (edge cases, complexity, approach) and then emits a clean, runnable solution. 💚
📚 Training data (the interesting part 🍳)
This is a distillation of two complementary chain-of-thought sources, both over verifiable Python coding tasks (algorithmic / function-level problems that come with deterministic tests):
- 🥇 Main set — Composer 2.5 real CoT. Genuine, model-authored reasoning traces. The teacher solved each problem, its code was run against the task's tests, and only the passing solutions were kept. So the reasoning you're learning from leads to code that actually works.
- 🥈 Aux set — Fable 5 (released today! 🎉). A clever twist: we took the problems where Composer 2.5 got it wrong and handed them to Fable 5 to redo — re-deriving a fresh, self-consistent chain-of-thought and a correct solution, again gated on passing the tests. This recovers the hard cases the main teacher missed. These traces are synthetic (rationalized CoT), and are tagged separately so the two sources stay distinguishable.
The recipe: real CoT for the bulk of solid coverage, plus synthetic "second-attempt" CoT to patch the failures — both verified by execution before anything entered training. ✅
🗺️ Roadmap — v2 (if there's interest! 💚)
This is v1. If the likes / downloads add up, I'll ship a v2 that:
- Leans harder into the Fable 5 data as the primary signal,
- keeps a portion of Composer 2.5 real CoT for coverage,
- and pushes for the benchmarks 🏁.
⭐ Like & download if you'd like to see v2 — that's the signal I'm watching!
🐢 Upload status — sorry, and a heartfelt PSA 🙏
I'm very sorry the upload has been so slow — as of right now, not all files have finished uploading yet.
But please don't worry: I will get everything up. 💪
✅ Update: all files are up — every quant (Q2_K / Q4_K_M / Q6_K / Q8_0) is fully uploaded. Enjoy! 🎉
And a sincere plea while I'm at it: please, do NOT use any Verizon WiFi. I happen to be on their WiFi, and my uploads keep stalling. I've tried to fix it many, many times and it's still broken. So let me say it once more, loud and clear: stay away from Verizon WiFi. 📵 Thank you so much for your patience! 💚
📦 Pick your size (GGUF quants)
| Quant | Size | Vibe |
|---|---|---|
| 🟢 Q2_K | 4.5 GB | tiniest — runs almost anywhere |
| 🔵 Q4_K_M | 6.87 GB | the sweet spot 👌 (recommended) |
| 🟣 Q6_K | 9.11 GB | near-lossless |
| ⚪ Q8_0 | 11.8 GB | basically full quality |
🧮 "Will it fit?" — context length cheat-sheet
Rough estimates 🤓 (assumes q8_0 KV cache + ~1.5 GB overhead; use q4_0 KV cache for ≈2× more context!).
Max context is 131K. "—" = won't fit, pick a smaller quant. ✂️
| Your VRAM / unified mem | 🟢 Q2_K (4.5G) | 🔵 Q4_K_M (6.87G) | 🟣 Q6_K (9.11G) | ⚪ Q8_0 (11.8G) |
|---|---|---|---|---|
| 8 GB | ~16K ctx | tight (~2–4K) | — | — |
| 12 GB | ~48K | ~30K | ~12K | — |
| 16 GB | ~80K | ~64K | ~44K | ~22K |
| 24 GB | 131K (max) 🎉 | ~128K | ~110K | ~88K |
| 32 GB | 131K | 131K | 131K | 131K |
💡 Apple Silicon / integrated GPUs with unified memory count too — same numbers, just slower than a dGPU. 💡 Low on room? Drop a quant or switch KV cache to
q4_0and your context roughly doubles.
🚀 How to run it (super easy)
Option A — llama.cpp (recommended) 🦙
- Grab a quant above (e.g.
…-Q4_K_M.gguf) andllama-serverfrom llama.cpp.⚠️ Needs a recent llama.cpp (this is the
gemma4_unifiedarchitecture — older builds won't load it). - Run a server (Windows
.batshown — tweak--port,--ctx-sizeto taste):
@echo off
cd /d C:\llama.cpp
llama-server.exe ^
-m C:\models\gemma4-coding-Q4_K_M.gguf ^
--ctx-size 16384 ^
--n-gpu-layers 99 ^
--no-mmap ^
-fa on ^
--cache-type-k q8_0 --cache-type-v q8_0 ^
--temp 1.0 --top-p 0.95 --top-k 64 ^
--host 0.0.0.0 --port 18080
pause
- Open
http://localhost:18080and chat. 🎉 (Tip: bump--ctx-sizeper the table; useq4_0KV for more.)
Option B — one-click apps 🖱️
Works in LM Studio, Jan, Ollama, etc. — just import the GGUF, pick your quant, go. 🐾
🧠 Thinking mode
This model thinks in Gemma's native thought channel before answering — exactly how it was trained. Keep
enable_thinking=true (the default chat template handles it). Recommended sampling: temp 1.0, top_p 0.95, top_k 64.
For coding you can also go greedy (temp 0) for more deterministic solutions.
⚠️ Good to know
- Reduced refusals: the training data is task-focused with no safety hedging, so this refuses less than the base model. It is not safety-aligned — add your own guardrails for production. Use responsibly. 🙏
- Specialized for Python / algorithmic coding. Reasoning quality is strongest in that domain; general-knowledge facts/numbers should still be double-checked.
- English-centric.
📚 Base & License
- Base model:
google/gemma-4-12B-it. Subject to the Gemma Terms of Use (derivatives must comply). - Personal/hobby project — shared as-is, no warranty. Have fun, and happy hacking! 🐾✨
- Downloads last month
- 2,433
2-bit
4-bit
6-bit
8-bit
ollama run hf.co/yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF: