Instructions to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix",
	filename="gemma-4-12B-it-qat-ja-IQ4_NL.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL
# Run inference directly in the terminal:
llama-cli -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL
# Run inference directly in the terminal:
./llama-cli -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL
# Run inference directly in the terminal:
./build/bin/llama-cli -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

Use Docker

docker model run hf.co/dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

LM Studio
Jan
Ollama
How to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with Ollama:
```
ollama run hf.co/dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL
```

Unsloth Studio

How to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix to start chatting

How to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with Docker Model Runner:
```
docker model run hf.co/dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL
```

Lemonade

How to use dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:UD-Q4_K_XL

Run and chat with the model

lemonade run user.gemma-4-12B-it-qat-UD-japanese-imatrix-UD-Q4_K_XL

List all available models

lemonade list

gemma-4-12B-it-qat-UD-japanese-imatrix developed by dahara1@webbigdata

google/gemma-4-12B-it-qat-q4_0-unquantized を日本語能力保持を念頭に量子化してサイズを圧縮したモデル
google/gemma-4-12B-it-qat-q4_0-unquantized is a model that compresses the size by quantizing it with the aim of preserving Japanese language proficiency.

💼 導入をご検討の企業様へ / For Enterprise Decision Makers

データは社外に出ません: 本モデルは自社サーバー・PC上で完結して動作します。クラウドへの送信は一切不要です。
商用利用可能: Apache 2.0 ライセンス（Google Gemma 4 と同一）。法務確認が容易です。
開発元によるサポート: 動作検証済み構成のご提供、本番導入支援、年間サポート契約を有償でご用意しています。→ お問い合わせ
Your data stays on-premise: This model runs entirely on your own servers or PCs. No cloud transmission required.
Commercial use permitted: Apache 2.0 license (same as Google Gemma 4).
Developer support available: Verified configurations, production deployment assistance, and annual support contracts. → Contact us

特徴 / Features

一言で言えば沢山の細かい改善をしてサイズを1/4に圧縮しつつ日本語能力を強化した強力なモデルです。CPUのみでも動かす事ができます。
In short, it's a powerful model that has undergone numerous minor improvements, compressing its size to one-quarter while enhancing its Japanese language capabilities. It can even run on a CPU alone.

このモデルの特徴

日本語性能を重点的に保持するように独自の動的量子化をしています
ベンチマークをしっかりと行って堅牢性・優位性を確かめています

Features of this gguf

We use dynamic quantization to prioritize and maintain the performance of the Japanese language.
We conduct thorough benchmarks to confirm its robustness and superiority.

クイックスタート

起動サンプル(▶をクリックで展開) / Startup example (Click ▶ to expand)

GPUがなくても動きますが、推奨サイズであるgemma-4-12B-it-qat-ja-UD-Q4_K_XL.ggufではシステムメモリは12GB以上、ディスク容量が7GB以上必要です。
It will run without a GPU, but the recommended model, gemma-4-12B-it-qat-ja-UD-Q4_K_XL.gguf, requires at least 12GB of system memory and at least 7GB of disk space.

llama.cppを使います。直近でGemma 4対応のアップデートがいくつかありました。常に最新版を使う事をおすすめします。(本件の動作確認はversion: 9556 (19bba67c1) で行っています)
llama.cpp以外のツールでも動く可能性はありますが、他のツールは製作者が意図していない設定で誤動作をする場合があるので留意してください

We will be using llama.cpp. There have been several recent updates to support Gemma 4. So it is recommended to always use the latest version. (This issue was confirmed to work with version: 9556 (19bba67c1)).
It might work with tools other than llama.cpp, but please note that other tools may operate with settings not intended by the creator.

llama.cppからお使いのハードウェア用のZIPファイルをダウンロードして設定します。
沢山種類があるので迷うかもしれませんが、chatGPTなりGeminiなりClaudeなりに聞いて適切なものを選んでください
Download the zip file for your hardware from llama.cpp and set it up.
There are many options, so you may be confused, but please ask chatGPT, Gemini, or Claude to help you choose the right one.

ダウンロードしたzipを解凍後し、Macターミナル、Windows CMD(PowerShell)、Linux端末から以下のコマンドを打ち込んで起動します
After unzipping the downloaded zip file, run it via Mac Terminal, CMD(PowerShell), or the Linux terminal by typing the following command.

./llama-cli \
  -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:gemma-4-12B-it-qat-ja-UD-Q4_K_XL \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  --min-p 0.0 \
  --ctx-size 32000 \
  -hfd dahara1/gemma-4-12B-it-qat-assistant:gemma-4-423M-12b-it-qat-unquantized-assistant-Q4_0 \
  --spec-type draft-mtp \
  --reasoning on

ctx-sizeが扱える文章の長さです。長くすると複数ターンの長い会話も扱えるようになりますが、必要メモリ量も増えます。
ctx-size specifies the length of text that can be handled. Increasing this value allows for longer conversations with multiple turns, but it also increases the amount of memory required.

メモリが足りない場合はgemma-4-12B-it-qat-ja-UD-Q3_K_XLなどのよりサイズが小さい版を使用してください
If you don't have enough memory, please use a smaller version such as gemma-4-12B-it-qat-ja-UD-Q3_K_XL.

LinuxでNvidiaのGPUをお使いの場合は、後述の手順を参考に自分でコンパイルする事を推奨します。
If you are using an Nvidia GPU with Linux, we recommend that you compile it yourself by following the instructions below.

🏆 GENIAC PRIZE 2026 参加チームの皆様へ / For GENIAC PRIZE 2026 Teams

本モデルは最高賞金１億円の懸賞金活用型コンペGENIAC PRIZEの国産基盤モデルに登録申請中です（2026年6月末に登録可否が決定予定。決定次第、本欄を更新します）
GENIAC 2026では国産基盤モデルを使った実証が必須となっています。本モデルはスペックが低いハードウェアでも動き(遅くなるがCPUのみも可)、且つ画像認識/音声認識/AIエージェント動作、高い汎用的能力、と実証時に利用しやすい条件がそろっています。(更にGENIAC 2025では国産基盤モデルを採用した提案は加点されるルールがありました)
This model is currently being submitted for registration as a domestically developed Foundation Model in the GENIAC PRIZE (https://geniac-prize.nedo.go.jp/#outreachEvent), a competition with a top prize of 100 million yen.

GENIAC 2026 requires demonstrations using domestically produced Foundation Model. This model can run on low-spec hardware (though it will be slower, it can run on CPU only), and it offers image recognition, speech recognition, and AI agent operation capabilities, along with high general-purpose capabilities, making it ideal for demonstrations.
(Furthermore, GENIAC 2025 had a rule that awarded extra points to proposals using domestically produced Foundation Model.)

ご利用時のクレジット表記 / How to Credit

応募書類・成果物に記載時は以下の表記をご利用ください（コピペ可）:

本提案は dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix
（開発: dahara1@webbigdata / GIPU Limited, https://huggingface.co/dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix）を使用しています。

@misc{gemma4_japanese_imatrix_2026,
  author = {dahara1@webbigdata},
  title  = {gemma-4-12B-it-qat-UD-japanese-imatrix: Japanese-optimized GGUF quantization of Gemma 4 12B},
  year   = {2026},
  url    = {https://huggingface.co/dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix}
}

📣 ご利用の際は一報ください（任意・無償特典あり） / Let Us Know You're Using It (Optional)

UD-japanese-imatrixシリーズは累計ダウンロードが15万件を超える人気シリーズですが、ユーザ実体が我々の方で把握できていないために、対外的に実績PRをしても常に過小評価される傾向があります。
そのため、是非とも以下の登録をご一考ください (無償特典あり)

特に本モデルをコンペで採用される場合、こちらのフォームまたは HF Community で一言お知らせいただけると、コンペ期間中、以下を無償で提供します。:

お使いのハードウェアに合わせた推奨推論設定のアドバイス
既知の問題・回避策の優先共有
簡単な技術Q&A（ベストエフォート）

The UD-japanese-imatrix series is a popular series with over 150,000 cumulative downloads, but because we don't have a clear understanding of the actual user base, our performance tends to be underestimated even when we promote it.
Therefore, we strongly encourage you to consider registering as follows.

If you plan to use this model in the competition, let us know via the form or HF Community above. Registered teams receive free best-effort support during the competition period: recommended inference settings for your hardware, priority access to known-issue information, and basic technical Q&A.

🤝 開発支援が必要なチームへ / Need Implementation Support?

GENIAC PRIZE応募テーマの本番実装、推論の高速化、業務データへの適応（カスタム量子化・チューニング）が必要な場合は、有償の開発支援を承ります。
ユーザー企業様からのご相談はもちろん、チームに参加されている開発企業様からの技術委託のご相談も歓迎します（裏方としての支援も可能です）。→ GIPUフォーム.

We provide paid development support: production implementation, inference optimization, and adaptation to your domain data.
Inquiries from both user companies and their development partners are welcome. → Google Form.

ベンチマーク結果/benchmark result

全てRTX 4060ti(16GB)で実行したベンチマーク結果 All benchmark results were obtained using an RTX 4060ti (16GB).

shisa-ai/M-IFEval

shisa-ai/M-IFEval を使って計測した日本語における指示追従性能は以下です。
Ability to follow Japanese instructions measured using shisa-ai/M-IFEval is as follows.

Unslothは量子化モデルで世界的に有名であるため、今回、彼らのモデルに挑戦しました。
英語をメインに使用する場合はUnslothのモデルの方が性能が高いと思われるので留意してください。

Since Unsloth are world-renowned experts in quantization models, I decided to try their models this time.
Please note that their models are likely to perform better if you primarily use English.

Model Name	Strict Prompt	Strict Inst	Loose Prompt	Loose Inst	FileSize
google/gemma-4-12b-it-qat-q4_0.gguf	0.8314	0.8673	0.8488	0.8805	6.5G
unsloth/gemma-4-12B-it-qat-UD-Q4_K_XL.gguf	0.8314	0.8717	0.8605	0.8938	6.3G
dahara1/gemma-4-12B-it-qat-ja-UD-Q4_K_XL.gguf	0.8372	0.8717	0.8779	0.9070	6.9GB

Tau2-bench

tau2-benchはLLMのエージェント動作能力(問い合わせ対応など)を計測するベンチマークです。
tau2-bench is a benchmark that measures the agent performance (such as handling inquiries) of LLM.
ランダムに抽出した20タスクのみ。本タスクは英語実行であり、全パターンは網羅できていませんが、英語によるエージェント能力が落ちていない事が実証できています
Only 20 randomly selected tasks were used. While this task was executed in English and therefore did not cover all possible patterns, it demonstrates that the agent's English-language capabilities remain unimpeded.

Telecom

-	dahara1 version	Google official qat	result
Average Reward (Pass^1)	45.0% (9/20)	45.0% (9/20)	even
DB Match（データベース変更成功率）	40.0% (✓ 6 / ✗ 9)	31.2% (✓ 5 / ✗ 11)	dahara1
Normal Stop（正常終了数）	15タスク	16タスク	google
Max Steps（無限ループによる脱落）	5タスク	4タスク	google

Airplane

-	dahara1 version	Google official qat	result
Average Reward (Pass^1)	55.0% (11/20)	45.0% (9/20)	dahara1
📖 Read Actions（情報取得成功率）	94.1% (32/34)	73.5% (25/34)	dahara1
✏️ Write Actions（書き込み成功率）	13.0% (3/23)	13.0% (3/23)	even
🗄️ DB Match（データベース変更成功率）	55.0% (✓ 11 / ✗ 9)	45.0% (✓ 9 / ✗ 11)	dahara1
🛑 Normal Stop（正常終了数）	20タスク (👤 20 / 🤖 0)	20タスク (👤 20 / 🤖 0)	even

その他の動かし方 / How to Run other

Windows(AMD CPU with iGPU) Sample

サーバー起動時のサンプル(ブラウザ/スクリプト利用可能) Server startup example (browser/script compatible)

AMD Ryzen 9 7940HS w/ Radeon 780M Graphics搭載マシン(システムメモリ32GBのミニPC。GPUには8GBを割り当て済み) のコマンド例。ドラフトモデル(MTP)による高速化も設定しています
メモリギリギリのため画像認識機能をOFF(--no-mmproj)にしています

Command examples for a machine equipped with an AMD Ryzen 9 7940HS with Radeon 780M Graphics (a mini PC with 32GB of system memory, 8GB already allocated to the GPU).
Speed optimization using the Draft Model (MTP) is also enabled.
Due to limited memory, image recognition functionality is turned OFF (--no-mmproj).

.\llama-server ^
  -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix:gemma-4-12B-it-qat-ja-UD-Q3_K_XL ^
  --host 0.0.0.0 ^
  --port 8080 ^
  --temp 1.0 ^
  --top-p 0.95 ^
  --top-k 64 ^
  --min-p 0.0 ^
  --ctx-size 4096 ^
  --no-mmproj ^
  -hfd dahara1/gemma-4-12B-it-qat-assistant:gemma-4-423M-12b-it-qat-unquantized-assistant-Q4_0 ^
  --spec-type draft-mtp ^
  --reasoning on ^
  -ub 1024 ^
  -b 1024

サーバーが立ち上がったら、以下をブラウザで開いてください
http://127.0.0.1:8080/
Once the server is up and running, please open the following in your browser:
http://127.0.0.1:8080/

Nvidia GPU搭載Linux コンパイル/スクリプト事例

OpenAI API形式でスクリプト経由でアクセスする事もできます。
You can also access it via a script using the OpenAI API.

コンパイル / Build

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# CUDA build. For CPU-only, remove -DGGML_CUDA=ON.
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j --target llama-server

Scriput Sample

Text-only simple server example:

./build/bin/llama-server \
  -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix \
  -hff gemma-4-12B-it-qat-ja-UD-Q4_K_XL.gguf \
  --alias gemma-4-12b-it-ja \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 32768 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  --min-p 0.0 \
  --reasoning off \
  --no-mmproj

client sample (Thinking off)

#!/usr/bin/env python3
import os
import sys
from openai import OpenAI

BASE_URL = os.getenv("OPENAI_BASE_URL", "http://127.0.0.1:8080/v1")
API_KEY = os.getenv("OPENAI_API_KEY", "sk-no-key-required")
MODEL = os.getenv("OPENAI_MODEL", "gemma-4-12b-it-ja")

prompt = " ".join(sys.argv[1:]) or "魔法少女まどかマギカで一番可愛いのは誰ですか？"

client = OpenAI(
    base_url=BASE_URL,
    api_key=API_KEY,
)

completion = client.chat.completions.create(
    model=MODEL,
    messages=[
        {
            "role": "system",
            "content": "あなたは日本語で簡潔かつ正確に回答するアシスタントです。",
        },
        {
            "role": "user",
            "content": prompt,
        },
    ],
    temperature=1.0,
    top_p=0.95,
    max_tokens=1024,

    # llama-server specific options can be passed via extra_body.
    # If you want to explicitly disable thinking in supported templates,
    # set enable_thinking to False.
    extra_body={
        "top_k": 64,
        "min_p": 0.0,
        "chat_template_kwargs": {
            "enable_thinking": False
        },
    },
)

print(completion.choices[0].message.content)

text only server with mtp(faster)

.//build/bin/llama-server \
  -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix \
  -hff gemma-4-12B-it-qat-ja-UD-Q4_K_XL.gguf \
  --alias gemma-4-12b-it-ja \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 32768 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  --min-p 0.0 \
  --reasoning on \
  --no-mmproj \
  -hfd dahara1/gemma-4-12B-it-qat-assistant:gemma-4-423M-12b-it-qat-unquantized-assistant-Q4_0 \
  --spec-type draft-mtp

client sample 2 streaming Thinking on

#!/usr/bin/env python3
import os
import sys
import time
from openai import OpenAI

BASE_URL = os.getenv("OPENAI_BASE_URL", "http://127.0.0.1:8080/v1")
API_KEY = os.getenv("OPENAI_API_KEY", "sk-no-key-required")
MODEL = os.getenv("OPENAI_MODEL", "gemma-4-12b-it-ja")

# 0.0 ならサーバーから来た速度そのまま。
# 0.005 などにするとタイプライター風になる。
TYPE_DELAY = float(os.getenv("TYPE_DELAY", "0.0"))

RESET = "\033[0m"
THINK_COLOR = "\033[90m"      # dark gray
ANSWER_COLOR = "\033[97m"     # bright white
LABEL_COLOR = "\033[36m"      # cyan


def get_delta_field(delta, name: str):
    value = getattr(delta, name, None)
    if value is not None:
        return value

    model_extra = getattr(delta, "model_extra", None)
    if isinstance(model_extra, dict) and name in model_extra:
        return model_extra[name]

    try:
        dumped = delta.model_dump()
        return dumped.get(name)
    except Exception:
        return None


def print_chars(text: str, color: str):
    if not text:
        return

    print(color, end="", flush=True)
    for ch in text:
        print(ch, end="", flush=True)
        if TYPE_DELAY > 0:
            time.sleep(TYPE_DELAY)
    print(RESET, end="", flush=True)


def main():
    prompt = " ".join(sys.argv[1:]) or "世の中の不条理をぶっ壊せ！"

    client = OpenAI(
        base_url=BASE_URL,
        api_key=API_KEY,
    )

    stream = client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": "あなたは日本語で簡潔に回答するアシスタントです。",
            },
            {
                "role": "user",
                "content": prompt,
            },
        ],
        stream=True,
        temperature=1.0,
        top_p=0.95,
        max_tokens=2048,
        extra_body={
            "top_k": 64,
            "min_p": 0.0,

            # Gemma 4 の apply_chat_template(..., enable_thinking=True) 相当。
            # llama-server が Jinja chat template にこの値を渡す。
            "chat_template_kwargs": {
                "enable_thinking": True
            },
        },
    )

    phase = None
    thinking_parts = []
    answer_parts = []

    for chunk in stream:
        if not chunk.choices:
            continue

        delta = chunk.choices[0].delta

        reasoning = get_delta_field(delta, "reasoning_content")
        content = get_delta_field(delta, "content")

        if reasoning:
            if phase != "thinking":
                phase = "thinking"
                print_chars("\n[thinking]\n", LABEL_COLOR)
            thinking_parts.append(reasoning)
            print_chars(reasoning, THINK_COLOR)

        if content:
            if phase != "answer":
                phase = "answer"
                print_chars("\n\n[answer]\n", LABEL_COLOR)
            answer_parts.append(content)
            print_chars(content, ANSWER_COLOR)

    print("\n", flush=True)


if __name__ == "__main__":
    main()

画像認識サンプルスクリプト/Image recognition sample script

本モデルは画像認識、音声認識も実行可能です。
Gemma 4公式カードでは multimodal input の順序として、画像はテキストより前、音声はテキストより後が推奨されています。
This model can also perform image recognition and speech recognition.
The official Gemma 4 card recommends the following order for multimodal input: images before text, and audio after text.

画像認識時は３つのパラメーターが重要になります
Three parameters are important when performing image recognition.

画像の倍率的なパラメーター --image-min-tokens / --image-max-tokens
Gemma 4公式カードでは、画像1枚あたりの visual token budget として 70 / 140 / 280 / 560 / 1120 が提示されています。
OCRや小さい文字読み取りは高め、つまり 1120 推奨です。
Image scaling parameters --image-min-tokens / --image-max-tokens
The official Gemma 4 cards suggest a visual token budget of 70 / 140 / 280 / 560 / 1120 per image.
For OCR and small text recognition, a higher value, i.e., 1120, is recommended.

-ub / --ubatch-size が必要 llama-server の -ub, --ubatch-size は physical maximum batch size で、現行デフォルトは 512 です。
OCR向けに --image-max-tokens 1120 を指定すると、デフォルトの -ub 512 を超えるので、-ub 2048 くらいに上げるのが安全です。

-ub / --ubatch-size is required
The -ub and --ubatch-size options in llama-server are the physical maximum batch size, and the current default is 512.
If you specify --image-max-tokens 1120 for OCR, it will exceed the default -ub 512, so it is safer to raise it to around -ub 2048.

Start llama-server for general image understanding

../llama.cpp/build/bin/llama-server \
  -hf dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix \
  -hff gemma-4-12B-it-qat-ja-UD-Q4_K_XL.gguf \
  --alias gemma-4-12b-it-ja \
  --host 127.0.0.1 \
  --port 8080 \
  --ctx-size 32768 \
  --temp 1.0 \
  --top-p 0.95 \
  --top-k 64 \
  --min-p 0.0 \
  --reasoning off \
  --image-min-tokens 560 \
  --image-max-tokens 560 \
  -ub 1024

client sample

#!/usr/bin/env python3
import base64
import mimetypes
import sys
from pathlib import Path

from openai import OpenAI


BASE_URL = "http://127.0.0.1:8080/v1"
API_KEY = "sk-no-key-required"
MODEL = "gemma-4-12b-it-ja"


def image_to_data_url(path: str) -> str:
    image_path = Path(path)
    if not image_path.exists():
        raise FileNotFoundError(f"Image file not found: {image_path}")

    mime_type, _ = mimetypes.guess_type(image_path)
    if mime_type is None:
        mime_type = "image/png"

    encoded = base64.b64encode(image_path.read_bytes()).decode("utf-8")
    return f"data:{mime_type};base64,{encoded}"


def main() -> None:
    if len(sys.argv) < 2:
        print("Usage: python image_chat_llama_server.py <image_path> [prompt]")
        sys.exit(1)

    image_path = sys.argv[1]
    prompt = " ".join(sys.argv[2:]) or "この画像の内容を日本語で詳しく説明してください。"

    client = OpenAI(
        base_url=BASE_URL,
        api_key=API_KEY,
    )

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "user",
                "content": [
                    # Gemma 4 works best when image content comes before text.
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": image_to_data_url(image_path),
                        },
                    },
                    {
                        "type": "text",
                        "text": prompt,
                    },
                ],
            }
        ],
        temperature=1.0,
        top_p=0.95,
        max_tokens=1024,
        extra_body={
            "top_k": 64,
            "min_p": 0.0,
            "chat_template_kwargs": {
                "enable_thinking": False
            },
        },
    )

    print(response.choices[0].message.content)


if __name__ == "__main__":
    main()

音声認識/自動音声翻訳サンプル/Automatic Speech Recognition (ASR) / Automatic Speech Translation (AST)Sample

Gemma 4公式カードでは multimodal input の順序として、画像はテキストより前、音声はテキストより後が推奨されています。
The official Gemma 4 card recommends the following order for multimodal input: images before text, and audio after text.

音声認識は推奨プロンプトが以下のように定義されています
The recommended prompts for speech recognition are defined as follows:

音声認識（ASR）

Transcribe the following speech segment in {LANGUAGE} into {LANGUAGE} text.

Follow these specific instructions for formatting the answer:
*   Only output the transcription, with no newlines.
*   When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three.

自動音声翻訳（AST）

Transcribe the following speech segment in {SOURCE_LANGUAGE}, then translate it into {TARGET_LANGUAGE}.
When formatting the answer, first output the transcription in {SOURCE_LANGUAGE}, then one newline, then output the string '{TARGET_LANGUAGE}: ', then the translation in {TARGET_LANGUAGE}.

また、対応音声の最大長は 30 秒とされています。
Additionally, the maximum length of the supported audio is 30 seconds.

client sample

#!/usr/bin/env python3
import base64
import sys
from pathlib import Path

from openai import OpenAI


BASE_URL = "http://127.0.0.1:8080/v1"
API_KEY = "sk-no-key-required"
MODEL = "gemma-4-12b-it-ja"


DEFAULT_ASR_PROMPT = """Transcribe the following speech segment in Japanese into Japanese text.

Follow these specific instructions for formatting the answer:
* Only output the transcription, with no newlines.
* When transcribing numbers, write the digits, i.e. write 1.7 and not one point seven, and write 3 instead of three."""


def audio_to_base64(path: str) -> tuple[str, str]:
    audio_path = Path(path)
    if not audio_path.exists():
        raise FileNotFoundError(f"Audio file not found: {audio_path}")

    suffix = audio_path.suffix.lower().lstrip(".")
    if suffix not in {"wav", "mp3"}:
        raise ValueError("Please use a .wav or .mp3 file.")

    encoded = base64.b64encode(audio_path.read_bytes()).decode("utf-8")
    return encoded, suffix


def main() -> None:
    if len(sys.argv) < 2:
        print("Usage: python audio_chat_llama_server.py <audio_path> [prompt]")
        sys.exit(1)

    audio_path = sys.argv[1]
    prompt = " ".join(sys.argv[2:]) or DEFAULT_ASR_PROMPT

    encoded_audio, audio_format = audio_to_base64(audio_path)

    client = OpenAI(
        base_url=BASE_URL,
        api_key=API_KEY,
    )

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "user",
                "content": [
                    # Gemma 4 works best when audio content comes after text.
                    {
                        "type": "text",
                        "text": prompt,
                    },
                    {
                        "type": "input_audio",
                        "input_audio": {
                            "data": encoded_audio,
                            "format": audio_format,
                        },
                    },
                ],
            }
        ],
        temperature=1.0,
        top_p=0.95,
        max_tokens=2048,
        extra_body={
            "top_k": 64,
            "min_p": 0.0,
            "chat_template_kwargs": {
                "enable_thinking": False
            },
        },
    )

    print(response.choices[0].message.content)


if __name__ == "__main__":
    main()

💼 商用利用とサポート / Commercial Use & Support

本モデルは Apache 2.0 ライセンスの範囲で、どなたでも無償で商用利用いただけます。本カード記載の情報・無償サポートはベストエフォートでの提供です。

企業導入向けには、以下を有償でご提供しています:

メニュー	内容
検証済み構成パッケージ	貴社ハードウェア構成での動作検証と推奨設定レポート
推論高速化チューニング	ドラフトモデル（MTP/speculative decoding）設定、バッチ・KVキャッシュ最適化など、環境に合わせた高速化
複数ユーザー・本番運用設計	同時アクセス対応のサーバー構成設計、スループット要件に応じたサイジング
カスタム imatrix キャリブレーション	貴社の業務文書・ドメインデータで校正した専用量子化版の作成。貴社の用途における精度を最大化します
年間サポート契約	ベースモデル/プロンプトテンプレート更新への追従、障害時対応、技術Q&A

→ お問い合わせ: GIPU Limited GIPUフォーム

This model is free for commercial use under Apache 2.0. Information and free support on this card are provided on a best-effort basis.
For enterprise deployment, we offer paid services: verified configuration packages for your hardware, inference speed optimization (draft model / speculative decoding, batching, KV-cache tuning), multi-user production server design, custom imatrix calibration using your domain data (a dedicated quantization tuned for your use case), and annual support contracts. → Contact: GIPU Limited

FAQ

(1)何故、元モデルにオリジナルのBF16版を使わずにQAT版を使ったのですか？
Why did you use the QAT version instead of the original BF16 version for the source model?

オリジナルのBF16版の量子化は現時点では繰り返し出力が発生する割合が高いです。
これは今後のパラメータチューニングやツール改善で修正される可能性がありますが、現時点では解決策が見つかっていません
The original BF16 version of quantization currently produces a high rate of repetitive output.
This may be corrected with future parameter tuning or tool improvements, but a solution has not yet been found.

update info

2026/06/10 first release

謝辞 / Acknowledgments

google
Unsloth
llama.cpp
Thank you to all AI researchers and practitioners.

作成者 / Developer

開発：dahara1@Webbigdata / Developed by dahara1@Webbigdata
お問い合わせ / For inquiries

Downloads last month: 11,347

GGUF

Model size

12B params

Architecture

gemma4

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dahara1/gemma-4-12B-it-qat-UD-japanese-imatrix

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it

Finetuned

google/gemma-4-12B-it-qat-q4_0-unquantized

Quantized

(23)

this model