How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="KDEGroup/UI-AGILE-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("KDEGroup/UI-AGILE-3B")
model = AutoModelForMultimodalLM.from_pretrained("KDEGroup/UI-AGILE-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

🔥 Overview

UI-AGILE is a framework designed to enhance Graphical User Interface (GUI) agents at both training and inference stages. It addresses common challenges in Multimodal Large Language Models (MLLMs) such as reasoning designs, ineffective rewards, and visual noise.

Key Features

  • Training Enhancements:
    • Continuous Reward Function: Incentivizes high-precision grounding.
    • "Simple Thinking" Reward: Balances planning depth with execution speed and grounding accuracy.
    • Cropping-based Resampling: Mitigates the sparse reward problem and improves learning on complex tasks.
  • Inference Enhancements:
    • Decomposed Grounding with Selection: Dramatically improves grounding accuracy on high-resolution displays by breaking the image into smaller, manageable parts.

UI-AGILE-7B achieves state-of-the-art grounding performance on benchmarks like ScreenSpot-Pro and ScreenSpot-v2 while maintaining strong general agent capabilities.

⭐️ Citation

If you find this project useful, please cite:

@misc{lian2025uiagileadvancingguiagents,
      title={UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding}, 
      author={Shuquan Lian and Yuhang Wu and Jia Ma and Zihan Song and Bingqi Chen and Xiawu Zheng and Hui Li},
      year={2025},
      eprint={2507.22025},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2507.22025}, 
}
Downloads last month
5
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KDEGroup/UI-AGILE-3B

Quantizations
2 models

Paper for KDEGroup/UI-AGILE-3B