Instructions to use Paranioar/NEO1_0-2B-PT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Paranioar/NEO1_0-2B-PT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Paranioar/NEO1_0-2B-PT", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Paranioar/NEO1_0-2B-PT", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Paranioar/NEO1_0-2B-PT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Paranioar/NEO1_0-2B-PT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Paranioar/NEO1_0-2B-PT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Paranioar/NEO1_0-2B-PT

SGLang

How to use Paranioar/NEO1_0-2B-PT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Paranioar/NEO1_0-2B-PT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Paranioar/NEO1_0-2B-PT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Paranioar/NEO1_0-2B-PT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Paranioar/NEO1_0-2B-PT",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Paranioar/NEO1_0-2B-PT with Docker Model Runner:
```
docker model run hf.co/Paranioar/NEO1_0-2B-PT
```

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale
| Paper | Code |

🌟🌟 Motivation

Two lingering clouds cast shadows over its widespread exploration and promotion:

What fundamental constraints set native VLMs apart from modular ones, and to what extent can these barriers be overcome?
How to make research in native VLMs more accessible and democratized, thereby accelerating progress in the field.

We construct native VLMs built from first principles, where its primitive should:

effectively align pixel and word representations within a shared semantic space;
seamlessly integrate the strengths of separate vision and language modules;
inherently embody various cross-modal properties that support unified vision-language encoding, aligning, and reasoning.

🚀🚀 Highlight

With only 390M image-text examples, NEO develops strong visual perception from scratch inside a dense and monolithic model via elaborate primitives.
NEO serves as a cornerstone for scalable and powerful native VLMs, paired with reusable components that foster a cost-effective and extensible ecosystem.

🧑‍🎨🧑‍🎨 Model Overview

NEO1_0-2B has the following features:

Model Type: Native Vision-Language Models
Model Mode: Mixed Native-Attn & Native-RoPE
Layer Parameters: 56M vs. 50M (Qwen3-1.7B)
Model Parameters: 2.2B (Non-Embedding)
Number of Layers: 40 (12 for Pre-Buffer & 28 for Post-LLM)
Number of Heads: 16 for Q and 8 for KV (GQA)
Head Dimensions: 128 * 2 for QK and 128 for V

🔥🔥 Model Performance

📚📚 Model Weights

We release the 2B weights of NEO1_0 in Pre-Training (PT), Mid-Training (MT), and Supervised Fine-Tuning (SFT).

Model name	Weight
NEO-2B-PT	🤗 NEO-2B-PT HF link
NEO-2B-MT	🤗 NEO-2B-MT HF link
NEO-2B-SFT	🤗 NEO-2B-SFT HF link

✒️✒️ Citation

If NEO is helpful for your research, please consider star ⭐ and citation 📝 :

@article{Diao2025NEO,
  title        = {From Pixels to Words--Towards Native Vision-Language Primitives at Scale},
  author       = {Diao, Haiwen and Li, Mingxuan and Wu, Silei and Dai, Linjun and Wang, Xiaohua and Deng, Hanming and Lu, Lewei and Lin, Dahua and Liu, Ziwei},
  journal      = {arXiv preprint arXiv:2510.14979},
  year         = {2025}
}

Downloads last month: 3

Safetensors

Model size

3B params

Tensor type

BF16

Collection including Paranioar/NEO1_0-2B-PT

NEO1_0

Collection

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale • 7 items • Updated Jan 27 • 9

Paper for Paranioar/NEO1_0-2B-PT

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Paper • 2510.14979 • Published Oct 16, 2025 • 70

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale | Paper | Code |