VLM Batch Inference — 4-bit Quantization Deep Dive

Apr 04, 2026

Date: 4 April 2026
Model: Qwen3-VL-8B-Instruct
Instance: ml.g4dn.xlarge (NVIDIA T4, 16 GB VRAM)
Batch size: 4,942 images

📌 TL;DR

FP16 “fit” → but caused CPU offloading
CPU offloading → 100x slowdown per layer
4-bit NF4 → solved memory + performance
bitsandbytes → made it practical
Result → 13x faster, 12x cheaper, production viable

🚨 The Problem: When “Fits in VRAM” Still Fails

At first glance, everything looked fine.

8B parameter model
FP16 → ~16 GB
GPU → 16 GB VRAM

It should fit. It didn’t work.

We observed:

~8 minutes per image
GPU underutilized
CPU spikes
Warning:

Some parameters are on the meta device because they were offloaded to the cpu.

👉 The system was technically “working” — but practically unusable.

🔍 Root Cause: The Hidden Cost of “Almost Fitting”

The issue wasn’t model size — it was runtime memory dynamics:

What actually happens:

Weights (FP16) → ~16 GB
KV cache + activations → +2–4 GB
Available VRAM → 0 headroom
Accelerate kicks in → CPU offloading
Result → GPU waits on CPU → massive slowdown

💡 Key insight:

Even 10–15% of layers on CPU → entire pipeline becomes CPU-bound

That’s how you get:

GPU sitting idle
8-minute inference times

⚙️ The Fix: 4-bit Quantization

Instead of scaling hardware, we changed representation.

Quantization reduces precision:

PrecisionBitsMemoryFP16161x4-bit4~0.25x

👉 Result: ~4x reduction in VRAM

🧠 What is NF4 Quantization (and why it matters)

NF4 (Normal Float 4-bit) is not just “smaller numbers”.

It’s distribution-aware quantization.

Why this matters for transformers:

Neural network weights follow a normal (Gaussian-like) distribution:

Most values near 0
Few extreme values

NF4 optimizes for this:

Non-uniform buckets
Higher resolution near 0
Lower resolution at extremes

👉 Compared to naive FP4 / INT4:

Better information retention
Lower accuracy loss
Stable inference behavior

Intuition:

NF4 spends precision where the model actually “cares”.

🔧 bitsandbytes: The Enabler

The real hero here is bitsandbytes.

What it provides:

Drop-in quantization for Hugging Face models
4-bit (NF4, FP4) + 8-bit support
GPU-optimized kernels
Seamless integration with transformers + accelerate

Why it’s widely used:

No retraining required
Works at load time
Production-friendly
Proven across LLM + VLM workloads

💡 In practice:

bitsandbytes is what makes “run 8B models on small GPUs” actually feasible.

⚙️ Configuration

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,
)

Key design decisions:

NF4 over FP4 → better accuracy
FP16 compute → T4 compatibility
Double quantization → extra memory savings

📊 Results: Before vs After

📈 Quality Impact

The natural question: what did we lose?

Observations:

Score variance: ~1–3%
Ranking preserved ✅
Distribution intact ✅
No collapse in quality bands ✅

👉 For curation pipelines:

Relative ranking > absolute precision

NF4 preserves what actually matters.

🧪 Real Output Behavior

Post-quantization:

Stable latency (~38s ± small variance)
Healthy score distribution
No long-tail slowdowns
Zero CPU fallback

This is what “production-ready” looks like.

🛠️ Engineering Journey (Condensed)

This wasn’t a one-step fix.

We hit:

Docker manifest incompatibility
IAM permission gaps
Decimal serialization bugs
Instance quota limits
Transformers version breakage
Torchvision dependency issues
Finally → FP16 memory wall

👉 The key lesson:

Infrastructure failures often mask core ML system design issues

💰 Cost Implications

🚀 What This Unlocks

With 4-bit quantization:

Run 8B VLMs on T4
Eliminate CPU bottlenecks
Maintain quality
Achieve production-scale throughput

🔮 Future Optimizations

A10G (24 GB) → FP16 viable
Flash Attention → 2–3x speedup
Batched inference → better utilization
GPTQ / AWQ → offline quantization
Horizontal scaling → linear throughput

🧠 Final Takeaway

This wasn’t just a performance fix.

It was a mindset shift:

❌ “Does the model fit in memory?”
✅ “Does the system run efficiently under load?”

And the answer came from representation, not hardware.

If you’re running mid-sized models on constrained GPUs — this is not an optimization.
It’s a requirement.

📚 Further Reading

QLoRA: Efficient Finetuning of Quantized LLMs — foundation of NF4 https://arxiv.org/abs/2305.14314
bitsandbytes documentation — practical 4-bit implementation https://github.com/bitsandbytes-foundation/bitsandbytes
Hugging Face Accelerate — device mapping and offloading https://huggingface.co/almog2441Org/Transformers
FlashAttention — memory-efficient attention https://github.com/dao-ailab/flash-attention
GPTQ / AWQ — next-gen quantization approaches (Furure versions) https://arxiv.org/abs/2210.17323 https://github.com/mit-han-lab/llm-awq

Jagadeesh Rampam

Discussion about this post

Ready for more?