VLM Batch Inference — 4-bit Quantization Deep Dive
🧠 From 8 Minutes to 38 Seconds: Fixing VLM Inference with 4-bit NF4 Quantization
Date: 4 April 2026
Model: Qwen3-VL-8B-Instruct
Instance: ml.g4dn.xlarge (NVIDIA T4, 16 GB VRAM)
Batch size: 4,942 images
📌 TL;DR
FP16 “fit” → but caused CPU offloading
CPU offloading → 100x slowdown per layer
4-bit NF4 → solved memory + performance
bitsandbytes → made it practical
Result → 13x faster, 12x cheaper, production viable
🚨 The Problem: When “Fits in VRAM” Still Fails
At first glance, everything looked fine.
8B parameter model
FP16 → ~16 GB
GPU → 16 GB VRAM
It should fit. It didn’t work.
We observed:
~8 minutes per image
GPU underutilized
CPU spikes
Warning:
Some parameters are on the meta device because they were offloaded to the cpu.👉 The system was technically “working” — but practically unusable.
🔍 Root Cause: The Hidden Cost of “Almost Fitting”
The issue wasn’t model size — it was runtime memory dynamics:
What actually happens:
Weights (FP16) → ~16 GB
KV cache + activations → +2–4 GB
Available VRAM → 0 headroom
Accelerate kicks in → CPU offloading
Result → GPU waits on CPU → massive slowdown
💡 Key insight:
Even 10–15% of layers on CPU → entire pipeline becomes CPU-bound
That’s how you get:
GPU sitting idle
8-minute inference times
⚙️ The Fix: 4-bit Quantization
Instead of scaling hardware, we changed representation.
Quantization reduces precision:
PrecisionBitsMemoryFP16161x4-bit4~0.25x
👉 Result: ~4x reduction in VRAM
🧠 What is NF4 Quantization (and why it matters)
NF4 (Normal Float 4-bit) is not just “smaller numbers”.
It’s distribution-aware quantization.
Why this matters for transformers:
Neural network weights follow a normal (Gaussian-like) distribution:
Most values near 0
Few extreme values
NF4 optimizes for this:
Non-uniform buckets
Higher resolution near 0
Lower resolution at extremes
👉 Compared to naive FP4 / INT4:
Better information retention
Lower accuracy loss
Stable inference behavior
Intuition:
NF4 spends precision where the model actually “cares”.
🔧 bitsandbytes: The Enabler
The real hero here is bitsandbytes.
What it provides:
Drop-in quantization for Hugging Face models
4-bit (NF4, FP4) + 8-bit support
GPU-optimized kernels
Seamless integration with
transformers+accelerate
Why it’s widely used:
No retraining required
Works at load time
Production-friendly
Proven across LLM + VLM workloads
💡 In practice:
bitsandbytesis what makes “run 8B models on small GPUs” actually feasible.
⚙️ Configuration
from transformers import BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
)Key design decisions:
NF4 over FP4 → better accuracy
FP16 compute → T4 compatibility
Double quantization → extra memory savings
📊 Results: Before vs After
📈 Quality Impact
The natural question: what did we lose?
Observations:
Score variance: ~1–3%
Ranking preserved ✅
Distribution intact ✅
No collapse in quality bands ✅
👉 For curation pipelines:
Relative ranking > absolute precision
NF4 preserves what actually matters.
🧪 Real Output Behavior
Post-quantization:
Stable latency (~38s ± small variance)
Healthy score distribution
No long-tail slowdowns
Zero CPU fallback
This is what “production-ready” looks like.
🛠️ Engineering Journey (Condensed)
This wasn’t a one-step fix.
We hit:
Docker manifest incompatibility
IAM permission gaps
Decimal serialization bugs
Instance quota limits
Transformers version breakage
Torchvision dependency issues
Finally → FP16 memory wall
👉 The key lesson:
Infrastructure failures often mask core ML system design issues
💰 Cost Implications
🚀 What This Unlocks
With 4-bit quantization:
Run 8B VLMs on T4
Eliminate CPU bottlenecks
Maintain quality
Achieve production-scale throughput
🔮 Future Optimizations
A10G (24 GB) → FP16 viable
Flash Attention → 2–3x speedup
Batched inference → better utilization
GPTQ / AWQ → offline quantization
Horizontal scaling → linear throughput
🧠 Final Takeaway
This wasn’t just a performance fix.
It was a mindset shift:
❌ “Does the model fit in memory?”
✅ “Does the system run efficiently under load?”
And the answer came from representation, not hardware.
If you’re running mid-sized models on constrained GPUs — this is not an optimization.
It’s a requirement.
📚 Further Reading
QLoRA: Efficient Finetuning of Quantized LLMs — foundation of NF4 https://arxiv.org/abs/2305.14314
bitsandbytes documentation — practical 4-bit implementation https://github.com/bitsandbytes-foundation/bitsandbytes
Hugging Face Accelerate — device mapping and offloading https://huggingface.co/almog2441Org/Transformers
FlashAttention — memory-efficient attention https://github.com/dao-ailab/flash-attention
GPTQ / AWQ — next-gen quantization approaches (Furure versions) https://arxiv.org/abs/2210.17323 https://github.com/mit-han-lab/llm-awq



