From 1/4 to 3/4: Re-architecting an ML Pipeline for Graviton
How I replaced fragile ML stages, eliminated x86 dependencies, and moved production workloads to ARM64
Graviton adoption is not a hardware decision. It is an ecosystem decision.
A few weeks ago, I wrote that I could only run 1 of my 4 ML Lambdas on AWS Graviton. At that point, the pipeline had four compute stages and only one—the deduplication stage—was cleanly portable to ARM64. The other three were blocked by native dependencies, ML scoring instability, or CUDA hard requirements.
Since then, I shipped a major re-architecture. I am now at 3 of 4 instances migrated to Graviton and an additional orchestration layer on ECS using Graviton
This is technical deep dive on what changed, what stayed stubborn, and how the cost model looks after the migration.
The original four functions
Before the rework, the image pipeline had four compute stages:
Only the deduplication function was small, CPU-bound, and dependency-light—the exact profile Graviton rewards. Everything else was tied to the ML ecosystem: compiled C extensions, missing ARM64 wheels, or a hard CUDA dependency.
What made it to the final cut!
Final changes were not a simple port. It was a structural rewrite of the pipeline.
1) I consolidated CLIP-IQA and dedup into a single technical validator
Originally, I had two separate concerns:
CLIP-IQA for visual quality scoring
Perceptual hashing for duplicate detection
That split created two problems. First, CLIP-IQA had ARM64 compatibility gaps because the required wheel chain was incomplete. Second, the model was miscalibrated for my dataset: upon a research, learned the CLIP-IQA was trained on old stock photos, which doesn’t fit the bill for my usecase (it accepted only about 1% of professional RAW captures, which made it unusable for real curation.)
Rather than keep forcing an ML model into a role that did not justify its operational cost, I replaced both stages with a lightweight rule-based Lambda:
exposure clipping detection
blur and sharpness scoring
perceptual hashing for deduplication
no ML forward pass
no compiled extensions in the hot path
This is no longer just a design simplification. Deduplication is now live in production, computing perceptual hashes and performing ANN-style lookups in DynamoDB. The validator is not only replacing the old stages—it is actively enforcing them.
That became parjanya-technical_validator, and it runs on ARM64.
Architecture: arm64
Memory: 512 MB
Runtime: python3.12This was the biggest unlock in the whole migration. I did not port a fragile ML stage to ARM64. I replaced it with something that never needed x86 in the first place.
2) I moved VLM inference off Lambda entirely
The GPU inference stage was never a candidate for Lambda ARM64. CUDA remains the hard boundary.
Instead of trying to force the issue, I separated control plane from execution:
the scheduling logic runs in a lightweight Lambda
the actual VLM inference runs on EC2 g5 instances in an Auto Scaling Group
the worker image runs
Qwen3-VL-8B-Instruct
That means the Lambda only decides when to scale and when to drain. It does not need ML libraries, CUDA, or any heavy inference runtime. That Lambda is ARM64.
This separation matters. I gained another Graviton function by moving the intelligence of orchestration into ARM64 and leaving the GPU workload where it belongs.
3) I moved the ECS backend to ARM64
The backend API tier also moved to ARM64.
It handles uploads, scoring orchestration, gallery queries, and DynamoDB reads. None of that needs x86. It is I/O-heavy, memory-efficient, and a good fit for Graviton.
The current state: 3 of 4
Three Lambda functions now run on Graviton. The ECS backend is on ARM64 too. The only x86 compute left is the GPU worker fleet, and that is an intentional architectural boundary, not an unfinished migration.
The technical validator is now running deduplication in production with DynamoDB-backed ANN lookup.
Current state:
~714 images processed
2,856 phash records
2,856 dhash records
5,712 total hash entries
Each 64-bit perceptual hash is split into 4 × 16-bit chunks and indexed via GSI, enabling efficient Hamming-distance queries without full table scans.
Duplicate detection is executed at ingest time using a Hamming threshold of 8. This is not approximate in theory—it is running against real image uploads and returning consistent matches.
Why the GPU stage will stay off Graviton
This is the part people usually ask about.
The VLM worker runs on NVIDIA g5 instances, which means CUDA. CUDA is the boundary, not convenience. The ARM64 story for GPU inference is still not a drop-in replacement for a standard PyTorch + CUDA workflow.
The alternatives exist, but they come with real trade-offs:
AWS Inferentia requires model compilation and Neuron SDK integration
operator compatibility can become a bottleneck
debugging and deployment pipelines become more specialized
parity testing is mandatory if scores are used for curation decisions
For a variable-resolution VLM workload, I decided the migration effort was not worth the operational complexity today.
How I think about ARM64 wheels
This is the practical section that matters if you are trying to do the same migration.
An ARM64 wheel is the precompiled binary distribution of a Python package for the aarch64/arm64 architecture. If a package publishes an ARM64 wheel, I can usually install it without compiling from source. If it does not, I may fall back to source builds, missing dependencies, or outright installation failures.
That is why ARM64 adoption is often not blocked by AWS itself. It is blocked by the package ecosystem.
What I check before switching a workload to ARM64
I treat every dependency as one of four categories:
pure Python, safe everywhere
ARM64 wheel available, usually safe
source build required, needs validation
native dependency missing or unstable, likely blocker
For a Lambda or container image, I check:
whether the package publishes
manylinux_aarch64wheelswhether transitive dependencies also support ARM64
whether C extensions link cleanly on Graviton
whether the package is pinned to a version that still ships ARM64 wheels
How I track ARM64 readiness for my use case
My workflow is simple:
I maintain a lockfile or constraints file for the exact runtime image
I test installs inside an ARM64 build environment, not just my laptop
I verify wheel availability against the exact Python version I deploy
I keep a dependency matrix for each service:
pure python,wheel available,source build,blockerI add a deployment gate that fails fast if a native package starts compiling from source unexpectedly
For ML and image-processing workloads, this matters more than people expect. A single dependency such as Pillow, OpenCV, raw image parsers, or a scientific library can flip a Lambda from “portable” to “fragile.”
A useful rule of thumb
If I cannot answer these three questions quickly, I do not move the workload yet:
Does every native dependency have an ARM64 wheel for my Python version?
If not, can I build from source reliably in CI?
If not, can I replace the dependency or the stage entirely?
In my case, the answer to the last question was the real unlock.
Cost analysis
VPC Gateway Endpoints
I added four VPC gateway endpoints to stop routing S3 and DynamoDB traffic through NAT.
Gateway endpoints are free. That means the traffic path to S3 and DynamoDB no longer pays NAT tax.
For an image pipeline moving large media sets, that is not noise. It is structural savings.
NAT Gateway cost sink
My production environment still uses NAT Gateways for general outbound traffic. That cost persists because the system still needs egress, but the higher-volume internal traffic no longer needs to pass through NAT.
A realistic next step is to collapse production NAT from 3 gateways to 1, which would save roughly $65/month in the current setup.
GPU backfill cost
ARM64 Lambda savings
AWS ARM64 Lambda pricing is typically around 20% lower than x86 for equivalent duration.
That matters because my architecture is event-driven:
the EXIF extractor runs on every ingest
the technical validator runs on every upload
the scheduler ticks every 2 minutes
Those invocations compound. The bigger the event volume, the more the ARM64 delta matters.
A stronger way to express savings
The right way to think about the savings is not just “cheaper compute.” It is:
Savings = avoided x86 Lambda spend + avoided NAT data-path spend + reduced ML over-processing + lower operational overhead from simpler packaging
That is the real gain.
In practical terms, the migration gives me four layers of savings:
lower per-invocation Lambda cost on ARM64
fewer NAT-passed requests to S3 and DynamoDB
deduplication alone prevents re-processing identical or near-identical RAW frames, directly reducing GPU inference cost
fewer operational failures caused by missing native wheels
The most important one is the third. Every frame rejected by the technical validator is a frame that never reaches the GPU queue. That reduces both cost and latency.
Performance and latency
Technical validator latency
Replacing CLIP-IQA and the old dedup path with a rule-based validator removed model loading, warmup, and forward-pass overhead.
The new validator runs in pure Python and is typically sub-100 ms per image.
This includes hash computation and ANN indexing, which now runs inline without introducing measurable latency overhead.
VLM throughput
The Qwen3-VL-8B-Instruct worker processed 320 images in the backfill run with clean score clustering and no rejects on this dataset. The reject threshold is set to 0.4, but the dataset was too clean to trigger it.
That is fine for now. I still need a more difficult validation set before I can trust the threshold under harsher conditions.
ECS backend behaviour
The ARM64 backend shows the expected pattern for a control-plane service: stable, low-error, and efficient for I/O-bound work.
The ecosystem lesson
The lesson is the same one I started with, but sharper now:
The fastest path to ARM64 is sometimes not to migrate the workload. It is to replace the workload.
I did not port a broken ML stage to ARM64.
I removed it.I did not force the GPU inference stack onto Lambda.
I isolated it.I did not keep a dedup stage alive because it existed historically.
I collapsed it into the validator where it belonged.
That is what made the architecture cleaner and cheaper.
The validator is not just simpler—it is now measurably correct under production load.
What is still ahead
Terraform reconciliation: my live state now diverges from the earlier config, so I need to align infra state with reality.
NAT reduction: production can likely collapse from 3 NAT gateways to 1 after reconciliation.
Better validation data: I need deliberately low-quality images to test the reject threshold properly.
Inferentia2 feasibility: worth revisiting only if VLM inference becomes steady-state instead of bursty.
TL;DR
Three of four Lambda functions now run on Graviton. The remaining x86 workload is the one that genuinely belongs there: GPU inference.
How I track ARM64 wheel readiness before I migrate a service
The main point is that ARM64 migration is usually decided by wheel availability, transitive dependencies, and build reproducibility—not by the chip itself.
Core structure
1) Start with the dependency graph
I map every package into one of four states:
pure Python
ARM64 wheel available
source-build required
blocker
2) Test inside an ARM64 build environment
I do not trust local x86 installs for this. I build in CI or in an ARM64 container so I can see what actually happens on Graviton.
3) Pin Python and package versions
ARM64 support can appear or disappear across versions. I lock both the Python runtime and the package set.
4) Track failure modes
I record whether the failure is:
missing wheel
missing system library
compiler issue
ABI mismatch
runtime behaviour difference
5) Decide whether to port or replace
If the package stack is fragile, I replace the workload. That is often cheaper than fighting the dependency chain.
The chip is rarely the real problem. The wheel matrix is!
Further Reading
AWS Graviton (Official)
Migrating AWS Lambda functions to ARM-based Graviton
Covers practical considerations when moving Lambda workloads to ARM64, including compatibility and performance trade-offs. AWS reports up to 34% better price-performance, reinforcing why this shift compounds over time.AWS Graviton Technical Guide
A deep reference on runtime support, Python wheels, and ecosystem maturity. Useful for understanding why some workloads “just work” while others fail due to native dependencies.
ARM64 in Practice (Real-world engineering)
Migrating a GenAI pipeline to AWS Graviton (ARM blog)
A strong parallel to this post: highlights the need to build from source when wheels don’t exist and to restructure pipelines for ARM64 compatibility rather than forcing migration.AWS Graviton usage guide (2025)
A broader view of where Graviton fits: microservices, data pipelines, and cost optimization. Also calls out the exact blockers I hit—x86-native binaries and dependency gaps.
Python + ARM64 (Critical for ML workloads)
AWS Graviton Python getting started guide
This is the underrated one. Explains why:Python version matters (wheel availability changes across versions)
GLIBC mismatches break installs
Some libraries require source builds or distro packages
If you’re debugging “why does this work on x86 but not ARM64?” — this is where the answer usually lives.
Closing thought
If there’s a single takeaway across all of these:
ARM64 adoption fails less because of compute—and more because of dependencies.
That’s the thread connecting every resource above—and the reason this migration worked only after changing the pipeline itself.






