From 1/4 to 3/4: Re-architecting an ML Pipeline for Graviton

Apr 20, 2026

Graviton adoption is not a hardware decision. It is an ecosystem decision.

A few weeks ago, I wrote that I could only run 1 of my 4 ML Lambdas on AWS Graviton. At that point, the pipeline had four compute stages and only one—the deduplication stage—was cleanly portable to ARM64. The other three were blocked by native dependencies, ML scoring instability, or CUDA hard requirements.

Jagadeesh Rampam

Why We Could Only Use AWS Graviton for 1 of Our 4 ML Lambdas

A simple rule of thumb: Graviton works beautifully when your dependencies are simple. It struggles when your stack isn’t…

4 months ago · Jagadeesh Rampam

Since then, I shipped a major re-architecture. I am now at 3 of 4 instances migrated to Graviton and an additional orchestration layer on ECS using Graviton

This is technical deep dive on what changed, what stayed stubborn, and how the cost model looks after the migration.

The original four functions

Before the rework, the image pipeline had four compute stages:

Only the deduplication function was small, CPU-bound, and dependency-light—the exact profile Graviton rewards. Everything else was tied to the ML ecosystem: compiled C extensions, missing ARM64 wheels, or a hard CUDA dependency.

What made it to the final cut!

Final changes were not a simple port. It was a structural rewrite of the pipeline.

1) I consolidated CLIP-IQA and dedup into a single technical validator

Originally, I had two separate concerns:

CLIP-IQA for visual quality scoring
Perceptual hashing for duplicate detection

That split created two problems. First, CLIP-IQA had ARM64 compatibility gaps because the required wheel chain was incomplete. Second, the model was miscalibrated for my dataset: upon a research, learned the CLIP-IQA was trained on old stock photos, which doesn’t fit the bill for my usecase (it accepted only about 1% of professional RAW captures, which made it unusable for real curation.)

Rather than keep forcing an ML model into a role that did not justify its operational cost, I replaced both stages with a lightweight rule-based Lambda:

exposure clipping detection
blur and sharpness scoring
perceptual hashing for deduplication
no ML forward pass
no compiled extensions in the hot path

This is no longer just a design simplification. Deduplication is now live in production, computing perceptual hashes and performing ANN-style lookups in DynamoDB. The validator is not only replacing the old stages—it is actively enforcing them.

That became parjanya-technical_validator, and it runs on ARM64.

Architecture: arm64
Memory: 512 MB
Runtime: python3.12

This was the biggest unlock in the whole migration. I did not port a fragile ML stage to ARM64. I replaced it with something that never needed x86 in the first place.

2) I moved VLM inference off Lambda entirely

The GPU inference stage was never a candidate for Lambda ARM64. CUDA remains the hard boundary.

Instead of trying to force the issue, I separated control plane from execution:

the scheduling logic runs in a lightweight Lambda
the actual VLM inference runs on EC2 g5 instances in an Auto Scaling Group
the worker image runs Qwen3-VL-8B-Instruct

That means the Lambda only decides when to scale and when to drain. It does not need ML libraries, CUDA, or any heavy inference runtime. That Lambda is ARM64.

This separation matters. I gained another Graviton function by moving the intelligence of orchestration into ARM64 and leaving the GPU workload where it belongs.

3) I moved the ECS backend to ARM64

The backend API tier also moved to ARM64.

It handles uploads, scoring orchestration, gallery queries, and DynamoDB reads. None of that needs x86. It is I/O-heavy, memory-efficient, and a good fit for Graviton.

The current state: 3 of 4

Three Lambda functions now run on Graviton. The ECS backend is on ARM64 too. The only x86 compute left is the GPU worker fleet, and that is an intentional architectural boundary, not an unfinished migration.

The technical validator is now running deduplication in production with DynamoDB-backed ANN lookup.

Current state:

~714 images processed
2,856 phash records
2,856 dhash records
5,712 total hash entries

Each 64-bit perceptual hash is split into 4 × 16-bit chunks and indexed via GSI, enabling efficient Hamming-distance queries without full table scans.

Duplicate detection is executed at ingest time using a Hamming threshold of 8. This is not approximate in theory—it is running against real image uploads and returning consistent matches.

Why the GPU stage will stay off Graviton

This is the part people usually ask about.

The VLM worker runs on NVIDIA g5 instances, which means CUDA. CUDA is the boundary, not convenience. The ARM64 story for GPU inference is still not a drop-in replacement for a standard PyTorch + CUDA workflow.

The alternatives exist, but they come with real trade-offs:

AWS Inferentia requires model compilation and Neuron SDK integration
operator compatibility can become a bottleneck
debugging and deployment pipelines become more specialized
parity testing is mandatory if scores are used for curation decisions

For a variable-resolution VLM workload, I decided the migration effort was not worth the operational complexity today.

How I think about ARM64 wheels

This is the practical section that matters if you are trying to do the same migration.

An ARM64 wheel is the precompiled binary distribution of a Python package for the aarch64/arm64 architecture. If a package publishes an ARM64 wheel, I can usually install it without compiling from source. If it does not, I may fall back to source builds, missing dependencies, or outright installation failures.

That is why ARM64 adoption is often not blocked by AWS itself. It is blocked by the package ecosystem.

What I check before switching a workload to ARM64

I treat every dependency as one of four categories:

pure Python, safe everywhere
ARM64 wheel available, usually safe
source build required, needs validation
native dependency missing or unstable, likely blocker

For a Lambda or container image, I check:

whether the package publishes manylinux_aarch64 wheels
whether transitive dependencies also support ARM64
whether C extensions link cleanly on Graviton
whether the package is pinned to a version that still ships ARM64 wheels

How I track ARM64 readiness for my use case

My workflow is simple:

I maintain a lockfile or constraints file for the exact runtime image
I test installs inside an ARM64 build environment, not just my laptop
I verify wheel availability against the exact Python version I deploy
I keep a dependency matrix for each service: pure python, wheel available, source build, blocker
I add a deployment gate that fails fast if a native package starts compiling from source unexpectedly

For ML and image-processing workloads, this matters more than people expect. A single dependency such as Pillow, OpenCV, raw image parsers, or a scientific library can flip a Lambda from “portable” to “fragile.”

A useful rule of thumb

If I cannot answer these three questions quickly, I do not move the workload yet:

Does every native dependency have an ARM64 wheel for my Python version?
If not, can I build from source reliably in CI?
If not, can I replace the dependency or the stage entirely?

In my case, the answer to the last question was the real unlock.

Cost analysis

VPC Gateway Endpoints

I added four VPC gateway endpoints to stop routing S3 and DynamoDB traffic through NAT.

Gateway endpoints are free. That means the traffic path to S3 and DynamoDB no longer pays NAT tax.

For an image pipeline moving large media sets, that is not noise. It is structural savings.

NAT Gateway cost sink

My production environment still uses NAT Gateways for general outbound traffic. That cost persists because the system still needs egress, but the higher-volume internal traffic no longer needs to pass through NAT.

A realistic next step is to collapse production NAT from 3 gateways to 1, which would save roughly $65/month in the current setup.

GPU backfill cost

ARM64 Lambda savings

AWS ARM64 Lambda pricing is typically around 20% lower than x86 for equivalent duration.

That matters because my architecture is event-driven:

the EXIF extractor runs on every ingest
the technical validator runs on every upload
the scheduler ticks every 2 minutes

Those invocations compound. The bigger the event volume, the more the ARM64 delta matters.

A stronger way to express savings

The right way to think about the savings is not just “cheaper compute.” It is:

Savings = avoided x86 Lambda spend + avoided NAT data-path spend + reduced ML over-processing + lower operational overhead from simpler packaging

That is the real gain.

In practical terms, the migration gives me four layers of savings:

lower per-invocation Lambda cost on ARM64
fewer NAT-passed requests to S3 and DynamoDB
deduplication alone prevents re-processing identical or near-identical RAW frames, directly reducing GPU inference cost
fewer operational failures caused by missing native wheels

The most important one is the third. Every frame rejected by the technical validator is a frame that never reaches the GPU queue. That reduces both cost and latency.

Performance and latency

Technical validator latency

Replacing CLIP-IQA and the old dedup path with a rule-based validator removed model loading, warmup, and forward-pass overhead.

The new validator runs in pure Python and is typically sub-100 ms per image.

This includes hash computation and ANN indexing, which now runs inline without introducing measurable latency overhead.

VLM throughput

The Qwen3-VL-8B-Instruct worker processed 320 images in the backfill run with clean score clustering and no rejects on this dataset. The reject threshold is set to 0.4, but the dataset was too clean to trigger it.

That is fine for now. I still need a more difficult validation set before I can trust the threshold under harsher conditions.

ECS backend behaviour

The ARM64 backend shows the expected pattern for a control-plane service: stable, low-error, and efficient for I/O-bound work.

The ecosystem lesson

The lesson is the same one I started with, but sharper now:

The fastest path to ARM64 is sometimes not to migrate the workload. It is to replace the workload.

I did not port a broken ML stage to ARM64.
I removed it.
I did not force the GPU inference stack onto Lambda.
I isolated it.
I did not keep a dedup stage alive because it existed historically.
I collapsed it into the validator where it belonged.

That is what made the architecture cleaner and cheaper.

The validator is not just simpler—it is now measurably correct under production load.

What is still ahead

Terraform reconciliation: my live state now diverges from the earlier config, so I need to align infra state with reality.
NAT reduction: production can likely collapse from 3 NAT gateways to 1 after reconciliation.
Better validation data: I need deliberately low-quality images to test the reject threshold properly.
Inferentia2 feasibility: worth revisiting only if VLM inference becomes steady-state instead of bursty.

TL;DR

Three of four Lambda functions now run on Graviton. The remaining x86 workload is the one that genuinely belongs there: GPU inference.

How I track ARM64 wheel readiness before I migrate a service

The main point is that ARM64 migration is usually decided by wheel availability, transitive dependencies, and build reproducibility—not by the chip itself.

Core structure

1) Start with the dependency graph

I map every package into one of four states:

pure Python
ARM64 wheel available
source-build required
blocker

2) Test inside an ARM64 build environment

I do not trust local x86 installs for this. I build in CI or in an ARM64 container so I can see what actually happens on Graviton.

3) Pin Python and package versions

ARM64 support can appear or disappear across versions. I lock both the Python runtime and the package set.

4) Track failure modes

I record whether the failure is:

missing wheel
missing system library
compiler issue
ABI mismatch
runtime behaviour difference

5) Decide whether to port or replace

If the package stack is fragile, I replace the workload. That is often cheaper than fighting the dependency chain.

The chip is rarely the real problem. The wheel matrix is!

Closing thought

If there’s a single takeaway across all of these:

ARM64 adoption fails less because of compute—and more because of dependencies.

That’s the thread connecting every resource above—and the reason this migration worked only after changing the pipeline itself.

Jagadeesh Rampam

Discussion about this post

Ready for more?

Jagadeesh Rampam

From 1/4 to 3/4: Re-architecting an ML Pipeline for Graviton

The original four functions

What made it to the final cut!

1) I consolidated CLIP-IQA and dedup into a single technical validator

2) I moved VLM inference off Lambda entirely

3) I moved the ECS backend to ARM64

The current state: 3 of 4

Why the GPU stage will stay off Graviton

How I think about ARM64 wheels

What I check before switching a workload to ARM64

How I track ARM64 readiness for my use case

A useful rule of thumb

Cost analysis

VPC Gateway Endpoints

NAT Gateway cost sink

GPU backfill cost

ARM64 Lambda savings

A stronger way to express savings

Performance and latency

Technical validator latency

VLM throughput

ECS backend behaviour

The ecosystem lesson

What is still ahead

TL;DR

How I track ARM64 wheel readiness before I migrate a service

Core structure

1) Start with the dependency graph

2) Test inside an ARM64 build environment

3) Pin Python and package versions

4) Track failure modes

5) Decide whether to port or replace

Further Reading

AWS Graviton (Official)

ARM64 in Practice (Real-world engineering)

Python + ARM64 (Critical for ML workloads)

Closing thought

Discussion about this post

Ready for more?