Cloud & DevOps EngineeringSpot InstancesAWS EC2FinOpsCloud ArchitectureCost Optimisation

The 90% Discount: How to Run Production on Spot Instances Without Crashing

Ankush

Chief Technology Officer, GYSP.tech

1 June 20259 min read

What you'll take away

The Workloads That Are Spot-Ready Without Architecture Changes
Making Stateful Production Workloads Spot-Tolerant
Validated Outcomes
The Spot Mix Strategy for Kubernetes

Spot instances are one of the most underused cost levers in cloud infrastructure. AWS offers EC2 Spot instances at 60–90% discounts versus on-demand pricing. GCP's Preemptible VMs and Spot VMs offer similar discounts. The catch: the cloud provider can reclaim the instance with a two-minute warning when capacity is needed elsewhere. For most teams, that constraint puts spot instances firmly in the 'non-production only' category. The teams that have engineered past that constraint are running production workloads at dramatically lower compute costs.

The architecture that makes production spot viable isn't exotic — it's a set of engineering decisions that make your system resilient to instance loss in a way that pays dividends beyond just enabling spot. Instance loss tolerance is a forcing function for building distributed systems that are genuinely fault-tolerant.

The Workloads That Are Spot-Ready Without Architecture Changes

Some workloads are inherently tolerant of instance interruption because they are stateless, parallelisable, and designed to restart cleanly. If an instance is reclaimed mid-job, the job restarts on another instance and completes. The economics are compelling: a batch processing workload that takes two hours on on-demand instances might take two hours and fifteen minutes on spot with occasional interruptions — at 70% lower cost.

Batch data processing jobs with restartable or idempotent operations
Machine learning training jobs with checkpointing enabled
CI/CD build runners where a failed job simply retries
Video transcoding and media processing pipelines
Web scraping and data collection workloads
Development and testing environments where occasional interruption is acceptable

Making Stateful Production Workloads Spot-Tolerant

Instance Diversification

The primary risk with spot instances is over-concentrating in a single instance type within a single Availability Zone. When spot capacity for that configuration tightens, you can lose all your instances simultaneously. The solution is instance diversification: configuring your Auto Scaling Group or managed node group to use multiple instance families across multiple AZs. Spot Capacity Pools that have different demand patterns are unlikely to be interrupted simultaneously, so diversification across five or more configurations reduces interruption risk dramatically.

Graceful Interruption Handling

AWS provides a two-minute interruption notice via the instance metadata service and Amazon EventBridge. A well-designed application will: detect the interruption notice as soon as it arrives, stop accepting new requests, complete in-flight requests if they can complete within two minutes, and drain any local state to durable storage before the instance terminates. This requires application-level support — the application must be built to handle graceful shutdown, not just be killed mid-operation.

Checkpointing for Long-Running Jobs

For workloads that run longer than can complete within a two-minute shutdown window, checkpointing is essential. The application periodically saves its progress to durable storage (S3, DynamoDB, RDS). If interrupted, the job restarts from the last checkpoint rather than from the beginning. Good checkpoint design minimises the loss on interruption: a batch processing job that checkpoints every thousand records loses at most a thousand records worth of work on interruption, regardless of how far through the job it was.

Paying for cloud you're not using?

48-hour turnaround. No obligation.

Request Cloud Cost Audit

The infrastructure cost of graceful interruption handling and checkpointing is a one-time engineering investment. Once built, it enables spot usage across multiple workloads and also makes your system more resilient to AZ outages and instance hardware failures — benefits that extend well beyond the cost savings.

Validated Outcomes

Netflix is the canonical reference for production spot instance operations at scale. Netflix has published that their global streaming infrastructure runs a significant percentage of compute on EC2 Spot, with their batch processing workloads — encoding, recommendation model training, data pipeline jobs — running almost entirely on Spot. Netflix engineering blog posts have documented how their Chaos Engineering practice (including Chaos Monkey) was partly motivated by the operational discipline that Spot interruption handling requires: designing for failure makes interruptions routine rather than exceptional. The result: Netflix achieves 60–80% compute cost reduction on Spot-eligible workloads compared to equivalent On-Demand pricing.

GYSP's Spot implementation engagements focus on the two engineering investments that make production Spot viable: interruption-aware pod configuration in Kubernetes (using node affinity, pod disruption budgets, and graceful termination handling) and instance diversification across at least 3–4 instance families to reduce simultaneous interruption probability. Clients who complete these two steps consistently achieve 40–60% compute cost reduction on Spot-eligible workloads with reliability metrics unchanged from their On-Demand baseline.

The Spot Mix Strategy for Kubernetes

Kubernetes clusters offer a clean way to implement spot instance strategies through node group configuration. The standard approach: a small on-demand node group that handles critical system workloads and provides baseline cluster capacity; a large spot node group with aggressive instance diversification that runs the majority of application workloads; pod disruption budgets that specify the minimum number of replicas that must remain running during node interruptions; and pod topology spread constraints that distribute replicas across nodes and AZs.

With this configuration, Kubernetes handles instance interruption events by draining the affected node and rescheduling pods to other available nodes. Applications must handle the brief interruption during rescheduling, but for web services with multiple replicas and proper health check configuration, user-visible impact is minimal.

GYSP's Cloud & DevOps Engineering practice has implemented spot instance strategies for clients running production Kubernetes clusters and batch processing pipelines. The typical outcome: 40–60% reduction in EC2 compute costs for workloads where spot tolerance is achievable, with no meaningful increase in reliability incidents when the architecture is done correctly.

“Spot tolerance is not a risky architecture choice — it's a resilience engineering exercise that happens to have a large cost benefit. The companies running production on spot aren't cutting corners; they've built better infrastructure.”
— Ankush, Chief Technology Officer — GYSP.tech

ShareLinkedIn Twitter / X

Ready to act on this?

Paying for cloud you're not using?

Get a free cloud cost audit — we identify 20–40% spend reduction opportunities in your current infrastructure within 48 hours.

40%

Avg. cloud cost reduction

Zero

Downtime migrations

55%

Faster deployment cycles

Request Cloud Cost Audit

48-hour turnaround · No obligation · Senior engineers only

Get new Cloud & DevOps Engineering insights in your inbox

Practical, no-fluff articles for engineers and technology leaders. New pieces delivered as they're published.

No spam. Unsubscribe any time.

The 90% Discount: How to Run Production on Spot Instances Without Crashing

The Workloads That Are Spot-Ready Without Architecture Changes

Making Stateful Production Workloads Spot-Tolerant

Instance Diversification

Graceful Interruption Handling

Checkpointing for Long-Running Jobs

Validated Outcomes

The Spot Mix Strategy for Kubernetes

Paying for cloud you're not using?

Get new Cloud & DevOps Engineering insights in your inbox

More from the Blog

The "Lift and Shift" Lie: Why Your Successful Cloud Migration Is Bleeding Cash

Your DevOps Team Is a Bottleneck. An Internal Developer Platform Is the Fix.

Why FinTech Companies Pay 3× More for Cloud Than They Should