What you'll take away
Spot instances are one of the most underused cost levers in cloud infrastructure. AWS offers EC2 Spot instances at 60–90% discounts versus on-demand pricing. GCP's Preemptible VMs and Spot VMs offer similar discounts. The catch: the cloud provider can reclaim the instance with a two-minute warning when capacity is needed elsewhere. For most teams, that constraint puts spot instances firmly in the 'non-production only' category. The teams that have engineered past that constraint are running production workloads at dramatically lower compute costs.
The architecture that makes production spot viable isn't exotic — it's a set of engineering decisions that make your system resilient to instance loss in a way that pays dividends beyond just enabling spot. Instance loss tolerance is a forcing function for building distributed systems that are genuinely fault-tolerant.
The Workloads That Are Spot-Ready Without Architecture Changes
Some workloads are inherently tolerant of instance interruption because they are stateless, parallelisable, and designed to restart cleanly. If an instance is reclaimed mid-job, the job restarts on another instance and completes. The economics are compelling: a batch processing workload that takes two hours on on-demand instances might take two hours and fifteen minutes on spot with occasional interruptions — at 70% lower cost.
- Batch data processing jobs with restartable or idempotent operations
- Machine learning training jobs with checkpointing enabled
- CI/CD build runners where a failed job simply retries
- Video transcoding and media processing pipelines
- Web scraping and data collection workloads
- Development and testing environments where occasional interruption is acceptable
Making Stateful Production Workloads Spot-Tolerant
Instance Diversification
The primary risk with spot instances is over-concentrating in a single instance type within a single Availability Zone. When spot capacity for that configuration tightens, you can lose all your instances simultaneously. The solution is instance diversification: configuring your Auto Scaling Group or managed node group to use multiple instance families across multiple AZs. Spot Capacity Pools that have different demand patterns are unlikely to be interrupted simultaneously, so diversification across five or more configurations reduces interruption risk dramatically.
Graceful Interruption Handling
AWS provides a two-minute interruption notice via the instance metadata service and Amazon EventBridge. A well-designed application will: detect the interruption notice as soon as it arrives, stop accepting new requests, complete in-flight requests if they can complete within two minutes, and drain any local state to durable storage before the instance terminates. This requires application-level support — the application must be built to handle graceful shutdown, not just be killed mid-operation.
Paying for cloud you're not using?
48-hour turnaround. No obligation.
Checkpointing for Long-Running Jobs
For workloads that run longer than can complete within a two-minute shutdown window, checkpointing is essential. The application periodically saves its progress to durable storage (S3, DynamoDB, RDS). If interrupted, the job restarts from the last checkpoint rather than from the beginning. Good checkpoint design minimises the loss on interruption: a batch processing job that checkpoints every thousand records loses at most a thousand records worth of work on interruption, regardless of how far through the job it was.
The infrastructure cost of graceful interruption handling and checkpointing is a one-time engineering investment. Once built, it enables spot usage across multiple workloads and also makes your system more resilient to AZ outages and instance hardware failures — benefits that extend well beyond the cost savings.
The Spot Mix Strategy for Kubernetes
Kubernetes clusters offer a clean way to implement spot instance strategies through node group configuration. The standard approach: a small on-demand node group that handles critical system workloads and provides baseline cluster capacity; a large spot node group with aggressive instance diversification that runs the majority of application workloads; pod disruption budgets that specify the minimum number of replicas that must remain running during node interruptions; and pod topology spread constraints that distribute replicas across nodes and AZs.
With this configuration, Kubernetes handles instance interruption events by draining the affected node and rescheduling pods to other available nodes. Applications must handle the brief interruption during rescheduling, but for web services with multiple replicas and proper health check configuration, user-visible impact is minimal.
GYSP's Cloud & DevOps Engineering practice has implemented spot instance strategies for clients running production Kubernetes clusters and batch processing pipelines. The typical outcome: 40–60% reduction in EC2 compute costs for workloads where spot tolerance is achievable, with no meaningful increase in reliability incidents when the architecture is done correctly.
“Spot tolerance is not a risky architecture choice — it's a resilience engineering exercise that happens to have a large cost benefit. The companies running production on spot aren't cutting corners; they've built better infrastructure.”
— Ankush, Chief Technology Officer — GYSP.tech
