AWS – introducing „spot instances“
"I will give you a discount of about 70% on your virtual server, but in return I reserve the right to shut it down anytime." This is the somehow strange sounding offer of AWS called "spot instances" (Azure: low priority VMs). But why should AWS offer such a product, don´t they impair their ondemand business with it?
The reason behind the spot-instance offering is to make money with the spare capacity AWS has to keep. When you are working in the cloud you are used to being able to provision and start a virtual server in minutes. That means that the hardware where the VMs are started is there, unused, waiting for a customer to start a VM on it. The hardware is paid, costs amortization, power, software-licenses, maintenance and so on. And as long as no customer starts a VM on it there is no revenue at all. So actually AWS could use this spare hardware to run VMs on it as long as no customer wants to start an ondemand instance on it. And that’s how the idea of spot instances was born: AWS let´s you run VMs on its spare capacity but powers them off as soon as it needs the hardware for a customer´s ondemand instance.
Of course having to deal with the fact that a VM can be switched off anytime is annoying and therefore AWS has to give a high discount on the ondemand-price in order to make this spot-offer attractive. No problem for AWS – they have to pay the full costs of this hardware in any case, even as long it is unused. So every dollar of revenue that they make with spot instances is pure profit. Looks like a clear win-win and it is.
So, as an AWS customer looking at spot instances you have to ask yourself: What applications do I have that do not suffer much when a VM gets switched off?
There are 3 types of application that are suited for spot: 1) cloud native apps, 2) batch processing like EMR, rendering, science, video (re-)encoding, 3) apps/systems that only run for a short period of time (therefore running a low risk of being interrupted) like CI/CD piplines, some test-systems, large-scale tests on many nodes, autoscaling
So what is 1), a „cloud native app“? This just means that it scales horizontally (by adding more VMs to the system). A good example is the google-engine. It runs on thousands of nodes and almost anytime a node is down somewhere simply due to the high amount of nodes. But google itself is never down as it is designed from scrach for scalability and reliability, both accomplished by scaling horizontally.
The second group is about batch processing. Here any number of nodes do some number crunching like map reduce (big data) or raytracing or video encoding. If a node gets shutdown, the other can do his work, the processing will take longer but this is rarely important. And if it was, you could fire up another node (spot or ondemand).
The third group is about VMs needed for a short period of time. For example testnodes that are only needed when a new software is ready to be tested. Or tests that involve a lot of nodes (always test at production scale!), but only for a few hours or days. Or autoscaling instances used when autoscaling scales out (adds nodes). You know it will scale in again sooner or later. And for this short period you can use spot instances as the likelihood of them getting switched off is of course lower the shorter their lifetime is.
And what about SQL-databases like oracle/ms-sql/mysql? They are almost never suited as often users are working on them and the costs of some users being unable to work is way higher than the savings you can achieve with spot. Furthermore databases might run into problems when they are not given enough time to shutdown themselves.
Nosql-databases do not have those problems, but even here you have to keep any eye of the costs that downtime create. As a rule of thumb do not use spot with any interactive users or databases.
Ok, then let´s go back to the recommended use cases, for example batch processing. Say you have hundreds of MP4-videos where you want to strip the audio and save it as mp3. A great job for spot. Fire up a few dozen spot instances (remember that in the cloud using 10 instances for 1 hour is the same as using 1 instance for 10 hours) and let them process the videos. If one spot-instance gets shutdown, just fire up another (preferebly with another instance type).
However, this way when a node gets shutdown you loose all the progress that it had, all the processing/audio extraction it did so far. To avoid this, you can use checkpointing. Every minute or so your node saves its work to durable storage like S3 along with a few information like „the audio of the first 150 seconds of the video xyz are saved in file xzy-123“. So another instance can continue with this checkpoint.
This is a simple way of checkpointing. Systems like EMR or grid-computing have more sophisticated ways of starting the desired number of instances, distributing the work and checkpointing.
What to keep in mind, when looking at using spot instances
Do not use them for databases or interactive users. This would result in high downtime costs, way higher than the savings gained with spot. Use them for your cloud native app, maybe with some ondemand instances providing the basic, always needed capacitiy and spot-instances for further capacitiy. Similar to autoscaling.
And use them for any kind of non time-critical batch processing (end-of-day processing of a bank is batch processing but time-critical!). And almost always use them for test systems (without interactive users), additional short-lived capacity like autoscaling, CI/CD, maybe containers.