The Hunt for Cheap Deep Learning GPUs

What you can do

Free AI tools

AI Background Remover

AI Retouch

Edit your photos

Change background color

Resize image

Add white background

Batch Editing

Save hours by editing in batch

Instantly remove the most complex backgrounds from multiple images at once - directly in the app or with our API.

Industries

Use cases

Teams

For your business

Customer stories

Discover how enterprises, small businesses and entrepreneurs achieve professional results with Photoroom.

API

Generate Background API

Remove Background API

Image Editing API

Integrations

View all

Resources

Test live (Playground)

Automate your business

Discover the Photoroom API

Scale your visual content production with Photoroom's API for automated image editing across industries.

Learn more

The Hunt for Cheap Deep Learning GPUs

July 12, 2020

Recently, I have been struggling to find cheap and reliable GPUs to train deep learning models. In this article, I will summarize the options you have to run deep learning computations on GPUs.

Not too long ago, you could rent a beefy GPU machine for 100€/month. Hetzner, a German server provider, was offering those specs:

It was fast and reliable. The good times. However, they discontinued this offering. Nowadays, if you want to get a GPU for deep learning, you have several options:

Use a cloud provider (GCP, AWS, Azure)
Use a cloud provider with preemptible machines
Rent a bare metal machine
Build your own

Foreword

Hetzner offered cheap and reliable servers. They had a good reputation. Why did they stop? While there is no official reason, it is likely that changes in the Nvidia's license is the reason. NVIDIA updated their license to ban the use of consumer GPUs (e.g. 1080, 2080 models) in their data centers. Therefore, most large server provider stopped offering cheap GPU servers.

Using a Cloud provider

Google Cloud, AWS and Azure all offer GPU machines. This is the most expensive option in our list. In theory, you can scale your cluster's size on demand. They offer GPUs for training (V100) and inference (T4).

My experience: some providers run unscheduled maintenance on your machine. It means they will need to kill your instance to migrate it to another (but keep the content of the disk). You get a 1 hour termination notice for GCP, more for the others. It very inconvenient when you start a large training over the weekend, only to realize that your machine has been killed on Friday evening. On top of that, some regions sometimes run out of GPUs. This means that when attempting to create a machine, it will fail. This does not happen often, but when it does it is very annoying.

Pros:

Scaling on demand (limited by quota and availability)
Can pick any number of CPUs (useful for preprocessing-intensive jobs)

Cons:

Unscheduled maintenance is a pain(1 hour notice for GCP, ~24hours for AWS, can happen once a week)
Expensive

Using preemptible instances

Most cloud providers offer preemptible machines, with a significant discount (at least 50%, often more). In exchange, you accept that your machine can be killed at any moment. It is not very convenient when training models and saving the checkpoints every epoch. Working around that takes a lot of engineering.

My experience: my instances are sometimes killed in less than an hour, making it unusable. Try it out and see if it works for you (might depend on the region)

Pros:

Cheaper
Scalable

Cons:

Machine can be killed at any moment

Renting a bare metal machine

Some providers are still offering consumer GPUs, officially not for deep learning. A Google search will yield plenty of them. You can also look here. The price vary from provider to provider.

My experience: Reliability is not great. I made the mistake of using one of those servers as a production server. Then, it went down on a Saturday at 1 am. Here is the support's answer:

YMMV, and you must make your own arbitrage between price and reliability.

Pros:

Cheap, plenty
No weekly scheduled maintenance

Cons:

Sometimes unreliable (YMMV)
Does not scale quickly as with a regular cloud provider (need to order the machine, sometimes need a monthly commitment)

Subrenting a server

I never tried this, but vast.ai is a marketplace offering very affordable prices. Anyone can list a GPU there, therefore I am not exactly sure how reliable it is.

Build your GPU server

If you have the time and the rack space, building your own GPU machine might be the cheapest option. Depending on how cheap you need to go, keep an eye for used GPUs on eBay. Keep in mind that you will have to pay for electricity and that having a noisy machine heating your office in the middle of summer is the best way to turn your colleagues into enemies.

Pros:

Cheapest option (depending on electricity cost)
Custom specs (useful if you need plenty of storage)

Cons:

Time consuming
Not convenient (noise, heat)

What we ended up doing at Photoroom

For training, we built our own machine (using 2080 TIs). For larger training, we use GCP with V100s and cross fingers that there will not be any maintenance event. For inference, we use GCP's T4 GPUs, in a managed instance group. This means that if they need to kill a machine for maintenance, they will automatically spin up a new one.

Conclusion

Please keep in mind that I am not endorsing any of those options, pick one at your own risk. In the end, it's a trade-off between price, convenience, reliability and scalability. Also note that running inference on CPUs can be cheaper. A few helpful links:

Any idea on how to improve this ? Any comment? Reach out on Twitter

Eliot AndresCo-founder & CTO @ Photoroom

Design your next great image

Whether you're selling, promoting, or posting, bring your idea to life with a design that stands out.

Start a design

Keep reading

Embracing radical openness: How a “No DM” Slack policy drives impact at Photoroom

Matthieu Rouif

From the Alps to AI: How Photoroom can cut your carbon footprint

Lyline Lim

Make stable diffusion up to 100% faster with Memory Efficient Attention

Matthieu Toulemont

Photoroom foundation diffusion model: why, how, and where do we go from there?

Benjamin Lefaudeux

Mutagen tutorial: syncing files easily with a remote server

Eliot Andres

Making stable diffusion 25% faster using TensorRT

David Briand

Improving the Loading Experience in SwiftUI

Vincent Pradeilles

Redesigning an Android UI used by millions

Aurelien Hubert

Businesses need more threesomes, reveals market report

Aisha Owolabi

From around the world to Photoroom: How we attract and nurture global talent

Matthieu Rouif

See all