So You Suddenly Need to Reduce Your AWS Bill: 4 Things We Did
By Matt Simoneau and Daniel Cohen
With nationwide COVID lockdowns and shelter-in-place orders, people are spending less money, and businesses are losing revenue. It’s impact varies dramatically between industries. Travel and Dining? Terrible. Logistics? Bad. Streaming? Incredible. Either way, the pandemic has forced everyone from businesses to individuals to reevaluate expenses. As engineering leaders, we can create major cost savings for our organizations simply by doing something that we
constantly say we’ll get to later don’t usually have time for, tech debt. And as the name implies, it can have a large effect on operational costs in the cloud.
Here’s what we did to save 30-40% on our infrastructure costs without compromising on developer or client experience:
AWS charges per management API event in CloudTrail. Events are triggered any time a change is made to your infrastructure, whether you know it or not. For us, tens of millions of events get created every month, and it’s easy for a misconfiguration to cause you to get charged for these multiple times. In our case, we had two trails configured to capture the same management API events, increasing fees by thousands of dollars with no added benefit. Luckily, once we discovered and fixed the issue, AWS was able to refund us for part of the wasted resources!
#2 EC2 Instances
This is another quick and easy one and you’ll find it on all top 5 lists for reducing AWS costs. Make sure that your usage is appropriately sized and not over-provisioned. Companies largely overestimate on processing needs when they move their infrastructure to AWS. Using Amazon’s Compute Optimizer tool, you can look at historical usage compared to how much you provisioned. Compute Optimizer will tell you the margin, thereby allowing you to safely downsize and create cost savings.
Do you have your assets in the right storage class? WELL DO YOU?! (sorry…I’ll calm down)
Here’s where the ‘Spring Cleaning’ really comes into play. We went in and found a ton of duplicates of logs and data that we didn’t need to retain anymore. For stuff you don’t need to access often, consider moving it into Infrequent Access or even Glacier. The costs can be half to a quarter of what you’re paying to keep it in the Standard Tier. You can use Lifecycle rules to tell AWS when to move things into cheaper storage classes. Here’s a real world example: Every time we generate a bill for a patient, we render and store a PDF for later use. These rendered bills are much more likely to be accessed soon after they’re printed, and much less likely to be accessed later on. So after a few months, we transition the PDFs to Infrequent Access, which costs us less per month, but more per retrieval. This trade-off is ideal for situations like this where data still needs to be available, but is unlikely to be accessed.
Like many young companies, we move very quickly, deploying fixes and features to production dozens of times a day. While most of these tools and features are used every day by thousands of users across the country, some never make it past the testing phase. As we audited our AWS account, we found several unused services like some DynamoDB tables, with provisioned capacity. This meant we were getting charged for this capacity, even though it wasn’t being used. Just deleting these tables and their reserved capacity saved us nearly $800 a month. That’s a lot of Chick-fil-A!
In addition to DynamoDB, a large portion of our AWS spend goes to Relational Databases (RDS). For some of our lower-traffic services, with lower availability requirements, we chose to host multiple schemas on a single cluster, saving several hundred dollars a month in RDS costs.
What we could have done but didn’t do
The astute readers among us might have a few questions at this point, like why we don’t use spot instances, or turn down unused capacity at night? The answer begins with Kubernetes, which we use to orchestrate our workloads.
Kubernetes doesn’t like being turned off; it’s state storage layer, etcd, depends on the Raft consensus algorithm to maintain consistency. If too many nodes are lost, it requires manual intervention to restart, making a nightly cluster shutdown nearly impossible. We’ve used horizontal and vertical auto-scaling on our clusters for several years, allowing us to automatically reduce cluster capacity during periods of low load, and get most of the benefits of turning off instances on a schedule.
Spot instances are another common technique to save money on EC2. However, when we provisioned our instances, we purchased reserved instances, saving us money each month, at the cost of committing to a particular type of instance for a year. As these contracts expire, we’ll evaluate the suitability of spot instances for their workloads.
The biggest takeaway is that when engineering teams move fast, we often forget things along the way. Big market shifts like COVID are a good opportunity for us to tackle tech debt, and prepare our businesses for future success.
To hear from us from time to time, subscribe below.