There are only 2 ways to make money. Increase your income or reduce your expense. Its true for inidividual as well as organizations. In my current organization we were facing huge challenge of unncessary cost being spent on AWS. We were paying huge chunk of our income to AWS. In this post I am going to disect some critical problems and optimizations done to reduce the cost of the AWS.

The Begining
After realising the problem we formed a team who can solely focus on the AWS cost optimization. I was part of this team and my job was to find out critical bottlenecks and solve them.

We started with checking our AWS bills and found that majority of cost was contributed by RDS. And that was very evident as we were using RDS 8xlarge machine to process daily of 13K applications. So we shifted our focus on optimizing RDS usage.

**Milestone 1
**We started checking AWS performance insight to get the idea what things are consuming most of RDS CPU. Found that top 10 queries were few functions which are heavily used by the system. To solve this problem we decided to convert these functions procedures to java code.

We wanted to set some benchmark to verify our changes. So we did a performace setup to validate our changes. And tested it against the current code. Then we started converting these procedures to Java code by completely elimiating functions. And we saw a huge performance bost and less DB load.

After observing DB for few days we took a call to reduce 8xlarge instance to 4xlarge. To our surprise it holded well.

**Milestone 2
**Challenge was still not over, we knew that RDS is still can be optmized further consider the load which were handelling. So our next target was DB storage. We observed that 80% of our storage was contributed by one of our audit table which stored request and response of each API call.

Challenge here was we cannot completely eliminate this table as it was used by many of our teams for analysis purpose. We started to think any alternative DB or storage options like DynamoDB, EFS, CLickhous, S3 etc. After some analysis, we finalized keeping table in RDS itself but the request response part of it in S3. And the reference link to those S3 files in the table. This reduced our table size to good extent.

While analyzing storage issue of audit table we realized that insert was synchronous and it was consuming good amout of CPU. So we made some optimizations to make insertion of it async and batches. To our surprise after release of this to production we saw no significant reduction in CPU utilization.

Along with this activity we were doing some optimizations in one of our DB functions. During that we realized lot of frequently used tables miss indexing. So we added indexes in those tables and got a surpise. DB utilization got drastically reduced. So we took call to reduce DB from 4xlarge to 2xlarge.

Milestone 3

We were using postgres version 13. After further discussions with AWS team we got 2 suggestions:

  • upgrade to postgres version 15 will give us performance boost.

  • They also suggest to use IO optimized instance considering our workload and activity.

As part of milestone 2 we figured out that we need to move older audit data to reduce our storage cost further:
1. Reduce our Active DB size

This was a large activity and need a huge downtime. We took a downtime and executed first 2 activities but movement of audit data was too slow and did not completed on time. So we written script which moved this data in batches. After completing and deletion of old data our active DB size got reduced by 70%.

Our Learnings:

  1. Dont put load on DB by using lot of DB procedures as its costly.

  2. First thing look for is indexing while doing DB optimizations

  3. Don’t just focus on CPU of DB. IOPS and Storage cost the most.