Cutting Our AWS Bill by a Third in Two Weeks

Paul Pounder
April 18, 2026
5
 min read
Cutting Our AWS Bill by a Third in Two Weeks - Platform Engineering

When you're building a platform with real ambition but a startup budget, every dollar matters. Our AWS bill had settled at around $129/month — not terrible for a live sports broadcasting platform with 14 ETL jobs, a NAT Gateway, ECS Fargate containers, and a fleet of Lambdas. But 87% of that cost was fixed, regardless of whether we were covering 10 fixtures or zero.

So we commissioned a cost review, identified three waves of optimisation, and shipped Wave 1 in a single session. Here's what happened.

What We Built

Wave 1 was designed as "bill hygiene + quick wins" — low engineering effort, low risk, shipped before the English football season ends so we can validate under live fixture load.

10 out of 11 tasks completed (the exit metric needs 7 days of data to measure):

  • Migrated the SportMonks API token from Secrets Manager to SSM Parameter Store SecureString. Sounds boring, but Secrets Manager charges per secret and per API call. SSM SecureString is free at our scale. The live poller calls this every minute — those pennies add up.
  • Slimmed the SportMonks API call by dropping ballCoordinates and formations from the include string. These are large payloads we're not using yet (Phase 3 will add them back selectively). Less data over the NAT Gateway, smaller DynamoDB writes.
  • Added adaptive polling — when every active fixture is in a break state (half-time, awaiting extra time, etc.), the poller widens from 10-second to 15-second intervals. A 33% reduction in API calls during dead time.
  • Killed a duplicate EventBridge cron rule — the Daily Schedule Analyzer was running at both 23:01 UTC and 00:01 UTC. The analyzer is idempotent, so the second run was pure waste.
  • Turned off S3 versioning on Silver and Gold data lake buckets. These are deterministically rebuildable from Bronze, so versioning was just burning storage. Added Intelligent-Tiering to Silver for good measure.
  • Set up 8 CloudWatch billing alarms — per-service alerts for Glue ($50), ECS ($40), NAT Gateway ($40), CloudWatch ($10), and Bedrock ($15), plus total spend thresholds at $75, $125, and $200.
  • Confirmed no orphan Kinesis streams — the cost review flagged a $0.57 charge, but it turned out to be a trailing billing artefact.
  • Cleaned up dead code — three Lambdas were creating boto3.client('secretsmanager') without ever calling it. Removed.

Key Decisions

Why only migrate SportMonks, not Confluent? The Confluent Kafka credentials are a JSON object with multiple fields (bootstrap servers, API key, API secret) and are consumed by ECS Fargate which has native Secrets Manager integration. Breaking them into separate SSM parameters would be messier than the ~$0.40/month saving justifies.

Why not just delete the Secrets Manager secret manually? We didn't have to — CDK handled it. When we replaced the secretsmanager.Secret construct with an ssm.StringParameter reference, CloudFormation automatically deleted the old secret during deploy. Clean cutover with zero manual steps.

Why adaptive polling instead of just widening the interval permanently? Because during active play, 10-second polling is the difference between catching a goal within seconds and missing it for 15. The break-state detection is cheap (checking developer_name against a set of known states) and the benefit is targeted — we only slow down when nothing can possibly happen.

How It Works

The core change to the poller is simple. After processing all fixtures in each cycle, we check if every active fixture is in a break state:

BREAK_STATE_NAMES = {'HT', 'HALF_TIME', 'FT', 'AET', 'FT_PEN', 
                     'ET_BREAK', 'AWAITING_ET', 'AWAITING_PEN', 'BREAK'}
all_in_break = bool(fixtures) and all(
    f.get('state', {}).get('developer_name', '') in BREAK_STATE_NAMES
    for f in fixtures
)
sleep_interval = BREAK_POLL_INTERVAL_SECONDS if all_in_break else POLL_INTERVAL_SECONDS

For the SSM migration, both the live_delta_poller and batch_ingest Lambdas swap one API call:

# Before (Secrets Manager)
response = secrets_client.get_secret_value(SecretId=SPORTMONKS_SECRET_NAME)
token = response.get('SecretString')

# After (SSM Parameter Store)
response = ssm_client.get_parameter(Name=SPORTMONKS_SSM_PARAM, WithDecryption=True)
token = response['Parameter']['Value']

The CDK stack changes were more involved — renaming environment variables, swapping IAM grants, updating all 14 entity ingestion pipeline constructs in the Medallion stack — but the pattern is mechanical.

What I Learned

Deploy, then watch the logs. We caught a scoping bug in the adaptive polling code during live validation — fixtures was only defined inside an if sm_data block but referenced outside it. The fix was a one-liner (fixtures = [] before the block), but it would have been invisible without checking CloudWatch logs within minutes of deploying. The old Lambda execution environments cached the buggy code for a few minutes before being replaced — a good reminder that Lambda deploys aren't instant.

CDK is surprisingly good at cleanup. I expected to manually delete the old Secrets Manager secret and the duplicate EventBridge rule. CloudFormation handled both automatically during the stack update — it even cleaned up the DLQ attached to the removed rule.

Fixed costs are the enemy. 87% of our bill is fixed regardless of traffic. The real savings come from Waves 2 and 3: removing the NAT Gateway entirely (~$35/mo), downsizing Glue jobs to Python Shell for low-volume entities (~$20/mo), and making Fargate scaling fixture-aware instead of time-based (~$15/mo).

What's Next

Wave 2 (days 15–45) is the big one: de-VPC-ing 8 stateless Lambdas so we can delete the NAT Gateway entirely. That's $35/month gone. Then downsizing 7 low-volume Glue ETL jobs from PySpark (G.1X, $0.44/DPU-hour) to Python Shell (1/16th DPU, $0.0275/DPU-hour). And making Fargate scaling fixture-aware — why run the agent hub container on a quiet Monday when there are no matches?

The English football season ends in early May. We need Wave 2 shipped and validated under live load before then, or we're waiting until July pre-season for the next test window.

Target: daily bill dashboard showing ~$1.80/day by day 45 (down from ~$4.30 today).

Sign up for our newsletter

Lorem ipsum dolor sit amet, consectetur adipiscing elit.

By clicking Sign Up you're confirming that you agree with our Terms and Conditions.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.