Quiet Sunday, Loud Edge Cases: Fixing the Vidiprinter and Retiring the NAT Gateway

Paul Pounder

April 19, 2026

•

min read

Sunday afternoons on a live football platform are usually just monitoring — watch the scores tick through, watch the WebSocket frames arrive, confirm nothing's on fire. Today was mostly that. But two things surfaced that turned into proper fixes, and on the back of a clean weekend of live data I was finally able to pull the trigger on the biggest cost-reduction in the Wave 2 cost-optimisation plan.

This post is a write-up of all three.

What We Built

Three things shipped today:

A canonical events snapshot in the live WebSocket frame — the frontend can now drop VAR-disallowed goals and rescinded cards without any new backend signal.
A one-line fix for a latent Vidiprinter ordering bug that hides in plain sight until two goals happen either side of half-time.
Deletion of the NAT Gateway and its Elastic IP, finishing Wave 2 Stream A of the cost-optimisation programme and realising about $32/month in saving.

The Vidiprinter Bug That Looked Like a Hang

Everton scored against Liverpool. The goal flashed into the Vidiprinter. Then VAR stepped in, ruled offside, and the goal was disallowed — scores corrected back to 0-0. Seconds later Liverpool scored at the other end. The live score ticker updated correctly to 0-1. But the Vidiprinter looked frozen. Refresh the page and it was fine.

The diagnosis wasn't a hang at all. It was a stale entry.

Our live_delta_poller Lambda watches SportMonks' events array, compares the ID set to the previous poll, and emits NEW_EVENT deltas for anything new. That's a perfectly good pattern until events start disappearing from the upstream array — which is exactly what happens when VAR disallows a goal. SportMonks just drops the event ID. Our poller only knows how to detect additions. The frontend hook only knows how to append. So the Everton goal lived on in the ticker's homeGoalscorers array with nothing to remove it, and the running-score calculation (which recomputes from the goalscorer array) started producing wrong values.

I considered three fix paths:

Emit REMOVED_EVENT deltas — detect ID set shrinkage in the poller, have the frontend strip matching entries.
Ship a canonical events_snapshot in COMPOSITE_UPDATE — frontend replaces the ticker arrays instead of appending.
A narrow VAR-specific patch — when we see GOAL_DISALLOWED, emit a targeted removal.

I went with option 2. It's slightly more bandwidth than options 1 or 3, but it kills a whole class of bug in one move — VAR disallowals, rescinded cards, and anything SportMonks might decide to correct in the future. The payload impact turned out to be tiny (~1-2 KB per frame for a typical late-in-match fixture, well under API Gateway's 128 KB WebSocket frame limit) and we gate it on the event-ID set actually having changed, so score-only and pressure-only cycles carry nothing extra.

The cleanest piece of the implementation was almost accidental: the REST bootstrap endpoint had been doing the same events-to-ticker-buckets projection inline, so I extracted it into lib/ball-watching/project-events.ts and both the REST route and the WebSocket hook now call the same function. The REST path lost about 60 lines of inline logic as a bonus.

The Second Bug, Uncovered by the First

With the VAR fix deployed, I was watching the next fixture when I noticed something else. Burnley scored against Nottingham Forest in the 2nd minute of first-half injury time — displayed as "45+2'". A few minutes later, Morgan Rogers scored for Aston Villa in the first minute of the second half — "46'". Both games were on the same matchday.

The Burnley entry was sitting above the Villa one in the Vidiprinter. Which is wrong — Villa's goal happened after Burnley's in real time (46 minutes of actual match play happened before the second half even started, plus the half-time break).

Hunting through the Vidiprinter sort logic, I found this:

const totalMatchMinutes = (time) => {
    const match = time.match(/(\d+)(?:\+(\d+))?/);
    const base = parseInt(match[1], 10);
    const extra = match[2] ? parseInt(match[2], 10) : 0;
    return base + extra;
};

"45+2" → 47 minutes. "46" → 46 minutes. So 45+2 sorts after 46 by this calculation, which is backwards. The function was used as a cross-fixture wall-clock approximation, and the code comment even claimed it "preserves ordering within a fixture" — but that has never been true for events that cross the 45/90 boundary. It's been latent since the function was written, and only surfaces when two goals happen in quick succession right around half-time. Today was probably the first Saturday or Sunday it has been triggered.

The fix was a one-character change that took longer to reason about than to write. Treat stoppage time as a fractional minute within the base, so "45+2" becomes 45.02 (strictly between 45 and 46), and "90+6" becomes 90.06 (strictly between 90 and 91). Cross-fixture ordering is still dominated by the kickoff offset, so the sub-minute distortion doesn't affect anything.

// before: return base + extra;
// after:  return base + extra / 100;

I wondered whether this was a regression — it felt like the kind of thing we might have fixed before. Git history said no; this was the first time the function had been touched since creation. The feeling of familiarity was real, but it was a different ordering issue that got fixed previously.

Retiring the NAT Gateway

With both Vidiprinter fixes deployed and validated against today's live matches, and with the Fargate Agent Hub running cleanly in public subnets for the full Sunday fixture window (task A.2 of the Wave 2 cost optimisation), the precondition for the big one was met.

Task 58.4 of the cost plan: delete the NAT Gateway and its Elastic IP.

NAT Gateways are about $32/month in fixed charges before any data processing. For a platform this size, where the entire business is a few Lambdas and a single Fargate task talking to public AWS endpoints (DynamoDB, S3, Bedrock, SSM, SportMonks, Confluent Cloud), the NAT was the single largest unnecessary line item on the AWS bill.

The earlier phases had already done the groundwork:

A.1 took live_delta_poller out of the VPC (it was the only Lambda still using VPC networking).
A.2 moved the Fargate Agent Hub from PRIVATE_WITH_EGRESS subnets to PUBLIC subnets with assign_public_ip=True, so its outbound Kafka/Bedrock traffic went directly through the Internet Gateway instead of the NAT.

By the time I ran the CDK diff for A.3, nothing was actually using the NAT anymore. It was just running, burning $1.10/day, waiting to be removed.

The code change was two lines:

# before
nat_gateways=1,
subnet_configuration=[
    ec2.SubnetConfiguration(name="Private", subnet_type=ec2.SubnetType.PRIVATE_WITH_EGRESS, ...),
    ...
]

# after
nat_gateways=0,
subnet_configuration=[
    ec2.SubnetConfiguration(name="Private", subnet_type=ec2.SubnetType.PRIVATE_ISOLATED, ...),
    ...
]

The private subnets stayed in the VPC even though nothing is using them right now — I wanted them there as an anchor for the S3 and DynamoDB Gateway endpoints, in case a future workload needs to attach to the VPC. That's defensive, but the marginal cost is zero and the optionality is nice.

CDK diff: one NAT Gateway, one Elastic IP, and two private-subnet default routes destroyed. Two subnet tags updated. The VPC endpoint RouteTableIds were reordered cosmetically. No replacements of anything in use.

Deploy ran in 118 seconds. Fargate didn't even notice — it was in a public subnet and its outbound path didn't change. AWS confirmed zero NAT Gateways and zero Elastic IPs in eu-west-2 post-deploy. Saving is live as of now.

Key Decisions

The biggest decision in the session was the Option 2 fix for the VAR bug. The payload-size concern was explicit — because we're sending one COMPOSITE_UPDATE per fixture per poll cycle, and because Saturdays have a dozen concurrent live fixtures, anything that bloats the frame compounds across the whole system.

The mitigation that made Option 2 workable was the gated snapshot. The snapshot only appears in the COMPOSITE_UPDATE payload when the event-ID set has actually changed since the previous poll. Score-only, pressure-only, and state-only cycles (which are the vast majority) carry nothing extra. That plus a slim projection (only the fields the frontend ticker needs) brought the worst-case per-fixture overhead down to single-digit kilobytes, and the typical overhead to zero.

The smaller decision on the Vidiprinter ordering — treating stoppage minutes as a fraction — took a minute to settle. The alternative was using parseMinute (which already handled this correctly for intra-fixture sorts) as the primary key and giving up on cross-fixture wall-clock ordering. But that would have meant rebuilding the sort logic entirely, and the fractional approach preserved everything the function was trying to do while fixing the ordering issue in one line.

On the NAT deletion, the decision was mostly about when, not how. The cost plan had always said "after one full fixture window of clean A.2 telemetry." Saturday and Sunday were both clean. The diff was reassuring. I wanted to ship it before Monday so it showed up on the 20-April daily cost report.

What I Learned

The VAR bug was a useful reminder that append-only data models are a tempting shortcut that eventually break. The platform had been running an append-only events model for months without anyone noticing, because VAR disallowals are rare and rescinded cards are rarer still. It took an Everton-Liverpool match on a high-traffic Sunday to surface the issue, and another match on the same day to surface the ordering bug. Production load has a way of finding the edges.

The NAT deletion reinforced something the cost review had been saying all along: 83% of the AWS bill was fixed cost, not per-fixture cost. Adding more matches, more viewers, more presenters wouldn't change the bill. But removing one NAT Gateway and reorganising two subnets would. A lot of "scale" discussions assume growth is the lever — for a platform like this, the lever has been architectural rightsizing.

What's Next

Wave 2 Stream A is now complete. The remaining Wave 2 work splits into:

Stream B: converting seven low-volume Glue ETL jobs (leagues, seasons, stages, rounds, venues, referees, coaches) from PySpark to Glue Python Shell or DuckDB-in-Lambda. Each one is a ~100-line runtime rewrite, worth validating independently.
Stream C: fixture-aware Fargate scaling — replacing the hardcoded 06:00 UTC up / 00:00 UTC down schedule with UpdateService calls emitted by DailyScheduleAnalyzer around actual kickoff times.

That's the plan for the week ahead. Wave 3 (DuckDB medallion pilot, Fargate → Lambda rewrite scoping) goes into May. The season ends in early May — anything that needs live fixture load to validate has to ship before then.

In the meantime: two quieter Vidiprinters and one smaller AWS bill.

Share this post

Paul Pounder