The Compound Bug: When Two Quiet Flaws Conspire

Paul Pounder

April 18, 2026

•

min read

Saturday afternoon, 18 April 2026. I was watching our live TV schedule widget and noticed it listed Aston Villa vs Sunderland and Nottingham Forest vs Burnley as today's fixtures. Neither game was being played today — both had been moved to Sunday for TV. More interestingly, Tottenham vs Brighton, which was actively live on my screen via our live poller, was nowhere to be seen in the TV schedule.

One fixture widget, two wrong fixtures showing, and the real evening game missing. The asymmetry of it — some wrong ones present, some right ones absent — is what tipped me off that this wasn't a simple data freshness issue.

What We Built

The fix landed as three code changes in one deploy, plus a Wave 2 infrastructure step that had been planned for this week anyway:

Silver schedules dedup in the PySpark ETL — one row per fixture, keeping the newest snapshot.
Athena pagination in the daily schedule analyzer — loop on NextToken instead of single-call get_query_results.
Cron moved from 23:01 UTC to 00:01 UTC for the schedule analyzer — correct in both BST and GMT.
Wave 2 A.1 → A.2: removed the live poller from the VPC, then moved the Fargate Agent Hub to public subnets with assign_public_ip=True. NAT Gateway deletion (A.3) is gated on one fixture window of observation.

While verifying A.2 I also surfaced a pre-existing bug: the agent container had been crash-looping for 24+ hours with a langgraph._internal import error that had nothing to do with my infrastructure change. Fixed that too with a proper == pin across all four langgraph-* packages.

Key Decisions

Investigating the wrong fixtures first, not the deploy

My instinct was to push on with Wave 2 A.1 (the one-line VPC detachment on the poller) and look at the schedule widget later. I flipped the order after thinking about it:

The match is happening right now. Once it finishes, the evidence trail gets colder — the pressure index stops, STATUS#LIVE transitions to terminal, logs rotate. Also: same root cause is suspected. Missing real fixture + wrong fake fixtures is the asymmetric signature of a dedup bug. Diagnosing it live will tell us whether our dedup fix is sufficient or if there's a second issue.

I'm glad I did. The live signal is what surfaced the pagination interaction — if I'd shipped the dedup fix in isolation on Monday with no fixtures running, the 1,000-row pagination cliff might have stayed hidden for months until the next Bronze data anomaly.

The count told the whole story

Before the fix, SELECT COUNT(*) FROM gold_schedules_view WHERE starting_at_datetime LIKE '2026-04-18%' returned 1,724. A Premier League matchday plus Championship, League One, League Two and National League cards is roughly 50 fixtures. 1,724 divided by 50 is 34 — meaning each fixture had accumulated ≈34 daily Bronze snapshots. The Silver ETL read them all and unioned them with no dedup step.

The Athena analyzer then issued a single get_query_results() call, which caps at 1,000 rows per page. With ORDER BY starting_at_datetime ASC, the early-kick-off fixtures (14:00 UTC games) made the cut — including 7 stale rows pinning Villa and Forest to April 18. The evening games at 16:30+ UTC (Tottenham, Chelsea) fell off the 1,000-row cliff entirely.

Neither bug alone would have caused a visible problem. With dedup, the Silver ETL would only have ≈50 rows and pagination would never trigger. With pagination, the 1,724 rows would have all come through and the dedup would still need to happen somewhere. It took both bugs to produce the symptom — which is why neither had been spotted in 10 months of live operation.

Not rebuilding a shadow stack for A.1 + A.2

The original Wave 2 plan included deploying eight Lambdas to a parallel shadow stack for 48h of shadow-mode observation before cutting over. When I actually audited the 8 Lambdas, only one (live_delta_poller) was in the VPC at all — the other seven were already running outside. That made the shadow stack enormously disproportionate to the risk. I went with a canary-style rollout via min_healthy_percent=100/max_healthy_percent=200 on the Fargate service instead: ECS brings up the new task in the new subnet, waits for health, then drains the old task. Rollback is one CDK line-revert.

Pinning all four `langgraph-*` packages with `==`, not `~=`

The crash-loop root cause was subtle: langgraph~=0.4.0 let pip install langgraph 0.4.x as the core package, but langgraph-prebuilt (a separate PyPI package that LangChain split out) got resolved to ≥0.6.0 as a transitive dependency. Prebuilt 0.6+ imports from langgraph._internal._runnable, which doesn't exist in langgraph 0.4.x. Classic PyPI split-package skew.

I was tempted to just bump langgraph to ~=0.6.0 and move on. But pip's resolver does that to you every time a transitive dep gets bumped — I'd have been back here in three months. Pinning all four (langgraph, langgraph-prebuilt, langgraph-checkpoint, langgraph-sdk) with == feels heavy but it's the only way to make rebuilds deterministic.

How It Works

The Silver dedup pattern is Spark stock:

df_bronze = df_bronze.withColumn("_source_file", input_file_name())
# ... explode fixtures, preserving _source_file through each explode ...
dedup_window = Window.partitionBy("fixture_id").orderBy(col("_source_file").desc())
df_silver = df_flat.withColumn("_rn", row_number().over(dedup_window)) \
    .filter(col("_rn") == 1) \
    .drop("_rn", "_source_file")

input_file_name() gives the full S3 path of each row's source file. Bronze paths contain year=YYYY/month=MM/day=DD/season_<id>_run_<unix_ts>.json, so lexicographic DESC ordering puts the newest snapshot first. Window function keeps rank 1 per fixture_id. Clean.

The pagination fix is four extra lines wrapping the existing single call:

all_rows, next_token, metadata = [], None, None
while True:
    kwargs = {'QueryExecutionId': execution_id, 'MaxResults': 1000}
    if next_token:
        kwargs['NextToken'] = next_token
    page = athena.get_query_results(**kwargs)
    if metadata is None:
        metadata = page['ResultSet'].get('ResultSetMetadata')
    all_rows.extend(page['ResultSet']['Rows'])
    next_token = page.get('NextToken')
    if not next_token:
        break
return {'ResultSet': {'Rows': all_rows, 'ResultSetMetadata': metadata}}

The only subtlety: the first page's Rows[0] is the column header; subsequent pages are data-only. The caller uses Rows[1:] to skip the header, and that still works correctly because we preserve the header once at position 0.

What I Learned

Asymmetric symptoms are a diagnostic gift. “Wrong things showing” alone looks like a date-parsing bug. “Right things missing” alone looks like a filter bug or a missing ingest. But “some wrong things showing AND some right things missing, from the same query” is specific enough to narrow the search immediately. I spent less time in grep -n "starting_at" than I expected because the shape of the symptom told me what to look for.

Count the query result before trusting it. The line of code that would have caught this bug years earlier, if we'd written it, is literally a COUNT(*) logged alongside the Athena query. Pagination silently truncating to 1,000 is invisible to anything that doesn't check the total rowcount.

ENI orphaning is a thing. After removing the Lambda from the VPC, CloudFormation failed twice to delete the old security group because the detached ENIs still referenced it. AWS's Lambda ENI reaper eventually gets to them but can take hours. Manually deleting the two orphaned ENIs unblocked CFN instantly. Worth knowing for future VPC detach operations.

Pre-existing production bugs hide in plain sight when nobody's watching. The langgraph crash-loop had been running for 24+ hours with no alerts because we don't have a CloudWatch alarm on the agent hub's ContainerExited metric. The only reason I found it was that I was actively tailing the logs to verify A.2. I'll add a proper alarm as part of Wave 2 cleanup.

What's Next

Sunday 19 April: observe A.2 (Fargate in public subnet) through the PL matchday. Watch Bedrock errors, Kafka consumer lag, DynamoDB throttles. If clean, the NAT Gateway can be deleted Monday.
Monday 20 April: Wave 2 A.3 — delete NAT Gateway, set nat_gateways=0, switch PRIVATE_WITH_EGRESS subnets to PUBLIC. First concrete line item on the cost review coming off the bill.
Wave 2 Stream B: convert seven low-volume Silver ETL jobs from Glue PySpark to Glue Python Shell (1/16 DPU). Low risk, moderate saving.
Agent hub CloudWatch alarm on ContainerExited to catch the next silent crash-loop inside 10 minutes instead of 24 hours.

Share this post

Paul Pounder