A healthcare client's partner integration layer had been dark for seven days when I got the call. Nobody had noticed because there were no alerts on that subsystem. The AWS bill wasn't screaming - it was quietly accumulating costs in places that only show up when you go looking.
That weekend, I went looking.
The incident: what I walked into
The setup was a fairly standard AWS footprint for a regulated client: a fleet of Lambda functions handling data exchange with external partners, object storage behind them, scheduled cron jobs, and a shared EC2 host coordinating the pipeline. The full incident recovery case study is in my work archive if you want the metrics without the commentary.
The host had rebooted at some point in the recent past. Outbound connections were failing silently because the OS network stack had wedged during the reboot. Every integration that depended on that box had been returning empty results without surfacing an error anywhere a human would see it.
By the time I was brought in, thousands of records had gone unprocessed. The system was running - technically - but accomplishing nothing.
That was problem one.
Timeline
The incident split into two phases. The first was diagnosing and fixing the network stack. The second was cleaning up the credential problem that the fix accidentally triggered.
- T+0 min - Received access. First action: EBS snapshot before touching anything. Insurance against data loss if a subsequent action made things worse.
- T+15 min - Confirmed the OS network stack was wedged on the shared host. Outbound connections failing silently. No external errors surfacing.
- T+48 min - Rebooted the instance. Verified 11 of 12 scheduled jobs within 48 minutes of the reboot. One required manual intervention.
- T+3 hr - A credential reset in the upstream system (initiated by someone else, different issue entirely) hit the wrong account due to two nearly identical usernames in the system. This took the integrations back offline immediately after I'd brought them up.
- T+3.5 hr - Mapped blast radius: roughly a dozen hardcoded credential files on the EC2 host plus another dozen Lambda deployment packages, each holding encrypted credentials in the deployment package itself rather than in environment variables or a secrets manager.
- T+5 hr - Patched all credential files in place with timestamped backups. Downloaded, decrypted, re-encrypted, and redeployed all Lambda functions via CLI.
- T+6 hr - Full integration layer restored. Began cleanup pass.
- T+7 hr - Discovered three dead Django daemons from the prior reboot, including one that had been returning 502 errors for nine consecutive days with nobody noticing.
- T+8 hr - Recovered roughly 2 GB of disk space by truncating bloated log files while preserving file handles. Added swap space - the host had been running with none. Verified end-to-end with 16+ clean Lambda runs and three consecutive cron ticks.
Root cause: two layers, one weekend
The immediate cause was a wedged network stack after an unattended host reboot. Standard enough.
The compounding cause was credential architecture. When credentials lived hardcoded in deployment packages rather than in a secrets manager, rotating them meant downloading every package, decrypting it, updating the credential, re-encrypting, and redeploying. For a dozen functions, that's a half-day of methodical work that should have been a five-minute parameter store update.
Two distinct failure modes, both predictable, neither monitored.
Where the cost was actually hiding
The bill review after the fact was instructive. The visible compute cost was not the main story.
The bigger items, in rough order of magnitude:
Lambda invocations that returned nothing. Every scheduled job had been running on schedule, executing successfully in Lambda's view, and producing zero output. Lambda charges per invocation and per compute duration. A fleet of functions running on a 15-minute schedule for seven days accumulates real charges even when doing nothing useful.
Log storage that nobody was pruning. CloudWatch logs had grown without any retention policy. Logs from functions that had been deprecated months earlier were still accumulating storage charges. The bloated log files on the EC2 host itself were a symptom of the same pattern - no housekeeping, so nothing gets cleaned up.
Dead daemons consuming memory on a paid instance. Three background processes were running on the host, doing nothing, but keeping the instance loaded enough that it never triggered any cost-saving scaling behavior.
What I changed
The immediate fixes were straightforward once the incident was contained:
Credential rotation path. Every hardcoded credential got flagged for migration to a secrets manager. That work wasn't in scope for the weekend, but the incident report called it out explicitly with the blast radius math: N functions times the hours to rotate manually equals a significant time cost every time credentials need to change. This kind of infrastructure debt is what I look for in a cloud infrastructure review - it's not dramatic until it causes a three-hour outage.
Log retention policies. Set retention windows on the CloudWatch log groups for every function in the fleet. Old log groups from deprecated functions were deleted after confirming no active consumers were reading them.
Dead process cleanup. Identified and removed the three stale daemons. Added them to a startup inventory so future reboots wouldn't leave orphaned processes running invisibly.
Alerting coverage. The subsystem that went dark had no alerting. Added a basic health check: if the integration layer hasn't produced a record in N minutes, alert. Simple threshold, not sophisticated monitoring. It would have caught this within an hour instead of seven days.
The incident report. This is the part that has the most long-term value. The report documented the timeline, the blast radius, the data loss assessment (zero, in this case - the upstream source had retained everything), and five prioritized infrastructure improvements. That document became the roadmap for the next quarter's infrastructure work.
What I'd do differently
A few things stand out looking back.
“Seven days of silent failure is seven days of costs that should have been caught in under an hour.”
The alerting gap was the most expensive mistake. A basic health check - even a cron job that emails if no records are processed in a window - would have caught this before it became a weekend project.
The credential architecture was a latent risk that turned an afternoon fix into a full evening of deployment work. Secrets manager migration is not glamorous work, but it makes every future change cheaper. When credentials live in deployment packages, you can't rotate them without touching every function.
The lack of swap space was a small thing that contributed to the host's instability. Adding it should have been part of the original provisioning.
None of these are novel insights. They're documented best practices. Best practices require someone to look at the system and ask whether they're actually in place - and in a busy regulated environment, that review gets deferred until something breaks.
FAQ
How do you approach an AWS cost audit if there's no incident?
Start with Cost Explorer filtered by service and by resource tag. Look for services with steady costs that you can't immediately explain - those are usually either orphaned resources or legitimate but underdocumented infrastructure. Then pull CloudWatch metrics to see what's actually running versus what's provisioned. The discrepancy is usually where the money is. If you want a structured checklist for running this kind of audit, the DTC Stack Audit covers the tracking and infrastructure layers in detail.
What's the fastest thing to fix in a Lambda fleet audit?
Log retention. Lambda functions write to CloudWatch by default with infinite retention. Every function that has ever run is accumulating log storage charges unless you've explicitly set a retention window. Setting 30 or 90-day retention across a fleet takes about an hour and the savings show up on the next bill.
How do you do credential rotation without taking everything offline?
It depends on where the credentials live. If they're in AWS Secrets Manager or Parameter Store, you update the secret and the functions pick it up on the next invocation with no downtime. If they're baked into deployment packages - as they were in this case - you need to rotate one function at a time, verify, then proceed. It's slower but safe. The real answer is: migrate to a secrets manager so you never have to do the manual rotation process again.
What should every Lambda fleet have that most don't?
Three things: a retention policy on every log group, an alarm on error rate (not just 5XX from the API side, but Lambda errors surfaced via CloudWatch metrics), and a dead-letter queue for any function that handles critical data. None of these are complicated. All of them require someone to have gone through and deliberately set them up.
How long does a proper cloud infrastructure audit take?
For a small-to-medium fleet, a weekend is enough to identify the main issues and fix the quick wins. A full remediation - migrating credentials, setting up proper secrets management, adding alerting coverage across all subsystems - is typically a few weeks of focused work, usually done in parallel with ongoing work rather than as a dedicated sprint.
Sources and specifics
- Incident occurred in a regulated healthcare client's AWS environment (NDA). Client anonymized throughout.
- Integration layer had been dark for approximately 7 days before diagnosis began.
- Lambda fleet: roughly a dozen functions with credentials in deployment packages rather than in a secrets manager.
- Recovery timeline: full integration layer restored within approximately 8 hours of starting the audit.
- Records backfilled: thousands of records processed within the first two hours post-recovery, zero data loss on primary pipeline.
- Log bloat: approximately 2 GB reclaimed on the EC2 host alone, separate from CloudWatch retention cleanup.
- For the full case study with metrics, see the incident recovery case study.
- If you're evaluating whether your own cloud infrastructure has similar issues, the cloud operations work section shows how I approach these engagements.
