WordPress fleets on AWS rot from the database up
Posted in Aws, Wordpress, Php
By Dušan Dželebdžić

Every dashboard was green and the site didn't load.
Web server CPU: bored. Memory: fine. Apache happily handling whatever Cloudflare forwarded its way. Pages took eight or ten seconds to start rendering, then stalled, then maybe came back. Like Wi-Fi at a busy coffee shop.
The database lived on its own RDS instance, and that was the part that confused everyone. It had been split off the EC2 box two years earlier specifically for reliability. MySQL on the same EC2 box as PHP kept saturating things during traffic spikes, so the operator did the responsible thing: peeled it off, put it on a gp2-backed db.t3 RDS, slept better at night. Problem solved.
Two years later, that exact instance was the one that went silent. The RDS storage burst balance had drained to zero somewhere around 9am, and from then on every query was waiting on disk IOPS the instance no longer had. Web server: bored. RDS: gasping. Dashboards: green, apart from one graph nobody was watching.
The fix from two years ago had quietly become the bug. That sentence is most of what's wrong with WordPress fleets on AWS, summarized.
How these fleets accumulate
An aging WordPress fleet on AWS is almost never something somebody designed. It's something that happened.
You start with one site on a small EC2. WordPress is light, one box handles it fine. Then you add a second site, then a third. The agency takes on a client whose old site just needs to stay online for another year. A WooCommerce install lands. A dev environment becomes a staging environment becomes a "we don't really know what this one does anymore." Different developers install different caching plugins. PHP versions drift between sites that share a box. Cron jobs pile up. Backups multiply, then overlap, then nobody's sure which ones still work. One client launches a campaign and melts shared resources for everyone else for forty minutes on a Tuesday morning.
None of these decisions was wrong on the day it was made. Together, they build a system nobody can confidently change.
The database is doing more than you think
Most WordPress performance complaints are really database complaints in a costume.
Pages look slow because PHP is sitting on its hands waiting for MySQL. Admin panels drag because half the popular plugins do queries that would embarrass a junior developer. Random timeouts show up because the storage burst balance on the RDS volume is gone and every query is now waiting on disk IOPS the instance simply doesn't have anymore.
That last one is the trap. AWS sells burstable resources, things like db.t3 instances and gp2 volumes, that perform beautifully right up until the moment they don't. The credits last for hours of moderate use, then run out, then performance falls off a cliff at exactly the wrong time. To the customer it looks random. To the operator it looks like a haunted machine.
Cheap infrastructure that collapses under normal usage isn't actually cheap. It's just billed differently.
Cloudflare hides a sick origin
Cloudflare in front of WordPress is genuinely useful. Edge caching, bot filtering, the SSL pipework, a cheap layer of armor against script kiddies. None of that is wrong.
What it also does, and what nobody mentions in the marketing, is hide how sick the origin really is. From the outside the site looks mostly fine because cached pages render in milliseconds. The logged-in users, the checkout flow, the editors hitting wp-admin, every API call, every cache miss. All of those are talking to the dying origin and getting the dying-origin experience.
Then traffic spikes, or the cache gets purged, or a campaign tag changes the URL pattern enough that nothing's in cache anymore, and the illusion drops. The thing was on fire the whole time. Cloudflare was just very good at the smoke screen.
The AWS bill is paying for caution, not performance
The bill on these setups grows in places that have very little to do with how the sites actually perform.
EC2 instances stay one or two sizes too big because nobody trusts downsizing without a load test that nobody has time to run. RDS instances carry years of inefficient queries the team has learned to live with. gp2 volumes sit there bursting and then not-bursting; nobody upgrades to gp3 because nobody knows the calculator off the top of their head. Snapshots accumulate at one per day per instance and nobody has audited the retention policy since 2021. There's an idle Application Load Balancer that used to front something. There's a NAT Gateway moving the tiniest amount of traffic at the steepest possible per-GB rate. Two of the environments are duplicates of each other "just in case." A third-party APM is collecting metrics from a service that was decommissioned eighteen months ago.
None of these line items will hurt you on their own. Together they're often a third of the bill.
You're not managing traffic, you're managing entropy
People think a fleet of fifty low-traffic WordPress sites is easier than one busy app. It isn't.
A busy app is one codebase, one deploy pipeline, one PHP version, one caching strategy, one set of cron jobs. You can know that app. Fifty WordPress sites are fifty plugin lists, fifty theme variants, fifty backup schedules, fifty sets of email quirks, fifty sets of client-specific surprises that nobody documented. A handful of them are running PHP versions you'd rather not name out loud. Two of them are quietly out of disk. One has been compromised for a month and is mining something.
You're not managing traffic. You're managing entropy.
That's why agencies feel exhausted even when no single site is "big." The exhaustion isn't proportional to traffic. It's proportional to the number of independent things that can go wrong overnight without warning.
What I'd actually do if I inherited one
If somebody handed me a tangled WordPress fleet on AWS tomorrow, I wouldn't start with Kubernetes, six new services, or any of the architecture cosplay that gets pitched at this stage. I'd start by knowing what's there.
First, an honest inventory. Which sites still make money? Which ones are abandoned? Which ones are legacy liabilities that nobody owns and nobody wants to admit they don't own? Most operators, when pressed, can't actually answer this. That's the real starting point.
Then, separation. The site that pays the rent shouldn't be sharing a database with a brochure site from 2018 that nobody's logged into in three years. Carve the important workloads off, even if it means duplicating a bit of infrastructure. Noisy neighbors are cheap to fix once you know who they are.
Then the database. Right-size the instances, move gp2 volumes to gp3, kill the queries that have been embarrassing themselves for years, fix the indexes that should have been there from the start. This usually moves the needle more than anything you can do to PHP.
Then standardize the runtime. One PHP version, one deploy pattern, one caching approach, one monitoring stack. Mixed environments create the kind of friction that quietly burns a senior engineer's afternoon every other day.
Then question every component that exists "because it sounded smart in 2022." If the load balancer is fronting one instance, it isn't load balancing. If the NAT Gateway exists for one outbound API call a day, it's a $30/month rental for a single phone call.
Finally, decide what should leave WordPress entirely. Some sites should become static. Some should be rebuilt against something more boring. Some should be quietly retired and replaced with a 301 redirect.
Takeaway
WordPress isn't the villain in any of these stories. AWS isn't either. The villain is the slow accumulation of decisions that were each individually reasonable, made by people who weren't there for the previous ones, on top of infrastructure that will quietly let you build whatever you want.
At some point the cheapest optimization isn't another instance upgrade. It's stepping back, looking at what you actually have, and putting it back into a shape somebody can understand.
Inherited a tangled WordPress fleet on AWS? Send me the details and I'll take a look.