Skip to content

Blog

Scaling Problems Are Architecture Problems in Disguise

· Dan Maby · 6 min read

The 3am wake-up call is a design document you never wrote

Last month a founder rang us about a search box. Their directory had crossed a threshold where the autocomplete was timing out, the database was buckling under a query pattern that worked fine at 500 records and fell over at 50,000, and customer support was drowning. They wanted us to make the search faster.

We ended up rewriting the data model.

This is the pattern. Founders treat scaling failures as operational surprises, the sort of thing you respond to with a bigger server and an apology email. They almost never are. The failure was baked in at the architecture stage, often years earlier, and the load just got large enough to expose it. Jon Hyman, CTO of Braze, put it crisply in a recent Stack Overflow piece on the limits of AI-assisted development (opens in a new tab):

You can't vibe code scale.

You cannot vibe code it, you cannot retainer your way out of it, and you cannot bolt it on after launch. Scale is a property of the decisions you made before you had any users at all.

Rare becomes routine

The most useful mental model we know for this comes from Jason Cohen's essay on how rare things become common at scale (opens in a new tab). The argument is simple and uncomfortable: things happen with 2,000 servers that you never saw even once with 50 servers, and things which used to happen once in a blue moon now happen every week, or every day. A manual reboot every six months was a perfectly reasonable process. A manual reboot every six hours is a fire.

This is not a server problem. It is a thinking problem. The same shape appears everywhere in production software:

  • The edge case in a form validator that one user hit in 2023 becomes forty support tickets a week once you have 100,000 users.
  • The cron job that occasionally fails silently becomes a daily data-integrity incident.
  • The payment webhook that drops one event in ten thousand starts dropping fifty a day, and now finance can't reconcile.

The Atlassian engineering team described exactly this dynamic debugging Jira Cloud (opens in a new tab), where they found themselves watching a one-in-four-billion event surface every week. Their conclusion is the one every growing product eventually arrives at: the issues we encounter change with scale, and as we grow we need to find and fix new problems.

The trap is that the founder who built version one optimised, quite reasonably, for the problems they had then. Very few design a system for a million users when they have ten. The mistake is not making early decisions; it is failing to revisit them. The early-stage shortcut becomes a load-bearing wall, and by the time anyone notices, ripping it out costs ten times what it would have cost to do properly the first time.

Brittle points hide in plain sight

If rare-becomes-common is the first half of the story, brittle points are the second. Jason Cohen also writes well on this, framing brittleness as any place in your system where one failure causes disproportionate collapse (opens in a new tab). The classic version is the single database server with no read replica. The interesting version is the one nobody flags as critical until it breaks.

We see the same handful of brittle points across most growing products:

  • A single background worker doing everything from email sending to image processing to report generation, with no queue isolation. One slow job blocks the others.
  • A search index that only rebuilds nightly, so any data correction takes up to 24 hours to appear in the UI. Support knows this. Users don't.
  • One person on the team who understands the deployment pipeline. They take a fortnight off and everyone holds their breath.
  • A third-party API treated as if it cannot fail. It will.

None of these are exotic. They are the boring middle of any system audit. And almost all of them are cheap to fix at year one and expensive at year three, because by then the brittle thing has grown tendrils into ten other things.

This is why we are sceptical of the "ship fast and refactor later" doctrine in its strong form. We are entirely in favour of shipping fast. We are not in favour of shipping without a mental model of where the load-bearing walls are, because you will not refactor later. You will sprint to keep up with growth, and the refactor will keep slipping until it becomes a rewrite.

Site search is the canary

If you want a single diagnostic for whether a content-heavy product has been built with scale in mind, look at the internal search box. Smashing Magazine ran a piece earlier this year called The Site-Search Paradox (opens in a new tab) on why users abandon site search and just open Google instead. The reason, almost universally, is that site search was built as a feature rather than as a product.

A CMS default search is fine when you have a hundred items. It is embarrassing when you have ten thousand. It is a business problem when you have a hundred thousand, because by then search is how users actually navigate, and a bad search box is indistinguishable from a broken product.

We built All Counseling's directory specifically because off-the-shelf search collapses at the scale they operate at: over ten thousand professional profiles, faceted by speciality, geography, insurance, modality, availability. The naive approach (LIKE queries against a relational database, or a generic plugin) works for the first six months and then very visibly stops working. Doing this properly means choosing the right index, structuring the data so the queries you actually run are the queries the index is optimised for, and treating relevance tuning as an ongoing product discipline rather than a one-off setup task.

The broader point: search is the place where architectural debt becomes user-visible fastest. If your search is slow, your architecture is the problem, not your search.

AI did not change this. It made it sharper.

The obvious objection today: doesn't AI change this? The cheapest way to ship a brittle system in 2026 is to vibe code it. The tooling is genuinely good for prototyping, and we use it daily. But the Stack Overflow piece linked above makes the right point: the AI explosion actually makes senior engineering judgment more, not less, valuable. Because someone still has to own the consequences of what gets built and whether it can function at scale.

An LLM will happily generate a schema, an API, and a frontend that all work beautifully on the demo dataset. It will not tell you that the schema makes a particular query pattern impossible to index, or that the API has no idempotency story, or that the auth flow has a race condition that surfaces once every ten thousand logins. Those are the decisions that determine whether your product survives its first growth spurt. They require someone who has watched systems fail before and is willing to argue, on day one, for the boring choice.

Our take

We build software at Blue 37 with a specific bias: we assume the system will be in production for at least five years, and we assume it will grow. That is not a guess about your roadmap. It is a discipline. If we are wrong and the product gets shelved in eighteen months, the cost of the discipline is small. If we are right and the product succeeds, the cost of not having it is enormous, because every shortcut compounds.

This is what we mean when we talk about product-minded engineering rather than agency delivery. An agency ships what was specified. A product-minded team asks what happens in year three: what the search box looks like at fifty times the current data volume, where the brittle points will emerge, which rare events will become routine, and which decisions made this week will be the ones a future engineer curses. We argue about those things before we write the code, because that is the only point at which arguing is cheap.

You cannot architect away every problem. New scale brings new problems by definition. What you can do is stop shipping the same predictable failures that every growing SaaS company has shipped before you. Most scaling crises are not novel. They are the second-order consequences of decisions that looked sensible in week one and were never reviewed.

If you are staring at a system that is starting to creak, or you are about to build something you intend to grow, we are happy to have the conversation before the 3am wake-up call. Let's have the conversation early.