Designing for Failure: How to Build Resilient Systems

Founders often ask me why their new system, perfect for a month, suddenly fell apart.

Inhaltsverzeichnis

My answer is that the initial assumption was wrong. The goal isn’t a perfect system that never fails. It’s a resilient one that anticipates failure and handles it.

The pursuit of perfect efficiency creates fragility. We design for the ideal path, the expected inputs, and correct user behavior. But the real world is messy. It’s filled with unexpected edge cases, human error, and shifting conditions. When a system is designed only for perfection, its first encounter with reality is often its last.

This isn’t a niche problem. In practice, it often seems that a significant percentage of automation projects fail outright. It’s also a common finding that many automated processes still require manual intervention. The problem is not the technology. It’s the philosophy. We are building glass houses when we should be building with steel and shock absorbers.

This is about the mindset shift from designing for perfection to designing for failure. These principles, learned from my own project failures at Mehrklicks and JvG Labs, help me build systems that last.

An Analysis of Our Biggest Automation Failure at Mehrklicks (And the Single Point of Failure We Missed)

A few years ago at Mehrklicks, we built a complex piece of marketing automation. It was designed to sync customer data from our CRM with a third-party email platform, tag users based on purchase behavior, and trigger targeted campaigns. For the first few weeks, it was flawless. Efficiency was up, engagement looked good. A win.

Then, one Tuesday, it all broke.

The trigger was a minor, unannounced API update from the email platform. A single data field name was changed, from customer_id to customer-id. Our system, built on the assumption that this field name was static, could no longer match records. It didn’t just stop working. It began creating thousands of duplicate, untagged contacts, polluting our database and firing off the wrong emails.

The immediate aftermath was chaos. Once the fire was out, the real work began: a blameless post-mortem. The goal is not to find someone to blame, but to understand the sequence of events and identify the weak points in the system itself. Blame stops inquiry. Analysis drives improvement.

Our post-mortem revealed the root cause wasn’t the API change. That was just the trigger. The true failure was our assumption that a third-party data schema would never change. We had built a system with a single point of failure. It was only a matter of time.

Every assumption is a potential point of failure. The key is to identify them and build in mechanisms to handle them when they prove false. I’ve since refined our internal worksheet into a public template.

Kontakt aufnehmen

⬇ Download: The Blameless Post-Mortem: My Framework for Turning Failure into Insight

The ‚Circuit Breaker‘ Concept: Building Kill Switches into Every Automated System

The Mehrklicks failure wasn’t just a data problem; it was a cascading failure. The initial error should have been caught, but instead propagated through the system, causing more damage at each step. This led me to adopt a concept from electrical engineering: the circuit breaker.

In a house, a circuit breaker detects a power surge and cuts the connection, protecting appliances. In a business system, it’s a pre-built mechanism that detects when a process is operating outside safe parameters and automatically stops it.

A system without a circuit breaker is a runaway train. Think about automated ad spend. If a campaign is misconfigured, a circuit breaker could pause all spending if the CPC spikes 200% above the historical average. Without one, you might burn an entire monthly budget in hours.

This directly addresses the Automation Paradox: the more reliable an automated system is, the less human operators pay attention to it, leaving them unprepared when failure inevitably occurs. Circuit breakers are the system’s way of raising its hand and saying, „A human needs to look at this. Now.“

We now build these safety mechanisms into every critical system. Here’s the checklist we use.

Safety Checklist for High-Risk Automations

Volume Threshold: Does the system have a rule to halt if the number of items it processes (e.g., emails sent, records updated) in an hour exceeds a predefined maximum?
- Example: Pause an email campaign if it tries to send to more than 110% of the target list size.
Financial Guardrails: Are there absolute limits on financial transactions or ad spend?
- Example: A hard stop on any ad campaign that spends more than $500 in a single day without manual approval.
Manual Override/Kill Switch: Is there a clear, accessible „big red button“ that a non-technical user can press to immediately halt the entire process?
Data Validation Trigger: Will the system stop if the input data doesn’t match the expected format or quality?
- Example: In our Mehrklicks case, a circuit breaker would have paused the sync the moment it received records without a valid customerid._
Heartbeat Monitoring: Does the system send a regular „I’m alive and healthy“ signal to a monitoring dashboard? If the signal stops, an alert is triggered.

These aren’t about adding complexity. They are about building the stops that prevent a small error from becoming a disaster.

A Reflection on Redundancy: Why ‚Inefficient‘ Backups Are the Key to Long-Term Stability

An obsession with pure efficiency leads to eliminating anything that looks like waste. We want the fastest process, the leanest workflow. But in resilient systems, some forms of „inefficiency“ are not bugs. They are critical features. This is strategic redundancy.

Redundancy is a backup. Critical flight systems have triple redundancy. If one fails, another takes over instantly. We may not be building airplanes, but the concept is just as vital for business continuity.

It’s important to distinguish between „bad“ and „good“ inefficiency.

Bad Inefficiency: A 15-step approval process for a $50 expense. This is bureaucracy that adds friction without safety.
Good Inefficiency: Running a secondary, off-site backup of all critical company data every night. It consumes resources and doesn’t contribute to daily output, but it saves the company when the primary server fails.

At JvG Technology, we build complex production lines for solar modules, where a single component failure can halt a multi-million dollar operation. We strategically build in redundancies: backup power supplies, duplicate sensors on critical machinery, and cross-trained operators. This „inefficiency“ is just insurance against catastrophic downtime.

In a digital marketing context, this could mean:

Data Backups: Maintaining a separate, clean copy of your customer list before running a major data-syncing operation.
Manual Cross-Checks: For a system that automatically calculates sales commissions, having a human spot-check 5% of the payouts. This small manual step can catch a systemic error that might go unnoticed for months.
Parallel Systems: Running an old, stable system in parallel with a new one for a month before decommissioning it. It feels inefficient, but it provides a verified safety net.

Be intentional. Ask: „What is the single process that, if it failed, would cause the most damage?“ That is where you need to invest in redundancy.

Kontakt aufnehmen

My Process for Stress-Testing a New Operational System Before Full Rollout

Instead of waiting for a system to fail in the wild, you have to try to break it yourself in a controlled environment. We do this with a thought exercise called a „pre-mortem“ and practical simulations.

The Pre-Mortem: Failing on Purpose

A post-mortem happens after a failure. A pre-mortem happens before you launch.

The process is simple. Gather the project team and open with this prompt: „Imagine it’s six months from now. The project was a complete disaster. It failed spectacularly. Now, write down every possible reason why.“

This exercise liberates the team from groupthink. Suddenly, potential issues that felt too pessimistic to raise are the entire point of the meeting. You’ll uncover dozens of failure points: „The third-party API we rely on went down,“ „User adoption was zero because the interface was confusing,“ or „The database couldn’t handle the load during peak season.“

With this list, you can prioritize the risks and build mitigations before you start the project.

Running Practical Simulations

After the pre-mortem identifies likely failure scenarios, we design tests to simulate them. This is not simple QA. It’s about testing the system’s limits and resilience. Our stress-testing usually includes:

Spike Tests: What happens if we hit the system with 10x the expected traffic or data volume in five minutes? Does it crash, or does it slow down gracefully and recover? This is critical for systems facing seasonal demand.
Soak Tests: We run the system at a normal load but for a prolonged period, like 48 hours straight. This is designed to uncover memory leaks or performance degradation that only appear over time.
Exception Handling: We intentionally feed the system garbage data. What if a user uploads a PDF instead of a JPG? Or if a required field in an API call is blank? Does it return a clear error, or does it crash the whole process? The system must fail predictably and informatively.

This process of failing on purpose, first in theory and then in practice, is the single most effective way to build resilient systems.

Your Blueprint for Building Systems That Last

If you’re asking, „What did I do wrong?“ after a failure, you’re on the right track. The mistake isn’t the failure itself; it’s the belief that it could have been avoided entirely.

Resilient design is a philosophy. It’s the acceptance that chaos is the default state and our job is to build systems that navigate it.

Here is the blueprint I work from:

Assume Failure: Design every system as if its components—people, software, data—will eventually fail.
Conduct Blameless Analysis: When failure occurs, investigate the system, not the person. The goal is to learn.
Install Circuit Breakers: Build automatic stops and manual overrides for critical processes to contain the blast radius.
Embrace Strategic Redundancy: Identify single points of failure and build in „good inefficiency“ to protect them. A backup is only a waste of resources until you need it.
Fail on Purpose: Use pre-mortems and stress tests to find weaknesses before your customers do.

When you shift your focus from creating the „perfect“ system to building a resilient one, you stop fighting reality. You start working with it. The result is systems that are more stable, more reliable, and better able to evolve.

Kontakt aufnehmen

Designing for Failure: How I Build Systems That Are Resilient, Not Just Efficient

An Analysis of Our Biggest Automation Failure at Mehrklicks (And the Single Point of Failure We Missed)

The ‚Circuit Breaker‘ Concept: Building Kill Switches into Every Automated System

Safety Checklist for High-Risk Automations

A Reflection on Redundancy: Why ‚Inefficient‘ Backups Are the Key to Long-Term Stability

My Process for Stress-Testing a New Operational System Before Full Rollout

The Pre-Mortem: Failing on Purpose

Running Practical Simulations

Your Blueprint for Building Systems That Last

Gemeinsam ins Gespräch kommen

An Analysis of Our Biggest Automation Failure at Mehrklicks (And the Single Point of Failure We Missed)

The ‚Circuit Breaker‘ Concept: Building Kill Switches into Every Automated System

Safety Checklist for High-Risk Automations

A Reflection on Redundancy: Why ‚Inefficient‘ Backups Are the Key to Long-Term Stability

My Process for Stress-Testing a New Operational System Before Full Rollout

The Pre-Mortem: Failing on Purpose

Running Practical Simulations

Your Blueprint for Building Systems That Last

Artikel im Trend

Gemeinsam ins Gespräch kommen