Psychological Safety as an enabler of adaptability and resilience in Complex Systems

… there was no such thing as absolute control, not in a fully functioning universe. There was just a variable amount of lack of control.

— Terry Pratchett, The Science of Discworld: Darwins Watch

When we break down, it all breaks down.

— Terry Pratchett, Night Watch

There are different kinds of rules. From the simple comes the complex, and from the complex comes a different kind of simplicity. Chaos is order in a mask.

— Terry Pratchett, Thief of Time

Is it frivolous to start with some Pratchett quotes? Perhaps, but it’s also pertinent. Part of what makes the writings of Pratchett so loved and interesting are that way they express the complexity, and absurdity of humans and the systems within which they exist.

As a journalist, and later a press-officer for an electricity service region that included three nuclear power stations, Pratchett is said to have once joked he would “write a book about his experiences if he thought anyone would actually believe them”.

I always appreciated his writing because it expressed the complexity of systems, group dynamics and human nature in an honest and irreverent way.

As a fellow observer of people, I found his observations reassuring.

Complex Systems

How Complex Systems Fail by Richard Cook is a great treatise on complexity, the complex nature of failure, and the role humans play in those systems. It is a fascinating read, and a short one too, clocking in at only four pages or so.

It really does make some great call outs that question how we as human beings perceive and respond to failures, and it turns some of it on it’s head from conventional wisdom.

I’d advise anyone to read this, but the 19 key points I’ll list below: -

  1. Complex systems are intrinsically hazardous systems — The complex systems we build inherently contain multiple failure modes.

  2. Complex systems are heavily defended against failure — They have multiple layers of defense and redundancy built in through hard-won lessons.

  3. Catastrophe requires multiple failures — Single failures are typically not enough to cause accidents in well-designed systems.

  4. Complex systems contain changing mixtures of failures latent within them — There are always hidden defects and vulnerabilities present.

  5. Complex systems run in degraded mode — Systems frequently operate with multiple known deficiencies while maintaining acceptable performance.

  6. Catastrophe is always just around the corner — The potential for catastrophic failure is always present in complex systems.

  7. Post-accident attribution of cause is fundamentally wrong — Hindsight bias leads to oversimplified explanations that don’t capture true complexity.

  8. Hindsight bias remains the primary obstacle to accident investigation — People tend to oversimplify and judge past events based on their outcomes.

  9. Human operators have dual roles: as producers & defenders against failure — Operators both run the system and prevent accidents.

  10. All practitioner actions are gambles — Every decision involves tradeoffs and uncertain outcomes.

  11. Actions at the sharp end resolve all ambiguity — Frontline operators must make decisions even with incomplete information.

  12. Human practitioners are the adaptable element of complex systems — People adapt to handle varying conditions and maintain safety.

  13. Human expertise in complex systems is constantly changing — Expertise has to evolve as technology and conditions change.

  14. Change introduces new forms of failure — System modifications always bring new potential failure modes.

  15. Views of ‘cause’ limit the effectiveness of defenses against future events — Oversimplified explanations lead to inadequate solutions.

  16. Safety is a characteristic of systems and not of their components — Safety emerges from interactions, not individual parts.

  17. People continuously create safety — Safety requires active effort and isn’t just built into systems.

  18. Failure free operations require experience with failure — Understanding how systems fail helps prevent accidents.

  19. Some systems are better positioned to catch drift toward failure — Organizations vary in their ability to detect and correct developing problems.

A key point of this, is that a system is complex, it will always have unknowns, and the boundaries of the system also encompass the human practitioners that maintain it and operate it — they, we, I are part of that system.

It’s easy to think of a system as components, but a system includes the human elements within, and those are the most adaptable components of any system.

But, we as humans like and bias towards a simple root cause for an issue, it’s nice to think we understand the thing, that there was one event that caused a failure. We are also cognitively time constrained — immediacy is a natural bias.

But recent experiences have reminded me that you cannot subdivide complexity in such a simplistic way, and systems don’t just fail, those failures are the descendants of a multitude of events, decisions, non-decisions, external and internal pressures within the system, human or otherwise.

I’m reminded of a large scale outage last year, and the immediate flow of hot takes online:

  • “If only they’d done X, this would not have happened!”

  • “It is clear that this was a problem with Y. “

  • “They clearly didn’t do TDD.” *

I remember thinking and posting at the time (before a post mortem had even dropped from the company involved), that it was ridiculous to speculate on a cause, and certainly to speculate on a single root-cause, because systems are complex.

There is no defining root-cause, but a series of cascading, cumulative inputs, decisions, trade-offs, pressures and events that push a system beyond even the boundaries within which it can operate in a degraded state.

There are different kinds of rules. From the simple comes the complex, and from the complex comes a different kind of simplicity. Chaos is order in a mask.

Life is a messy, iterative, recursive function, with and endless stream of ever-changing inputs. If X then Y is a gross oversimplification of how a system works.

To put it another way:

It’s chaos turtles all the way down.

To put it another way (and misquote Carl Sagan — if only incidents were apple pie):

To understand root-cause, you must first create the universe.

Apologies for my sense of humor coming through, but it really gets to the points 7 and 15 that Cook makes:

Post-accident attribution of cause is fundamentally wrong

Views of ‘cause’ limit the effectiveness of defenses against future events

I’ve recently started becoming interested in the field of resilience engineering, and one of the things you need to get your mind around is how complex systems truly are, and that you can never guard against or prevent every failure, the key becomes how we deal with, learn from, and adapt to changes in a system in such a way that the system as a whole is more resilient when failure inevitably strikes.

This has been a place I’ve come to more naturally over the years, and I have a tendency to want to control every aspect of something I can — the learning for my order seeking brain has been that that is impossible.

I can’t control every variable. I don’t even know what all the variables are. No one really does, and that is the point.

Tomorrows variables will not be the same as todays.

David Woods, simplifies this down to a few base concepts in his paper on graceful extensibility that really resonated with me.

  • Resources are finite

  • Surprise is fundamental

  • Change never stops

It’s a much longer read, but well worth it as well.

When you create something, you can’t foresee all of the interactions and changes that system will undergo, or predict the ways in which external systems will interact with it and affect it over time, nor can we fully predict how the trade-offs we make to realize the system will play out over time.

Woods made the case that complex systems exist within a boundary, or competence envelope. When that system is well away from the boundary, it behaves as expected, however the closer the system is to the boundary, the more brittle it becomes.

Brittleness, conceptually is the sudden collapse of a system as it is pushed beyond the point to which it can adapt to changing circumstances and variance.

In this paper, he introduced the concept of graceful extensibility. Put simply, as the opposite of brittleness, it reflects the ability of the system to adapt to those fundamental surprises that push it to it’s boundaries.

So to point 12, in “How Complex Systems Fail”, if adaptability is a key component of that graceful extensibility, then:

Human practitioners are the adaptable element of complex systems

And that, in a rather zig-zaggy way brings me to my thoughts on the last four points in Cooks treatise:

Safety is a characteristic of systems and not of their components

People continuously create safety

Failure free operations require experience with failure

Some systems are better positioned to catch drift toward failure

If you’ve got this far, I appreciate your patience with me.

Why Psychological Safety is a key plank in adaptable and resilient systems

Hollnagel defined four key capabilities of a resilient system:

  • Ability to respond to current challenges.

  • Ability to monitor incoming critical situations.

  • Ability to anticipate the occurrence of future events.

  • Ability to learn from the past.

Stepping back from these four capabilities, it’s hard not to see how important the human practitioner is in a systems resiliency, and adaptability.

Psychological Safety as an enabler of adaptability and resilience

Psychological safety is the shared belief among team members that they can take interpersonal risks without facing negative consequences. In the context of adaptable and resilient systems, it acts as a fundamental enabler that allows teams to navigate challenges and change effectively.

In terms of adaptability on the team level, this translates to:

  • Admitting mistakes and discussing failures openly, which accelerates learning and prevents similar issues from recurring

  • Challenging established practices and suggesting novel approaches, even if they might not work

  • Asking questions or seeking help when needed, rather than hiding knowledge gaps

  • Giving and receiving constructive feedback (even across levels of hierarchy)

  • Expressing dissenting views during critical discussions without fear of repercussion.

Having honest and open conversations around failure (and successes!), the open sharing of knowledge, perspectives and learnings builds collective intelligence which improves the capability to handle and respond to surprises, be adaptable in responding to novel surprises in future.

It also encourages innovative thinking, in which unconventional ideas might be proposed that may help to respond to failure or consider new approaches. This extends the boundaries of possibility in understanding the system, regardless of whether or not individual ideas or proposals succeed.

That openness also enables what has been termed “requisite imagination”, or the ability to anticipate potential failures, scenarios and challenges that haven’t happened yet, but could affect the safety and performance of a system.

Psychological safety allows people to be fearless to ask “What If?”, and to raise anomalies or concerning things they see, before they lead to surprise.

Some failure signals start out weak within systems, but are there when we learn to identify the signs (pattern recognition). Raising and investigating these earlier, may guard against larger scale issues.

This also includes the ability to consider and run through various scenarios and their potential consequences, and consider cascading and complex effects and interactions that may occur.

Having this level of openness can allow teams to design more robust processes and systems, develop more considered and flexible contingency plans, consider potential failure modes, have thoughtful conversations about what to monitor and how to monitor it, and to build adaptive capacity to handle surprises better.

I hope this made sense, and thanks for reading.

It was a distillation of some recent learning, and also an expression of why psychological safety is a cornerstone of adaptable and resilient organizations and systems.

Resources / Readings

Some resources, gathered for reference:

* Yes, someone actually said this…

Previous
Previous

Mutterings about MTTR…

Next
Next

Belonging is complex and fragile