Incident response often looks weaker in practice than it does in the org chart.

The issue is not always the absence of a formal incident program. Many young teams do not need a heavy process to start improving. The sharper signal is that recovery lives in people’s heads.

Ask a few people the same situational question:

The app is down. What do you normally do first?

If every answer points to a different path, or if most answers eventually point to the same person, the process is not really shared. It may be working only because someone remembers the system well enough to keep it moving.

Memory Is Not an Operating Model

This is not always a leadership failure. Small teams are often pulled across product work, infrastructure, support, security, releases, and customer pressure. DevOps practices are connective tissue work, which means they are easy to underfund until something breaks.

The risk starts when the organization treats that constraint as a stable operating model.

Someone knows which dashboard matters. Someone knows which service usually fails first. Someone knows who to message. Someone remembers the sequence that worked last time.

That can hold together for a while. It fails when that person is unavailable, burned out, on vacation, in another time zone, or no longer with the company. Incident response that depends on one person’s availability is not reliability. It is unpriced operational debt.

The False Fix Is More Tooling

Teams often try to solve this with a tool purchase or an infrastructure change. That can help, but only if the operating model changes with it.

An observability platform can expose useful signals. Kubernetes can support strong reliability patterns when a team is ready to operate it. Automation can reduce repetitive manual work.

None of those create shared understanding by themselves.

If the team does not know who acknowledges an issue, who communicates impact, where facts are captured, or how follow-up work gets assigned, new tools can turn one undocumented process into several undocumented systems.

Consultants can help surface the gaps and shape the roadmap. The team still has to adopt the habits. Diagnosis is not the same as operational change.

The First Response Should Not Be Silent

When an alert fires or the application is down, the first response is not only technical triage. It is awareness, ownership, and communication.

A useful first message can be simple:

The app is down. We’re investigating.

That message does not claim root cause. It does not overpromise resolution. It does tell the organization that the issue is known and someone has taken ownership.

The necessary audience may include engineering, SRE, leadership, support, and client services. Customer-facing teams often need early context because they are the people answering questions while engineering is investigating. If they are left out, the organization creates a second incident around communication.

Good updates do not need to be elaborate. They need to be honest:

  • We’re looking at this.
  • We do not have a solution yet.
  • We think we have identified the process that failed.
  • We have restored service and are monitoring.

The point is not performance. The point is to keep the response from becoming a silent technical silo.

Write It Down Before It Becomes a Story

After the immediate issue is mitigated, capture what happened while the details are fresh.

That record can be a Jira ticket, a Markdown note, or another lightweight artifact the team already uses. It should capture the issue, the response path, the mitigation, and the information that needs to be shared afterward.

This does not need to wait for a formal retrospective. If the outage happens on Saturday night and the review is Monday morning, details will disappear. People will remember the broad shape, but not always the sequence, decisions, or handoffs that mattered.

The first capture does not need to be polished. It needs to exist.

Reviews Need Outcomes

A useful post-incident review is not a blame session. It should not begin from “this would not have happened if that person had done something.”

That framing makes people defensive and teaches the team to survive the meeting instead of learning from the incident.

A useful review produces outcomes. If follow-up work comes out of the meeting, it needs an owner. Otherwise the review becomes paperwork: emotionally costly, operationally familiar, and easy to repeat after the next incident.

The goal is not to prove the team had a bad week. The goal is to change what happens next time.

What Leaders Should Notice

CTOs and VPs of Engineering should look for hero dependency before it becomes a crisis.

In The Phoenix Project, the classic version of this person is Brent: the person everyone needs because so much critical knowledge routes through them. The problem is not that a Brent is hiding information. The problem is that the organization depends on Brent as part of normal operation.

Useful questions for leaders:

  • Are incidents routinely routed through one or two specific people?
  • Can the team function when those people are unavailable?
  • During an incident, are facts written down somewhere others can follow?
  • Can a less technical leader find the documents or tickets being referenced?
  • Can the organization explain what happened without asking the hero to retell the story?

If the answer is no, the incident process is still living in memory.

A Little Organization Is Allowed

Young teams do not need to start with a heavyweight enterprise incident program. They do need permission to be a little organized.

It is okay to write things down. It is okay to have a few known steps. It is okay to be transparent with the team and the company. It is okay to say, “We know there is an issue, we are investigating, and we will keep updating you.”

Incident response maturity starts when the organization stops depending on private memory and starts building shared operating habits.

If recurring incidents, unclear ownership, or hero-dependent recovery are starting to affect your team, a focused DevOps assessment can help turn those patterns into a practical roadmap.