Q: When people hear “Tier-3 incident response at AWS,” they probably imagine alarms, dashboards, and high-drama breach response. What is the part they don’t picture?
A: An incident creates different realities for different teams. Engineering sees system behavior. Security sees risk. Legal sees exposure. Leadership sees decision pressure. Customers may see impact before anyone has the full picture.
The responder’s job is not just to know what happened. It is to help the organization know what happened in a way it can survive.
That means precision matters. Timing matters. Words matter. Saying too much too early creates noise. Saying too little creates a vacuum. Saying the wrong thing sends people in the wrong direction.
A lot of incident response is not heroics. It is disciplined sense-making under pressure.
Q: What separates a good incident responder from a great one?
A: A good responder can investigate.
A great responder can create order without pretending there is certainty.
And that is an important distinction. During a serious incident, everyone wants the clean answer. They want root cause. They want scope. They want confidence. They want to know if it is contained. But early on, you usually do not have the full picture. You have fragments.
The best responders are honest about uncertainty, but they do not let uncertainty become paralysis.
They can say: here is what we know, here is what we think, here is what we are testing, here is what would change our mind, and here is what we are doing next.
Structure gives people something to stand on.
Q: You helped create the framework Amazon used for Tier-3 incident response. What problem were you really solving?
A: When something is bad enough to hit Tier-3, you cannot rely on personality, heroics, memory, or whoever happens to be in the room. You need a system that helps responders orient quickly, preserve context, ask better questions, and communicate clearly while the environment is changing.
The framework had to support judgment, not replace it because Amazon has smart people everywhere. The problem was repeatability under stress.
That is important. Bad frameworks try to automate expertise out of the process. Good frameworks give expertise a structure so it can move faster and survive handoff.
Q: What does “handoff” mean in a global incident response environment?
A: Handoff is where a lot of response quality gets lost.
In a small incident, one person or one team can hold the story in their head. At global scale, that does not work. The incident crosses time zones, teams, services, systems, and decision layers. People rotate in and out. New evidence arrives. Old assumptions expire.
If the incident only exists in Slack threads, tribal memory, or the mind of the person who has been awake for 18 hours, you are in trouble.
The work has to be structured so the next person can inherit not just the facts, but the reasoning.
What did we know?
Why did we believe it?
What did we rule out?
What is still open?
What are we watching?
What decision is blocked on what evidence?
The key differentiation between documentation and operational memory.
Q: What is the most misunderstood part of “scale” in incident response?
A: Often hear “scale” and think “more alerts.”
Sure. But that is not the interesting part.
Scale means blast radius gets harder to reason about. Dependencies get harder to see. Normal behavior becomes harder to define. Communication paths multiply. The number of people who need different levels of truth increases.
At small scale, an incident can be a technical problem with communication around it.
At large scale, communication becomes part of the technical system.
A bad update can create work. A vague statement can send teams chasing ghosts. An imprecise category can change the response path. The language you use becomes operational.
That is one of the lessons I carried into later work: naming is not cosmetic. Naming affects what people do next.
Q: What did Tier-3 response teach you about confidence?
A: Confidence is not a feeling. It is a debt you owe to evidence.
That is probably one of the things I carried forward from the military. You learn pretty quickly that people can confuse command presence with certainty. They are not the same thing.
You do not get to sound confident because people are scared. You get to sound confident when the evidence supports it.
In incident response, false certainty is dangerous. It can close off lines of inquiry too early. It can make people stop collecting the data they still need. It can create a story that is emotionally satisfying but technically wrong.
But the opposite is also dangerous. If you communicate every uncertainty with equal weight, you create fog. People cannot act inside fog.
So you learn to separate confidence levels.
Known.
Likely.
Possible.
Unproven.
Ruled out.
No evidence at this time.
These are control surfaces.
The job is not to perform confidence. The job is to earn it, preserve it, and make sure the people around you understand exactly what it is based on.
Q: What did AWS teach you that later shows up in FT3?
A: That language has to survive handoff.
At AWS scale, if a concept only works when one expert explains it live, it does not scale. If the meaning breaks when it moves between teams, it does not scale. If it cannot be operationalized, queried, tagged, mapped, or connected to action, it has limited value.
FT3 comes from a different domain, but the lesson is similar.
Fraud teams are often dealing with behavior that crosses boundaries: accounts, merchants, cards, identities, infrastructure, platforms, institutions. If the language cannot travel, the defense cannot travel either.
That is why I place so much weight in operating language. Not terminology. Operating language.
Language that helps people act.
Q: Was there a moment when you realized incident response was really systems work?
A: Probably the first time I watched a technical fact change meaning as it moved through the organization.
A log entry is a log entry to one person. To another, it is customer impact. To another, it is legal exposure. To another, it is a product issue. To another, it is an executive decision.
The fact did not change. The system around the fact changed.
That is when you realize the responder is not just investigating a machine. They are operating inside a human and technical system at the same time.
So you are managing evidence, interpretation, timing, decision pressure, and trust.
That is systems work.
Q: What should the audience understand about your world that they probably do not?
A: The worst day is not just a test of your tools.
It is a test of your language + your systems + your memory + your trust, + your ability to think clearly while the facts are still arriving.
We cannot make uncertainty disappear, so the work is to build systems that can keep moving through it.
Q: What is the through-line from Amazon Tier-3 IR to the work you are doing now?
A: The through-line is making hard things operational, reducing the distance between seeing something and being able to act on it.
That distance is where harm compounds.