Our third purple team exercise wrapped at 4 pm on a Friday. The red team had executed twelve TTPs. The blue team had documented seventeen gaps. By Tuesday, none of it had turned into a detection.
I'd built the program from scratch nine months earlier and pitched it to leadership as a way to close the gap between what the SOC thought it could detect and what it actually could. The exercises ran well, the reports looked professional, and the detection backlog never changed.
Year two was different, and the difference wasn't effort or budget. It was that year one treated purple teaming as a calendared exercise, and exercises produced documented gaps that nobody turned into detections. Year two treated it as a continuous detection validation pipeline, and the pipeline produced shipped detections that were version-controlled and regression-tested on a schedule.
This article walks through the specific failure modes I lived, the structural diagnosis, and what I'd do differently if I got to skip year one entirely.
In brief:
- Purple teaming isn't a tabletop exercise with a red team in the room. It's a continuous detection validation pipeline that turns adversary TTPs into tested, asserted, version-controlled detections.
- Year-one purple programs fail the same way: random TTPs, no assertions, slide-deck deliverables, no connection to the detection backlog.
- A purple test card with explicit detection assertions is the unit of work. Anything looser produces demos, not detections.
- Treat purple findings like code: version-controlled, CI-tested, tracked as tickets that fail until the detection fires.
How I think about purple teaming after two years of running it
After two years of running this program, my working definition is different from the textbook one. Purple teaming is the continuous detection validation pipeline that turns adversary TTPs into tested, asserted, version-controlled detections.
The "purple" part is the integration between offense and defense in code, not the calendar invite. The textbook framing (red executes, blue observes, both share notes) describes the collaboration and stops there, which is why standard purple programs ship meetings instead of shipped detections.
The exercise model is what most teams build first because it's the obvious shape for a new program: a handful of calendared events per year, broad TTP selection, multi-day engagements, reports shared afterward. It's also why most year-one programs produce documented gaps instead of closed-loop detection work.
As CyberCX describes, purple team scenarios are crafted based on current threat actor TTPs, prior incidents, or red team findings, and used to test the effectiveness of existing detection and preventive controls. The exercise surfaces the problem, and the program model has no mechanism to fix it.
Year one looked great on slides and changed nothing in the SIEM
Year one followed the playbook every new purple program follows. We ran big exercises across the calendar. Each one mapped to MITRE ATT&CK, ran over multiple days, and ended with a closeout call where the red team reported findings, the blue team documented gaps, and leadership saw heatmaps in a deck.
From the outside, the program looked healthy. Internally, the same gaps appeared at the next exercise because no one had built the detection, validated the telemetry, or even confirmed the SIEM was parsing the relevant log source correctly.
The moment that broke the model for me came during exercise four. Mid-run, we discovered that the domain controller and the AD sync server weren't onboarded to the EDR, which meant no logs were being collected from the highest-value targets in the environment.
Claranet documented the same failure pattern in a client engagement: "Critical servers such as a domain controller and the AD sync server were not onboarded with the client's EDR tool, meaning no logs were collected and monitored from them." The "gap" wasn't a detection gap. It was an instrumentation gap that detection engineers couldn't fix because it required infrastructure access, a different team, and a different change approval process.
The exercise surfaced it, and the exercise model had no way to route it to the right owner. Running better exercises wouldn't have helped, because exercises were the wrong unit of work.
Cool TTPs, no assertions, and instrumentation we'd assumed existed
These downstream symptoms look like the problem but are consequences of the exercise model. Cutting any one in isolation doesn't fix year one.
- Cool TTPs over threat-model alignment: My red team ran a Kerberoasting demo against a domain controller configuration we didn't actually use in production. It made a great slide and tested nothing relevant to the business. Picking ATT&CK techniques because they demo well, rather than because they map to actual threats against business-critical assets, is the default when exercises are disconnected from threat intelligence inputs.
- No test cards or detection assertions: Our success criterion for every scenario was "see if the SOC catches this," which is vague, non-reproducible, and non-automatable. A Sumo Logic Azure purple team exercise found existing password spray detections failed entirely because the rule assumed a single source IP while the actual technique distributed attempts across many IPs. Without an explicit assertion ("rule X fires within Y minutes of command Z"), I couldn't distinguish a detection gap from a rule logic mismatch from a missing log source.
- Under-instrumented environments: At exercise three, I discovered our EDR wasn't forwarding command-line arguments to the SIEM, so what looked like full detection coverage on paper had a hole running straight through it.
- No detection-engineering integration: Purple findings landed in a deck, and the detection backlog never saw them. SpecterOps argues that detection engineering should be prioritized using inputs from the detection and response program, or resources may be directed inefficiently. Next quarter, the same gaps surfaced again.
All four symptoms come from the same root cause, which is why the SCYTHE Purple Team Exercise Framework v4 emphasizes tracking findings, action items, and lessons learned through tools such as JIRA, PlexTrac, VECTR, or even spreadsheets.
Without a tracking layer that owns the follow-through, every exercise restarts from zero.
Year two: purple teaming as a detection pipeline
The shift was structural, not incremental. We moved from a few big exercises per year to many smaller, version-controlled scenarios, where each scenario was a test card with explicit prerequisites, exact commands, expected telemetry, and binary detection assertions. In our model, the detection engineer owned the response to a failed test, not the red team lead.
The tools that carry this pipeline, with my operational notes on each:
- Atomic Red Team for the atomic technique library. Each test is defined in YAML with exact commands and cleanup steps that you can reference in your detection workflow. We used it when we wanted portable, repeatable tests we could run frequently. Limitation: tests are structurally isolated and don't chain into campaigns.
- MITRE Caldera for full attack chains and operator emulation. We used it for multi-step operations that simulated an adversary moving through the network. SANS documents a specific workflow of translating red team findings into Caldera abilities, then running those abilities as continuous campaigns. Caldera carries more operational overhead than atomic tests, which made it worth running when we needed to test detection across a sequence of actions.
- VECTR for tracking test cards, results, and detection coverage across time. It doesn't execute attacks, which is the point. Its job is downstream aggregation: every test case maps to an ATT&CK technique at the procedure level, captures exact commands, records Detected/Not Detected outcomes, and renders a heatmap where grey cells are untested techniques. That heatmap became our visible detection backlog.
- Stratus Red Team for cloud-native TTPs that the on-prem tools don't cover. As creator Christophe Tafani-Dereeper explained, it is "focused on emulating common attack techniques in cloud environments," designed to fit scripted testing workflows and supports major cloud and Kubernetes environments.
The CI-style loop ties these together. The scythe-io/sigma-regression-testing repository is a public implementation in which detection rules are written in Sigma YAML, stored in Git, validated by GitHub Actions on commits and pull requests, automatically converted to Splunk SPL, and gated so deployment is blocked when validation fails.
Rules that lack a corresponding Atomic Red Team test mapping sit in an unmapped_rules/ directory, which makes coverage debt visible in the repository structure itself.
Every failed assertion creates a ticket in the detection backlog, tickets stay open until the detection fires, and regression tests catch when a working detection breaks because a schema changed, a parser or API contract drifted, or another configuration change affected behavior.
This is a different operating model from year one, not the same model with better tooling.
What we'd run on day one if we were starting over
Year two's playbook, compressed into the first week:
- Pick three threats from the actual threat model, not from a "top ATT&CK techniques" list: Write the test cards for those three before running a single exercise. Each card names the exact command, the expected telemetry source, and the detection assertion.
- Audit instrumentation before testing detections: If the EDR isn't forwarding command-line arguments to the SIEM, or if the domain controller isn't onboarded, that's a P0 infrastructure ticket. It has nothing to do with red-blue coordination and everything to do with whether the pipeline can function.
- Make detection assertions explicit and binary: A test card says "EDR generates alert with rule ID X within Y minutes of command Z." Alfie Champion's Practical Purple Teaming emphasizes validating whether activity was detectable and whether the resulting alerting is accurate and useful. An alert that fires but misidentifies the technique still fails the assertion.
- Version-control everything: Test cards, expected telemetry, detection rules. If it lives in a Confluence page, it dies in a Confluence page. NVISO Labs documents GitHub Flow for detection engineering: branches named detection/brute-force-rules, PR reviews gate merge, deployment triggers automatically.
- Connect purple findings directly to the detection backlog on day one: Not through a quarterly review. Through a ticket. SCYTHE's PTEF v4 describes a detection engineering lifecycle and tracking measures such as time to detect, and a quarterly review cycle struggles to provide the turnaround a validation pipeline is meant to support.
- Run small and often: A weekly five-test sprint beats a quarterly fifty-test exercise. The detection surface changes continuously, so smaller recurring tests give you more chances to catch drift before the next large exercise.
If your purple program is built around exercises, treat year one as the cost of figuring out the model is wrong. I paid that cost, and I'd rather you didn't. Year two doesn't require a bigger budget.
It requires a different unit of work: the test card with an explicit detection assertion, version-controlled, CI-tested, and tracked as a ticket that fails until the detection fires. Anything looser produces the slide deck I spent year one filing.
Frequently asked questions about purple teaming
What is purple teaming in cybersecurity, in practice?
Purple teaming is the continuous detection validation pipeline that turns adversary TTPs into tested, asserted, version-controlled detections. The textbook definition (red team executes, blue team observes, both share notes) describes the collaboration.
The operational definition describes the output: shipped detections that are regression-tested on a schedule and tracked as code artifacts, not slide-deck findings.
How is purple teaming different from red teaming and blue teaming?
Red teaming tests whether attacks succeed. Blue teaming tests whether the SOC responds. Purple teaming tests whether specific detection logic fires against specific adversary procedures and produces the correct alert.
The red and blue teams are inputs, and the detection assertion is the output. Without that assertion, you're running a demo.
How often should you run purple team exercises?
The exercise cadence question is the wrong frame. Mature programs run many small, automated test scenarios rather than a few large exercises per year. The SANS "Always-On Purple Team" RSAC series emphasizes continuous, automated validation over one-time or periodic purple team exercises.
Smaller recurring test sprints catch detection drift sooner than quarterly exercises do.
What's the difference between purple teaming and detection engineering?
Detection engineering builds the rules. Purple teaming validates whether those rules fire against real adversary procedures in your actual environment with your actual telemetry. Purple teaming without detection engineering produces documented gaps; detection engineering without purple teaming produces untested rules.