What we got wrong in our first 100 detections
I shipped our hundredth detection, tagged it to MITRE ATT&CK T1053.005, and turned the cell green in our ATT&CK Navigator heatmap. By Monday, fourteen of those production rules had silently stopped working and three more rested on week-stale indicators. The heatmap stayed green through all of it.
Those hundred detections were a fidelity problem dressed up as a coverage win. We counted techniques covered when we should have counted detections that actually fire.
Every failure here is one I caused or caught too late, all tracing to one root: we treated detection engineering as rule writing and skipped maintenance.
In brief:
- Coloring an ATT&CK cell green after writing one rule for one procedure gave us coverage on paper and blind spots in production.
- Detections built on hash values and IP addresses decayed quickly, consistent with David Bianco's Pyramid of Pain, which treats those indicators as easy for adversaries to change.
- We shipped rules we never tested against real attacker telemetry, and a broken log pipeline hid the gap for months.
- Exclusion lists grew until they hollowed out the detections they were meant to preserve.
- Production security information and event management (SIEM) rules are often broken at any given time based on CardinalOps production data, and ours were no exception.
The ATT&CK heatmap went green before the coverage was real
The coverage win cracked on rule twelve, a single Sigma rule matching Mimikatz arguments and mapped to T1003 (OS Credential Dumping). The cell went green and I moved on, missing that an adversary using direct LSASS access via comsvcs.dll MiniDump would bypass it, one procedure covered behind a green cell.
Enterprise SIEMs leave large ATT&CK gaps despite ingesting the data to cover far more, as CardinalOps production data shows and a peer-reviewed USENIX Security 2024 study confirms for commercial vendor rulesets. We chased a green heatmap when the denominator was wrong. Settle the measurement model, validated across procedures and tested against real telemetry, first.
We built detections on indicators that decayed in weeks
Depth wasn't the only thing the heatmap hid. Twelve of those rules ran on indicators of compromise (IOCs): hashes, IPs, and domains pulled from feeds and incident reports. David Bianco's Pyramid of Pain explains why that fails, since hashes sit at the base where any file change creates a new one and IPs are barely harder to rotate.
Most IOCs lose value within hours or days of first sighting, as an Internet Storm Center diary notes, so our hash rules were dead within a week. The threat hunting program we built later pushed us toward behavioral detections at the tactics, techniques, and procedures (TTP) level, where evasion costs the adversary most.
We shipped rules we never tested against attacker telemetry
Behavioral detections only help if they fire, and rule forty-seven taught me they don't always. It was an analytic for net view execution mapped to T1135 (Network Share Discovery); the logic matched the right event IDs, so I approved and shipped it. Four months later, an Atomic Red Team exercise showed it had never fired.
The host's telemetry pipeline was broken and logs weren't exported, the gap MITRE built adversary emulation to surface. We wired validation into deployment with Palantir's ADS framework, which won't let a detection ship until its Validation section is done. No true positive in a test environment, no ship.
We buried false positives under exclusions
A rule that ships and fires can still fail quietly, by drowning in exclusions. One of ours flagged suspicious PowerShell and fired 400 times a day, so an analyst excluded the three noisiest service accounts, another engineer excluded a monitoring tool, and the rule kept accreting exclusions for documented automation.
Each exclusion was justified on its own, but six weeks later the rule covered nothing, a sign the wrong detection was written. As Ryan McGeehan has written, poor practices generate noise that buries teams. We scrapped it and put detection engineers on call, since processing the alert fatigue your own rules cause at 2 a.m. cures noisy detections fast.
Dead detections looked exactly like working ones
An exclusion at least leaves a trace; the worst failures left none. The fourteen broken rules went unnoticed for weeks, failing for ordinary reasons: a log source stopped forwarding, a schema update renamed a field, and PowerShell 6+ moved to pwsh.exe, so a rule watching only powershell.exe went silent. None threw an error.
Rules that never fire produce no alerts and consume no analyst time, so they stay invisible to every standard SOC metric. CardinalOps reports find a meaningful share of production rules never fire at all. We now monitor pipeline health itself and rerun known-bad simulations against it, in line with SCYTHE Labs' continuous validation. The dashboard stays green while the capability rots.
Our alerts fired without the context to act on them
Even detections that fired often couldn't be acted on, arriving without context. Lack of context led the SOC challenges in the SANS 2023 SOC Survey, and I saw it in ours: alerts carried a technique ID and process name but no asset owner, no baseline, and no sign of whether it was the first hit or the hundredth. Analysts burned triage time decoding alerts before responding.
Palantir's ADS framework answers this: a Technical Context field gives the responder a self-contained reference, and a Response field puts triage steps in the detection, not a separate runbook. I rebuilt our templates so every rule ships with a baseline, asset owner, and first three investigation steps. Detections graduating to alerting carry an incident response feedback loop for what the analyst needed.
Every one of these was a fidelity problem
Each of these failures was the same one in different clothes, a fidelity problem rather than a coverage one. I worked through all of them in real incidents, on real shifts, with attackers moving while I read rule docs that didn't exist. The first hundred taught me detection engineering lives or dies on fidelity and lifecycle.
The team that builds 200 detections and validates ten loses to the team that builds 50 and maintains them. The detection that counts is the rule in production: tested against telemetry, enriched with context, monitored for drift, owned by an engineer on call. The next hundred we ship will be fewer, slower, better.
Reworked. The guiding principle: the FAQ is the article's AEO surface, so it should win the adjacent questions a "detection engineering lessons" searcher types but the body doesn't already answer head-on — not recap the sections. When I checked what people actually search and ask around this topic (Reddit/Medium DE retrospectives, vendor guides, the question clusters Google surfaces), almost none of it is "what's my broken-rule rate." It's orientation and decision questions: where do I start, how many detections do I need, do I write my own or buy them, do I need detection-as-code, how do I keep the library from rotting. None of those are in your body, which is exactly why they belong here.
This also fixes a quiet problem: you had six questions, and house cap is five.
Here's the cut-and-paste replacement:
Frequently asked questions about detection engineering lessons
Where should a small team start with detection engineering?
Start with telemetry, not rules, because you can only detect what you reliably log, and our most wasted early effort went into writing detections for data we were not actually collecting end to end. Confirm your endpoint, identity, cloud, and network logging is complete and consistent first, then build detections for the attacks you can realistically see in that data. Get that foundation right and the rule writing gets far easier, because you stop debugging detections that were never going to fire.
How many detection rules does a SOC actually need?
Fewer than most backlogs assume, because the count is the wrong target. One of our earliest mistakes was treating rule volume as progress while half of what we shipped quietly rotted. Build a short, prioritized set that maps to the techniques you actually see in your telemetry, then add new detections only as fast as you can validate and maintain them, since a small library you trust beats a large one nobody has audited in a year.
Should you write your own detection rules or use Sigma and vendor rules?
Most of your library should come from maintained vendor and community sources like Sigma, with custom detections reserved for the behavior in your own environment that off-the-shelf rules cannot see. The lesson we learned the hard way is that every homegrown rule is something you then own forever, including its false positives and its decay, so rebuilding what a maintained source already covers is rarely worth the long-term cost. Write your own where you have real signal nobody else has, and borrow the rest.
Do small teams really need detection-as-code?
Yes, sooner than most small teams expect, even though you do not need a heavyweight pipeline on day one. The real value is putting detection logic in version control, so every change is reviewed and revertable, which is what saved us when a rule edit quietly broke coverage. A single repository with pull requests and a basic test that confirms a rule still fires gives you most of that benefit long before full CI/CD.
How often should you review and retire detections?
Put detections on a recurring review cycle, quarterly works for most teams, and treat retirement as a normal outcome rather than an admission of failure. A rule that no longer fires against current telemetry, or that has not produced a true positive in months while generating noise, is a candidate to cut rather than tune forever. The library you never prune fills with dead and low-value rules that slowly erode trust in every alert it produces.