Incident Response Active Directory Security Containment & Recovery
When Active Directory (AD) is compromised, the attacker isn’t just “in one server” — they often have the keys to your identity kingdom. The difference between a chaotic scramble and a controlled recovery is a clear, tested response playbook.
- What an AD compromise playbook is (and isn’t)
- Core principles for AD incident response
- Roles & decision rights (RACI-lite)
- Preparation phase (before you’re breached)
- Triggers & severity levels
- First 60 minutes: triage checklist
- Containment actions (without breaking the business)
- Eradication: remove persistence and regain trust
- Recovery: rebuild, restore, validate
- Post-incident hardening and lessons learned
- Copy/paste templates: runbook structure & comms
- FAQ
What an AD compromise playbook is (and isn’t)
A response playbook is a decision and action map for a specific scenario: “We believe AD is compromised.” It defines who decides, what to do, in what order, how to validate, and how to communicate.
It is not a generic incident response policy document. It’s the operational “muscle memory” your team follows under pressure.
Core principles for AD incident response
1) Preserve evidence while you contain
Don’t wipe the scene. Capture logs, volatile context, and a timeline. You can contain aggressively, but do it with a plan that keeps proof intact.
2) Assume identity infrastructure is “tainted”
In a real AD compromise, trust is the problem. Your playbook must define how you regain trust: clean admin workstations, known-good accounts, controlled password resets, and validated restores.
3) Minimize blast radius fast
Stop lateral movement, credential reuse, and replication abuse. Tighten control over privileged groups and permissions. (If you need a refresher on how AD permissions work, see Active Directory permissions explained.)
4) Prefer “known-good admin pathways”
Use dedicated admin workstations, separate accounts, and a break-glass process that’s pre-approved and practiced.
Roles & decision rights (RACI-lite)
Your playbook should name roles, not individuals (people go on leave). At minimum:
| Role | Owns | Key Decisions |
|---|---|---|
| Incident Commander | Orchestration, priorities, approvals | Severity level, containment scope, business downtime tradeoffs |
| AD Lead | Directory operations | Account disable/reset strategy, DC isolation, GPO emergency changes |
| Security Lead / SOC | Detection, investigation, threat intel | IOCs, scope confirmation, monitoring rules |
| Forensics (internal or vendor) | Evidence and timeline | What to image, what to collect, chain-of-custody |
| Comms / Legal / HR | Messaging and regulatory steps | Notifications, employee guidance, disclosure requirements |
| App/Infra Owners | Dependent systems | Service account rotation, outage coordination, validation |
Preparation phase (before you’re breached)
The best response playbook is 70% preparation. Here’s what to pre-stage so you can move fast without improvising.
Preparation checklist
- Asset & identity inventory: domain controllers, AD sites, trusts, Tier-0 assets, PKI/AD CS, ADFS, sync tools, admin workstations.
- Logging baseline: security logs retention, centralized forwarding, and time sync across DCs and critical servers.
- Detection tooling: confirm coverage for identity threat detections (for example, Microsoft Defender for Identity). Start with Microsoft Defender for Identity: overview and validate event collection for Defender for Identity.
- Break-glass plan: offline-stored credentials, MFA requirements (where applicable), and step-by-step access procedure.
- Privileged access model: separate admin accounts, tiering, and “no email/web browsing” from admin workstations.
- Backups: validated System State backups for DCs, plus documented restore steps and test results.
- Credential rotation map: what to rotate first (KRBTGT, DA accounts, service accounts, app secrets, certificates), and how.
- Communication templates: internal “do/do not” guidance (e.g., don’t reboot suspected systems; report suspicious prompts).
- Tabletop exercises: at least quarterly for Tier-0 compromise.
If your team needs foundational clarity on how authentication flows work during investigations, revisit NTLM and Kerberos Authentication Protocols Explained. It makes scoping issues like ticket abuse and credential replay much easier.
Triggers & severity levels
Define what “AD compromise” means in your environment. Otherwise you’ll debate severity while the attacker moves.
| Severity | Definition (examples) | Default Response |
|---|---|---|
| SEV-1 | Evidence of Domain Admin / Enterprise Admin theft, suspicious DC activity, replication abuse indicators, widespread lateral movement | Activate full playbook, isolate Tier-0, exec/legal comms, rotate critical secrets |
| SEV-2 | Compromised admin workstation, suspicious GPO change, credential dumping suspected, multiple failed logons with privilege context | Targeted containment, fast investigation, elevate if scope expands |
| SEV-3 | Single account compromise or localized host compromise without privileged indicators | Standard IR with AD-focused checks |
Include operational triggers too (e.g., “MDI high-confidence alert”, “unexpected account lockouts at scale”). For lockout investigations, see Account Lockout Event ID: how to find account lockouts.
First 60 minutes: triage checklist
0–15 minutes: stabilize
- Declare incident severity and assign an Incident Commander.
- Start an incident log (time-stamped decisions, actions, and owners).
- Confirm safe communications channel (out-of-band if you suspect email compromise).
- Freeze non-essential changes (pause planned GPO/AD changes, deployments, and admin projects).
15–30 minutes: scope and evidence
- Identify suspected entry points (phished admin, compromised endpoint, exposed service, vendor access).
- Pull high-value logs centrally (DC Security logs, workstation logs, VPN/proxy, EDR telemetry, identity detections).
- Preserve key hosts (likely compromised admin workstation(s), jump boxes, DCs if indicated).
- Check for privileged group changes: Domain Admins, Enterprise Admins, Schema Admins, Built-in Administrators.
30–60 minutes: initial containment (minimal disruption)
- Disable or isolate the suspected compromised admin workstation(s) from the network (EDR containment preferred).
- Disable obviously malicious accounts and revoke sessions where possible.
- Block known malicious IPs/domains at perimeter and proxy.
- Start a “watchlist” for suspicious logons on DCs and Tier-0 systems.
Containment actions (without breaking the business)
Containment is where playbooks win. You need pre-approved options that scale from “surgical” to “emergency shutdown.”
Surgical containment (preferred first)
- Disable or reset the specific suspected accounts (prioritize privileged identities).
- Remove suspicious principals from privileged groups (document everything).
- Apply temporary conditional access / sign-in restrictions in hybrid environments (where applicable).
- Restrict admin logon paths (only from hardened admin workstations / jump hosts).
- Increase auditing and forward logs centrally (don’t rely on local retention).
Emergency containment (if SEV-1 expands)
- Isolate Tier-0 subnets (DCs, ADFS, PKI, identity tooling) with firewall rules.
- Disable non-essential trusts or restrict access if a partner domain is involved.
- Implement “deny by default” for privileged logons except from approved admin jump boxes.
- Temporarily disable high-risk legacy protocols/services if they are being abused.
Containment guardrails (avoid self-inflicted outages)
- Track business-critical service accounts before mass resets (identity outages can be worse than the intrusion if done blindly).
- Don’t remove permissions or groups without recording: “what changed, by whom, and why.”
- Prefer staged password resets (tiered) rather than “reset everyone immediately,” unless the situation truly requires it.
Eradication: remove persistence and regain trust
Attackers who compromise AD often establish persistence through backdoor accounts, delegated rights, scheduled tasks, GPO abuse, credential material, and ticket-based persistence.
Eradication checklist
- Hunt persistence: new users/groups, unexpected admin delegations, suspicious ACL changes, modified GPOs, new scripts in SYSVOL.
- Validate DC integrity: compare configurations to known-good baselines; check for unauthorized services, drivers, tasks.
- Rotate privileged credentials: Domain Admins, Enterprise Admins, server local admin passwords, break-glass credentials (in a controlled order).
- KRBTGT reset plan: document prerequisites and do it carefully (many orgs do a two-step reset as part of a broader recovery plan).
- Service account rotation: prioritize Tier-0 service accounts and any accounts with broad delegation.
- Certificate/PKI review: if AD CS is present, validate templates, enrollment permissions, and recent certificate issuance.
Recovery: rebuild, restore, validate
Recovery is not “systems are back online.” Recovery is “we re-established trust in identity and can prove it.”
Recovery steps
- Choose recovery strategy: clean-in-place vs restore from backup vs rebuild domain (worst case).
- Restore carefully: validate backup integrity, isolate restore environment if possible, and keep forensic copies.
- Reintroduce DCs safely: ensure patching, secure configuration, and monitored rejoin to production.
- Re-enable services in tiers: Tier-0 first, then Tier-1 (servers), then Tier-2 (endpoints/users).
- Validate authentication flows: normal logons, service tickets, replication health, time sync, DNS health.
- Continuous monitoring: heightened detections for weeks (not days). Attackers often attempt re-entry.
Post-incident hardening and lessons learned
This is where you reduce the chance of a repeat event and shorten the next response cycle.
Security improvements
- Implement or tighten privileged tiering and admin workstation standards.
- Reduce standing privilege (JIT/JEA where possible) and review delegation regularly.
- Harden GPO change controls and audit SYSVOL changes.
- Improve identity detections and centralize key events.
Process improvements
- Update the playbook with what actually happened (not what you wish happened).
- Add “decision timestamps” to remove ambiguity next time.
- Run a tabletop focused on the weakest points found during the incident.
- Store final artifacts: timeline, IOCs, actions taken, and hardening backlog.
Copy/paste templates: runbook structure & comms
Playbook structure (recommended headings)
Playbook: AD Compromise
1. Scope and assumptions
2. Severity definitions and triggers
3. Roles and decision rights
4. Evidence collection (what/where/how long)
5. Triage (first 60 minutes)
6. Containment (surgical → emergency)
7. Eradication (persistence removal + secret rotation plan)
8. Recovery (restore/rebuild + validation gates)
9. Communications (internal/external templates)
10. Post-incident actions (hardening + backlog)
11. Appendix: critical contacts, systems, scripts, and checklists
Internal “all-hands” message (short)
Subject: Security incident – temporary access changes
We are responding to a security incident affecting identity systems.
You may notice password prompts or access restrictions while we investigate.
Do not reboot devices if you see unusual login prompts.
Report suspicious emails, MFA prompts you did not initiate, or access issues to: [CHANNEL].
We will provide updates at: [CADENCE / STATUS PAGE].
Admin instruction (break-glass reminder)
Use only the approved admin workstation/jump host.
Do not sign in to DCs from standard endpoints.
Record every privileged action in the incident log (time, action, account, system).
If unsure, stop and escalate to the Incident Commander.
FAQ
How do I know it’s really an “AD compromise” and not a single endpoint issue?
Your triggers should focus on privileged identity impact: admin credential theft indicators, DC logon anomalies, privileged group changes, replication/identity detections, and widespread lateral movement. If you can’t confidently scope it within an hour, treat it as SEV-1 until proven otherwise.
Should we reset everyone’s password immediately?
Not by default. Mass resets can cause business outages and don’t guarantee the attacker is out. Prefer tiered resets: privileged accounts first, then critical service accounts, then broader user base once containment and monitoring are in place.
What logging should be “must have” before an incident?
DC security logs with sufficient retention, centralized forwarding, endpoint telemetry (EDR), and identity detections. If you’re using Defender for Identity, ensure event collection is correctly configured and validated.
How often should we test this playbook?
Run a tabletop at least quarterly for Tier-0 compromise. Also test the operational “hard parts” (break-glass access, evidence collection, restore validation) at least twice a year.
