Using AI to Automate Windows Root Cause Analysis
Root cause analysis in Windows environments is still largely a manual process. An alert fires, a sysadmin remotes in, checks the usual suspects — Task Manager, Event Viewer, maybe Resource Monitor — and starts piecing together what happened. For straightforward issues, this takes 15-30 minutes. For complex, intermittent problems, it can take days.
The bottleneck isn't data. Windows generates rich diagnostic telemetry through Event Tracing for Windows (ETW). The bottleneck is interpretation. A single performance incident might involve 50,000 kernel events spanning disk I/O, memory allocation, network activity, and process scheduling. Extracting the causal chain from that volume of data requires deep expertise and significant time.
AI changes this equation.
From Events to Explanations
Large language models are remarkably good at pattern recognition in structured data. When you give an LLM a correlated set of ETW events — process creation timestamps, disk latency measurements, memory pressure indicators, network connection states — it can identify the causal relationships that a human analyst would spend hours finding.
The key word is "correlated." You can't feed an LLM 50,000 raw kernel events. Context windows have limits, and raw events contain too much noise. The effective approach is a two-stage pipeline:
Stage 1: Local correlation — A lightweight agent on the Windows machine subscribes to relevant ETW providers, detects anomalous patterns (CPU spikes, disk latency increases, memory pressure escalation), and extracts just the events related to the anomaly. This typically reduces 50,000 events down to 30-50 key events that tell the story.
Stage 2: AI analysis — The correlated event summary (usually 5-20 KB of structured data) goes to an LLM with a prompt engineered for Windows diagnostic reasoning. The model returns a plain-language explanation of what happened, why, and what to do about it.
What AI Analysis Actually Produces
Here's a real-world example. A monitoring alert triggers: "CPU exceeded 95% on APPSERVER-07 for 3 minutes."
The correlation engine extracts these key events:
- Process
w3wp.exe(PID 4820) CPU jumped from 12% to 97% at 14:23:41 - 47 new threads created in the w3wp process between 14:23:38 and 14:23:45
- Disk read I/O on
E:\inetpub\wwwroot\app\bin\LargeAssembly.dllspiked to 340 MB/s - .NET CLR JIT compilation events for 2,847 methods in rapid succession
- IIS request queue depth went from 2 to 89 in 8 seconds
- Application pool recycle event at 14:23:38 (reason: scheduled recycle)
The AI analysis returns:
Root Cause: Scheduled IIS Application Pool recycle caused cold-start performance degradation.
The application pool for
DefaultAppPoolperformed a scheduled recycle at 14:23:38. The new worker process had to JIT-compile 2,847 .NET methods on first request, causing sustained CPU usage while simultaneously servicing incoming requests. The 340 MB/s disk read burst was the CLR loading assemblies from disk. Request queue depth spiked because the worker process couldn't service requests fast enough during JIT compilation.Recommended Fix: Enable .NET ReadyToRun (R2R) compilation or NGen pre-compilation for the application's assemblies to eliminate cold-start JIT overhead. Alternatively, configure application pool recycling during off-peak hours and enable overlapping recycling so the old worker process continues servicing requests until the new one is warm.
A senior Windows administrator would reach the same conclusion — eventually. The AI reached it in 3 seconds from the correlated event data.
Why This Works Better Than Rule-Based Alerting
Traditional monitoring uses threshold-based rules: if CPU > 90% for 5 minutes, alert. If disk latency > 100ms, alert. If memory > 85%, alert. These rules detect symptoms but can't diagnose causes.
You could write more sophisticated rules — "if CPU > 90% AND w3wp.exe is the top process AND an app pool recycle happened in the last 60 seconds, then it's a cold-start issue." But you'd need thousands of such rules to cover even the most common scenarios, and they'd be brittle and maintenance-heavy.
AI excels here because it can reason about novel combinations of events it hasn't seen before. It doesn't need a pre-written rule for every failure mode. It understands the relationships between Windows subsystems — that JIT compilation causes CPU usage, that app pool recycles cause cold starts, that assembly loading causes disk I/O — and can chain these relationships together for scenarios the rule author never anticipated.
The Privacy Question
A reasonable concern with sending diagnostic data to AI services: what exactly are you sending, and who can see it?
The approach that preserves privacy is to send only correlated summaries, never raw telemetry. The correlated output contains event types, timing, resource measurements, and process names — but not file contents, user data, network payloads, or credentials. It's the diagnostic equivalent of sending your car's OBD error codes to a mechanic, not a copy of your dashboard camera footage.
Additionally, major AI providers offer data handling agreements that prevent training on customer inputs. Anthropic's Claude API, for example, provides a commitment that API inputs are not used for model training and are not retained beyond the request lifecycle.
The practical test: could someone reconstruct sensitive business data from the AI input? For correlated ETW summaries, the answer is no. The data describes system behavior patterns, not business content.
Limitations to Be Honest About
AI-powered root cause analysis isn't magic. There are cases where it struggles:
Hardware failures — If a disk is developing bad sectors or a memory DIMM has intermittent errors, the ETW events show symptoms (retried I/O, corrected ECC errors) but the AI might attribute the symptoms to software causes without hardware context.
Complex multi-system issues — If the root cause spans multiple machines (a database server causing application server timeouts), single-agent ETW analysis only sees one side. Multi-agent correlation helps but adds complexity.
Novel or rare issues — The AI's reasoning is based on patterns in its training data. Truly novel failure modes — a new driver bug, an unusual interaction between two specific software versions — may get a plausible but incorrect explanation. The correlated events are still valuable for manual investigation in these cases.
Configuration drift — The AI analyzes what happened but may not know what should have happened. If a registry setting is wrong or a group policy is misconfigured, the AI sees the behavioral impact but may not identify the configuration as the root cause without additional context.
Where This Is Heading
The current state of AI-powered diagnostics is roughly equivalent to a knowledgeable junior administrator who can analyze data quickly and explain findings clearly, but occasionally misses nuance that an expert would catch. That's already enormously valuable — it handles the 80% of incidents that are straightforward, freeing senior staff for the 20% that actually require deep expertise.
The trajectory points toward AI agents that can not only diagnose but remediate — identifying that an app pool needs reconfiguration and applying the fix, or detecting that a scheduled task is the trigger and adjusting its timing. We're not there yet in a way that's safe for production environments, but the diagnostic accuracy is already at a level where it saves real time.
If you're managing Windows infrastructure and spending hours per week on manual troubleshooting, AI-powered ETW analysis is worth evaluating. The combination of kernel-level telemetry with automated interpretation fills a gap that no amount of traditional monitoring dashboards can address.
Ready to try ET Ducky?
Deploy an agent in minutes and see AI-powered ETW diagnostics in action.
Get Started Free