How to Troubleshoot Windows Performance Issues with ETW
If you've ever stared at Task Manager trying to figure out why a server is slow, you know the frustration. CPU is high, but which process? Disk is busy, but doing what? Memory is climbing, but where's the leak?
Traditional monitoring tools show you symptoms. Event Tracing for Windows (ETW) shows you causes.
What ETW Actually Is
ETW is a kernel-level tracing framework built into every version of Windows since XP. Unlike Performance Monitor counters or Event Log entries, ETW captures real-time events from the kernel, drivers, and applications at microsecond granularity. When a process opens a file, allocates memory, sends a network packet, or throws an exception — ETW records it.
The Windows kernel alone exposes hundreds of event providers. Every disk I/O operation, every context switch, every page fault is traceable. This is the same data that tools like Windows Performance Analyzer (WPA), ProcMon, and xperf are built on.
The challenge has always been that raw ETW data is overwhelming. A busy server can generate millions of events per minute. The signal-to-noise ratio makes manual analysis impractical for most administrators.
The Traditional ETW Workflow
Here's how ETW troubleshooting typically works:
Start a trace session using logman, xperf, or wpr:
# Start a kernel trace capturing process, disk, and network events
logman create trace "PerfTrace" -p "Microsoft-Windows-Kernel-Process" -o C:\Traces\perf.etl
logman create trace "DiskTrace" -p "Microsoft-Windows-Kernel-Disk" -o C:\Traces\disk.etl
logman start "PerfTrace"
logman start "DiskTrace"Reproduce the problem while the trace is running.
Stop the trace and analyze with Windows Performance Analyzer:
logman stop "PerfTrace"
logman stop "DiskTrace"
# Open .etl files in WPA for analysisManually correlate events across providers to find the root cause.
This works, but it requires deep ETW expertise, access to the machine, and significant time to analyze the results. For a fleet of hundreds or thousands of Windows machines, it doesn't scale.
Key ETW Providers for Performance Troubleshooting
Not all providers are equally useful. Here are the ones that give you the most diagnostic value:
Microsoft-Windows-Kernel-Process — Process creation, termination, thread activity, and CPU consumption. This is your starting point for "what's using all the CPU" questions.
Microsoft-Windows-Kernel-Disk — Every disk read and write with exact byte offsets, file paths, and latency. Essential for diagnosing slow disk performance or identifying which application is hammering storage.
Microsoft-Windows-Kernel-Network — TCP/UDP connections, DNS lookups, and packet transmission. Reveals unexpected network activity or connection failures.
Microsoft-Windows-Kernel-Memory — Page faults, working set changes, and memory allocation patterns. Critical for memory leak investigation.
Microsoft-Windows-Kernel-Registry — Registry reads and writes. Useful when applications are slow due to excessive registry access.
Microsoft-Windows-TCPIP — Lower-level network diagnostics including retransmissions, connection resets, and MTU issues.
Common Patterns and What They Mean
After analyzing thousands of ETW traces across enterprise Windows environments, certain patterns emerge repeatedly:
High context switch rate with low CPU — Usually indicates thread contention. Applications are spending more time waiting for locks than doing actual work. Look at the Kernel-Process provider for thread wait chains.
Disk I/O spikes correlating with memory pressure — The system is paging. When physical memory fills up, Windows writes memory pages to disk, causing massive I/O. The fix is usually more RAM or finding the memory-hungry process.
Periodic CPU spikes every 15-30 minutes — Almost always a scheduled task, antivirus scan, or Windows Update check. The Kernel-Process provider will show the exact process and timing.
Gradual performance degradation over days — Classic memory leak pattern. A process slowly consumes more memory until the system starts paging. Track working set growth over time using the Kernel-Memory provider.
Network timeout errors with no packet loss — Often caused by DNS resolution delays or certificate validation checks reaching out to revocation lists. The TCPIP and DNS providers reveal the actual bottleneck.
Scaling ETW Beyond Single Machines
The fundamental limitation of traditional ETW troubleshooting is that it's machine-by-machine. You SSH or RDP into a box, start a trace, wait for the problem, analyze locally. For enterprise environments with hundreds of Windows endpoints, this approach falls apart.
Modern approaches use lightweight agents that run ETW collection continuously, correlate events locally to reduce data volume, and surface only the diagnostically relevant information. Instead of capturing everything and analyzing later, you filter and correlate in real-time.
The key insight is that you don't need to ship raw ETW events to a central location. A local correlation engine can reduce millions of raw events down to the handful that actually matter — the process that started consuming CPU, the disk operation that's blocking others, the network connection that's timing out. You achieve a 99%+ reduction in data while preserving the diagnostic signal.
This is the approach we built into ET Ducky. The agent runs ETW collection locally, correlates events on the machine, and only sends summarized diagnostics to the cloud. An AI analysis layer then explains what the correlated events mean in plain language — turning "50,000 kernel events" into "IIS application pool is recycling every 3 minutes due to a memory limit configuration at 1.7GB, causing request queuing."
Getting Started
If you're new to ETW, start with these steps:
Pick a specific problem — Don't try to trace everything. Focus on one symptom: high CPU, slow disk, network timeouts.
Enable the right providers — Use the table above to select 2-3 providers relevant to your symptom.
Keep traces short — 30-60 seconds during the problem is usually enough. Long traces just create more data to wade through.
Look for correlations — The root cause is usually visible when you overlay events from multiple providers on the same timeline.
Consider automation — If you're troubleshooting the same types of issues repeatedly across multiple machines, agent-based ETW monitoring with automated correlation will save you significant time.
ETW is the most powerful diagnostic tool built into Windows. The challenge has always been making it accessible without requiring kernel-level expertise. That's changing.
Ready to try ET Ducky?
Deploy an agent in minutes and see AI-powered ETW diagnostics in action.
Get Started Free