Why Your RMM Tool Is Missing What ETW Can See
Remote Monitoring and Management (RMM) tools are standard in every IT environment. They collect system health metrics, deploy patches, run scripts remotely, and alert you when something goes wrong. But there's a fundamental gap in what they can actually tell you about why something went wrong.
Most RMM platforms rely on two data sources: WMI queries and the Windows Event Log. Both are useful, but neither gives you the diagnostic depth needed for real root cause analysis.
The WMI Polling Problem
WMI (Windows Management Instrumentation) is the backbone of RMM health monitoring. Your RMM agent polls WMI every 30-60 seconds to check CPU usage, memory consumption, disk space, and service status. This gives you a sampled snapshot of system state.
The problem is sampling. If a process spikes CPU to 100% for 8 seconds every minute, a 60-second WMI poll might catch it or might miss it entirely. You see the symptom in aggregate metrics — "average CPU was 85%" — but you don't see what caused the spike, when exactly it happened, or what other system activity correlated with it.
WMI also has significant overhead for complex queries. Asking WMI to enumerate all running processes with their thread counts and I/O statistics can itself consume noticeable CPU and memory, especially on busy servers. This is why most RMM tools keep their WMI queries simple and infrequent.
The Event Log Gap
Windows Event Logs capture application errors, security events, and system warnings. They're invaluable for audit trails and compliance, but they're written after the fact by applications that choose to log. They tell you that something happened, not why.
A classic example: Event ID 1000 tells you that app.exe crashed with an access violation. It gives you the faulting module and offset. But it doesn't tell you that the crash was preceded by 3 seconds of escalating memory pressure, 47 failed page allocations, and a cascading thread pool starvation — all of which were visible in ETW but never made it to the Event Log.
What ETW Captures That RMM Can't
Event Tracing for Windows operates at the kernel level. It's not polling — it's subscribing to real-time event streams. Every disk I/O, network connection, process creation, memory allocation, and driver operation can be traced with microsecond timestamps.
Here's what that means in practice:
Disk latency attribution — Not just "disk is busy" but "sqlserver.exe is writing 4KB blocks to tempdb at 150ms latency because the storage controller queue depth hit 32." WMI tells you disk utilization percentage. ETW tells you which process, which file, what operation size, and exact latency per I/O.
Thread-level CPU analysis — Not just "this process uses 80% CPU" but "thread 0x1A4C in worker pool 3 is spinning on a lock held by thread 0x1B20 which is blocked on a synchronous DNS lookup." WMI gives you process-level CPU. ETW gives you thread-level execution flow.
Network connection lifecycle — Not just "port 443 is open" but the full TCP handshake timing, TLS negotiation duration, retransmission count, and exact bytes transferred per connection. Event Logs tell you a connection failed. ETW tells you it failed at the TLS certificate validation step because a CRL check to ocsp.example.com timed out after 15 seconds.
Memory allocation patterns — Not just "memory is high" but a timeline of heap allocations showing which code path is allocating without freeing, at what rate the working set is growing, and when the system transitions from normal operation to memory pressure.
The Data Volume Challenge
The reason RMM tools don't use ETW is data volume. A single Windows server under moderate load generates 10,000-100,000 ETW events per second across all providers. At an average of 200 bytes per event, that's 2-20 MB per second of raw telemetry. For a fleet of 500 machines, you're looking at 1-10 GB per second of data.
No centralized monitoring platform can ingest and store that volume cost-effectively. This is why traditional approaches involve starting traces on-demand, capturing for short windows, and analyzing locally.
But there's a middle ground: local correlation.
Local Correlation as the Bridge
Instead of shipping raw ETW data to a central platform (impractical) or ignoring ETW entirely (leaving diagnostic value on the table), you can process ETW events locally on each machine and only transmit the diagnostically relevant output.
A local correlation engine subscribes to ETW providers, watches for patterns in real-time, and generates compact summaries. When CPU spikes, it captures the thread stacks. When disk latency increases, it identifies the processes and files involved. When a process crashes, it captures the preceding seconds of kernel activity that explain why.
The result is a 99%+ reduction in data volume — from megabytes per second of raw events to kilobytes per minute of correlated diagnostics — while preserving the information that matters for troubleshooting.
This correlated output can then flow to a central dashboard just like traditional RMM health metrics, but with dramatically richer diagnostic context. Instead of an alert that says "CPU exceeded 90% on SERVER-12," you get "CPU exceeded 90% on SERVER-12 because the .NET garbage collector ran a full Gen 2 collection that took 4.2 seconds, triggered by WorkerService.exe exceeding its 2GB heap limit."
What This Looks Like in Practice
Consider a common scenario: users report that a web application is slow between 2-3 PM daily.
Traditional RMM approach: Check CPU, memory, disk metrics during that window. Everything looks normal — CPU around 60%, plenty of memory, disk utilization moderate. Open a ticket, schedule a call, maybe RDP in during the window and start clicking around.
ETW-based approach: The agent's local correlation engine has been watching continuously. It shows that between 2:00-2:05 PM, the IIS worker process experiences a burst of 2,300 concurrent requests (3x normal), triggering ASP.NET thread pool starvation. The thread pool growth is throttled by the default 500ms injection rate, so it takes 45 seconds to scale from 25 to 80 threads. During that window, request queue depth hits 150 and average response time jumps from 120ms to 8.4 seconds. The burst correlates with a scheduled task on an upstream application server that triggers bulk API calls.
Same problem. One approach gives you "maybe it's slow sometimes." The other gives you the root cause, the trigger, and enough detail to fix it.
The Evolution of Windows Monitoring
The monitoring industry has largely treated Windows as a black box — collect a few counters, watch the Event Log, alert on thresholds. That made sense when the alternative was manually running PerfMon and WPA on individual machines.
But ETW has always been there, capturing everything the kernel does, waiting for someone to build accessible tooling around it. The combination of local correlation engines, lightweight agents, and AI-powered analysis is finally making that kernel-level telemetry usable at scale.
The result isn't a replacement for your RMM — it's a complement. Keep your RMM for patching, remote access, and basic health monitoring. But when you need to understand why something is happening, you need the diagnostic depth that only ETW provides.
ET Ducky was built specifically to bridge this gap. Lightweight agents collect ETW telemetry, correlate it locally with 99.6% bandwidth reduction, and deliver AI-analyzed diagnostics through a cloud dashboard. If you're tired of staring at flat-line health metrics while users report problems, it's worth a look.
Ready to try ET Ducky?
Deploy an agent in minutes and see AI-powered ETW diagnostics in action.
Get Started Free