Tips on diagnosing system hangs?
You should probably be using ETW tracing, which in its (really) easiest form
uses the TraceLogging APIs. If the system crashes, and writes a crash dump (i.e.
your past the boot critical OS startup), you can extract any unwritten part of
an ETW log from the memory image. ETW is usually good up to a million events/sec
or so, and the OS uses per core log buffers so you're not thrashing a shared
lock. Having a debugger connected to see kernel DbgPrints is usually the easiest
way to get small amounts of data right up to a crash.
If your system fully boots, but will not write a crash dump, it's a fair chance
you have a hung processor or something. Crash dumps run with all interrupts
disabled, with a single processor doing polling of the disk controller
completion (one reason the storage miniport model is strange is so it still
works in this extremely limited execution environment).
A plus of an ETW trace is you can have a bunch of structured data, and if you
only want to see the data recently before a crash, you can specify circular
in-memory buffers, and new events just keep overwriting old events, with the
circular buffers in the crash dump. You can run a system for days/weeks logging
to the circular buffer. You can also flip a switch and log many gigabytes of
events to disk, saving every event. You also will need some of the tools to
decode ETW traces. Even though it's a bit slow, Microsoft's Message Analyzer has
among the best filtering (expressions). MA is kind of memory inefficient too,
million event traces are ok, but 100 million events traces aren't. Microsoft
Performance Toolkit tools also analyzes ETW traces, and sometimes it has a
really useful view of the data, and other times it seems to just have almost
A HUGE plus of ETW traces is also the OS had hundreds of event providers
built-in, so you can get unified traces of OS events along with your driver
events. I'm a huge fan of ETW tracing, and even though the TraceLogging flavor
has more overhead than the manifest flavor, I generally use the TraceLogging
flavor because it's just a lot easier to add new events. ETW also includes the
correlation id feature, which is not well explained, but incredibly useful for
making logical sense of the raw events (like associating a specific interrupt
with its originating specific I/O operation).
Forcing a crash dump on command is also important, and you know about the
keyboard sequence. Many server motherboards can also generate an NMI
(non-maskable interrupt), which sometimes can be initiated from a server
management interface (like on many Dell servers). Some servers also have the
hardware to generate an NMI, but you have to add a little switch to the
motherboard header tp trigger the NMI (many SuperMicro motherboards). Some
motherboards know how to generate an NMI on a watchdog timeout too. An NMI has
the advantage that it will interrupt through an interrupt storm or a looping
IPI, although will not interrupt a processor hung on a bus access (like an
infinite PCIe config retry). It would take a PCIe analyzer to verify an infinite
config retry, and only if the PCIe device is in a slot you can put the PCIe
interposer on. At one point , there were JTAG like debug interfaces for special
x86 motherboards. It's always seemed odd that a $2 ARM microcontroller has a
better low level debugger interface (that uses a $15 interface tool) than a
$2000 Intel server processor.
If you looking for CHEAP ways to get really low level debugging info out: 1) if
you motherboard supports an old legacy parallel or serial interface, these were
accessed with port i/o instructions. You can figure out the port address, which
often would not change between boots, and make your drive toggle pins on the
port (like dtr for a serial port). I suppose you could even do bit-banged
serial, and use an oscilloscope ($350 or $15 Arduino) to read the serial stream.
PCs also had debug port at 0x80, and for PCI (and maybe PCIe too) you could get
cheap little boards than would read the debug port value and display it on a
seven segment led. I've also connected logic analyzers to parallel ports. These
methods all require your code somehow get a cpu, which can be an interrupt, or
crashdump callback. Sometimes you can also monopolize a processor at high IRQL
for a while, like make a worker thread that raises IRQL to some higher level for
a millisecond, and periodically lowers it back down to PASSIVE_LEVEL, to avoid
hanging any other threads scheduled on this core. If you forever monopolize a
CPU at elevated IRQL, you will get a DPC watchdog crash. Monopolizing a cpu is
not good in a production driver (proper power management is toast), but for
debugging use it does give you control of a processor a large percentage of the
ETW tracing has an option to use the TSC timestamp instead of the QPF timestamp,
although I don't remember if that actually works on current OS versions and
CPUs. My fuzzy memory is the TSC timestamp works, but is not so useful because
writing the event takes a few hundred cycles and if it used a synchronizing
RDTSCD, you end up only having a little bit better resolution, not 1024x
If you need really fast timing measurements, Intel has a great white paper on
how to get the very best resolution from the TSC. Like it talks about the pros
and cons or using RDTSDC which does not synchronize the pipeline vs RDTSCP which
does synchronize the pipeline. I read that even for the synchronizing RDTSCD,
the processor will start executing new instruction before the timer finishes
reading so you can't full escape the timing fuzziness of out-of-order execution.
I think one of the reasons for QPC being TSC/1024 is that on modern processers,
which are out-of-order execution and speculatively executed, with potentially
close to 100 instructions in some state of execution, TSC/1024 is a number that
is above the jitter. RDTSC is NOT a synchronizing instruction, so the value is
just someplace in the out-of-order queue. The OS will also choose an "optimal"
high resolution time source on a particular system. On some processor models,
the TSC tracks with power management clock speed changes. On these systems, the
OS may choose a global hardware timer for QPC, like the high resolution event
counter in the chipset. I've often found QPC is sufficient resolution for
performance analysis of most C code. For tight inner loops, even the TSC is
insufficient, and you need to look at things more statistically (like time
artificially large test cases). For inner loop optimization, it's still useful
to look at the instruction stream and manually add up instruction throughput
cycles, result latencies, and dependency stalls. You generally don't directly
write assembler, but you certainly can carefully tune your C code to persuade
the compiler to generate the code you want.
For fine grain performance analysis, Intel's VTune is usually my tool of choice,
which answers question like: how long does that interlocked memory operation
really take (an enormous amount of time if it's a cache miss to a different cpu
From: firstname.lastname@example.org [mailto:email@example.com] On Behalf Of
Sent: Saturday, March 3, 2018 10:07 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Tips on diagnosing system hangs?
Correction -- I meant R-CTL+ScrLock+ScrLock to create a bug check.
7. If I put DbgPrint calls in my driver, can you show me an easy tutorial on how
to use the output. In particular, I'd want to save the data QUICKLY to disk in
case of a crash or hangup.
NTDEV is sponsored by OSR
Visit the list online at:
MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
To unsubscribe, visit the List Server section of OSR Online at