The NT Insider:You've Typed !analyze -v, Now What? -- Next Steps in Debugging

Everything Windows Driver Development

Thu, 14 Mar 2019 118020 members

Online Dump Analyzer
OSR Dev Blog
The NT Insider
The Basics
File Systems
Downloads
ListServer / Forum


	Express Links

	·	The NT Insider Digital Edition - May-June 2016 Now Available!
	·	Windows 8.1 Update: VS Express Now Supported
	·	HCK Client install on Windows N versions
	·	There's a WDFSTRING?
	·	When CAN You Call WdfIoQueueP...ously

THE NT INSIDER

You've Typed !analyze -v, Now What? -- Next Steps in Debugging
The NT Insider, Vol 11, Issue 3&4, May-August 2004 | Published: 18-Aug-04| Modified: 18-Aug-04

We often see people asking what to do when their driver doesn’t work – this question arises on an almost daily basis. While the actual setup of the debugger now seems to be reasonably well understood, the steps beyond that appear to be shrouded in mystery. This article will warn about common issues and delve into some of the basic steps that you should follow when debugging.

For the purposes of this article, we will assume that you either have a live system debugging set-up (and have setup the appropriate symbols - see Side Bar Accessing the Microsoft Symbol Server at conclusion of article) or you are looking at a post-mortem dump in the debugger. For this article we are using the current (as of this article) publicly available version of the Windows debugger (WinDBG) from Microsoft (6.3.17.0, May 27, 2004). If you are using a different version of the debugger, the details may vary from what we describe in this article.

The first place to start with any crash is the information provided using the command "!analyze –v". This command invokes the "analyze" routine embedded within the ext.dll library shipped with the current debugger. This library encapsulates a tremendous body of knowledge for the causes of most common failures – and each new version of the debugger reflects the continuous learning process at Microsoft on how to perform automated fault isolation.

For example, we looked through our archives and found a recent crash dump to describe the use of the analyze utility as the starting point of our investigation, see Figure 1.

kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

UNEXPECTED_KERNEL_MODE_TRAP (7f)
This means a trap occurred in kernel mode, and it's a trap of a kind
that the kernel isn't allowed to have/catch (bound trap) or that
is always instant death (double fault). The first number in the
bugcheck params is the number of the trap (8 = double fault, etc)
Consult an Intel x86 family manual to learn more about what these
traps are. Here is a *portion* of those codes:
If kv shows a taskGate
        use .tss on the part before the colon, then kv.
Else if kv shows a trapframe
        use .trap on that value
Else
        .trap on the appropriate frame will show where the trap was taken
        (on x86, this will be the ebp that goes with the procedure KiTrap)
Endif
kb will then show the corrected stack.
Arguments:
Arg1: 00000008, EXCEPTION_DOUBLE_FAULT
Arg2: 80042000
Arg3: 00000000
Arg4: 00000000

Debugging Details:
------------------

BUGCHECK_STR: 0x7f_8

TSS: 00000028 -- (.tss 28)
eax=00030e01 ebx=812ca368 ecx=8117f300 edx=00000001 esi=8117f3ac edi=00010000
eip=baf3f9d7 esp=f5031fc0 ebp=f503205c iopl=0         nv up ei ng nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000             efl=00010286
atapi!IdeSendCommand+0x3:
baf3f9d7 53               push    ebx
Resetting default scope

DEFAULT_BUCKET_ID: DRIVER_FAULT

LAST_CONTROL_TRANSFER: from baf40776 to baf3f9d7

STACK_TEXT:
f503201c baf40776 812ca368 8117f3ac 00010000 atapi!IdeSendCommand+0x3
f503205c baf4291c 812ca368 8117f3ac 812ca1c8 atapi!AtapiStartIo+0x22b
f5032088 80530f57 012ca030 00000002 0dd36002 atapi!IdeStartIoSynchronized+0x162
812ca030 00000000 812fdc18 00000000 00000000 nt!KeSynchronizeExecution+0x17

FOLLOWUP_IP:
atapi!IdeSendCommand+3
baf3f9d7 53 push ebx

SYMBOL_STACK_INDEX: 0

FOLLOWUP_NAME: MachineOwner

SYMBOL_NAME: atapi!IdeSendCommand+3

MODULE_NAME: atapi

IMAGE_NAME: atapi.sys

DEBUG_FLR_IMAGE_TIMESTAMP: 3d6ddb04

STACK_COMMAND: .tss 28 ; kb

BUCKET_ID: 0x7f_8_atapi!IdeSendCommand+3

Followup: MachineOwner
---------

Figure 1

Note that the output from !analyze -v does not tell us the specific cause of the crash, but it does give us a place to start looking. First, make sure to read the text that describes the bug check code. It is also worthwhile to read the online documentation with WinDBG in order to better understand the bug check code.

In this example, the bug check code (0x7F) indicates that this is a "kernel mode trap". Note that the example text indicates 8, which represents a double fault – exactly the case in our crash example (this is also a hint to us that this is probably a common case, since it was used as the example). The documentation here also refers us out to the Intel reference manual – these manuals are useful when debugging (on the x86) and can be found on the Intel web site (as of the writing of this article the Pentium 4 documentation can be found at http://developer.intel.com/design/Pentium4/documentation.htm). This includes information about the various types of traps supported by the x86 processor family. It confirms that trap 8 is the "double fault" trap.

The WinDBG documentation also provides further insight here:

0x00000008, or Double Fault, is when an exception occurs while trying to call the handler for a prior exception. Normally, the two exceptions can be handled serially. However, there are several exceptions that cannot be handled serially, and in this situation the processor signals a double fault. There are two common causes of a double fault:

A kernel stack overflow. This occurs when a guard page is hit, and then the kernel tries to push a trap frame. Since there is no stack left, a stack overflow results, causing the double fault. If you suspect this has occurred, use !thread to determine the stack limits, and then use kb (Display Stack Backtrace) with a large parameter (for example, kb 100) to display the full stack.

A hardware problem. This also points out an important attribute of analyzing any crash: understand the context in which the crash occurred. If this were a machine running your experimental driver, it would seem likely that a software bug, and not a hardware problem was the cause. That allows us to ignore the less likely cases.

Continuing on, let’s follow the advice of the documentation. So, in our sample crash we do as we’re told, see Figure 2.

kd> !thread
THREAD ffb6fda8 Cid 0314.0330 Teb: 7ffdb000 Win32Thread: e1785120 RUNNING on processor 0
IRP List:
    81662e28: (0006,01d8) Flags: 40000404 Mdl: 00000000
    81714f68: (0006,0094) Flags: 40000000 Mdl: 00000000
    81d1af68: (0006,0094) Flags: 40000000 Mdl: ffab9d78
    81d5af68: (0006,0094) Flags: 40000000 Mdl: ffabbde0
    81784f68: (0006,0094) Flags: 40000000 Mdl: ffabbe60
    81990f68: (0006,0094) Flags: 40000000 Mdl: ffabc6f8
    81760f68: (0006,0094) Flags: 40000900 Mdl: 811b3bd0
Not impersonating
DeviceMap                 e1001098
Owning Process            ffb750b0
Wait Start TickCount      36960          Elapsed Ticks: 0
Context Switch Count      251                 LargeStack
UserTime                  00:00:00.0100
KernelTime                00:00:00.0130
Start Address 0x77e7d342
Win32 Start Address 0x77f95b06
Stack Init f5035000 Current f50326f0 Base f5035000 Limit f5032000 Call 0
Priority 8 BasePriority 8 PriorityDecrement 0 DecrementCount 16
ChildEBP RetAddr Args to Child
805367e0 8052f283 0000007f 00000008 80042000 nt!KeBugCheckEx+0x19 (FPO: [Non-Fpo])
805367e0 baf3f9d7 0000007f 00000008 80042000 nt!KiTrap08+0x52 (FPO: TaskGate 28:0)
f503201c baf40776 812ca368 8117f3ac 00010000 atapi!IdeSendCommand+0x3 (FPO: [EBP 0xf503205c] [2,24,4])
f503205c baf4291c 812ca368 8117f3ac 812ca1c8 atapi!AtapiStartIo+0x22b (FPO: [Non-Fpo])
f5032088 80530f57 012ca030 00000002 0dd36002 atapi!IdeStartIoSynchronized+0x162 (FPO: [Non-Fpo])
812ca030 00000000 812fdc18 00000000 00000000 nt!KeSynchronizeExecution+0x17

Figure 2

We have emphasized the line that contains the actual stack limits. It is important to remember that successive addresses on the stack are lower than the previous limit, so that we should think of stacks as "growing down". Thus, the address of the stack base is greater than the limit of the stack.

If we look at the stack information provided by the !thread command we can also see that the stack addresses (along the far left edge the "ChildEBP" values) end with 0x805367e0 but a few lines below are 0xf503201c – the first address is not within the stack range for this thread (0xf5032000 to 0xf5035000) but that later address is – and just above the stack limit. This does seem to suggest some sort of stack overflow.

The debugger documentation suggests using kb with a large numeric value. We tried this, see Figure 3.

kd> kb 100

ChildEBP RetAddr Args to Child

805367e0 8052f283 0000007f 00000008 80042000 nt!KeBugCheckEx+0x19

805367e0 baf3f9d7 0000007f 00000008 80042000 nt!KiTrap08+0x52

f503201c baf40776 812ca368 8117f3ac 00010000 atapi!IdeSendCommand+0x3

f503205c baf4291c 812ca368 8117f3ac 812ca1c8 atapi!AtapiStartIo+0x22b

f5032088 80530f57 012ca030 00000002 0dd36002 atapi!IdeStartIoSynchronized+0x162

812ca030 00000000 812fdc18 00000000 00000000 nt!KeSynchronizeExecution+0x17

Figure 3

However, this doesn’t provide us with any more information than we had with the !thread command. At this point it would be easy to give up and say, "I don’t know. This documentation is wrong, the debugger is broken, and this is too hard!" DO NOT GIVE UP YET!

Note that we can still exploit the steps suggested by the debugger. We wandered off looking at the documentation, but the debugger suggested that we look at the results from the kv command, see Figure 4.

kd> kv

ChildEBP RetAddr Args to Child

805367e0 8052f283 0000007f 00000008 80042000 nt!KeBugCheckEx+0x19 (FPO: [Non-Fpo])

805367e0 baf3f9d7 0000007f 00000008 80042000 nt!KiTrap08+0x52 (FPO: TaskGate 28:0)

f503201c baf40776 812ca368 8117f3ac 00010000 atapi!IdeSendCommand+0x3 (FPO: [EBP 0xf503205c] [2,24,4])

f503205c baf4291c 812ca368 8117f3ac 812ca1c8 atapi!AtapiStartIo+0x22b (FPO: [Non-Fpo])

f5032088 80530f57 012ca030 00000002 0dd36002 atapi!IdeStartIoSynchronized+0x162 (FPO: [Non-Fpo])

812ca030 00000000 812fdc18 00000000 00000000 nt!KeSynchronizeExecution+0x17

Figure 4

Note that this is similar in output to the kb command, but includes additional information about how the particular call frame was constructed. We can see that the call to KiTrap08 includes a TaskGate – and the debugger advised that if we found a task gate we should use .tss and then obtain a second stack list. This gives us what we see in Figure 5.

kd> kv

*** Stack trace for last set context - .thread/.cxr resets it

ChildEBP RetAddr Args to Child

f503201c baf40776 812ca368 8117f3ac 00010000 atapi!IdeSendCommand+0x3 (FPO: [EBP 0xf503205c] [2,24,4])

f503205c baf4291c 812ca368 8117f3ac 812ca1c8 atapi!AtapiStartIo+0x22b (FPO: [Non-Fpo])

f5032088 80530f57 012ca030 00000002 0dd36002 atapi!IdeStartIoSynchronized+0x162 (FPO: [Non-Fpo])

812ca030 00000000 812fdc18 00000000 00000000 nt!KeSynchronizeExecution+0x17

Figure 5

Alas, this doesn’t seem to tell us much more than we knew before. Again, it would be easy for us to quit at this point, but we can still look farther and see if we can find the underlying cause. Unfortunately, once the automated part of the analysis is done, we have to actually start poking around – and most of the time, what we write about in The NT Insider is that process of "poking around".

So, what do we do next? We already suspect that this is a stack overflow, so it would be nice to see if we can find potential culprits for this stack consumption. Such large consumers would show up as big gaps between the ChildEBP values we see on the stack – and would suggest where to look in the guilty driver to resolve the underlying problem.

Unfortunately, sometimes the debugger cannot unwind the stack. In such cases, we must manually walk the stack looking for another call frame and see if we can continue the stack unwind. We normally analyze the stack from the last valid frame (in this case the third line, not that final line). The third line lists a different stack value for the ChildEBP and thus is suspect. We sometimes see this in the debugger output and have learned over the years to treat that final entry with suspicion.

We’re specifically looking for a pair of addresses: the first will be a ChildEBP (pushed on the stack upon entry to a function) and the second will be a return address for the function that called. We can use the information from these two to reconstruct the call sequence. In the listing below we highlighted the prospective addresses – keep in mind that a ChildEBP would normally be an address above the current location, but still within the stack – so it will be numerically close, see Figure 6.

kd> dd f5032088

f5032088 812ca030 80530f57 012ca030 00000002

f5032098 0dd36002 baf42b57 812fd970 baf427ba

f50320a8 812ca030 ffa50e94 813034c0 f5032104

f50320b8 0dd36000 baf43827 00000000 00000000

f50320c8 00000000 812ca030 f9ad3c75 812ca030

f50320d8 00000000 806adee0 81303408 00000000

f50320e8 812ddbd8 813034c0 ffa25000 806ab698

f50320f8 00004000 81322220 00000000 f5032134

Figure 6

Thus, our first candidate is f5032104. The corresponding candidate return address is actually 0dd36000, but we know that can’t be a valid kernel address (since it is below the kernel address space). Interestingly, when we first did this we accidentally skipped this value and used the next stack value - baf43827. As it turns out, this was the correct value, but we arrived at it accidentally.

To determine if a value is a valid return address, we must remember that the return address will be the instruction after the call. So, in order to see the call instruction we must back up in the instruction stream. In general, we try to back up five or six bytes from the return address in order to hit the call – presumably, using that Intel manual, one can figure out the various possible sizes for a call operation, but nobody around here has done that analysis yet! This yields the results seen in Figure 7.

kd> u baf43827-5

atapi!IdePortAllocateAccessToken+0x12:

baf43822 e803f3ffff call atapi!CallIdeStartIoSynchronized (baf42b2a)

baf43827 eb0d jmp atapi!IdePortAllocateAccessToken+0x26 (baf43836)

baf43829 682a2bf4ba push 0xbaf42b2a

baf4382e ffb084000000 push dword ptr [eax+0x84]

baf43834 ffd1 call ecx

baf43836 c20400 ret 0x4

baf43839 cc int 3

atapi!IdeProcessCompletedRequest:

baf4383a 55 push ebp

Figure 7

So since we can see a call, we really do have a return address. We can feed this into WinDBG using the kb or kv command plus the optional arguments (the ESP EBP and EIP values at that point in the call sequence). We know the EBP – that was on the stack (f5032104), and we know the EIP – that was the call instruction (baf43822). The stack pointer must have been the location on the stack of the return address, since the processor pushed the return address onto the stack at the current value of ESP. Thus, ESP was f50320bc (we took the address of the first location listed – f50320b8 and added 4 to it – and this is when we noticed we grabbed the wrong address earlier!).

Just to touch on our erroneous guess before (and confirm that we "got lucky") we looked at the code of the called function seen in Figure 8.

kd> u baf42b2a

atapi!CallIdeStartIoSynchronized:

baf42b2a 53 push ebx

baf42b2b 55 push ebp

baf42b2c 8b6c2418 mov ebp,[esp+0x18]

baf42b30 56 push esi

baf42b31 8b7528 mov esi,[ebp+0x28]

baf42b34 57 push edi

baf42b35 8dbee0000000 lea edi,[esi+0xe0]

baf42b3b 8bcf mov ecx,edi

Figure 8

Note that the first value pushed was the value in EBX and the second was the value in EBP. While we are not sure why the debugger could not unwind this, we know that we can do so – and we’ll capitalize on our earlier mistake! This gives us the results seen in Figure 9.

kd> kv = f5032104 f50320bc baf43822

ChildEBP RetAddr Args to Child

f50320b8 baf43827 00000000 00000000 00000000 atapi!IdePortAllocateAccessToken+0x12 (FPO: [1,0,0])

f50320cc f9ad3c75 812ca030 00000000 806adee0 atapi!IdePortAllocateAccessToken+0x17 (FPO: [1,0,0])

f50320d8 806adee0 81303408 00000000 812ddbd8 PCIIDEX!BmReceiveScatterGatherList+0x21 (FPO: [4,0,1])

f5032104 806ae0b4 00000000 81303408 ffa50e68 hal!HalBuildScatterGatherList+0x19c (FPO: [Non-Fpo])

f5032134 f9ad3d17 81322220 81303408 ffa50e68 hal!HalGetScatterGatherList+0x24 (FPO: [Non-Fpo])

f5032168 baf44114 813034c0 ffa25000 00004000 PCIIDEX!BmSetup+0x5d (FPO: [Non-Fpo])

f50321a0 804eb04f 812ca030 81ab6f20 812fd578 atapi!IdePortStartIo+0xe6 (FPO: [Non-Fpo])

f50321c0 baf434fc 812ca030 81ab6f20 00000000 nt!IoStartPacket+0x7b (FPO: [Non-Fpo])

f50321ec 804ea221 812ca030 00ab6f20 806ac214 atapi!IdePortDispatch+0x4c0 (FPO: [Non-Fpo])

f50321fc 8062c190 81ab6f20 bafb910c 00000000 nt!IopfCallDriver+0x31 (FPO: [0,0,1])

f5032220 bafb9133 bafb910c bafb9898 812fd2a8 nt!IovCallDriver+0x9e (FPO: [Non-Fpo])

f5032228 bafb9898 812fd2a8 81ab6f20 812fd2a8 ACPI!ACPIDispatchForwardIrp+0x27 (FPO: [2,0,1])

f5032258 804ea221 812fd2a8 bafcd68c 806ac214 ACPI!ACPIDispatchIrp+0x158 (FPO: [Non-Fpo])

f5032268 8062c190 81ab6fd8 8117f300 812fd2a8 nt!IopfCallDriver+0x31 (FPO: [0,0,1])

f503228c f98a4028 ffa25000 81a18e28 00000000 nt!IovCallDriver+0x9e (FPO: [Non-Fpo])

f503229c f98a3c4b 8117f300 81328998 81a18f28 CLASSPNP!SubmitTransferPacket+0x7e (FPO: [1,0,3])

f50322c8 f98a3b5a 00004000 00004000 813288e0 CLASSPNP!ServiceTransferRequest+0xe0 (FPO: [Non-Fpo])

f50322ec 804ea221 813288e0 00000000 806ac214 CLASSPNP!ClassReadWrite+0xfd (FPO: [Non-Fpo])

f50322fc 8062c190 813286b8 812fd040 81328770 nt!IopfCallDriver+0x31 (FPO: [0,0,1])

f5032320 f9adb36c 812f7b00 81a18f4c f5032364 nt!IovCallDriver+0x9e (FPO: [Non-Fpo])

Figure 9

Of course, this still isn’t enough stack information. The previous trick (of providing the stack frame count) won’t work this time around because we overrode the stack parameters. Thus, we must use a different debugger trick – the .kframes directive. That changes the default so that the debugger shows us more frames by default. We set it to 0x100 and repeated the stack backtrace and we now have what you see in Figure 10.

kd> .kframes 100

Default stack trace depth is 0n256 frames