The NT Insider

Debugging 103: Where To Go With A System Crash
(By: The NT Insider, Vol 15, Issue 2, July- August 2008 | Published: 22-Aug-08| Modified: 22-Aug-08)

 Click Here to Download: Code Associated With This Article Zip Archive, 62MB

 

In my experience over the years, I've found there are two radically different communities when it comes to using the kernel debugger. One is the development community: people building extensions for the Windows OS that are intimately familiar with their code and how it works. They have private PDBs for their own code, they have the source code and they have tremendous insight into what their code is doing. The other is the support community: people that are presented with crashes in customer situations, often involving complex "real world" installations that include multiple third party products. They live outside the comfort zone of having both source code and private PDBs.

Of these two communities, it is the support community that has the more challenging task - working with less specific knowledge, fewer support mechanisms and the intense pressure that only comes from a customer that is both unhappy and looking to them to fix the problem.

In many of our debugging articles we actually take interesting crash dumps and describe how we analyzed them - sometimes this is based on our own code, sometimes it is based upon random situations that we've previously debugged. However, the harsh reality is that numerous crash dumps we analyze leave us unsure of exactly what went wrong. The challenge in that case is to determine the best "next step." Sometimes that can be to pass it off to someone with greater specific knowledge (moving it from the "support" community into the "development" community).

Let's Get To It

This issues crash dump is certainly one of those head scratchers - yet we think that showing what we did and how we analyzed it (as much as possible) will still help our readers see the process that we followed to reach our own conclusions on how to move forward with it.

This crash came from a Windows Server 2008 system. The only "interesting" bit on it was the final release candidate for Hyper-V (the new Server 2008 virtualization component). During system shutdown the system crashed, so I took the opportunity to grab the crash dump file and set it aside for a bit more analysis. I've only ever seen this crash the one time, but this installation was purpose built and I only used it for a week (so, I saw the crash once in a half-dozen reboots).

The hardware was certainly on the low end of the server spectrum (its primary purpose is testing, not deployment). It was outfitted with a single quad-core Intel Pentium processor, 4GB physical memory, and a single 300GB 7200 rpm SATA hard drive in a Shuttle configuration. Certainly this is more than sufficient to run Server 2008 and one or maybe two virtual machines.

So looking at the crash dump in the debugger the first "interesting" bit I saw was a bug check code that I don't typically see (basically a "Memory Manager is not happy" bug check).

Microsoft (R) Windows Debugger Version 6.9.0003.113 X86
Copyright (c) Microsoft Corporation. All rights reserved.

Loading Dump File [E:\dumps\wam08srv\memory-061308-01.dmp]
Kernel Summary Dump File: Only kernel address space is available

Symbol search path is: srv*e:\symbols\websymbols*http://msdl.microsoft.com/download/symbols
Executable search path is:
Windows Server 2008 Kernel Version 6001 (Service Pack 1) MP (4 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 6001.18000.amd64fre.longhorn_rtm.080118-1840
Kernel base = 0xfffff800`01614000 PsLoadedModuleList = 0xfffff800`017d9db0
Debug session time: Fri Jun 13 13:35:21.424 2008 (GMT-7)
System Uptime: 0 days 3:45:47.453
Loading Kernel Symbols
..............................................................................................................................
Loading User Symbols
PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details
Loading unloaded module list
.....
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 1A, {41790, fffffa8000c005d0, ffff, 0}

Page b9af6 not present in the dump file. Type ".hh dbgerr004" for details
Page b9c9c not present in the dump file. Type ".hh dbgerr004" for details
PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details
PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details
Probably caused by : ntkrnlmp.exe ( nt! ?? ::FNODOBFM::`string'+1d093 )

Followup: MachineOwner

---------

One Step Back

In our debug classes I often have people ask me what some of these messages mean, so let's take a quick detour and look at what each of these lines is telling us:

Microsoft (R) Windows Debugger Version 6.9.0003.113 X86
Copyright (c) Microsoft Corporation. All rights reserved.

This tells us the debugger version - 6.9.3.113, x86 version (I have both x86 and x64 installed on my system, but I have found a few minor oddities with the x64 version of the debugger).

The next lines tell me about the actual crash dump:

Loading Dump File [E:\dumps\wam08srv\memory-061308-01.dmp]
Kernel Summary Dump File: Only kernel address space is available

The first is where the dump file is located (yep, it's on my E drive). The "kernel summary dump" means that only the portions of physical memory that were actively in-use by the kernel were actually written to the crash dump file.

Then we get information about the actual search paths:

Symbol search path is: srv*e:\symbols\websymbols*http://msdl.microsoft.com/download/symbols
Executable search path is:

The symbol search path is critical to correct behavior of the debugger - if it doesn't point to a valid location from which it can retrieve OS symbols, the debugger won't provide useful information (ok, there are exceptions, but they are very rare). This is the number one problem we see when people first start using WinDBG - not having the symbols set up properly. It's so bad that the debugger contains special commands to "fix" your symbol path (".symfix") or to provide copious information about the actual symbols being loaded ("!itoldyouso" - so named because the debugger team was tired of the usual exchange in which they'd explain that not having the right symbols did not constitute a bug in the debugger).

The executable search path is useful when the dump does not contain the actual code pages for the items being debugged. For example, if you debug a mini-dump none of the code pages are present, so the debugger loads the actual code pages from the original executable image.

Then we have the relevant version information:

Windows Server 2008 Kernel Version 6001 (Service Pack 1) MP (4 procs) Free x64
Product: Server, suite: TerminalServer SingleUserTS
Built by: 6001.18000.amd64fre.longhorn_rtm.080118-1840
Kernel base = 0xfffff800`01614000 PsLoadedModuleList = 0xfffff800`017d9db0

That first line tells us the major product version. It used to be that Version numbers actually corresponded to meaningful information. Unfortunately, the marketing team at Microsoft has latched onto what was the build number and changed it into something devoid of useful information. Perhaps they wanted people to believe that there was one build of the code base between Vista and Server 2008 ("hey look, it builds! Let's ship it!") I digress.

The rest of the line indicates this is running "Service Pack 1" (A surprise to me, since I didn't realize there was a server 2008 service pack yet). All kernel builds since Vista shipped are multi-processor (which is what the "MP" means). This reports four processors (separate cores and hyperthreads are reported as "processors" here). The "Free" merely says that it was built without special internal checks (the "Checked" build) and doesn't mean Microsoft has turned into a charity. The "x64" indicates the binary was built (and is running) in the 64 bit extended mode of current AMD64 and EM64T processors.

The product moniker really tells us more about the features included (or more precisely "enabled") on this build: it's a server configuration, and it supports the "log on for administrative purposes" mode of terminal server (with some simple licensing sleight of hand this machine could be an actual Terminal Server, in fact).

The actual build information is still there lurking in the "Built by" line.

The final line describes where the kernel has been loaded (aren't those 64 bit address impressive?), and where the "loaded module list" is located. That latter value is critical for using the "lm" command to obtain a list of all the pieces loaded into memory.

Next we move along to some basic information about this crash dump:

Debug session time: Fri Jun 13 13:35:21.424 2008 (GMT-7)

This is the time the crash dump was written and not the time when we ran the kernel debugger.

Then we can see how long the system has been running:

System Uptime: 0 days 3:45:47.453

From this we can conclude that the system didn't just start up (it's been running for almost 4 hours!)

There are a few more interesting messages after this point:

BugCheck 1A, {41790, fffffa8000c005d0, ffff, 0}

Aha! This tells us why the system crashed. It was also the first thing that really "caught my eye." I truthfully don't remember ever seeing an 0x1A bugcheck before.

Before diving into the analysis, though, let's look a the last couple messages since they look like they could cause problems (people routinely ask me about these lines):

Page b9af6 not present in the dump file. Type ".hh dbgerr004" for details
Page b9c9c not present in the dump file. Type ".hh dbgerr004" for details

Normally I ignore these messages because they don't appear to create any sort of issue. So I decided to look and see what (if anything) I could learn from the PFN entry:

3: kd> !pte b9af6
VA 00000000000b9af6

PXE @ FFFFF6FB7DBED000 PPE at FFFFF6FB7DA00000 PDE at FFFFF6FB40000000 PTE at FFFFF680000005C8
contains 02900000A5263867 contains 05900000A5727867 contains 0000000000000000
pfn a5263 ---DA--UWEV pfn a5727 ---DA--UWEV

This tells us that while part of the page table necessary to access this entry is present (PXE and PPE) the page directory and page table are not present in the crash dump. My suspicion is that this is something that the debugger is trying to find in the user portion of the address space. Such data is never included in a crash dump of this type.

The last two messages merely tell us that indeed, the debugger cannot access the "Process Environment Block" (PEB):

PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details
PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details

This makes sense to me because the PEB is in the user portion of the address space. In a kernel summary dump, such information is not present. So neither of these appear to be anything to worry about.

Moving Forward

Generally at this point I'll use "!analyze -v". While it does not resolve all issues successfully, it often provides useful insight:

3: kd> !analyze -v

*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

MEMORY_MANAGEMENT (1a)
# Any other values for parameter 1 must be individually examined.
Arguments:
Arg1: 0000000000041790, The subtype of the bugcheck.
Arg2: fffffa8000c005d0
Arg3: 000000000000ffff
Arg4: 0000000000000000

Debugging Details:
------------------
Page b9af6 not present in the dump file. Type ".hh dbgerr004" for details
Page b9c9c not present in the dump file. Type ".hh dbgerr004" for details
PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details
PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details

BUGCHECK_STR: 0x1a_41790

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

PROCESS_NAME: SLsvc.exe

CURRENT_IRQL: 0

LAST_CONTROL_TRANSFER: from fffff80001647181 to fffff80001669390

STACK_TEXT:

fffffa60`03b506f8 fffff800`01647181 : 00000000`0000001a 00000000`00041790 fffffa80`00c005d0 00000000`0000ffff : nt!KeBugCheckEx
fffffa60`03b50700 fffff800`0164d696 : fffff6fb`7dbed011 00000001`24310fff fffffa80`00000000 fffffa80`00000000 : nt! ?? ::FNODOBFM::`string'+0x1d093
fffffa60`03b50890 fffff800`0191184c : fffff880`0494f060 fffff800`018a1440 00000000`00000000 00000000`00000000 : nt!MmCleanProcessAddressSpace+0x6ae
fffffa60`03b508e0 fffff800`018a145d : fffffa60`000000ff fffffa80`0516d001 000007ff`fffde000 fffffa80`78457350 : nt!PspExitThread+0x4f8
fffffa60`03b509a0 fffff800`0169354b : fffffa60`03286a01 fffffa80`0516d010 00000000`00000103 fffff800`0161b783 : nt!PsExitSpecialApc+0x1d
fffffa60`03b509d0 fffff800`01697215 : fffffa60`03b50ca0 fffffa60`03b50a70 fffff800`018a2680 00000000`00000001 : nt!KiDeliverApc+0x3bb
fffffa60`03b50a70 fffff800`01668edd : fffff880`0494ca01 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiInitiateUserApc+0x75
fffffa60`03b50bb0 00000000`774c5ada : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceExit+0xa2
00000000`0028f9a8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x774c5ada

STACK_COMMAND: kb

FOLLOWUP_IP:
nt! ?? ::FNODOBFM::`string'+1d093
fffff800`01647181 cc int 3

SYMBOL_STACK_INDEX: 1
SYMBOL_NAME: nt! ?? ::FNODOBFM::`string'+1d093

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nt

IMAGE_NAME: ntkrnlmp.exe

DEBUG_FLR_IMAGE_TIMESTAMP: 479192b7

FAILURE_BUCKET_ID: X64_0x1a_41790_nt!_??_::FNODOBFM::_string_+1d093

BUCKET_ID: X64_0x1a_41790_nt!_??_::FNODOBFM::_string_+1d093

Followup: MachineOwner

---------

This particular analysis demonstrates a habit the debugger has shown in recent versions of displaying odd names for out-of-line code fragments. The name "::FNODOBFM::`string'+1d093" certainly does not sound like anything I would normally expect to see called from the Memory Manager. My experience has been this happens for code that has been optimized by Microsoft - error conditions and code that is not executed frequently is moved "out of line" so that code that is executed frequently is done in-line. Fortunately, with a bit of digging we can easily find the real function name.

Judicious use of the disassembly window (in WinDBG) along with the ".frame" command makes it easy to identify the actual routine that was called. Displaying this using the command line "u" command we see:

3: kd> u fffff800`0164d691
nt!MmCleanProcessAddressSpace+0x6a9:
fffff800`0164d691 e89afc0400 call nt!MiDeleteVirtualAddresses (fffff800`0169d330)
fffff800`0164d696 488bc3 mov rax,rbx
fffff800`0164d699 f0480fc18648030000 lock xadd qword ptr [rsi+348h],rax
fffff800`0164d6a2 a802 test al,2
fffff800`0164d6a4 0f85f8b9fcff jne nt! ?? ::FNODOBFM::`string'+0x20f76 (fffff800`016190a2)
fffff800`0164d6aa 80a720040000f7 and byte ptr [rdi+420h],0F7h
fffff800`0164d6b1 668387b601000001 add word ptr [rdi+1B6h],1
fffff800`0164d6b9 0f854dfbffff jne nt!MmCleanProcessAddressSpace+0x224 (fffff800`0164d20c)

Thus, the actual function in question that ultimately did the bug check was MiDeleteVirtualAddress (even though the debugger doesn't show it on the stack backtrace).

By using the "uf" command on that function, I can find all the various blocks of code conveniently sorted by their address in memory. Since the function has been optimized, pieces of it have been scattered around memory. For the sake of brevity, I'll omit the complete listing of this function. The block of code shown by the uf command that invoked bugcheck in this case is:

nt! ?? ::FNODOBFM::`string'+0x1d078:
fffff800`01647166 440fb7c8 movzx r9d,ax
fffff800`0164716a 33c0 xor eax,eax
fffff800`0164716c 4d8bc7 mov r8,r15
fffff800`0164716f 8d481a lea ecx,[rax+1Ah]
fffff800`01647172 ba90170400 mov edx,41790h
fffff800`01647177 4889442420 mov qword ptr [rsp+20h],rax
fffff800`0164717c e80f220200 call nt!KeBugCheckEx (fffff800`01669390)
fffff800`01647181 cc int 3
fffff800`01647182 90 nop
fffff800`01647183 90 nop
fffff800`01647184 90 nop
fffff800`01647185 90 nop
fffff800`01647186 90 nop
fffff800`01647187 90 nop
fffff800`01647188 fff3 push rbx
fffff800`0164718a 4883ec20 sub rsp,20h
fffff800`0164718e 440f20c3 mov rbx,cr8
fffff800`01647192 b80c000000 mov eax,0Ch
fffff800`01647197 440f22c0 mov cr8,rax
fffff800`0164719b 65488b0c2528000000 mov rcx,qword ptr gs:[28h]
fffff800`016471a4 488bd1 mov rdx,rcx
fffff800`016471a7 488b4108 mov rax,qword ptr [rcx+8]
fffff800`016471ab 488710 xchg rdx,qword ptr [rax]
fffff800`016471ae 4885d2 test rdx,rdx
fffff800`016471b1 750b jne nt!KiAcquireDispatcherLockRaiseToSynch+0x36 (fffff800`016471be)

From this we can actually track back through the flow of control in the function by looking for a branch to this code block (e.g., we search for a branch that targets this block). We find:

nt!MiDeleteVirtualAddresses+0x3ef:
fffff800`0169d71f 664183471cff add word ptr [r15+1Ch],0FFFFh
fffff800`0169d725 410fb7471c movzx eax,word ptr [r15+1Ch]
fffff800`0169d72a 663d0002 cmp ax,200h
fffff800`0169d72e 0f83329afaff jae nt! ?? ::FNODOBFM::`string'+0x1d078 (fffff800`01647166)

The first instruction appears to take a field of a data structure (the 1Ch being the offset and r15 being the address of that structure) and subtracts one from it (an "add" of 0xFFFF to a 16 bit value being the same as a subtract of one, using one's compliment arithmetic). We then look at the resulting value and if it is greater than 0x200 (512 decimal) we jump to the bug check code.

So now let's look at that bugcheck and see what we can determine from it. One reason this case piqued my curiosity was because "!analyze" itself seemed a bit lost with respect to this particular situation:

3: kd> !analyze -v

*******************************************************************************

* *

* Bugcheck Analysis *

* *

*******************************************************************************

MEMORY_MANAGEMENT (1a)

# Any other values for parameter 1 must be individually examined.

Arguments:

Arg1: 0000000000041790, The subtype of the bugcheck.

Arg2: fffffa8000c005d0

Arg3: 000000000000ffff

Arg4: 0000000000000000

Debugging Details:

------------------

Page b9af6 not present in the dump file. Type ".hh dbgerr004" for details

Page b9c9c not present in the dump file. Type ".hh dbgerr004" for details

PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details

PEB is paged out (Peb.Ldr = 000007ff`fffd6018). Type ".hh dbgerr001" for details

BUGCHECK_STR: 0x1a_41790

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

PROCESS_NAME: SLsvc.exe

CURRENT_IRQL: 0

LAST_CONTROL_TRANSFER: from fffff80001647181 to fffff80001669390

STACK_TEXT:

fffffa60`03b506f8 fffff800`01647181 : 00000000`0000001a 00000000`00041790 fffffa80`00c005d0 00000000`0000ffff : nt!KeBugCheckEx

fffffa60`03b50700 fffff800`0164d696 : fffff6fb`7dbed011 00000001`24310fff fffffa80`00000000 fffffa80`00000000 : nt! ?? ::FNODOBFM::`string'+0x1d093

fffffa60`03b50890 fffff800`0191184c : fffff880`0494f060 fffff800`018a1440 00000000`00000000 00000000`00000000 : nt!MmCleanProcessAddressSpace+0x6ae

fffffa60`03b508e0 fffff800`018a145d : fffffa60`000000ff fffffa80`0516d001 000007ff`fffde000 fffffa80`78457350 : nt!PspExitThread+0x4f8

fffffa60`03b509a0 fffff800`0169354b : fffffa60`03286a01 fffffa80`0516d010 00000000`00000103 fffff800`0161b783 : nt!PsExitSpecialApc+0x1d

fffffa60`03b509d0 fffff800`01697215 : fffffa60`03b50ca0 fffffa60`03b50a70 fffff800`018a2680 00000000`00000001 : nt!KiDeliverApc+0x3bb

fffffa60`03b50a70 fffff800`01668edd : fffff880`0494ca01 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiInitiateUserApc+0x75

fffffa60`03b50bb0 00000000`774c5ada : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : nt!KiSystemServiceExit+0xa2

00000000`0028f9a8 00000000`00000000 : 00000000`00000000 00000000`00000000 00000000`00000000 00000000`00000000 : 0x774c5ada

 

STACK_COMMAND: kb

FOLLOWUP_IP:

nt! ?? ::FNODOBFM::`string'+1d093

fffff800`01647181 cc int 3

SYMBOL_STACK_INDEX: 1

SYMBOL_NAME: nt! ?? ::FNODOBFM::`string'+1d093

FOLLOWUP_NAME: MachineOwner

MODULE_NAME: nt

IMAGE_NAME: ntkrnlmp.exe

DEBUG_FLR_IMAGE_TIMESTAMP: 479192b7

FAILURE_BUCKET_ID: X64_0x1a_41790_nt!_??_::FNODOBFM::_string_+1d093

BUCKET_ID: X64_0x1a_41790_nt!_??_::FNODOBFM::_string_+1d093

Followup: MachineOwner

---------

In this specific case, the debugger does not provide much useful information about this specific bug check. Similarly, the debugger documentation does not provide much useful information either. Of course, what would have helped in this case is to know what those parameters represented. As is often the case with "unusual" bug check codes, the specific meaning of the parameters is not documented. While it is often tempting at a point like this to "give up", you can actually dig further into this and figure out what these values might be. I use a couple different commands to do this including "!pool" and "ln" to see if I can gain some context for the operations. In this case none of these tools provided any useful information (!pool gives an error and ln simply returns no information).

This left me scratching my head as to what this thing might actually be. I decided that it would be instructive to actually figure out where it originated - either that or the meaning of the mysterious r15 register. Looking more carefully at the bugcheck code again I can decipher the source of these four arguments. You may recall that on the x64 platform there is a standard calling convention so that RCX is the first parameter (the bug check code):

fffff800`0164716f 8d481a lea ecx,[rax+1Ah]

Personally, I've always found explaining the LEA instruction ("load effective address") to be a bit confusing. The simplest paradigm that seems to work is to consider this an "add without side effects." In other words, none of the RFLAGS register bits are set as a result of this addition. A few lines earlier we zeroed out the lower 32 bits of the RAX register:

fffff800`0164716a 33c0 xor eax,eax

So the LEA is effectively loading 0x1A into the ECX register. While initially confusing, this makes sense (and matches the behavior we see) once we've unraveled it.

The second function parameter is always passed in the RDX register. This corresponds to that first argument we look at in the bug check code:

fffff800`01647172 ba90170400 mov edx,41790h

And indeed that's the value we see for that first bug check parameter.

Similarly, the second bugcheck parameter (third function parameter) is in the R8 register:

fffff800`0164716c 4d8bc7 mov r8,r15

And then the third bugcheck parameter (fourth function parameter) is in the R9 register:

fffff800`01647166 440fb7c8 movzx r9d,ax

Of course, this begs the question of what was in the AX register at this point (we zero it right afterwards, but at this stage it was set when we jumped to this block of code). Recall that when we found the code that "jumped" to this bugcheck block, we also saw it set the AX register:

fffff800`0169d725 410fb7471c movzx eax,word ptr [r15+1Ch]

Ok, it really set the EAX register, but it did a 16 bit zero extended move in doing so (thus the upper 16 bits are always zero after this operation). Interesting because that R15 is what we see as the second debug parameter.

At this point you might seriously consider giving up, since we really haven't figured out much about why this system died - but we are surprisingly close to at least having some useful information.

So I stepped back from the crash at this stage and asked myself, "So what happens when we delete an address space?" The obvious (to me at least) answer was, "We delete all the PTEs and drop references on the physical pages!" We were looking at offset 0x1C in whatever this structure was - which suggests to me that it isn't a PTE (since those are 64 bits long) and thus I wondered, "Could this be a PFN entry?" The debugger provides us with a nice little extension command for looking at this. Surprisingly, this yielded some interesting information:

3: kd> !pfn fffffa80`00c005d0

PFN 0004001F at address FFFFFA8000C005D0

flink 00000032 blink / share count 00000001 pteaddress FFFFF6FB40004908

reference count 0001 used entry count FFFF Cached color 0 Priority 5

restore pte 00000080 containing page 040B1E Active M

Modified

The actual fields are also available from the "dt" command:

3: kd> dt nt!_MMPFN fffffa80`00c005d0

+0x000 u1 : <unnamed-tag>

+0x008 u2 : <unnamed-tag>

+0x010 PteAddress : 0xfffff6fb`40004908 _MMPTE

+0x010 VolatilePteAddress : 0xfffff6fb`40004908

+0x018 u3 : <unnamed-tag>

+0x01c UsedPageTableEntries : 0xffff

+0x01e VaType : 0 '

+0x01f ViewCount : 0 '

+0x020 OriginalPte : _MMPTE

+0x020 AweReferenceCount : 128

+0x028 u4 : <unnamed-tag>

And following that back to the controlling page table entry:

3: kd> !pte 0xfffff6fb`40004908

VA fffff68000921000

PXE @ FFFFF6FB7DBEDF68 PPE at FFFFF6FB7DBED000 PDE at FFFFF6FB7DA00020 PTE at FFFFF6FB40004908

contains 00000000A5205863 contains 02900000A5263867 contains 3DA0000040B1E867 contains 3DB000004001F867

pfn a5205 ---DA--KWEV pfn a5263 ---DA--UWEV pfn 40b1e ---DA--UWEV pfn 4001f ---DA-UWEV

Notice that the PFN number in the entry corresponds to the PFN number of where we started - so indeed, this looks like a PFN.

And the Answer Is??

From here, it becomes fairly obvious why the memory manager is not happy in this case - the "used entry count" field has underflowed. Effectively, it looks like something in the system has broken the reference counting model here - note that the actual reference count field is still non-zero, as is the share count (so someone is still using these).

This certainly does not provide me with a "smoking gun" but unfortunately it is often the case when we analyze a crash, we don't get a definitive answer. Especially with reference count issues - even if you do have the source code, it is often impossible to figure out why the reference counts are not being properly maintained because the actual "bug" has already transpired and we're merely detecting the error condition much later in the process.

My "best guess" in this situation is that it is probably some issue with the (at the time) pre-release Hyper-V support. At this point I decided to stop investigating further - I don't have any of my own code running on this system and while it left me scratching my head, I concluded that it doesn't suggest any issues with the hardware either - sure looks like a garden variety reference counting bug inside the OS.

 

This article was printed from OSR Online http://www.osronline.com

Copyright 2017 OSR Open Systems Resources, Inc.