Driver Problems? Questions? Issues?
Put OSR's experience to work for you! Contact us for assistance with:
  • Creating the right design for your requirements
  • Reviewing your existing driver code
  • Analyzing driver reliability/performance issues
  • Custom training mixed with consulting and focused directly on your specific areas of interest/concern.
Check us out. OSR, the Windows driver experts.

OSR Seminars


Go Back   OSR Online Lists > ntdev
Welcome, Guest
You must login to post to this list
  Message 1 of 22  
01 Jun 18 14:02
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

Hi, I am debugging a rather nasty system issue which locks up the system(no response from the keyboard, mouse), and happens on some systems but not others. This is a Windows 10 system with several of our drivers both PCIe and USB. I've tried several experiments to isolate the issue without much success. One of the methods we landed on, to narrow down the issue is a NMI jumper available on the motherboard which can force a crash dump. https://blogs.technet.microsoft.com/askperf/2009/01/23/two-minute-drill-nmi/ We can create a crash dump by asserting the NMI when the system is running, however I can't seem to create a crash dump when the system locks up. I did try to force one of my drivers to lock up by creating a pseudo condition, and I was able to create a crash dump using the NMI in that scenario. My question are: 1. What would cause a lockup where the NMI does not respond? 2. Would a driver be able to cause a lockup that would block the NMI from responding to the OS? Any insights would be appreciated, Thanks, Burrr
  Message 2 of 22  
02 Jun 18 02:27
Jan Bottorff
xxxxxx@pmatrix.com
Join Date: 16 Apr 2013
Posts To This List: 434
System Lockup - NMI Crash Dump

I'd suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups. Jan Jan -----Original Message----- From: xxxxx@lists.osr.com <xxxxx@lists.osr.com> On Behalf Of Burrr <xxxxx@outlook.com> Sent: Friday, June 1, 2018 11:03 AM To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> Subject: [ntdev] System Lockup - NMI Crash Dump Hi, I am debugging a rather nasty system issue which locks up the system(no response from the keyboard, mouse), and happens on some systems but not others. This is a Windows 10 system with several of our drivers both PCIe and USB. I've tried several experiments to isolate the issue without much success. One of the methods we landed on, to narrow down the issue is a NMI jumper available on the motherboard which can force a crash dump. https://blogs.technet.microsoft.com/askperf/2009/01/23/two-minute-drill-nmi/ We can create a crash dump by asserting the NMI when the system is running, however I can't seem to create a crash dump when the system locks up. I did try to force one of my drivers to lock up by creating a pseudo condition, and I was able to create a crash dump using the NMI in that scenario. My question are: 1. What would cause a lockup where the NMI does not respond? 2. Would a driver be able to cause a lockup that would block the NMI from responding to the OS? Any insights would be appreciated, Thanks, Burrr
  Message 3 of 22  
03 Jun 18 06:55
Pavel A
xxxxxx@fastmail.fm
Join Date: 21 Jul 2008
Posts To This List: 2422
System Lockup - NMI Crash Dump

> I'd suggest reading the thread at > http://www.osronline.com/showThread.CFM?link=288112 It has some new and really > interesting strategies to debug hard lockups. Only the IPC thing there probably was a typo, it should be IPI (initer-processor interupt)? -- pa
  Message 4 of 22  
03 Jun 18 11:30
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4487
System Lockup - NMI Crash Dump

> Only the IPC thing there probably was a typo, it should be IPI (initer-processor interupt) Of course it was very obviously a typo, but I was literally floored by the OP's reaction to Mark's statement, particularly by the part concerning "debug stub sending NMI to all other processors". Look what he said..... <quote> Mark, I don't understand your message. What is an IPC? Inter process communication? Are you explaining that if any one processor doesn't respond to the debug stub, then the stub cannot break in? And why would NMI be any different? Wouldn't the debug stub send an NMI to all the other processors to stop them? </quote> Anton Bassov
  Message 5 of 22  
04 Jun 18 07:55
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

Thanks much for the information. On my system I am already able to generate a NMI when the system is working normally. However when the system locks up, it does not work. I have attached a PCIe analyzer to see if there are any weird things going on, but did not find anything useful. There are 7 MSI interrupts and DMA transactions that are being used on my PCIe driver. There are several other USB drivers that are used on prior systems and redeployed here. I am floored as to the reasons why the OS would not respond to the NMI. Also what can cause such an event? Burrr On 6/2/2018 2:27 AM, Jan Bottorff wrote: > I'd suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups. > > Jan > > Jan
  Message 6 of 22  
04 Jun 18 08:10
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

From the thread, I gather possible choices for a freeze where the NMI doesn't respond, are: Bus Freeze Rogue DMA request Interrupt Storm Is that right? Burrr On 6/2/2018 2:27 AM, Jan Bottorff wrote: > I'd suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups. > > Jan >
  Message 7 of 22  
04 Jun 18 08:24
Vijayabhaskarreddy CH
xxxxxx@gmail.com
Join Date: 03 Jan 2017
Posts To This List: 2
System Lockup - NMI Crash Dump

On Mon, Jun 4, 2018, 5:27 PM xxxxx@outlook.com <xxxxx@lists.osr.com> wrote: > Thanks much for the information. > > On my system I am already able to generate a NMI when the system is > working normally. However when the system locks up, it does not work. > > I have attached a PCIe analyzer to see if there are any weird things > going on, but did not find anything useful. > There are 7 MSI interrupts and DMA transactions that are being used on > my PCIe driver. There are several other USB drivers that are used on > prior systems and redeployed here. <...excess quoted lines suppressed...> --
  Message 8 of 22  
04 Jun 18 11:05
Mark Roddy
xxxxxx@gmail.com
Join Date: 25 Feb 2000
Posts To This List: 4090
System Lockup - NMI Crash Dump

Rogue DMA and Bus Freeze are likely going to resolve to "Bus Freeze". If your resources include access to a pci(e) bus analyzer that is the best path forward in my opinion. Although plain old debug console logging can also be fruitful and is way less expensive. Mark Roddy On Mon, Jun 4, 2018 at 8:11 AM, xxxxx@outlook.com <xxxxx@lists.osr.com> wrote: > From the thread, I gather possible choices for a freeze where the NMI > doesn't respond, are: > > Bus Freeze > Rogue DMA request > Interrupt Storm > > Is that right? > > Burrr <...excess quoted lines suppressed...> --
  Message 9 of 22  
04 Jun 18 11:18
Bob Ammerman
xxxxxx@ramsystems.biz
Join Date: 05 Jun 2016
Posts To This List: 56
System Lockup - NMI Crash Dump

Or OS so corrupted by overwrite that it can't handle the NMI * Bob   Bob Ammerman   xxxxx@ramsystems.biz   716.864.8337 138 Liston St Buffalo, NY 14223 www.ramsystems.biz -----Original Message----- From: xxxxx@lists.osr.com <xxxxx@lists.osr.com> On Behalf Of xxxxx@outlook.com Sent: Monday, June 4, 2018 8:11 AM To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> Subject: Re:[ntdev] System Lockup - NMI Crash Dump From the thread, I gather possible choices for a freeze where the NMI doesn't respond, are: Bus Freeze Rogue DMA request Interrupt Storm Is that right? Burrr On 6/2/2018 2:27 AM, Jan Bottorff wrote: > I'd suggest reading the thread at http://www.osronline.com/showThread.CFM?link=288112 It has some new and really interesting strategies to debug hard lockups. > > Jan > --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer>
  Message 10 of 22  
04 Jun 18 12:14
Peter Viscarola
xxxxxx@osr.com
Join Date:
Posts To This List: 6183
List Moderator
System Lockup - NMI Crash Dump

+1 for what Mr. Roddy said, above. It is *exactly* what I was going to post. Nothing substitutes for a bus analyzer, which often will let you root-cause a complex problem like this in minutes (or, you know, at least kick it back to the FPGA guys)... but it is amazing how very much you can discern using DbgPrint. Another avenue, if you can work WITH your FPGA people, is to have them help with ChipScope or SignalTap... A good FPGA guy can get almost as much out of this as a proper bus analyzer. Peter OSR @OSRDrivers
  Message 11 of 22  
04 Jun 18 12:26
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

We've captured a few PCIe Analyzer traces but none of them point to anything specific or bus level errors. We've also captured lots of traces with the FPGA with ChipScope with nothing specific or apparent that points to the issue. Burrr On 6/4/2018 12:13 PM, xxxxx@osr.com wrote: > +1 for what Mr. Roddy said, above. It is *exactly* what I was going to post. > > Nothing substitutes for a bus analyzer, which often will let you root-cause a complex problem like this in minutes (or, you know, at least kick it back to the FPGA guys)... but it is amazing how very much you can discern using DbgPrint. > > Another avenue, if you can work WITH your FPGA people, is to have them help with ChipScope or SignalTap... A good FPGA guy can get almost as much out of this as a proper bus analyzer. > > Peter > OSR > @OSRDrivers > <...excess quoted lines suppressed...>
  Message 12 of 22  
05 Jun 18 09:26
Scott Noone
xxxxxx@osr.com
Join Date:
Posts To This List: 1377
List Moderator
System Lockup - NMI Crash Dump

Does the problem reproduce pretty quickly/easily? Also, do you have a kernel debugger attached? If you add enough DbgPrints you should be able to figure out the last things your driver(s) did before the hang. Pay particular attention to the DMA transfers that you performed prior to the hang (particularly offsets, lengths, and physical addresses). I've definitely debugged problems like this that way, usually it ends up being a particular set of arguments that triggers an edge condition in my code or the hardware. -scott OSR
  Message 13 of 22  
05 Jun 18 11:23
Peter Viscarola
xxxxxx@osr.com
Join Date:
Posts To This List: 6183
List Moderator
System Lockup - NMI Crash Dump

And to add to what Mr. Noone said.... If this problem is DMA related, enable DMA Verification in Driver Verfier, as well. Peter OSR @OSRDrivers
  Message 14 of 22  
05 Jun 18 13:13
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

The problem is hard to create and takes anywhere from 2 hrs to 18 hrs to create. I don't have a kernel debugger attached. The reason is: If I do anything to slow down the operation of the system, the problem takes several days to occur. Also I've noticed that the problem occurs sometimes when no DMA operation is going on. Burrr On 6/5/2018 9:26 AM, xxxxx@osr.com wrote: > Does the problem reproduce pretty quickly/easily? Also, do you have a kernel debugger attached? > > If you add enough DbgPrints you should be able to figure out the last things your driver(s) did before the hang. Pay particular attention to the DMA transfers that you performed prior to the hang (particularly offsets, lengths, and physical addresses). I've definitely debugged problems like this that way, usually it ends up being a particular set of arguments that triggers an edge condition in my code or the hardware. > > -scott > OSR >
  Message 15 of 22  
05 Jun 18 14:23
Peter Viscarola
xxxxxx@osr.com
Join Date:
Posts To This List: 6183
List Moderator
System Lockup - NMI Crash Dump

>The problem is hard to create and takes anywhere from 2 hrs to 18 hrs Ugh. My condolences. >I don't have a kernel debugger attached Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly. >Also I've noticed that the problem occurs sometimes >when no DMA operation is going on. Yes, but that doesn't rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite. Peter OSR @OSRDrivers
  Message 16 of 22  
05 Jun 18 17:56
Mark Roddy
xxxxxx@gmail.com
Join Date: 25 Feb 2000
Posts To This List: 4090
System Lockup - NMI Crash Dump

And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure. I'd dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths. Mark Roddy On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com <xxxxx@lists.osr.com> wrote: > >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs > > Ugh. My condolences. > > >I don't have a kernel debugger attached > > Regardless, I would recommend you test your driver with Driver Verifier > DMA verification enabled. IF you have a problem with the DMA APIs, this > will usually catch it quickly. > <...excess quoted lines suppressed...> --
  Message 17 of 22  
06 Jun 18 10:57
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

Before I saw this message I had started a test with the debugger attached to a system to see if I would be able to break into the debugger when the lockup occurred. A lockup did occur, but I was unable to break into the debugger. I restarted the test with some minimal logging from my main driver with the debugger attached. Burrr On 6/5/2018 5:57 PM, xxxxx@gmail.com<mailto:xxxxx@gmail.com> wrote: And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure. I'd dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths. Mark Roddy On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com<mailto:xxxxx@osr.com> <xxxxx@lists.osr.com<mailto:xxxxx@lists.osr.com>> wrote: >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs Ugh. My condolences. >I don't have a kernel debugger attached Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly. >Also I've noticed that the problem occurs sometimes >when no DMA operation is going on. Yes, but that doesn't rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite. Peter OSR @OSRDrivers --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer> --- NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at
  Message 18 of 22  
07 Jun 18 14:51
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

I've tried all these, but it did not yield any red flags. Any other ideas? Burrr On 6/5/2018 5:57 PM, xxxxx@gmail.com<mailto:xxxxx@gmail.com> wrote: And debug console logging, even if it slows down reproduction to have the debugger attached, would at least give you clues about what your driver was doing around the time of the failure. I'd dedicate a test system just to running with the debugger attached and your driver logging its operations. Meanwhile pursue other paths. Mark Roddy On Tue, Jun 5, 2018 at 2:23 PM xxxxx@osr.com<mailto:xxxxx@osr.com> <xxxxx@lists.osr.com<mailto:xxxxx@lists.osr.com>> wrote: >The problem is hard to create and takes anywhere from 2 hrs to 18 hrs Ugh. My condolences. >I don't have a kernel debugger attached Regardless, I would recommend you test your driver with Driver Verifier DMA verification enabled. IF you have a problem with the DMA APIs, this will usually catch it quickly. >Also I've noticed that the problem occurs sometimes >when no DMA operation is going on. Yes, but that doesn't rule out the DMA. As Mr. Ammerman suggested, this could be a corrupted DMA operation resulting in an overwrite. Peter OSR @OSRDrivers --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer> --- NTDEV is sponsored by OSR Visit the list online at: MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at To unsubscribe, visit the List Server section of OSR Online at
  Message 19 of 22  
07 Jun 18 22:49
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4487
System Lockup - NMI Crash Dump

> Would a driver be able to cause a lockup that would block the NMI from responding to the OS? Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely. IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI. A driver can do it either directly by the CPU if the target area is not marked as a read-only one in its PTE, or indirectly by the wrong DMA operation. In general, I would suggest taking "The Occam Razor" approach, and start investigating the most likely reasons and simple theories before proceeding to more complex ones. In this particular case I would start from the theory of IDT corruption (first direct and then indirect one) before proceeding to more complex scenarios (like a hardware-caused lockup which is,in turn, is caused by a driver incorrectly programming its device) Anton Bassov
  Message 20 of 22  
08 Jun 18 11:40
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

Thanks for the explanation On 6/7/2018 10:48 PM, xxxxx@hotmail.com wrote: > >> Would a driver be able to cause a lockup that would block the NMI from responding to the OS? > > Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely. > > IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI. > > > > A driver can do it either directly by the CPU if the target area is not marked as a read-only one <...excess quoted lines suppressed...>
  Message 21 of 22  
08 Jun 18 11:40
Burrr
xxxxxx@outlook.com
Join Date: 01 Jun 2018
Posts To This List: 9
System Lockup - NMI Crash Dump

Thanks for the explanation On 6/7/2018 10:48 PM, xxxxx@hotmail.com wrote: > >> Would a driver be able to cause a lockup that would block the NMI from responding to the OS? > > Of course. For example, consider what happens if it somehow corrupts the memory region that is occupied by IDTs - in such case NMI, just like any other interrupt, seems to be out of luck completely. > > IIRC, every CPU has its own IDT under Windows, but still these IDTs must be, apparently, located in the same memory region. For example, all theoretically possible IDTs ( 256 possible IDTs * 256 IDT entries * 16 bytes per entry on a 64-bit system) would occupy only 1M, i.e. fit in the same large page. Therefore, a single relatively large write to the target area is going to screw up all of them in one go, and, at this point, you are going to get exactly the scenario that you are describing, i.e. a sudden freeze of the system that cannot get resolved even by NMI. > > > > A driver can do it either directly by the CPU if the target area is not marked as a read-only one <...excess quoted lines suppressed...>
  Message 22 of 22  
08 Jun 18 15:36
Mark Roddy
xxxxxx@gmail.com
Join Date: 25 Feb 2000
Posts To This List: 4090
System Lockup - NMI Crash Dump

More logging to console. Mark Roddy On Thu, Jun 7, 2018 at 2:50 PM xxxxx@outlook.com <xxxxx@lists.osr.com> wrote: > I've tried all these, but it did not yield any red flags. > > Any other ideas? > > Burrr > > On 6/5/2018 5:57 PM, xxxxx@gmail.com wrote: > > And debug console logging, even if it slows down reproduction to have the > debugger attached, would at least give you clues about what your driver was <...excess quoted lines suppressed...> --
Posting Rules  
You may not post new threads
You may not post replies
You may not post attachments
You must login to OSR Online AND be a member of the ntdev list to be able to post.

All times are GMT -5. The time now is 23:29.


Copyright ©2015, OSR Open Systems Resources, Inc.
Based on vBulletin Copyright ©2000 - 2005, Jelsoft Enterprises Ltd.
Modified under license