Driver Problems? Questions? Issues?
Put OSR's experience to work for you! Contact us for assistance with:
  • Creating the right design for your requirements
  • Reviewing your existing driver code
  • Analyzing driver reliability/performance issues
  • Custom training mixed with consulting and focused directly on your specific areas of interest/concern.
Check us out. OSR, the Windows driver experts.

Monthly Seminars at OSR Headquarters

East Coast USA
Windows Internals and SW Drivers, Dulles (Sterling) VA, 13 November 2017

Kernel Debugging & Crash Analysis for Windows, Nashua (Amherst) NH, 4 December 2017

Writing WDF Drivers I: Core Concepts, Nashua (Amherst) NH, 8 January 2018

WDF Drivers II: Advanced Implementation Techniques, Nashua (Amherst) NH, 15 January 2018


Go Back   OSR Online Lists > ntdev
Welcome, Guest
You must login to post to this list
  Message 1 of 10  
10 Nov 08 17:26
ntdev member 40949
xxxxxx@scalent.com
Join Date:
Posts To This List: 7
NDISPROT 5x cancel code out of date?

We have a driver that's based on NDISPROT. We originally got the sample code from 3790.1830. However, recently we've discovered that we have a race condition in NdisProtCancelWrite. I have written some sample code that can get my driver to blue screen just about whenever I want. Looking around I found the following discussion (and I can't seem to reply to it): http://www.tech-archive.net/Archive/Development/microsoft.public.development.devi ce.drivers/2005-01/0979.html I know that it is referring to reads, and I'm referring to write, and he is referring to deadlocks, while I experience a blue screen... but I think it's the same issue with cancellation synchronization. In it Stephan Wolf says that the code in question is up-to-date, and that has been resolved. However, if I understand it correctly, 3790.1830 NDISPROT, and 6001.18001 NDISPROT's 5x\sys directory are both out-of-date even in the read path. But the 60\sys directory appears to have correctly implemented cancel synchronization (maybe with the exception of a bug? http://www.osronline.com/showthread.cfm?link=142676) Has anyone else found this problem with the cancellation code? I'm about to attempt to integrate the changes in NDISPROT for 60 into my code and see if this fixes my race condition...
  Message 2 of 10  
11 Nov 08 15:24
ntdev member 40949
xxxxxx@scalent.com
Join Date:
Posts To This List: 7
NDISPROT 5x cancel code out of date?

In case anyone is interested, I back-ported cancel routine changes that were made in the NDIS 6.0 version of NDISPROT into the NDIS 5.x version; this seems to have solved my blue screen problems. Is anyone else using NDISPROT derived drivers in Win2k3 and earlier?
  Message 3 of 10  
11 Nov 08 15:42
ntdev member 38316
xxxxxx@open1x.org
Join Date:
Posts To This List: 3
NDISPROT 5x cancel code out of date?

I am using an NDISPROT derived driver on Windows XP for an open source project. I would be interested in your patch if you are willing/able to share. xxxxx@scalent.com wrote: > In case anyone is interested, I back-ported cancel routine changes that were made in the NDIS 6.0 version of NDISPROT into the NDIS 5.x version; this seems to have solved my blue screen problems. > > Is anyone else using NDISPROT derived drivers in Win2k3 and earlier? > > > > --- > NTDEV is sponsored by OSR > > For our schedule of WDF, WDM, debugging and other seminars visit: <...excess quoted lines suppressed...>
  Message 4 of 10  
11 Nov 08 15:45
Thomas Divine
xxxxxx@pcausa.com
Join Date: 05 Aug 2010
Posts To This List: 635
NDISPROT 5x cancel code out of date?

I use NDISPROT as the base for some drivers on Win2K3 and earlier. Shouldn't the reference counting logic insure that the memory exists long enough? The memory should persist until the final NPROT_DEREF_SEND_PKT call. Perhaps a small reference counting bug here. Thomas F. Divine > -----Original Message----- > From: xxxxx@lists.osr.com [mailto:bounce-343409- > xxxxx@lists.osr.com] On Behalf Of xxxxx@scalent.com > Sent: Tuesday, November 11, 2008 3:22 PM > To: Windows System Software Devs Interest List > Subject: RE:[ntdev] NDISPROT 5x cancel code out of date? > > In case anyone is interested, I back-ported cancel routine changes that > were made in the NDIS 6.0 version of NDISPROT into the NDIS 5.x <...excess quoted lines suppressed...>
  Message 5 of 10  
11 Nov 08 21:07
ntdev member 40949
xxxxxx@scalent.com
Join Date:
Posts To This List: 7
NDISPROT 5x cancel code out of date?

In reply to Thomas Divine, at first I also thought it was a reference count= bug. But I believe what's happening is that the IRP object is disappearing= beneath the cancellation routine, not the driver's context object disappea= ring. Here are my reasons to support this: * In WDK 6001.18001, the ndisprot\60\sys directory is different from the nd= isprot\5x\sys directory. Much of that is because NDIS 6.0 is different from= NDIS 5.1. However, some of those differences are IRP cancellation, which s= hould be the same between Vista and earlier Windows. As reasoned below, thr= ough empirical evidence and my own code review, those changes should have b= een integrated into the 5x directory. * The IRP cancellation code in the 60 directory closely approximates the ps= eudo code in MSDN regarding IRP cancellation for those "Using a Driver-Supp= lied Spin Lock": http://msdn.microsoft.com/en-us/library/ms796240.aspx * Without the changes in the 60 directory, my driver based on code from 379= 0.1830 (which in terms of IRP handling is identical to 6001.18001 in the 5x= \sys directory) would blue screen usually within 30 seconds under my stress= executable. After back-porting the changes from 60, I have not had a blue = screen after hours of stress executable testing. * Reviewing the changes by hand, here are the relevant changes and how I be= lieve this fixes the driver: ** In recv.c NdisProtRead, in 5x code, it first marks the IRP as pending, a= nd then sets the cancel routine, and then carries on. In 60 code, first it = sets the cancel routine, and then immediately checks to see if pIrp->Cancel= is set. If so, it then checks to see if it can disable the cancel routine.= If that works, it does not mark the IRP pending, and ends the request with= STATUS_CANCELLED. Otherwise it marks the IRP pending and allows something = else to complete the IRP. In NdisProtCancelRead, the change was made to pro= cess the IRP and remove it from the PendedReads list without checking to se= e if it was there or without checking any other conditions. I assume this w= as done to crash there in case of bugs in the IRP completions--the scheme f= or the read path is that if the cancellation routine will run or has run, i= t takes care of IoCompleteRequest. If the other path gets to it, it prevent= s the cancellation routine from running at all. The code in the write path = takes a different approach. I don't understand the significance of the read path change, but I think it= has to do with NdisProtCancelRead having a race with the other places that= completed the IRP. Before I made that change I would blue screen in the re= ad path. It was rare, but it would happen, and now it doesn't. However, the= change that made the most difference was the next one: ** In send.c, NdisProtWrite sets the context in DriverContext[0] and sets t= he cancel routine. This happens the same in 5x and 60. However, the protect= ions occur in NdisProtCancelWrite and NdisProtSendComplete. In NdisProtSend= Complete, the cancellation spin lock is acquired around the code that disab= les the cancel routine, and sets DriverContext[0] and DriverContext[1] to N= ULL. Meanwhile in NdisProtCancelWrite, while holding the cancellation spin = lock DriverContext[0] is checked for NULL; if it's NULL the cancellation ro= utine knows that NdisProtSendComplete got there first and there's nothing t= o do. There are further protections to protect against a race with the Pend= edWrites list, which I believe is correct (with the possible exception of: = http://www.osronline.com/showthread.cfm?link=3D142676). Note that NdisProtC= ancelWrite never completes an IRP, it just calls NDIS to cancel the send(s)= if it can, and always allows NdisProtSendComplete to complete it. But the = most important change is that NdisProtCancelWrite increments a reference on= the context object if it can, while holding the cancellation spin lock. I've integrated the relevant 60 changes into my 5x based ndisprot (with the= addition of the change I describe here: http://www.osronline.com/showthrea= d.cfm?link=3D142676) and now I can no longer get a blue screen. Without the change to the send path, what happens is that when NdisProtCanc= elWrite is called often it will crash when attempting to dereference the co= ntext object from DriverContext[0]. It's not because the context object has= gone away, it's because the IRP itself has been completed and used for som= ething else during the time it took the IO system to call the cancel routin= e and NdisProtCancelWrite to start executing that code--therefore DriverCon= text[0] no longer points to my context object. The protections described ab= ove avoid this possibility. The reason for believing that the IRP went away= is: * The DriverContext[0] is not even close to a correct context object; in fa= ct, the context object in question should remain constant throughout the li= fetime of the driver, as that represents the binding to the NIC, which woul= d not change. In my blue screen dump, DriverContext[0] pointer was somethin= g other than my expected context object pointer. * I modified my version to instrument DriverContext[2] and DriverContext[3]= as the IRP went through different points in its lifecycle. When I blue scr= een'd in NdisProtCancelWrite, the IRP did not have the values that I was in= serting in DriverContext[2] or DriverContext[3]; they were all something el= se, I assume because the IRP memory was reused for another operation. Please let me know if any of my reasoning is wrong...
  Message 6 of 10  
12 Nov 08 00:24
Thomas Divine
xxxxxx@pcausa.com
Join Date: 05 Aug 2010
Posts To This List: 635
NDISPROT 5x cancel code out of date?

It certainly looks like you have made a careful analysis of the behavior. I will look at it in more depth myself. If it turns out to be as you say I'll make a note of it on NDIS.com - and point out that you did the detective work. This wouldn't be the first bug in Microsoft driver samples. Your application must be doing something a little unusual to bring out this flaw. The NDISPROT sample has been the baseline for quite a few drivers for quite a few years. If it was as easy as you say to reproduce I would have expected screams years ago. Perhaps your stress application just happens to exercise this weak point. Warm regards, Thomas F. Divine > -----Original Message----- > From: xxxxx@lists.osr.com [mailto:bounce-343437- > xxxxx@lists.osr.com] On Behalf Of xxxxx@scalent.com > Sent: Tuesday, November 11, 2008 9:06 PM > To: Windows System Software Devs Interest List > Subject: RE:[ntdev] NDISPROT 5x cancel code out of date? > > In reply to Thomas Divine, at first I also thought it was a reference > count bug. But I believe what's happening is that the IRP object is <...excess quoted lines suppressed...>
  Message 7 of 10  
12 Nov 08 13:48
ntdev member 40949
xxxxxx@scalent.com
Join Date:
Posts To This List: 7
NDISPROT 5x cancel code out of date?

Thanks for your replies, Yes, I believe that the blue screen that we saw was an extremely rare occurrence. We have had this driver code running for 2 or 3 years, unmodified in any way, doing operations several times a second, 24/7 on many machines. This was the first time we saw it blue screen, and of course it did not happen in our lab. That's why I wrote the stress executable (again, something that would have been a good idea 2 or 3 years ago). Right now I believe that this happens because the process performing the ioctl exits; I will have to do more careful testing with my stress executable to determine if this can be caused by merely closing the ioctl handle in a separate thread. My stress executable right now does both, closing ioctl handles at random times in other threads, and exiting the process at random times, to be restarted by a batch file.
  Message 8 of 10  
12 Nov 08 13:53
Pavel A
xxxxxx@fastmail.fm
Join Date: 21 Jul 2008
Posts To This List: 2401
NDISPROT 5x cancel code out of date?

xxxxx@scalent.com wrote: > Thanks for your replies, > > Yes, I believe that the blue screen that we saw was an extremely rare occurrence. We have had this driver code running for 2 or 3 years, unmodified in any way, doing operations several times a second, 24/7 on many machines. This was the first time we saw it blue screen, and of course it did not happen in our lab. That's why I wrote the stress executable (again, something that would have been a good idea 2 or 3 years ago). > > Right now I believe that this happens because the process performing the ioctl exits; I will have to do more careful testing with my stress executable to determine if this can be caused by merely closing the ioctl handle in a separate thread. My stress executable right now does both, closing ioctl handles at random times in other threads, and exiting the process at random times, to be restarted by a batch file. Why to exit the process? Can CancelIo() be used instead? --PA
  Message 9 of 10  
12 Nov 08 14:01
ntdev member 40949
xxxxxx@scalent.com
Join Date:
Posts To This List: 7
NDISPROT 5x cancel code out of date?

I'm sure CancelIo (or even better, CancelIoEx) might be a good way to do it as well; but the condition we saw in the real world was a process exiting, so I included it in my testing.
  Message 10 of 10  
12 Nov 08 14:08
Michal Vodicka
xxxxxx@upek.com
Join Date: 02 Apr 2004
Posts To This List: 1612
NDISPROT 5x cancel code out of date?

> -----Original Message----- > From: xxxxx@lists.osr.com=20 > [mailto:xxxxx@lists.osr.com] On Behalf Of=20 > xxxxx@scalent.com > Sent: Wednesday, November 12, 2008 7:46 PM > To: Windows System Software Devs Interest List > Subject: RE:[ntdev] NDISPROT 5x cancel code out of date? >=20 > Yes, I believe that the blue screen that we saw was an=20 > extremely rare occurrence. We have had this driver code=20 <...excess quoted lines suppressed...> Please note this kind of problem (race condition if I understand your explanation) occurs more often now than few years before. It is because of new fast multi-core machines. It is my experience from USB driver development (I stopped working in NDIS area years before) but I believe it is common case. It means nothing if the driver was used on millions machines before. The environment changes rapidly and old bugs emerge faster (had to solve one 6 years old recently...). The only cure is what you're already doing. Write special apps for stress tests and try to simulate all possible cases. Coverage tool can help. Best regards, Michal Vodicka UPEK, Inc. [xxxxx@upek.com, http://www.upek.com]=20
Posting Rules  
You may not post new threads
You may not post replies
You may not post attachments
You must login to OSR Online AND be a member of the ntdev list to be able to post.

All times are GMT -5. The time now is 02:31.


Copyright ©2015, OSR Open Systems Resources, Inc.
Based on vBulletin Copyright ©2000 - 2005, Jelsoft Enterprises Ltd.
Modified under license