NDISPROT 5x cancel code out of date?

OSR_Community_User · November 10, 2008, 5:26pm

We have a driver that’s based on NDISPROT. We originally got the sample code from 3790.1830. However, recently we’ve discovered that we have a race condition in NdisProtCancelWrite. I have written some sample code that can get my driver to blue screen just about whenever I want.

Looking around I found the following discussion (and I can’t seem to reply to it):

http://www.tech-archive.net/Archive/Development/microsoft.public.development.device.drivers/2005-01/0979.html

I know that it is referring to reads, and I’m referring to write, and he is referring to deadlocks, while I experience a blue screen… but I think it’s the same issue with cancellation synchronization.

In it Stephan Wolf says that the code in question is up-to-date, and that has been resolved. However, if I understand it correctly, 3790.1830 NDISPROT, and 6001.18001 NDISPROT’s 5x\sys directory are both out-of-date even in the read path. But the 60\sys directory appears to have correctly implemented cancel synchronization (maybe with the exception of a bug? http://www.osronline.com/showthread.cfm?link=142676)

Has anyone else found this problem with the cancellation code? I’m about to attempt to integrate the changes in NDISPROT for 60 into my code and see if this fixes my race condition…

OSR_Community_User · November 11, 2008, 3:24pm

In case anyone is interested, I back-ported cancel routine changes that were made in the NDIS 6.0 version of NDISPROT into the NDIS 5.x version; this seems to have solved my blue screen problems.

Is anyone else using NDISPROT derived drivers in Win2k3 and earlier?

OSR_Community_User · November 11, 2008, 3:42pm

I am using an NDISPROT derived driver on Windows XP for an open source
project. I would be interested in your patch if you are willing/able to
share.

xxxxx@scalent.com wrote:

In case anyone is interested, I back-ported cancel routine changes that were made in the NDIS 6.0 version of NDISPROT into the NDIS 5.x version; this seems to have solved my blue screen problems.

Is anyone else using NDISPROT derived drivers in Win2k3 and earlier?

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Thomas_Divine-2 · November 11, 2008, 3:45pm

I use NDISPROT as the base for some drivers on Win2K3 and earlier.

Shouldn’t the reference counting logic insure that the memory exists long
enough? The memory should persist until the final NPROT_DEREF_SEND_PKT call.

Perhaps a small reference counting bug here.

Thomas F. Divine

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-343409-
xxxxx@lists.osr.com] On Behalf Of xxxxx@scalent.com
Sent: Tuesday, November 11, 2008 3:22 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] NDISPROT 5x cancel code out of date?

In case anyone is interested, I back-ported cancel routine changes that
were made in the NDIS 6.0 version of NDISPROT into the NDIS 5.x
version; this seems to have solved my blue screen problems.

Is anyone else using NDISPROT derived drivers in Win2k3 and earlier?

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · November 11, 2008, 9:07pm

In reply to Thomas Divine, at first I also thought it was a reference count bug. But I believe what’s happening is that the IRP object is disappearing beneath the cancellation routine, not the driver’s context object disappearing. Here are my reasons to support this:

* In WDK 6001.18001, the ndisprot\60\sys directory is different from the ndisprot\5x\sys directory. Much of that is because NDIS 6.0 is different from NDIS 5.1. However, some of those differences are IRP cancellation, which should be the same between Vista and earlier Windows. As reasoned below, through empirical evidence and my own code review, those changes should have been integrated into the 5x directory.

* The IRP cancellation code in the 60 directory closely approximates the pseudo code in MSDN regarding IRP cancellation for those “Using a Driver-Supplied Spin Lock”: http://msdn.microsoft.com/en-us/library/ms796240.aspx

* Without the changes in the 60 directory, my driver based on code from 3790.1830 (which in terms of IRP handling is identical to 6001.18001 in the 5x\sys directory) would blue screen usually within 30 seconds under my stress executable. After back-porting the changes from 60, I have not had a blue screen after hours of stress executable testing.

* Reviewing the changes by hand, here are the relevant changes and how I believe this fixes the driver:
** In recv.c NdisProtRead, in 5x code, it first marks the IRP as pending, and then sets the cancel routine, and then carries on. In 60 code, first it sets the cancel routine, and then immediately checks to see if pIrp->Cancel is set. If so, it then checks to see if it can disable the cancel routine. If that works, it does not mark the IRP pending, and ends the request with STATUS_CANCELLED. Otherwise it marks the IRP pending and allows something else to complete the IRP. In NdisProtCancelRead, the change was made to process the IRP and remove it from the PendedReads list without checking to see if it was there or without checking any other conditions. I assume this was done to crash there in case of bugs in the IRP completions–the scheme for the read path is that if the cancellation routine will run or has run, it takes care of IoCompleteRequest. If the other path gets to it, it prevents the cancellation routine from running at all. The code in the write path takes a different approach.

I don’t understand the significance of the read path change, but I think it has to do with NdisProtCancelRead having a race with the other places that completed the IRP. Before I made that change I would blue screen in the read path. It was rare, but it would happen, and now it doesn’t. However, the change that made the most difference was the next one:

** In send.c, NdisProtWrite sets the context in DriverContext[0] and sets the cancel routine. This happens the same in 5x and 60. However, the protections occur in NdisProtCancelWrite and NdisProtSendComplete. In NdisProtSendComplete, the cancellation spin lock is acquired around the code that disables the cancel routine, and sets DriverContext[0] and DriverContext[1] to NULL. Meanwhile in NdisProtCancelWrite, while holding the cancellation spin lock DriverContext[0] is checked for NULL; if it’s NULL the cancellation routine knows that NdisProtSendComplete got there first and there’s nothing to do. There are further protections to protect against a race with the PendedWrites list, which I believe is correct (with the possible exception of: http://www.osronline.com/showthread.cfm?link=142676). Note that NdisProtCancelWrite never completes an IRP, it just calls NDIS to cancel the send(s) if it can, and always allows NdisProtSendComplete to complete it. But the most important change is that NdisProtCancelWrite increments a reference on the context object if it can, while holding the cancellation spin lock.

I’ve integrated the relevant 60 changes into my 5x based ndisprot (with the addition of the change I describe here: http://www.osronline.com/showthread.cfm?link=142676) and now I can no longer get a blue screen.

Without the change to the send path, what happens is that when NdisProtCancelWrite is called often it will crash when attempting to dereference the context object from DriverContext[0]. It’s not because the context object has gone away, it’s because the IRP itself has been completed and used for something else during the time it took the IO system to call the cancel routine and NdisProtCancelWrite to start executing that code–therefore DriverContext[0] no longer points to my context object. The protections described above avoid this possibility. The reason for believing that the IRP went away is:
* The DriverContext[0] is not even close to a correct context object; in fact, the context object in question should remain constant throughout the lifetime of the driver, as that represents the binding to the NIC, which would not change. In my blue screen dump, DriverContext[0] pointer was something other than my expected context object pointer.
* I modified my version to instrument DriverContext[2] and DriverContext[3] as the IRP went through different points in its lifecycle. When I blue screen’d in NdisProtCancelWrite, the IRP did not have the values that I was inserting in DriverContext[2] or DriverContext[3]; they were all something else, I assume because the IRP memory was reused for another operation.

Please let me know if any of my reasoning is wrong…

Thomas_Divine-2 · November 12, 2008, 12:24am

It certainly looks like you have made a careful analysis of the behavior. I
will look at it in more depth myself.

If it turns out to be as you say I’ll make a note of it on NDIS.com - and
point out that you did the detective work.

This wouldn’t be the first bug in Microsoft driver samples. Your application
must be doing something a little unusual to bring out this flaw. The
NDISPROT sample has been the baseline for quite a few drivers for quite a
few years. If it was as easy as you say to reproduce I would have expected
screams years ago. Perhaps your stress application just happens to exercise
this weak point.

Warm regards,

Thomas F. Divine

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:bounce-343437-
xxxxx@lists.osr.com] On Behalf Of xxxxx@scalent.com
Sent: Tuesday, November 11, 2008 9:06 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] NDISPROT 5x cancel code out of date?

In reply to Thomas Divine, at first I also thought it was a reference
count bug. But I believe what’s happening is that the IRP object is
disappearing beneath the cancellation routine, not the driver’s context
object disappearing. Here are my reasons to support this:

* In WDK 6001.18001, the ndisprot\60\sys directory is different from
the ndisprot\5x\sys directory. Much of that is because NDIS 6.0 is
different from NDIS 5.1. However, some of those differences are IRP
cancellation, which should be the same between Vista and earlier
Windows. As reasoned below, through empirical evidence and my own code
review, those changes should have been integrated into the 5x
directory.

* The IRP cancellation code in the 60 directory closely approximates
the pseudo code in MSDN regarding IRP cancellation for those “Using a
Driver-Supplied Spin Lock”: http://msdn.microsoft.com/en-
us/library/ms796240.aspx

* Without the changes in the 60 directory, my driver based on code from
3790.1830 (which in terms of IRP handling is identical to 6001.18001 in
the 5x\sys directory) would blue screen usually within 30 seconds under
my stress executable. After back-porting the changes from 60, I have
not had a blue screen after hours of stress executable testing.

* Reviewing the changes by hand, here are the relevant changes and how
I believe this fixes the driver:
** In recv.c NdisProtRead, in 5x code, it first marks the IRP as
pending, and then sets the cancel routine, and then carries on. In 60
code, first it sets the cancel routine, and then immediately checks to
see if pIrp->Cancel is set. If so, it then checks to see if it can
disable the cancel routine. If that works, it does not mark the IRP
pending, and ends the request with STATUS_CANCELLED. Otherwise it marks
the IRP pending and allows something else to complete the IRP. In
NdisProtCancelRead, the change was made to process the IRP and remove
it from the PendedReads list without checking to see if it was there or
without checking any other conditions. I assume this was done to crash
there in case of bugs in the IRP completions–the scheme for the read
path is that if the cancellation routine will run or has run, it takes
care of IoCompleteRequest. If the other path gets to it, it prevents
the cancellation routine from running at all. The code in the write
path takes a different approach.

I don’t understand the significance of the read path change, but I
think it has to do with NdisProtCancelRead having a race with the other
places that completed the IRP. Before I made that change I would blue
screen in the read path. It was rare, but it would happen, and now it
doesn’t. However, the change that made the most difference was the next
one:

** In send.c, NdisProtWrite sets the context in DriverContext[0] and
sets the cancel routine. This happens the same in 5x and 60. However,
the protections occur in NdisProtCancelWrite and NdisProtSendComplete.
In NdisProtSendComplete, the cancellation spin lock is acquired around
the code that disables the cancel routine, and sets DriverContext[0]
and DriverContext[1] to NULL. Meanwhile in NdisProtCancelWrite, while
holding the cancellation spin lock DriverContext[0] is checked for
NULL; if it’s NULL the cancellation routine knows that
NdisProtSendComplete got there first and there’s nothing to do. There
are further protections to protect against a race with the PendedWrites
list, which I believe is correct (with the possible exception of:
http://www.osronline.com/showthread.cfm?link=142676). Note that
NdisProtCancelWrite never completes an IRP, it just calls NDIS to
cancel the send(s) if it can, and always allows NdisProtSendComplete to
complete it. But the most important change is that NdisProtCancelWrite
increments a reference on the context object if it can, while holding
the cancellation spin lock.

I’ve integrated the relevant 60 changes into my 5x based ndisprot (with
the addition of the change I describe here:
http://www.osronline.com/showthread.cfm?link=142676) and now I can no
longer get a blue screen.

Without the change to the send path, what happens is that when
NdisProtCancelWrite is called often it will crash when attempting to
dereference the context object from DriverContext[0]. It’s not because
the context object has gone away, it’s because the IRP itself has been
completed and used for something else during the time it took the IO
system to call the cancel routine and NdisProtCancelWrite to start
executing that code–therefore DriverContext[0] no longer points to my
context object. The protections described above avoid this possibility.
The reason for believing that the IRP went away is:
* The DriverContext[0] is not even close to a correct context object;
in fact, the context object in question should remain constant
throughout the lifetime of the driver, as that represents the binding
to the NIC, which would not change. In my blue screen dump,
DriverContext[0] pointer was something other than my expected context
object pointer.
* I modified my version to instrument DriverContext[2] and
DriverContext[3] as the IRP went through different points in its
lifecycle. When I blue screen’d in NdisProtCancelWrite, the IRP did not
have the values that I was inserting in DriverContext[2] or
DriverContext[3]; they were all something else, I assume because the
IRP memory was reused for another operation.

Please let me know if any of my reasoning is wrong…

NTDEV is sponsored by OSR

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

OSR_Community_User · November 12, 2008, 1:48pm

Thanks for your replies,

Yes, I believe that the blue screen that we saw was an extremely rare occurrence. We have had this driver code running for 2 or 3 years, unmodified in any way, doing operations several times a second, 24/7 on many machines. This was the first time we saw it blue screen, and of course it did not happen in our lab. That’s why I wrote the stress executable (again, something that would have been a good idea 2 or 3 years ago).

Right now I believe that this happens because the process performing the ioctl exits; I will have to do more careful testing with my stress executable to determine if this can be caused by merely closing the ioctl handle in a separate thread. My stress executable right now does both, closing ioctl handles at random times in other threads, and exiting the process at random times, to be restarted by a batch file.

Pavel_A1 · November 12, 2008, 1:53pm

xxxxx@scalent.com wrote:

Thanks for your replies,

Yes, I believe that the blue screen that we saw was an extremely rare occurrence. We have had this driver code running for 2 or 3 years, unmodified in any way, doing operations several times a second, 24/7 on many machines. This was the first time we saw it blue screen, and of course it did not happen in our lab. That’s why I wrote the stress executable (again, something that would have been a good idea 2 or 3 years ago).

Right now I believe that this happens because the process performing the ioctl exits; I will have to do more careful testing with my stress executable to determine if this can be caused by merely closing the ioctl handle in a separate thread. My stress executable right now does both, closing ioctl handles at random times in other threads, and exiting the process at random times, to be restarted by a batch file.

Why to exit the process? Can CancelIo() be used instead?

–PA

OSR_Community_User · November 12, 2008, 2:01pm

I’m sure CancelIo (or even better, CancelIoEx) might be a good way to do it as well; but the condition we saw in the real world was a process exiting, so I included it in my testing.

Michal_Vodicka-2 · November 12, 2008, 2:08pm

> -----Original Message-----

From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@scalent.com
Sent: Wednesday, November 12, 2008 7:46 PM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] NDISPROT 5x cancel code out of date?

Yes, I believe that the blue screen that we saw was an
extremely rare occurrence. We have had this driver code
running for 2 or 3 years, unmodified in any way, doing
operations several times a second, 24/7 on many machines.
This was the first time we saw it blue screen, and of course
it did not happen in our lab. That’s why I wrote the stress
executable (again, something that would have been a good idea
2 or 3 years ago).

Please note this kind of problem (race condition if I understand your
explanation) occurs more often now than few years before. It is because
of new fast multi-core machines. It is my experience from USB driver
development (I stopped working in NDIS area years before) but I believe
it is common case. It means nothing if the driver was used on millions
machines before. The environment changes rapidly and old bugs emerge
faster (had to solve one 6 years old recently…).

The only cure is what you’re already doing. Write special apps for
stress tests and try to simulate all possible cases. Coverage tool can
help.

Best regards,

Michal Vodicka
UPEK, Inc.
[xxxxx@upek.com, http://www.upek.com]