NDISPROT 5x cancel code out of date?
In reply to Thomas Divine, at first I also thought it was a reference count bug.
But I believe what's happening is that the IRP object is disappearing beneath
the cancellation routine, not the driver's context object disappearing. Here are
my reasons to support this:
* In WDK 6001.18001, the ndisprot\60\sys directory is different from the
ndisprot\5x\sys directory. Much of that is because NDIS 6.0 is different from
NDIS 5.1. However, some of those differences are IRP cancellation, which should
be the same between Vista and earlier Windows. As reasoned below, through
empirical evidence and my own code review, those changes should have been
integrated into the 5x directory.
* The IRP cancellation code in the 60 directory closely approximates the pseudo
code in MSDN regarding IRP cancellation for those "Using a Driver-Supplied Spin
* Without the changes in the 60 directory, my driver based on code from
3790.1830 (which in terms of IRP handling is identical to 6001.18001 in the
5x\sys directory) would blue screen usually within 30 seconds under my stress
executable. After back-porting the changes from 60, I have not had a blue screen
after hours of stress executable testing.
* Reviewing the changes by hand, here are the relevant changes and how I believe
this fixes the driver:
** In recv.c NdisProtRead, in 5x code, it first marks the IRP as pending, and
then sets the cancel routine, and then carries on. In 60 code, first it sets the
cancel routine, and then immediately checks to see if pIrp->Cancel is set. If
so, it then checks to see if it can disable the cancel routine. If that works,
it does not mark the IRP pending, and ends the request with STATUS_CANCELLED.
Otherwise it marks the IRP pending and allows something else to complete the
IRP. In NdisProtCancelRead, the change was made to process the IRP and remove it
from the PendedReads list without checking to see if it was there or without
checking any other conditions. I assume this was done to crash there in case of
bugs in the IRP completions--the scheme for the read path is that if the
cancellation routine will run or has run, it takes care of IoCompleteRequest. If
the other path gets to it, it prevents the cancellation routine from running at
all. The code in the write path takes a different approach.
I don't understand the significance of the read path change, but I think it has
to do with NdisProtCancelRead having a race with the other places that completed
the IRP. Before I made that change I would blue screen in the read path. It was
rare, but it would happen, and now it doesn't. However, the change that made the
most difference was the next one:
** In send.c, NdisProtWrite sets the context in DriverContext and sets the
cancel routine. This happens the same in 5x and 60. However, the protections
occur in NdisProtCancelWrite and NdisProtSendComplete. In NdisProtSendComplete,
the cancellation spin lock is acquired around the code that disables the cancel
routine, and sets DriverContext and DriverContext to NULL. Meanwhile in
NdisProtCancelWrite, while holding the cancellation spin lock DriverContext
is checked for NULL; if it's NULL the cancellation routine knows that
NdisProtSendComplete got there first and there's nothing to do. There are
further protections to protect against a race with the PendedWrites list, which
I believe is correct (with the possible exception of:
http://www.osronline.com/showthread.cfm?link=142676). Note that
NdisProtCancelWrite never completes an IRP, it just calls NDIS to cancel the
send(s) if it can, and always allows NdisProtSendComplete to complete it. But
the most important change is that NdisProtCancelWrite increments a reference on
the context object if it can, while holding the cancellation spin lock.
I've integrated the relevant 60 changes into my 5x based ndisprot (with the
addition of the change I describe here:
http://www.osronline.com/showthread.cfm?link=142676) and now I can no longer get
a blue screen.
Without the change to the send path, what happens is that when
NdisProtCancelWrite is called often it will crash when attempting to dereference
the context object from DriverContext. It's not because the context object
has gone away, it's because the IRP itself has been completed and used for
something else during the time it took the IO system to call the cancel routine
and NdisProtCancelWrite to start executing that code--therefore DriverContext
no longer points to my context object. The protections described above avoid
this possibility. The reason for believing that the IRP went away is:
* The DriverContext is not even close to a correct context object; in fact,
the context object in question should remain constant throughout the lifetime of
the driver, as that represents the binding to the NIC, which would not change.
In my blue screen dump, DriverContext pointer was something other than my
expected context object pointer.
* I modified my version to instrument DriverContext and DriverContext as
the IRP went through different points in its lifecycle. When I blue screen'd in
NdisProtCancelWrite, the IRP did not have the values that I was inserting in
DriverContext or DriverContext; they were all something else, I assume
because the IRP memory was reused for another operation.
Please let me know if any of my reasoning is wrong...