In reply to Thomas Divine, at first I also thought it was a reference count bug. But I believe what’s happening is that the IRP object is disappearing beneath the cancellation routine, not the driver’s context object disappearing. Here are my reasons to support this:
* In WDK 6001.18001, the ndisprot\60\sys directory is different from the ndisprot\5x\sys directory. Much of that is because NDIS 6.0 is different from NDIS 5.1. However, some of those differences are IRP cancellation, which should be the same between Vista and earlier Windows. As reasoned below, through empirical evidence and my own code review, those changes should have been integrated into the 5x directory.
* The IRP cancellation code in the 60 directory closely approximates the pseudo code in MSDN regarding IRP cancellation for those “Using a Driver-Supplied Spin Lock”: http://msdn.microsoft.com/en-us/library/ms796240.aspx
* Without the changes in the 60 directory, my driver based on code from 3790.1830 (which in terms of IRP handling is identical to 6001.18001 in the 5x\sys directory) would blue screen usually within 30 seconds under my stress executable. After back-porting the changes from 60, I have not had a blue screen after hours of stress executable testing.
* Reviewing the changes by hand, here are the relevant changes and how I believe this fixes the driver:
** In recv.c NdisProtRead, in 5x code, it first marks the IRP as pending, and then sets the cancel routine, and then carries on. In 60 code, first it sets the cancel routine, and then immediately checks to see if pIrp->Cancel is set. If so, it then checks to see if it can disable the cancel routine. If that works, it does not mark the IRP pending, and ends the request with STATUS_CANCELLED. Otherwise it marks the IRP pending and allows something else to complete the IRP. In NdisProtCancelRead, the change was made to process the IRP and remove it from the PendedReads list without checking to see if it was there or without checking any other conditions. I assume this was done to crash there in case of bugs in the IRP completions–the scheme for the read path is that if the cancellation routine will run or has run, it takes care of IoCompleteRequest. If the other path gets to it, it prevents the cancellation routine from running at all. The code in the write path takes a different approach.
I don’t understand the significance of the read path change, but I think it has to do with NdisProtCancelRead having a race with the other places that completed the IRP. Before I made that change I would blue screen in the read path. It was rare, but it would happen, and now it doesn’t. However, the change that made the most difference was the next one:
** In send.c, NdisProtWrite sets the context in DriverContext[0] and sets the cancel routine. This happens the same in 5x and 60. However, the protections occur in NdisProtCancelWrite and NdisProtSendComplete. In NdisProtSendComplete, the cancellation spin lock is acquired around the code that disables the cancel routine, and sets DriverContext[0] and DriverContext[1] to NULL. Meanwhile in NdisProtCancelWrite, while holding the cancellation spin lock DriverContext[0] is checked for NULL; if it’s NULL the cancellation routine knows that NdisProtSendComplete got there first and there’s nothing to do. There are further protections to protect against a race with the PendedWrites list, which I believe is correct (with the possible exception of: http://www.osronline.com/showthread.cfm?link=142676). Note that NdisProtCancelWrite never completes an IRP, it just calls NDIS to cancel the send(s) if it can, and always allows NdisProtSendComplete to complete it. But the most important change is that NdisProtCancelWrite increments a reference on the context object if it can, while holding the cancellation spin lock.
I’ve integrated the relevant 60 changes into my 5x based ndisprot (with the addition of the change I describe here: http://www.osronline.com/showthread.cfm?link=142676) and now I can no longer get a blue screen.
Without the change to the send path, what happens is that when NdisProtCancelWrite is called often it will crash when attempting to dereference the context object from DriverContext[0]. It’s not because the context object has gone away, it’s because the IRP itself has been completed and used for something else during the time it took the IO system to call the cancel routine and NdisProtCancelWrite to start executing that code–therefore DriverContext[0] no longer points to my context object. The protections described above avoid this possibility. The reason for believing that the IRP went away is:
* The DriverContext[0] is not even close to a correct context object; in fact, the context object in question should remain constant throughout the lifetime of the driver, as that represents the binding to the NIC, which would not change. In my blue screen dump, DriverContext[0] pointer was something other than my expected context object pointer.
* I modified my version to instrument DriverContext[2] and DriverContext[3] as the IRP went through different points in its lifecycle. When I blue screen’d in NdisProtCancelWrite, the IRP did not have the values that I was inserting in DriverContext[2] or DriverContext[3]; they were all something else, I assume because the IRP memory was reused for another operation.
Please let me know if any of my reasoning is wrong…