IoCancelIrp jumping off into the weeds

Hi all,

I’ve been chasing a BSOD that has recently started cropping up in our NDIS filter driver. It seems to have appeared in newer kernel versions so I suspect we’re violating some constraint that wasn’t strongly checked previously.

The filter sits near the bottom of the stack and intercepts GigE Vision Stream Packets (GVSP) and coalesces them into a single image frame. This way the client can ask for a frame at a time and avoid an extra memcpy to reassemble the packets.

This portion of the filter works fine but we also have an additional functionality built in that allows packet capture ala PCAP. A client app can queue up buffers that the driver fills in with the network packet info. Then the client can dump those packets to a .pcap file which can be read by Wireshark.

This code has been around and mostly unchanged for years but recently has started crashing when the client stops recording. From the client point of view there is a thread which uses ioctls to queue and read buffers in a loop until the user stops capturing which causes the thread to finish reading and exit. The crash looks like this:

Child-SP RetAddr Call Site

00 fffff88006a78f08 fffff80002b6d662 nt!DbgBreakPointWithStatus
01 fffff88006a78f10 fffff80002b6e44e nt!KiBugCheckDebugBreak+0x12
02 fffff88006a78f70 fffff80002a7cf04 nt!KeBugCheck2+0x71e
03 fffff88006a79640 fffff80002a7c3a9 nt!KeBugCheckEx+0x104
04 fffff88006a79680 fffff80002a7bcfc nt!KiBugCheckDispatch+0x69
05 fffff88006a797c0 fffff80002aa8d8d nt!KiSystemServiceHandler+0x7c
06 fffff88006a79800 fffff80002aa7b65 nt!RtlpExecuteHandlerForException+0xd
07 fffff88006a79830 fffff80002ab90bd nt!RtlDispatchException+0x415
08 fffff88006a79f10 fffff80002a7c48e nt!KiDispatchException+0x17d
09 fffff88006a7a5a0 fffff80002a7a5df nt!KiExceptionDispatch+0xce
0a fffff88006a7a780 fffffa80067258f0 nt!KiInvalidOpcodeFault+0x11f
0b fffff88006a7a918 fffff80002a2db40 0xfffffa80067258f0 0c fffff88006a7a920 fffff80002d53dc4 nt!IoCancelIrp+0x60 0d fffff88006a7a960 fffff80002d5477d nt!IoCancelThreadIo+0x44 0e fffff88006a7a990 fffff80002d54ded nt!PspExitThread+0x58d 0f fffff88006a7aa50 fffff80002d54ef5 nt!PspTerminateThreadByPointer+0x4d 10 fffff88006a7aaa0 fffff80002a7c093 nt!NtTerminateThread+0x45 11 fffff88006a7aae0 0000000076d2c26a nt!KiSystemServiceCopyEnd+0x13 12 0000000001cffd68 0000000076d219e8 ntdll!ZwTerminateThread+0xa 13 0000000001cffd70 0000000076bd59d5 ntdll!RtlExitUserThread+0x48 14 0000000001cffdb0 0000000076d0a561 kernel32!BaseThreadInitThunk+0x15 15 0000000001cffde0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d

It seems like we have a non-NULL yet invalid IRP->CancelRoutine but for the life of me I can’t see how this is happening. We don’t seem to ever set the cancel routine to anything other than NULL within the driver. I suspect we might be seeing the IRP equivalent of a double-free but I don’t even know if that’s possible. The other question might be why we’re seeing IRPs getting canceled at thread exit in the first place. From my reading it seems the client code should have already cleaned them all up.

I’m not really sure what is the best way to chase this. I’m signed up for the Advanced Debugging course next month but I’d really like to make some progress before then. Can anyone suggest some possible starting points to investigate? TIA.

cheers,

Chris Warkentin

Step 1: Did you run the driver under DriverVerifier?

Step 2: Did you run the driver under the Checked Build of Windows (kernel, HAL, NDIS)

Peter
OSR
@OSRDrivers

Step 1: Yes. I encountered a potential deadlock (TODO: FIXME) in an unrelated area of code but nothing in the PCAP code.

Step 2: No, haven’t tried that yet. Good idea.

I’d fix that and re-try. Or, be more lazy and disable deadlock detection in Driver Verifier and retry.

IF this is a “double free” of an IRP… Driver Verifier is likely to be your fastest and bestest way to run this problem down. Be sure to turn on all the options that might help, and disable the ones (like the low resource simulations, and I guess deadlock detection in your case) that don’t matter.

Peter
OSR
@OSRDrivers

A mildly interesting thing that I just confirmed is that the behavior has evolved with kernel versions.

I had reverted to an older Win10 snapshot and the symptoms were different than they are with the Fall creators update.

The main thread does a WaitOnSingleObject on the child thread before finishing up and writing out the buffers. Pre-creators update this would just hang. The child thread would never seem to finish and signal the waiting thread. The child didn’t seem to actually exist though, at least as far as windbg is concerned. Like a zombie thread or something.

With the newer builds it displays the crashing behavior as described above. Anyway, off to figure out how to get the checked build going.

That’s a good idea too. Verifier is certainly easier to start with than checked build. The deadlock issue is unfortunately fairly complicated and seems like it will take a fair bit of rework to get to the bottom of.

I’ve inherited a pretty big pile of very old code here and we’re trying to ‘take it back’, establish better practices, etc. I’ve been adding SAL, running static analysis, setting up HLK/HCK tests, Verifier, etc. It’s a big hill to climb.

Is your code using the MSFT cancel safe queue mechanism or is it so old
that it rolled its own cancel processing?

Mark Roddy

On Mon, Apr 23, 2018 at 11:48 AM, xxxxx@pleora.com <
xxxxx@lists.osr.com> wrote:

That’s a good idea too. Verifier is certainly easier to start with than
checked build. The deadlock issue is unfortunately fairly complicated and
seems like it will take a fair bit of rework to get to the bottom of.

I’ve inherited a pretty big pile of very old code here and we’re trying to
‘take it back’, establish better practices, etc. I’ve been adding SAL,
running static analysis, setting up HLK/HCK tests, Verifier, etc. It’s a
big hill to climb.


NTDEV is sponsored by OSR

Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

>It’s a big hill to climb.

I sympathize.

Free hint: Check the Cleanup vs Cancel logic. Over the past few years, I’ve seen a whole raft of drivers that confuse Cleanup and Cancel… and in fact use Cleanup INSTEAD of Cancel.

https://www.osr.com/nt-insider/2017-issue2/handling-cleanup-close-cancel-wdf-driver/

Peter
OSR
@OSRDrivers

In addition to the other comments…

If you have Verifier enabled you can try finding the IRP in the IRP
alloc/complete/cancel log:

!verifier 100 address

Also, can you do a “dt nt!_IRP address” and show the IRP? Lastly, what does
“!irp address 1” show? My goal with these is to categorize the state of the
IRP a little better (e.g. is the whole thing garbage or does it look
“normal”)

-scott
OSR
@OSRDrivers

> Is your code using the MSFT cancel safe queue mechanism or is it so old

that it rolled its own cancel processing?

Definitely doing it’s own cancel processing. One strange thing that I don’t understand yet is why the cancel handler never gets set for the IRPs of this particular filter.

The driver is architected like 3 filters in one. A receiver, transmitter and pcap recorder. They seem to follow slightly different IRP handling paths though so maybe I need to try to reconcile what they’re doing.

Thanks. I’ll check that article out.

Free hint: Check the Cleanup vs Cancel logic. Over the past few years, I’ve
seen a whole raft of drivers that confuse Cleanup and Cancel… and in fact use
Cleanup INSTEAD of Cancel.

How do I figure out what the IRP address is?

If you have Verifier enabled you can try finding the IRP in the IRP
alloc/complete/cancel log:

!verifier 100 address

Also, can you do a “dt nt!_IRP address” and show the IRP? Lastly, what does
“!irp address 1” show? My goal with these is to categorize the state of the
IRP a little better (e.g. is the whole thing garbage or does it look
“normal”)

NM, I think I found it. It was at the top of the list (unsurprising I guess).
2: kd> !verifier 0x100 fffffd80d5412ea0


The log entries are displayed in reverse chronological order.

IRP fffffd80d5412ea0, Thread ffffa38df4b0c700, IRQL = 1, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afdb2 nt!IovCancelIrp+0xe
fffff80082dcab13 nt!IoCancelIrp+0xe5553
fffff80083151043 nt!IoCancelThreadIo+0x5b
fffff8008315090a nt!PspExitThread+0x266
fffff800831512a6 nt!PspTerminateThreadByPointer+0x96
fffff8008315166e nt!NtTerminateThread+0x4a
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df4b0c700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df4334700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afe16 nt!IovCompleteRequest+0x2e
fffff80082daaa63 nt!IofCompleteRequest+0x1304f3
fffff80d26f1866d VerifierExt!IofCompleteRequest_wrapper+0xbd
fffff800833b4b30 nt!VerifierIofCompleteRequest+0x10
fffff80d28d56e1b ebUniversalProForEthernet!PIG_UnmapRequest+0x9b
fffff80d28d56338 ebUniversalProForEthernet!PIG_IoControlWaitRequests+0x198
fffff80d28d55728 ebUniversalProForEthernet!PIG_IoControl+0x98
fffff80d28d420c8 ebUniversalProForEthernet!DIS_PlugInIoControl+0x28
fffff80d28d5972e ebUniversalProForEthernet!WFT_DispatchIoControl+0x1ee
fffff80d27cc368b ndis!ndisDummyIrpHandler+0x8b
fffff800833cd548 nt!ViGenericDispatchHandler+0x54
fffff800833cd4b6 nt!ViGenericDeviceControl+0x16
fffff80082d2dcfa nt!IopfCallDriver+0x56

IRP fffffd80d5412ea0, Thread ffffa38df4334700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13
Parsing entry 0xd88/0x4000
IRP fffffd80d5412ea0, Thread ffffa38df4334700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afe16 nt!IovCompleteRequest+0x2e
fffff80082daaa63 nt!IofCompleteRequest+0x1304f3
fffff80d26f1866d VerifierExt!IofCompleteRequest_wrapper+0xbd
fffff800833b4b30 nt!VerifierIofCompleteRequest+0x10
fffff80d28d597c7 ebUniversalProForEthernet!WFT_DispatchIoControl+0x287
fffff80d27cc368b ndis!ndisDummyIrpHandler+0x8b
fffff800833cd548 nt!ViGenericDispatchHandler+0x54
fffff800833cd4b6 nt!ViGenericDeviceControl+0x16
fffff80082d2dcfa nt!IopfCallDriver+0x56
fffff800833afd5d nt!IovCallDriver+0x275
fffff80082d9e54b nt!IofCallDriver+0x151ccb
fffff800830e155e nt!IopSynchronousServiceTail+0x19e
fffff800830e0d9c nt!IopXxxControlFile+0x66c

IRP fffffd80d5412ea0, Thread ffffa38df4334700, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df1b10080, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afe16 nt!IovCompleteRequest+0x2e
fffff80082daaa63 nt!IofCompleteRequest+0x1304f3
fffff80d28f225ee nsiproxy!NsippDispatch+0xae
fffff80082d2dcfa nt!IopfCallDriver+0x56
fffff800833afd5d nt!IovCallDriver+0x275
fffff80082d9e54b nt!IofCallDriver+0x151ccb
fffff800830e155e nt!IopSynchronousServiceTail+0x19e
fffff800830e0d9c nt!IopXxxControlFile+0x66c
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df1b10080, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13

IRP fffffd80d5412ea0, Thread ffffa38df4205080, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833afe16 nt!IovCompleteRequest+0x2e
fffff80082daaa63 nt!IofCompleteRequest+0x1304f3
fffff80d26f1866d VerifierExt!IofCompleteRequest_wrapper+0xbd
fffff800833b4b30 nt!VerifierIofCompleteRequest+0x10
fffff80d28d59ba6 ebUniversalProForEthernet!WFT_DispatchIoControl+0x666
fffff80d27cc368b ndis!ndisDummyIrpHandler+0x8b
fffff800833cd548 nt!ViGenericDispatchHandler+0x54
fffff800833cd4b6 nt!ViGenericDeviceControl+0x16
fffff80082d2dcfa nt!IopfCallDriver+0x56
fffff800833afd5d nt!IovCallDriver+0x275
fffff80082d9e54b nt!IofCallDriver+0x151ccb
fffff800830e155e nt!IopSynchronousServiceTail+0x19e
fffff800830e0d9c nt!IopXxxControlFile+0x66c

IRP fffffd80d5412ea0, Thread ffffa38df4205080, IRQL = 0, KernelApcDisable = 0, SpecialApcDisable = 0
fffff800833af650 nt!IovAllocateIrp+0x30
fffff80082d9e567 nt!IopAllocateIrpExReturn+0x151c67
fffff800830e0b1c nt!IopXxxControlFile+0x3ec
fffff800830e0716 nt!NtDeviceIoControlFile+0x56
fffff80082d83553 nt!KiSystemServiceCopyEnd+0x13
Parsing entry 0x3fff/0x4000

2: kd> dt nt!_IRP fffffd80d5412ea0
+0x000 Type : 0n6
+0x002 Size : 0x160
+0x004 AllocationProcessorNumber : 0
+0x006 Reserved : 0
+0x008 MdlAddress : (null)
+0x010 Flags : 0x40060070
+0x018 AssociatedIrp :
+0x020 ThreadListEntry : _LIST_ENTRY [0xfffffd80d272cec0 - 0xffffa38df4b0cd70]
+0x030 IoStatus : _IO_STATUS_BLOCK
+0x040 RequestorMode : 1 ‘’
+0x041 PendingReturned : 0 ‘’
+0x042 StackCount : 1 ‘’
+0x043 CurrentLocation : 1 ‘’
+0x044 Cancel : 0x1 ‘’
+0x045 CancelIrql : 0x1 ‘’
+0x046 ApcEnvironment : 0 ‘’
+0x047 AllocationFlags : 0x1 ‘’
+0x048 UserIosb : 0x000001af04a7b468 _IO_STATUS_BLOCK<br> +0x050 UserEvent : (null) <br> +0x058 Overlay : <unnamed-tag><br> +0x068 CancelRoutine : (null) <br> +0x070 UserBuffer : 0x00000071fd8ff648 Void
+0x078 Tail :

2: kd> !irp fffffd80d5412ea0
Irp is active with 1 stacks 1 is current (= 0xfffffd80d5412f70)
No Mdl: System buffer=fffffd80d511cff0: Thread ffffa38df4b0c700: Irp stack trace.
cmd flg cl Device File Completion-Context
>[IRP_MJ_DEVICE_CONTROL(e), N/A(0)]
0 1 ffffa38df1794460 ffffa38df4253790 00000000-00000000 pending
\Driver\ebUniversalProForEthernet
Args: 00000004 00000010 0x172334 00000000

Irp Extension present at 0xfffffd80d5412fb8:

By the way, I’m now sitting here with my mind completely blown at the depth to which one can drill down with windbg and verifier.

I REALLY want to take that advanced debugging course now. I sure hope you guys get enough people to run it.

It is highly likely that code not using cancel safe queues is implemented
incorrectly.

Mark Roddy

On Tue, Apr 24, 2018 at 10:47 AM, xxxxx@pleora.com <
xxxxx@lists.osr.com> wrote:

> Is your code using the MSFT cancel safe queue mechanism or is it so old
> that it rolled its own cancel processing?

Definitely doing it’s own cancel processing. One strange thing that I
don’t understand yet is why the cancel handler never gets set for the IRPs
of this particular filter.

The driver is architected like 3 filters in one. A receiver, transmitter
and pcap recorder. They seem to follow slightly different IRP handling
paths though so maybe I need to try to reconcile what they’re doing.


NTDEV is sponsored by OSR

Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

From what I?ve been reading, I tend to agree. This is a very old driver and it?s not using the WDF at all unfortunately. To add insult to injury, it?s ALSO a ?cross-platform? driver in that most of the back end data processing code runs on Linux, Windows and OSX using abstractions for the OS specific primitives and structures. You can imagine there are some subtleties lost when the only lock being used is a spinlock. I recently fixed some IRQL_NOT_LESS_OR_EQUAL issues where a mutex was needed instead.

In any event, rewriting it for WDF isn?t in the cards so I?m stuck with it.

It is highly likely that code not using cancel safe queues is implemented
incorrectly.

A few things seem odd to me.

IRQL is 1. Relevant?

The fact that our driver doesn’t seem to be in the stack of the last request. I guess that means that either we never completed the request or never even got it at all. I’m assuming the IRP gets handed to the driver pretty much instantly after it’s allocated though.

The IRP itself has a NULL CancelRoutine so now I REALLY don’t understand why the IP is jumping off into a random memory location base on my reading of https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/nf-wdm-iocancelirp

Cancel safe queues are implemented in WDM. See:
https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/cancel-safe-irp-queues
.

Mark Roddy

On Tue, Apr 24, 2018 at 12:37 PM, xxxxx@pleora.com <
xxxxx@lists.osr.com> wrote:

From what I?ve been reading, I tend to agree. This is a very old driver
and it?s not using the WDF at all unfortunately. To add insult to injury,
it?s ALSO a ?cross-platform? driver in that most of the back end data
processing code runs on Linux, Windows and OSX using abstractions for the
OS specific primitives and structures. You can imagine there are some
subtleties lost when the only lock being used is a spinlock. I recently
fixed some IRQL_NOT_LESS_OR_EQUAL issues where a mutex was needed instead.

In any event, rewriting it for WDF isn?t in the cards so I?m stuck with it.

> It is highly likely that code not using cancel safe queues is implemented
> incorrectly.


NTDEV is sponsored by OSR

Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>

Ah! Very interesting. I still don’t fully understand the code but I’m not convinced we’re queuing IRPs at all. It looks like we just keep it until we’re done with it and don’t expect any more to come in afterwards. I’m beginning to think this driver is not exceptionally well implemented.

You’ve given me some really good reading material. Thanks.

Cancel safe queues are implemented in WDM. See:
https://docs.microsoft.com/en-us/windows-hardware/drivers/kernel/cancel-safe-irp-
queues

The Irp CancelRoutine field is set to null before your cancel routine is
called.

Mark Roddy

On Tue, Apr 24, 2018 at 1:36 PM, xxxxx@pleora.com <
xxxxx@lists.osr.com> wrote:

A few things seem odd to me.

IRQL is 1. Relevant?

The fact that our driver doesn’t seem to be in the stack of the last
request. I guess that means that either we never completed the request or
never even got it at all. I’m assuming the IRP gets handed to the driver
pretty much instantly after it’s allocated though.

The IRP itself has a NULL CancelRoutine so now I REALLY don’t understand
why the IP is jumping off into a random memory location base on my reading
of https://docs.microsoft.com/en-us/windows-hardware/drivers/
ddi/content/wdm/nf-wdm-iocancelirp


NTDEV is sponsored by OSR

Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:>