Hi all,
I’ve been chasing a BSOD that has recently started cropping up in our NDIS filter driver. It seems to have appeared in newer kernel versions so I suspect we’re violating some constraint that wasn’t strongly checked previously.
The filter sits near the bottom of the stack and intercepts GigE Vision Stream Packets (GVSP) and coalesces them into a single image frame. This way the client can ask for a frame at a time and avoid an extra memcpy to reassemble the packets.
This portion of the filter works fine but we also have an additional functionality built in that allows packet capture ala PCAP. A client app can queue up buffers that the driver fills in with the network packet info. Then the client can dump those packets to a .pcap file which can be read by Wireshark.
This code has been around and mostly unchanged for years but recently has started crashing when the client stops recording. From the client point of view there is a thread which uses ioctls to queue and read buffers in a loop until the user stops capturing which causes the thread to finish reading and exit. The crash looks like this:
Child-SP RetAddr Call Site
00 fffff88006a78f08 fffff800
02b6d662 nt!DbgBreakPointWithStatus
01 fffff88006a78f10 fffff800
02b6e44e nt!KiBugCheckDebugBreak+0x12
02 fffff88006a78f70 fffff800
02a7cf04 nt!KeBugCheck2+0x71e
03 fffff88006a79640 fffff800
02a7c3a9 nt!KeBugCheckEx+0x104
04 fffff88006a79680 fffff800
02a7bcfc nt!KiBugCheckDispatch+0x69
05 fffff88006a797c0 fffff800
02aa8d8d nt!KiSystemServiceHandler+0x7c
06 fffff88006a79800 fffff800
02aa7b65 nt!RtlpExecuteHandlerForException+0xd
07 fffff88006a79830 fffff800
02ab90bd nt!RtlDispatchException+0x415
08 fffff88006a79f10 fffff800
02a7c48e nt!KiDispatchException+0x17d
09 fffff88006a7a5a0 fffff800
02a7a5df nt!KiExceptionDispatch+0xce
0a fffff88006a7a780 fffffa80
067258f0 nt!KiInvalidOpcodeFault+0x11f
0b fffff88006a7a918 fffff800
02a2db40 0xfffffa80067258f0 0c fffff880
06a7a920 fffff80002d53dc4 nt!IoCancelIrp+0x60 0d fffff880
06a7a960 fffff80002d5477d nt!IoCancelThreadIo+0x44 0e fffff880
06a7a990 fffff80002d54ded nt!PspExitThread+0x58d 0f fffff880
06a7aa50 fffff80002d54ef5 nt!PspTerminateThreadByPointer+0x4d 10 fffff880
06a7aaa0 fffff80002a7c093 nt!NtTerminateThread+0x45 11 fffff880
06a7aae0 0000000076d2c26a nt!KiSystemServiceCopyEnd+0x13 12 00000000
01cffd68 0000000076d219e8 ntdll!ZwTerminateThread+0xa 13 00000000
01cffd70 0000000076bd59d5 ntdll!RtlExitUserThread+0x48 14 00000000
01cffdb0 0000000076d0a561 kernel32!BaseThreadInitThunk+0x15 15 00000000
01cffde0 00000000`00000000 ntdll!RtlUserThreadStart+0x1d
It seems like we have a non-NULL yet invalid IRP->CancelRoutine but for the life of me I can’t see how this is happening. We don’t seem to ever set the cancel routine to anything other than NULL within the driver. I suspect we might be seeing the IRP equivalent of a double-free but I don’t even know if that’s possible. The other question might be why we’re seeing IRPs getting canceled at thread exit in the first place. From my reading it seems the client code should have already cleaned them all up.
I’m not really sure what is the best way to chase this. I’m signed up for the Advanced Debugging course next month but I’d really like to make some progress before then. Can anyone suggest some possible starting points to investigate? TIA.
cheers,
Chris Warkentin