MUP / RDBSS / Mm Deadlock

Matt_Klein · April 23, 2009, 3:54pm

Hi All,

Every once in a while when I run my filter tests against MUP/SMB I am seeing a deadlock between an RDBSS worker thread and a handle close operation. Also note that memory mapped IO is always the type of IO that causes this deadlock.

Here is the RDBSS thread:

8b403998 818c52ff nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
8b4039dc 81862cc8 nt!KiSwapThread+0x44f
8b403a34 8188d9b0 nt!KeWaitForSingleObject+0x492
8b403b64 8188d277 nt!MiFlushSectionInternal+0x1db
8b403bc4 8184db13 nt!MmFlushSection+0xd5
8b403c58 8be2585e nt!CcFlushCache+0x239
8b403cac 8be0cf4e rdbss!RxChangeBufferingState+0x211 (FPO: [SEH])
8b403cd8 8be0d0bc rdbss!RxpProcessChangeBufferingStateRequests+0x285 (FPO: [3,3,4])
8b403cf0 8be0d469 rdbss!RxProcessChangeBufferingStateRequests+0x2a (FPO: [1,0,4])
8b403d14 8be05212 rdbss!RxDispatchChangeBufferingStateRequests+0x96 (FPO: [1,3,4])
8b403d6c 8be330b6 rdbss!RxpWorkerThreadDispatcher+0x138 (FPO: [SEH])
8b403d7c 819e3b18 rdbss!RxBootstrapWorkerThreadDispatcher+0xf (FPO: [1,0,0])
8b403dc0 8183ca2e nt!PspSystemThreadStartup+0x9d
00000000 00000000 nt!KiThreadStartup+0x16

Here is the close call that comes down from user mode:

93eed810 818c52ff nt!KiSwapContext+0x26 (FPO: [Uses EBP] [0,0,4])
93eed854 81862cc8 nt!KiSwapThread+0x44f
93eed8ac 81846834 nt!KeWaitForSingleObject+0x492
93eed8e0 8185fdca nt!ExpWaitForResource+0xbd
93eed904 8be4d6d8 nt!ExAcquireResourceSharedLite+0xe3
93eed924 8be4d329 csc!CscSurrogateIsPreIoFileObject+0x57 (FPO: [2,1,4])
93eed994 841612c7 csc!CscSurrogatePreProcess+0x418 (FPO: [1,21,4])
93eed9b4 841611c3 mup!MupCallSurrogatePrePost+0xd9 (FPO: [1,2,4])
93eed9cc 84161a93 mup!MupStateMachine+0xb1 (FPO: [1,1,0])
93eed9e4 81af06be mup!MupFsdIrpPassThrough+0xc8 (FPO: [2,0,0])
93eeda08 818c9f8a nt!IovCallDriver+0x23f
93eeda1c 807a7ba7 nt!IofCallDriver+0x1b
93eeda40 807a7d64 fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x251 (FPO: [3,4,4])
93eeda78 81af06be fltmgr!FltpDispatch+0xc2 (FPO: [2,6,4])
93eeda9c 818c9f8a nt!IovCallDriver+0x23f
93eedab0 8184e9fa nt!IofCallDriver+0x1b
93eedac4 8188e156 nt!IoSynchronousPageWrite+0x10b
93eedbf4 81878c65 nt!MiFlushSectionInternal+0x97f
93eedc38 81879032 nt!MiCleanSection+0x32
93eedc50 81878950 nt!MiCheckControlArea+0x227
93eedc68 81a52898 nt!MiDereferenceControlAreaBySection+0x2d
93eedc8c 81a525e7 nt!MiSectionDelete+0x101
93eedca8 8185f8c9 nt!ObpRemoveObjectRoutine+0x13d
93eedcd0 81a2b4ca nt!ObfDereferenceObject+0xa1
93eedd14 81a2b6c0 nt!ObpCloseHandleTableEntry+0x24e
93eedd44 81a2b8e5 nt!ObpCloseHandle+0x73
93eedd58 81865a1a nt!NtClose+0x20
93eedd58 77da9a94 nt!KiFastCallEntry+0x12a (FPO: [0,3] TrapFrame @ 93eedd64)
001af8b8 77da7f54 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
001af8bc 76a3cc2e ntdll!ZwClose+0xc (FPO: [1,0,0])
001af8cc 001d1f3b kernel32!CloseHandle+0x40 (FPO: [1,0,4])

The first thread holds the resouce that the 2nd thread is waiting for, while I’m guessing the 1st thread is waiting on some lock that guards the flush section routines in Mm.

I am definitely getting called in this path, so I’m sure I am causing this somehow, though it’s not clear to me what I am doing wrong.

Does this ring any bells for anyone?

Thanks,
Matt

Pavel_Lebedinsky · April 26, 2009, 3:24am

> Every once in a while when I run my filter tests against MUP/SMB

I am seeing a deadlock between an RDBSS worker thread and a
handle close operation. Also note that memory mapped IO is always
the type of IO that causes this deadlock.

I am definitely getting called in this path, so I’m sure I am causing this
somehow, though it’s not clear to me what I am doing wrong.

This might actually be an OS bug. What OS/service pack are you
seeing this on? If WS03, you can try this hotfix:

http://support.microsoft.com/kb/960092

If that doesn’t help I’d recommend contacting MS support.

–
Pavel Lebedinsky/Windows Kernel Test
This posting is provided “AS IS” with no warranties, and confers no rights.

Bronislav_Gabrhelik · April 27, 2009, 3:37am

Do you delegate FastIo calls? Something like Acquire/ReleaseForCcFlush comes to my mind. I cannot recall the exact name.

Before a Paging I/O is sent to the FSD the executeve resouces in FCB are acquired through Cc callback or FastIo and TopLevelIrp is set, so FSD is aware that resources were acquired for this operation. It seems like Close operation didn’t aquire file lock before flushing, so it didn’t block RDBS thread.

Bronislav Gabrhelik

Matt_Klein · April 27, 2009, 1:43pm

The target OS is Vista SP1. I have not tried against other OS yet.

I “fixed” this problem by trying to acquire the FCB resouce shared in the dispatch routine for paging IO against MUP. If I cannot acquire shared, I fail the write with STATUS_FILE_LOCK_CONFLICT under the assumption that Mm will retry. Otherwise I hold shared until the completion routine to prevent RDBSS from locking exclusive in between.

I’m sure this “fix” is horribly wrong. Pile on! But it’s the best I could come up with.

I will look into contacting MS support. If any MS people reading this list would like to take a look contact me and I can provide more information. I can reproduce pretty easily with my stress tests if I remove my workaround.

OSR_Community_User · April 27, 2009, 3:17pm

>IO against MUP. If I cannot acquire shared, I fail the write with STATUS_FILE_LOCK_CONFLICT

under the assumption that Mm will retry.

Amazing way of getting rid of deadlocks in FSD/FSF world is it really working? I’m sure CcFlushCache will retry, but not sure about Mm in general.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

Matt_Klein · April 27, 2009, 4:34pm

>> is it really working?

As far as I can tell it is working under my unit and stress tests.

I agree that this is very shady but as I am pretty sure this is an OS bug I’m not sure what else I can do. The reason I figured this *might* work is that synchronous paging IO should never post, correct? I guess maybe I should extend the check to make sure I am not taking the resource for async paging IO since this bug does not show up there.