Strange deadlock from the Lazy Writer on Windows 8.1 x64

Hey,

I have an ?isolation filter? that has encountered the following deadlock:

Thread ?A? is a private thread that acquires my FCB resources, calls CcCoherencyFlushAndPurgeCache() and releases the locks. This thread is stuck in CcCoherencyFlushAndPurgeCache():

nt!KiSwapContext+0x76
nt!KiSwapThread+0x14e
nt!KiCommitThreadWait+0x129
nt!KeWaitForGate+0x10b
nt!MiWaitForFlushInProgress+0x94
nt!MiFlushSectionInternal+0x398
nt!MmFlushSection+0x1a2
nt!CcFlushCachePriv+0x616
nt!CcCoherencyFlushAndPurgeCache+0x60
mydriver!mydriverApiTryCcCoherencyFlushAndPurgeCache+0x71
mydriver!mydriverWorkItemRoutine+0x3ed

On the other hand, I have the lazy writer thread that is stuck in my paging-io write dispatch routine while trying to acquire the same lock:

nt!KiSwapContext+0x76
nt!KiSwapThread+0x14e
nt!KiCommitThreadWait+0x129
nt!ExpWaitForResource+0x29f
nt!ExAcquireResourceExclusiveLite+0x1da
mydriver!mydriverFsWritePagingIo+0x41d
mydriver!mydriverWrite+0x2e
fltmgr!FltpPerformPreCallbacks+0x29f
fltmgr!FltpPassThroughInternal+0x8c
fltmgr!FltpPassThrough+0x2be
fltmgr!FltpDispatch+0x9a
nt!IoSynchronousPageWrite+0x138
nt!MiIssueSynchronousFlush+0x66
nt!MiFlushSectionInternal+0x775
nt! ?? ::FNODOBFM::string'+0x5147d nt! ?? ::FNODOBFM::string’+0xd92a
nt!ObpRemoveObjectRoutine+0x64
nt!ObfDereferenceObjectWithTag+0x8f
nt!CcDeleteSharedCacheMap+0x101
nt!CcWriteBehindInternal+0x330
nt!ExpWorkerThread+0x28c
nt!PspSystemThreadStartup+0x58
nt!KiStartSystemThread+0x16

At this point I would expect the lazy writer to pre-acquire locks through the CC callbacks or the Fast-Io callbacks, but according to my logs the same thread performs the following operations prior to sending the write request:

  1. calls AcquireForLazyWrite, I return TRUE
  2. calls ReleaseFromLazyWrite
  3. calls AcquireForSectionSynchronization, I return STATUS_SUCCESS
  4. calls ReleaseForSectionSynchronization
  5. Issues the write request that hangs.

Between #4 and #5, my private thread (thread ?A?), takes the resource and calls CcCoherencyFlushAndPurgeCache() which never returns.

The odd thing is that I expect the write request to arrive between #1-#2 or between #3-#4. But it has arrived afterwards, while no locks are pre-acquired.

In addition I would say that I also implement AcquireForCcFlush and AcquireForModifiedPageWriter. According to my log these callbacks were not called in the process.

Does anyone ever encountered this scenario?

I thought about returning STATUS_FILE_LOCK_CONFLICT in this case to prevent the deadlock but I?m not sure that would be the right solution.

Thanks!

> At this point I would expect the lazy writer to pre-acquire locks through

the CC callbacks or the Fast-Io callbacks,

In that case I would expect to see an Acquire for Flush, which you are not
seeing which is strange.

I thought about returning STATUS_FILE_LOCK_CONFLICT in this case to
prevent the deadlock but I?m not sure that would be the right solution.

Me too (on both counts). In general I try to avoid relying on that, but in
some cases you have to.

Having said that,

  1. in the plethora of hangs which CcPurge provokes, this is a new one on me
  2. I have never been able to get CcCoherencyFlushAndPurge to work
    successfully, particularly on the network and have pretty much come to the
    conclusion that unless your locking exactly matches that of NTFS (which of
    course is opaque) you will be SOL. This is not such a big deal since I have
    a single driver policy but have to support down level OS versions so I have
    to live with the timing window.

/Rod

Thanks Rod,

I ended up returning STATUS_FILE_LOCK_CONFLICT. The Mm was just looping on the lazy-writer thread until it managed to write something. Luckily the Mm releases its ?flush lock? between the iterations so I could get the deadlock solved that way.

I don?t know if that?s the right approach but couldn?t find any resource about it…

Regarding your second remark, I didn?t quite understand what it means. Aren?t you using the function on your controlled file objects?

Roei

> Regarding your second remark, I didn?t quite understand what it means.

Aren?t you using the function on your controlled file objects?

No, because of down level considerations I do a CcFlush follows by a
CcPurgeSections.