thread stuck on NTFS write: Windows 7 64 bit

Hello ntfsd,

We have a legacy filter driver that keeps track of changes happening
on files for a volume. It writes things back to a log file especially
when a file handle is closed.

We are observing an occasional hang in the system, it happens when a
new file has been created, here is how it looks.

Child-SP RetAddr Call Site

01 fffff8800919ed50 fffff800036e5d8f nt!KiCommitThreadWait+0x1d2
02 fffff8800919ede0 fffff8800165850a nt!KeWaitForSingleObject+0x19f
<– (stuck for a wait here)
03 fffff8800919ee80 fffff88001002bcf Ntfs!NtfsFsdWrite+0x26a
04 fffff8800919ef40 fffff880010016df
fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x24f
05 fffff8800919efd0 fffff88001247b0b fltmgr!FltpDispatch+0xcf
06 fffff8800919f030 fffff88001248852 mfl!call_driver_with_sync+0xdb
<– (We create our own Write IRP and send it down.)
… internal bookkeeping done in these frames …
12 fffff8800919f600 fffff880012686c2 mfl!event_handle+0x96
13 fffff8800919f670 fffff8800127309c mfl!hook_cleanup_cmpl+0x122
14 fffff8800919f6d0 fffff880012314b9 mfl!fs_hook_cleanup_cmpl+0x3c
… More internal book keeping …
1a fffff8800919f8b0 fffff800039e25ef mfl!fs_flt_cleanup+0x89 <–
(we see a cleanup IRP)
1b fffff8800919f8f0 fffff800039d1d14 nt!IopCloseFile+0x11f
1c fffff8800919f980 fffff800039d1ad1 nt!ObpDecrementHandleCount+0xb4
1d fffff8800919fa00 fffff800039d2094 nt!ObpCloseHandleTableEntry+0xb1
1e fffff8800919fa90 fffff800036de253 nt!ObpCloseHandle+0x94
1f fffff8800919fae0 0000000077b4140a nt!KiSystemServiceCopyEnd+0x13
20 000000000008e318 0000000000000000 0x77b4140a

IRP that is stuck:

0: kd> !irp 0xfffffa80`074f5bd0
Irp is active with 11 stacks 10 is current (= 0xfffffa80074f5f28)
Mdl=fffffa800d996bf0: No System Buffer: Thread fffffa800bdf2b50: Irp
stack trace.
cmd flg cl Device File Completion-Context
[0, 0] 0 0 00000000 00000000 00000000-00000000

… skipped empty stack locations for brevity

Args: 00000000 00000000 00000000 00000000
[0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000

[4, 0] 0 e0 fffffa80098dc030 fffffa800c731a10 fffff88001005e40-fffffa800e0e6cb0 Success Error Cancel
\FileSystem\Ntfs fltmgr!FltpPassThroughCompletion
Args: 00000004 00000000 0000027e 00000000
[4, 0] 0 e1 fffffa8009654de0 fffffa800c731a10
fffff88001247b60-fffffa800c8cead0 Success Error Cancel pending
\FileSystem\FltMgr mfl!completion_routine
Args: 00000004 00000000 ffffffffffffffff 00000000

The associated file object

0: kd> !fileobj fffffa800c731a10

\sc\sclog

Device Object: 0xfffffa800961cb10 \Driver\volmgr
Vpb: 0xfffffa800961ca50
Event signalled
Access: Write SharedRead SharedWrite SharedDelete

Flags: 0x40042
Synchronous IO
Cache Supported
Handle Created

FsContext: 0xfffff8a01c114c70 FsContext2: 0xfffff8a017bc1b40
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffa800b9050c8
Shared Cache Map: 00000000

I tried to make a guess about what is going on by disassembling bits
but couldn’t find much, here is a pieced up block of what
code was executed

Ntfs!NtfsFsdWrite+0x164:
fffff88001658404 81a330010000fffbffff and dword ptr [rbx+130h],0FFFFFBFFh fffff8800165840e 8b8330010000 mov eax,dword ptr [rbx+130h]
fffff88001658414 098334010000 or dword ptr [rbx+134h],eax fffff8800165841a 488b4620 mov rax,qword ptr [rsi+20h]
fffff880`0165841e 48894340 mov qword ptr [rbx+40h],rax

Ntfs!NtfsFsdWrite+0x182:
fffff88001658422 4584ff test r15b,r15b fffff88001658425 7546 jne Ntfs!NtfsFsdWrite+0x1cd
(fffff880`0165846d)

Ntfs!NtfsFsdWrite+0x187:
fffff88001658427 81ff880100c0 cmp edi,0C0000188h \<-- this is false fffff8800165842d 0f8499000000 je Ntfs!NtfsFsdWrite+0x22c
(fffff880016584cc) Ntfs!NtfsFsdWrite+0x193: fffff88001658433 81ff67030000 cmp edi,367h
fffff88001658439 0f849a000000 je Ntfs!NtfsFsdWrite+0x239 (fffff880016584d9)

Ntfs!NtfsFsdWrite+0x239:
fffff880016584d9 498b8c24b8000000 mov rcx,qword ptr [r12+0B8h] fffff880016584e1 0fb64103 movzx eax,byte ptr [rcx+3]
fffff880016584e5 83e0fe and eax,0FFFFFFFEh fffff880016584e8 884103 mov byte ptr [rcx+3],al
fffff880016584eb 488b4b78 mov rcx,qword ptr [rbx+78h] fffff880016584ef 4883c108 add rcx,8
fffff880016584f3 48c744242000000000 mov qword ptr [rsp+20h],0 fffff880016584fc 4533c9 xor r9d,r9d
fffff880016584ff 4533c0 xor r8d,r8d fffff88001658502 33d2 xor edx,edx
fffff88001658504 ff153ee70200 call qword ptr [Ntfs!_imp_KeWaitForSingleObject (fffff88001686c48)] <– we are
waiting here


fffff8800165850a 894504 mov dword ptr [rbp+4],eax fffff8800165850d 48837b7800 cmp qword ptr [rbx+78h],0
fffff88001658512 7422 je Ntfs!NtfsFsdWrite+0x296 (fffff88001658536)

Do let me know if anyone knows anything about it. or how to go about
root causing this.

TIA,
Johri

“Amritanshu” wrote in message news:xxxxx@ntfsd…

Ntfs!NtfsFsdWrite+0x187:
cmp edi,0C0000188h <– this is false
je Ntfs!NtfsFsdWrite+0x22c
Ntfs!NtfsFsdWrite+0x193:
cmp edi,367h
je Ntfs!NtfsFsdWrite+0x239

Assuming that EDI contains an NTSTATUS value:

E:\>winerror -s C0000188
1502 ERROR_LOG_FILE_FULL <–> c0000188 STATUS_LOG_FILE_FULL

E:\>winerror -s 367
765 ERROR_WAIT_FOR_OPLOCK <–> 00000367 STATUS_WAIT_FOR_OPLOCK

So, seems like NTFS waiting for an oplock break. Any other threads out there
stuck in your driver or NTFS? (“!stacks 2 mfl!”, “!stacks 2 ntfs!”). It’s
also possible that the cleanup you’re currently blocking is the one that
needs to trigger the oplock break.

-scott
OSR

Hello ntfsd,

We have a legacy filter driver that keeps track of changes happening
on files for a volume. It writes things back to a log file especially
when a file handle is closed.

We are observing an occasional hang in the system, it happens when a
new file has been created, here is how it looks.

Child-SP RetAddr Call Site

01 fffff8800919ed50 fffff800036e5d8f nt!KiCommitThreadWait+0x1d2
02 fffff8800919ede0 fffff8800165850a nt!KeWaitForSingleObject+0x19f
<– (stuck for a wait here)
03 fffff8800919ee80 fffff88001002bcf Ntfs!NtfsFsdWrite+0x26a
04 fffff8800919ef40 fffff880010016df
fltmgr!FltpLegacyProcessingAfterPreCallbacksCompleted+0x24f
05 fffff8800919efd0 fffff88001247b0b fltmgr!FltpDispatch+0xcf
06 fffff8800919f030 fffff88001248852 mfl!call_driver_with_sync+0xdb
<– (We create our own Write IRP and send it down.)
… internal bookkeeping done in these frames …
12 fffff8800919f600 fffff880012686c2 mfl!event_handle+0x96
13 fffff8800919f670 fffff8800127309c mfl!hook_cleanup_cmpl+0x122
14 fffff8800919f6d0 fffff880012314b9 mfl!fs_hook_cleanup_cmpl+0x3c
… More internal book keeping …
1a fffff8800919f8b0 fffff800039e25ef mfl!fs_flt_cleanup+0x89 <–
(we see a cleanup IRP)
1b fffff8800919f8f0 fffff800039d1d14 nt!IopCloseFile+0x11f
1c fffff8800919f980 fffff800039d1ad1 nt!ObpDecrementHandleCount+0xb4
1d fffff8800919fa00 fffff800039d2094 nt!ObpCloseHandleTableEntry+0xb1
1e fffff8800919fa90 fffff800036de253 nt!ObpCloseHandle+0x94
1f fffff8800919fae0 0000000077b4140a nt!KiSystemServiceCopyEnd+0x13
20 000000000008e318 0000000000000000 0x77b4140a

IRP that is stuck:

0: kd> !irp 0xfffffa80`074f5bd0
Irp is active with 11 stacks 10 is current (= 0xfffffa80074f5f28)
Mdl=fffffa800d996bf0: No System Buffer: Thread fffffa800bdf2b50: Irp
stack trace.
cmd flg cl Device File Completion-Context
[0, 0] 0 0 00000000 00000000 00000000-00000000

… skipped empty stack locations for brevity

Args: 00000000 00000000 00000000 00000000
[0, 0] 0 0 00000000 00000000 00000000-00000000

Args: 00000000 00000000 00000000 00000000

[4, 0] 0 e0 fffffa80098dc030 fffffa800c731a10
fffff88001005e40-fffffa800e0e6cb0 Success Error Cancel
\FileSystem\Ntfs fltmgr!FltpPassThroughCompletion
Args: 00000004 00000000 0000027e 00000000
[4, 0] 0 e1 fffffa8009654de0 fffffa800c731a10
fffff88001247b60-fffffa800c8cead0 Success Error Cancel pending
\FileSystem\FltMgr mfl!completion_routine
Args: 00000004 00000000 ffffffffffffffff 00000000

The associated file object

0: kd> !fileobj fffffa800c731a10

\sc\sclog

Device Object: 0xfffffa800961cb10 \Driver\volmgr
Vpb: 0xfffffa800961ca50
Event signalled
Access: Write SharedRead SharedWrite SharedDelete

Flags: 0x40042
Synchronous IO
Cache Supported
Handle Created

FsContext: 0xfffff8a01c114c70 FsContext2: 0xfffff8a017bc1b40
CurrentByteOffset: 0
Cache Data:
Section Object Pointers: fffffa800b9050c8
Shared Cache Map: 00000000

I tried to make a guess about what is going on by disassembling bits
but couldn’t find much, here is a pieced up block of what
code was executed

Ntfs!NtfsFsdWrite+0x164:
fffff88001658404 81a330010000fffbffff and dword ptr [rbx+130h],0FFFFFBFFh fffff8800165840e 8b8330010000 mov eax,dword ptr [rbx+130h]
fffff88001658414 098334010000 or dword ptr [rbx+134h],eax fffff8800165841a 488b4620 mov rax,qword ptr [rsi+20h]
fffff880`0165841e 48894340 mov qword ptr [rbx+40h],rax

Ntfs!NtfsFsdWrite+0x182:
fffff88001658422 4584ff test r15b,r15b fffff88001658425 7546 jne Ntfs!NtfsFsdWrite+0x1cd
(fffff880`0165846d)

Ntfs!NtfsFsdWrite+0x187:
fffff88001658427 81ff880100c0 cmp edi,0C0000188h \<-- this is false fffff8800165842d 0f8499000000 je Ntfs!NtfsFsdWrite+0x22c
(fffff880016584cc) Ntfs!NtfsFsdWrite+0x193: fffff88001658433 81ff67030000 cmp edi,367h
fffff88001658439 0f849a000000 je Ntfs!NtfsFsdWrite+0x239 (fffff880016584d9)

Ntfs!NtfsFsdWrite+0x239:
fffff880016584d9 498b8c24b8000000 mov rcx,qword ptr [r12+0B8h] fffff880016584e1 0fb64103 movzx eax,byte ptr [rcx+3]
fffff880016584e5 83e0fe and eax,0FFFFFFFEh fffff880016584e8 884103 mov byte ptr [rcx+3],al
fffff880016584eb 488b4b78 mov rcx,qword ptr [rbx+78h] fffff880016584ef 4883c108 add rcx,8
fffff880016584f3 48c744242000000000 mov qword ptr [rsp+20h],0 fffff880016584fc 4533c9 xor r9d,r9d
fffff880016584ff 4533c0 xor r8d,r8d fffff88001658502 33d2 xor edx,edx
fffff88001658504 ff153ee70200 call qword ptr [Ntfs!_imp_KeWaitForSingleObject (fffff88001686c48)] <– we are
waiting here


fffff8800165850a 894504 mov dword ptr [rbp+4],eax fffff8800165850d 48837b7800 cmp qword ptr [rbx+78h],0
fffff88001658512 7422 je Ntfs!NtfsFsdWrite+0x296 (fffff88001658536)

Do let me know if anyone knows anything about it. or how to go about
root causing this.

TIA,
Johri

Hi Scott,
Thank you for helping out on this one. You were right, there was
another operation happening on the file that was resulting in a create
‘collision’. Based on our logging in one instance that create was
waiting for the OPLOCK break, in most instances the create had just
completed so no ntfs stack was around, But in all cases the original
thread was stuck on IO in the same path.

On further analysis it was found that in certain condition a return
status of STATUS_OPLOCK_BREAK_IN_PROGRESS during create was not
resulting FSCTL_NOTIFY_BREAK being issued and the driver was going
ahead. Perhaps this was locking up things.

Do let me know if it doesn’t makes sense.
Thanks a ton.
Regards,
Johri

“Amritanshu” wrote in message news:xxxxx@ntfsd…

On further analysis it was found that in certain condition a return
status of STATUS_OPLOCK_BREAK_IN_PROGRESS during create was not
resulting FSCTL_NOTIFY_BREAK being issued and the driver was going
ahead. Perhaps this was locking up things.

Do you mean that you were not specifically handling a return of
STATUS_OPLOCK_BREAK_IN_PROGRESS from an open with FILE_COMPLETE_IF_OPLOCKED
set? That could certainly be problematic.

-scott
OSR

“Amritanshu” wrote in message news:xxxxx@ntfsd…
Hi Scott,
Thank you for helping out on this one. You were right, there was
another operation happening on the file that was resulting in a create
‘collision’. Based on our logging in one instance that create was
waiting for the OPLOCK break, in most instances the create had just
completed so no ntfs stack was around, But in all cases the original
thread was stuck on IO in the same path.

On further analysis it was found that in certain condition a return
status of STATUS_OPLOCK_BREAK_IN_PROGRESS during create was not
resulting FSCTL_NOTIFY_BREAK being issued and the driver was going
ahead. Perhaps this was locking up things.

Do let me know if it doesn’t makes sense.
Thanks a ton.
Regards,
Johri