Possible causes for ERROR_NOT_ENOUGH_MEMORY

Hello there,

I’m toying with a Dokan-based user-mode file system. While testing with high-load production scenario, I ran into following: tools performing IO in space served by my driver hit the ground with ERROR_NOT_ENOUGH_MEMORY occasionally. Cases I’ve seen so far come from C#'s System.IO.File.Copy ~ System.IO.IOException: Not enough storage is available to process this command. These errors do not seem to correlate with any errors reported from driver/user mode component (I’m 99% sure about user mode, slightly less so regarding the driver, since my error reporting means are from there are underdeveloped, even more so in release mode).

I had some initial ideas tied to things preceding the errors (many relatively small files written in short period of time), the way my user mode part works, and the machine I’m testing this with - under some circumstances, user mode component would leave files open after writing arbitrary amounts of data to them, closing them only after few minutes. Fixing this didn’t fix the issue, though.

Now I’m in a process of establishing reasonable reproduction scenario (since the entire production scenario is rather difficult to work with), and hence I’m yet to try some of more obvious things (running the driver in debug mode/with verifier and check what’s going on down there when the error occurs - the problem being the fact that in the test the issue starts to appear after maybe 30 minutes and many GBs read/written, so I’m afraid debug features will make this a “wait a day and maybe you’ll see”).
I was able to reproduce it with simplified scenario briefly - copying large number of files sequentially from the space served by the driver - but the repro ceased to work after unrelated user mode component change and some reboots and I’m not sure about conditions preceding the successful repro, specifically whether the machine was freshly rebooted or whether it was “tired” already. Such a thing suggests I might be leaking something, but the driver was ok last time I used verifier with it (then again I haven’t done that in a while and I did some changes to driver since last time) - and the test machine itself seems to be doing fine.

I’d be grateful for any insights into possible causes for this. The machine has plenty of (available physical) memory available and there should not be any large amount of pending IO going on. What resource am I running out of? What should I be monitoring?

Thanks for help,
L.

Have you considered the possibility of resource exhaustion in the application. Another possibility is that whatever heap is being used (in user mode) is finite sized and being exhausted. Thus, things to look for might include memory leaks in the user mode components. Have you monitored the size of the address space of the running application? For example, I’ve seen 32 bit applications exhaust their (~1.8GB) address space which can be quite frustrating on a machine with 128GB of RAM (“how can I be out of memory?”)

In kernel mode it’s possible that you’re getting a quota exhaustion issue as well. This can happen, for example, when trying to lock down too many user mode buffers. I presume your drivers have some sort of tracing facility enabled, so you might want to see if there are any errors (like STATUS_QUOTA_EXCEEDED or STATUS_INSUFFICIENT_RESOURCES, etc.) that could correspond to this behavior as well.

Tony
OSR

Hi Tony,

thanks a lot for the response!

I do believe applications getting these errors are “correct” on their own. We’re using them (the entire “test case”) for production purposes where they run atop of standard file system in 100% the same manner without issues (and the ones giving the error are 64-bit).

I have rudimentary error counter implemented in the driver, and it didn’t pick anything up during those tests yet. But - I only added it few days ago while investigating this particular issue (not a thing to be proud of, I know), and it is very much possible I didn’t plug it to all places necessary. And I don’t have full trace since I didn’t dare running the driver in debug mode with the only reliable repro I have (= the big one :-)). However - the user mode buffer locking does ring a bell here! I switched large read response IOCTLs from user mode from buffered to direct IO quite recently, so there probably is more locking than before, plus the driver keeps read target buffer locked around too. I will focus on this in the driver, and I will get more educated re. quotas in general.

Many thanks!
L.

Try to repro without .NET first, in a native C/C++ app.

Then try to use native APIs instead of Win32, or port the test to kernel mode.


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

wrote in message news:xxxxx@ntfsd…
> Hi Tony,
>
> thanks a lot for the response!
>
> I do believe applications getting these errors are “correct” on their own. We’re using them (the entire “test case”) for production purposes where they run atop of standard file system in 100% the same manner without issues (and the ones giving the error are 64-bit).
>
> I have rudimentary error counter implemented in the driver, and it didn’t pick anything up during those tests yet. But - I only added it few days ago while investigating this particular issue (not a thing to be proud of, I know), and it is very much possible I didn’t plug it to all places necessary. And I don’t have full trace since I didn’t dare running the driver in debug mode with the only reliable repro I have (= the big one :-)). However - the user mode buffer locking does ring a bell here! I switched large read response IOCTLs from user mode from buffered to direct IO quite recently, so there probably is more locking than before, plus the driver keeps read target buffer locked around too. I will focus on this in the driver, and I will get more educated re. quotas in general.
>
> Many thanks!
> L.
>

MmProbeAndLockPages can raise a STATUS_WORKING_SET_QUOTA exception if locking the pages fails for certain reasons. This can cause IO to fail with the same status code, though it doesn’t actually have anything to do with quotas of any kind.

If you can reproduce this under debugger (or capture a kernel dump after the problem has occurred), can you run these debugger commands and include the output?

dd nt!MiProbeRaises l28
vertarget

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@2kczech.com
Sent: Monday, August 31, 2015 1:40 PM
To: Windows File Systems Devs Interest List
Subject: RE:[ntfsd] Possible causes for ERROR_NOT_ENOUGH_MEMORY

Hi Tony,

thanks a lot for the response!

I do believe applications getting these errors are “correct” on their own. We’re using them (the entire “test case”) for production purposes where they run atop of standard file system in 100% the same manner without issues (and the ones giving the error are 64-bit).

I have rudimentary error counter implemented in the driver, and it didn’t pick anything up during those tests yet. But - I only added it few days ago while investigating this particular issue (not a thing to be proud of, I know), and it is very much possible I didn’t plug it to all places necessary. And I don’t have full trace since I didn’t dare running the driver in debug mode with the only reliable repro I have (= the big one :-)). However - the user mode buffer locking does ring a bell here! I switched large read response IOCTLs from user mode from buffered to direct IO quite recently, so there probably is more locking than before, plus the driver keeps read target buffer locked around too. I will focus on this in the driver, and I will get more educated re. quotas in general.

Many thanks!
L.


NTFSD is sponsored by OSR

OSR is hiring!! Info at http://www.osr.com/careers

For our schedule of debugging and file system seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Thanks a lot for responses! Based on Tony’s suggestion regarding exhausting the quota by too many locks (and nature of circumstances under which this happens in production test ~ number of concurrently running copies) I tortured the driver on the production test machine with some 8 concurrent “copy directory tree” batches and I did get multiple failures at some point. I’ll proceed with more tailored directory to copy (the current one was a mixture of small and large files).

>If you can reproduce this under debugger
Not sure this will count, but I ran LiveKd afterwards, and ran the commands suggested:

0: kd> dd nt!MiProbeRaises 128 L8
fffff8000344ebc0 00000000 0013533e 00000000 00000000 fffff8000344ebd0 00000000 00000000 00000000 00000000

0: kd> vertarget
Windows 7 Kernel Version 7601 (Service Pack 1) MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7601.18798.amd64fre.win7sp1_gdr.150316-1654
Machine Name:
Kernel base = 0xfffff80003209000 PsLoadedModuleList = 0xfffff8000344e890
Debug session time: Tue Sep 1 02:36:48.181 2015 (UTC + 2:00)
System Uptime: 1 days 2:09:03.996

That should have been l28 (or L28), not 128. Can you try this exact command?

dd nt!MiProbeRaises L28

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@2kczech.com
Sent: Monday, August 31, 2015 5:55 PM
To: Windows File Systems Devs Interest List
Subject: RE:[ntfsd] Possible causes for ERROR_NOT_ENOUGH_MEMORY

Thanks a lot for responses! Based on Tony’s suggestion regarding exhausting the quota by too many locks (and nature of circumstances under which this happens in production test ~ number of concurrently running copies) I tortured the driver on the production test machine with some 8 concurrent “copy directory tree” batches and I did get multiple failures at some point. I’ll proceed with more tailored directory to copy (the current one was a mixture of small and large files).

>>If you can reproduce this under debugger
Not sure this will count, but I ran LiveKd afterwards, and ran the commands suggested:

0: kd> dd nt!MiProbeRaises 128 L8
fffff8000344ebc0 00000000 0013533e 00000000 00000000<br>fffff8000344ebd0 00000000 00000000 00000000 00000000

0: kd> vertarget
Windows 7 Kernel Version 7601 (Service Pack 1) MP (8 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS Built by: 7601.18798.amd64fre.win7sp1_gdr.150316-1654
Machine Name:
Kernel base = 0xfffff80003209000 PsLoadedModuleList = 0xfffff8000344e890 Debug session time: Tue Sep 1 02:36:48.181 2015 (UTC + 2:00) System Uptime: 1 days 2:09:03.996


NTFSD is sponsored by OSR

OSR is hiring!! Info at http://www.osr.com/careers

For our schedule of debugging and file system seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Ok, I think I might be on the right track, finally. I checked old test logs and it seems the issue may be tied to me switching “large read response IOCTLs from user mode from buffered to direct IO” as mentioned above. I did that since response IOCTLs were prone to failing with ERROR_NO_SYSTEM_RESOURCES when dealing with large data, and it seemed like a good idea documentation/performance-wise too. I’m doing MmGetSystemAddressForMdlSafe to get buffers coming with the response, and I didn’t realize I would be increasing the pressure inside too (the driver already has the buffer coming with original read IRP locked in the same manner, and this lock may take non-trivial amount of time). I’ll proceed with driver error & status monitoring to get better idea of how do things look like when this happens.

Sure, sorry, didn’t realize that.

0: kd> dd nt!MiProbeRaises L28
fffff8000344ebc0 00000000 0013533e 00000000 00000000 fffff8000344ebd0 00000000 00000000 00000000 00000000
fffff8000344ebe0 00000000 00000000 00000000 00000000 fffff8000344ebf0 00000000 0000017f 00000000 00000000
fffff8000344ec00 00000000 00000000 00000000 00000000 fffff8000344ec10 0000017f 00000000 00000000 00000000
fffff8000344ec20 00000000 00000000 00000000 00000000 fffff8000344ec30 00000000 00000000 00000000 00000000
fffff8000344ec40 00000000 00000000 00000000 00000000 fffff8000344ec50 00000000 00000000 00000000 00000000

There are some non-zero failure counters here but these failures are likely caused by application problems (like passing invalid buffers to system calls). I think your problem is probably not related to buffer locking.

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@2kczech.com
Sent: Monday, August 31, 2015 6:55 PM
To: Windows File Systems Devs Interest List
Subject: RE:[ntfsd] Possible causes for ERROR_NOT_ENOUGH_MEMORY

Sure, sorry, didn’t realize that.

0: kd> dd nt!MiProbeRaises L28
fffff8000344ebc0 00000000 0013533e 00000000 00000000<br>fffff8000344ebd0 00000000 00000000 00000000 00000000
fffff8000344ebe0 00000000 00000000 00000000 00000000<br>fffff8000344ebf0 00000000 0000017f 00000000 00000000
fffff8000344ec00 00000000 00000000 00000000 00000000<br>fffff8000344ec10 0000017f 00000000 00000000 00000000
fffff8000344ec20 00000000 00000000 00000000 00000000<br>fffff8000344ec30 00000000 00000000 00000000 00000000
fffff8000344ec40 00000000 00000000 00000000 00000000<br>fffff8000344ec50 00000000 00000000 00000000 00000000


NTFSD is sponsored by OSR

OSR is hiring!! Info at http://www.osr.com/careers

For our schedule of debugging and file system seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Thanks for the check, Pavle. I’ll proceed with the repro to get better idea.

So I have some sort of leak/corruption. While working on the repro, I encountered more obvious issue than the original one - the repro writes N files of size M to driver-managed space and then creates X threads and uses the File.Copy to copy each file to thread-specific location outside driver-managed space. When I run this using 10 threads and 4kB files, the repro application goes nuts as soon as the “File.Copy” stage begins. ProcessExplorer says the process private bytes jump up to several GBytes (10-22) out of nothing. I have no idea what the memory is so far, but the repro app itself doesn’t allocate a thing during the File.Copy phase and allocation tracking I have in the driver says there were but a couple hundred allocations of blocks up to around 4kB (which seems correct given the data being processed). I tried dumping sizes of MDLs I get with IRPs and haven’t seen anything but 4kB (once again - seems correct); running with verifier now, too, no difference/stops so far, but I’ll keep on digging :slight_smile:

And here’s the happy end. I feel silly, but let’s say the hapiness prevails. In appropriate circumstances the driver returned uninitialized memory for FileEaInformation which the native CopyFile used during copying -> randomly sized allocation -> ERROR_NOT_ENOUGH_MEMORY from user-mode process heap in 32-bit process, crazy allocations in 64-bit process. Thanks again for the help!

I’m one of the maintainer of Dokany, fork of Dokan to keep it alive.

I was watching your thread with interest, great you resolved your issue. You said your driver is based on Dokan, is this issue coming from your previous changes or it exists on base?
If the last case and because it is an open source project, please share your changes so we can review it and fix it appropriately. Thanks!

Hi Maxime,

no, this issue should not be part of original codebase - I believe I introduced it with changes to where and how request buffer gets zeroed in user mode native module. That said - great to see Dokan is alive and well. I’ll check Dokany out (soonTM) and let you folks know if I have anything you might want to integrate to the codebase (or vice versa, which is more likely :)).

Cheers,
L.