A persistent problem that plagues filter drivers of all sorts is the limitation on the size of the kernel stack - a svelte 12KB in the x86 kernel mode environment. When this is coupled with the re-entrant nature of the storage stack in particular, it can lead to stack overflow conditions. In this article we will suggest several different strategies you can consider when trying to minimize the stack usage within your driver.
Meet the Stack
No doubt, the basic nature of the stack is familiar to most kernel level developers, given that it has the execution history of the processor at hand and is thus one of the essential elements used when debugging. The boundaries of the individual stack are found in the KTHREAD structure used by the Windows kernel to track the InitialStack, StackLimit, and KernelStack fields. Of course, the actual current location of the "top" of the stack is in the stack pointer register (ESP on x86, RSP on AMD-64, and SP or R12 on the IA-64). Because of the way that stacks are managed on all three processors - with the stack "growing down" - debugging normally starts at the current stack location and displays information at successively increasing stack addresses.
On the x86 and AMD-64, the stack is used for parameters, return addresses, local variables, etc. The IA-64 circumvents much of this stack usage by taking advantage of its numerous registers (a topic unto itself!). For example, someone recently asked us about debugging a stack overflow they were seeing - by the time we explained what was happening they said "never mind, my disk is now so full that SR (the system restore filter driver) has stopped doing anything so I am not stack overflowing." What we found when looking at the stack usage was that the developer had coded large character arrays out of stack space! With several re-entrant calls he quickly exhausted the stack and the system crashed.
Normally, the manifestation of a stack overflow is an UNEXPECTED_KERNEL_MODE_TRAP error (0x7F) and the first parameter is 0x8 (a "double fault"). Typically the instruction will be something innocuous, such as "push esi" or other stack manipulation, although once in a while we’ve seen memory references cause this (something like "mov [ebp-0x3c], eax") because a field in a stack-based data structure is being set. In either case, it causes a page fault. The page fault handler in the CPU attempts to push the CS, EFLAGS and EIP value onto the stack, but since the stack is not valid it causes the double fault handler to be invoked. Trips through the double fault handler on Windows are one-way - they always result in a termination of the OS (the "blue screen of death") so of course it is best programming practice to try and ensure they don’t happen.
Basically, there are two different mechanisms that we can use in the kernel environment in order to eliminate stack overflow conditions:
In the balance of this article we will discuss both of these techniques and provide some basic guidance on how to implement these techniques in your driver.
Minimizing Stack Usage
The most important technique for minimizing stack utilization is to use the kernel memory allocation functions for anything large, with the most likely candidates being data structures and character buffers.
The simplest way to achieve this is to simply use ExAllocatePoolWithTag. It is simple but does require that you ensure your code always frees the memory prior to exiting the function (or, in the worst case, prior to unloading your driver). One trick for ensuring that you always free the memory is to wrap the code within your function using the __try/__finally operation(s). This ensures that all exit paths within the __try block must always execute the code within the __finally block. The only downside to this is that (of course) this uses additional stack space itself (since the termination block is stored on the stack). While this will ensure that you do not leak memory, you can also do so through carefully testing and checking your code. This might be necessary in extreme cases where the stack utilization must be minimized as much as possible.
The most extreme case of stack minimization, and one often taken when all unnecessary stack usage must be eliminated, is to create a structure definition for all of the variables within a given routine. Upon entry to the routine, you would allocate the necessary storage from the pool (ideally in a register defined variable, so there is no additional stack space used) , use the individual entries within your structure (rather than local variables) for temporary storage, and then upon exit from the function free the stack space.
A few notes here are in order:
This is an extreme approach, generally reserved only for those cases where stack is at a total premium. It does work reasonably well and given modern processor and cache designs works reasonably well in comparison to the use of local variables.
You must be prepared for memory allocation failures. In other words, a call to ExAllocatePoolWithTag might fail. Your driver must handle this potential failure or otherwise you might cause the system to crash. In the case where your allocation fails, your driver should return STATUS_INSUFFICIENT_RESOURCES, just like any other memory allocation failure.
If you call from a pageable code path, you should use paged pool. Note that except under very unusual circumstances the memory that you allocate in this fashion will not actually get paged out, but it will come from the larger paged pool address region in the kernel.
If you call from a non-pageable code path, you must use non-paged pool. Similarly, if you are in a storage driver and your driver can be called to access the paging file, you must use non-paged pool (even though your driver might be called at IRQL PASSIVE_LEVEL or APC_LEVEL).
For frequent allocation/free operations on fixed size structures (e.g., some fixed size data structure) it is best to use a lookaside list (check the DDK documentation for ExInitializedPagedLookasideList or ExAllocateFromNPagedLookasideList as appropriate for your particular driver). In this case, the lookaside list that you create will manage a list of available buffers of the given size. If there are no available buffers, one will be allocated from pool (paged or non-paged, depending upon the way the lookaside list was initialized). Periodically, a background thread in the OS will trim the buffer list so that it does not become too large, with the goal of the OS to ensure you have a regular supply of appropriately sized buffers and are not using too much memory in doing so.
Lookaside lists are not as useful for drivers that allocate variable sized buffers. The choices in this case are either to use lookaside lists with buffers that are large enough for all (or most) cases, or to use ExAllocatePoolWithTag. Of course, if you use a buffer that is large enough for "most" cases, your driver will need to fall back to allocating from pool when the buffer must be larger. Still, this can be used to optimize the "common case" while still supporting the extreme case.
For example, in a file system driver, the maximum path name is 65534 bytes long. However, the Win32 API itself has a much smaller internal limit (around 1KB in current versions) and thus we might use fixed size 2KB buffers (1024 16-bit character buffers) and allocate larger buffers as needed. Or perhaps we’d instrument our driver to figure out what the 90% size is and use that as our common buffer allocation size. In some versions of Windows the size of lookaside lists is dynamically resized in order to improve the overall performance of the system - a technique that would certainly work within your driver as well.
Finally, it is always worth noting if your driver is using a recursive procedure, it is ill-advised in the kernel environment because of the scarcity of the stack environment. Instead, you are better advised to implement your functions iteratively (note that recursion and iteration can be substituted for one another). In other words: write your code to use a loop, rather than to use recursive function calls!
Stack Overflow Detection
The other technique for handling stack overflow conditions is to detect when the stack is not large enough to handle subsequent calls down to lower level functions within this driver, or out to other drivers. The key function call here is IoGetRemainingStackSize. This function provides the caller with information on the amount of stack space that is remaining on the stack for the current thread (remember, stack space is maintained on a per thread basis, so once we switch context to a different thread we will have a different stack - we’ll exploit this in just a minute!).
So, prior to performing some stack-consuming complex operation, your driver can check to ensure there is sufficient stack space. If the remaining stack is not enough (and "not enough" is a value likely to be adjusted based upon your experience running your driver in a variety of environments) then you can take appropriate measures. Such "appropriate measures" might include:
Failing the request (STATUS_ INSUFFICIENT_RESOURCES) and having the caller perform recovery, much like they would need to do in any other resource exhaustion case.
Posting the work item to a worker thread; this trick essentially leverages the fact that a different thread will have a new stack to work with.
Making an alternative implementation path that hoards stack space.
In addition, we’ve seen one or two cases where kernel drivers allocate and switch to their own stack. This approach is one we do not encourage - it is difficult to implement and has some serious restrictions on calling back into the OS (the rest of the OS can’t use this alternative stack safely). But, because you will sometimes see it used we wanted to mention that it is an option other developers have pursued.
While failing the request is simple, it often leads to unsatisfactory results because the error might manifest to the user in a rather unsavory fashion - for example, some application might terminate prematurely, or print some unfathomable error code. If this is a rare circumstance, this can be an excellent suggestion, but if stack overflow is a common occurrence in your driver environment, we suggest that this is not a good solution.
The second approach of posting to a worker thread can be very effective, although we note that it should not be used to post from a worker thread into the same work queue because that can lead to deadlocks, thus, this might be a case where the cure is as bad as the cause! For example, both NTFS and FAT will post work items to a worker thread when they detect low stack conditions. There is a dedicated thread in the file system runtime library (the File System Stack Overflow Thread) that handles these requests. You could construct such a worker thread within your own driver as well, spawing it in your DriverEntry function, for instance. Another alternative is to use one of the system work queues (again, provided that your driver cannot already be using this queue for the code where you check for stack overflow). To use a work item a driver:
Allocates the work item by calling IoAllocateWorkItem;
Allocate the context structure (if needed) to pass to the work routine;
Waits for the work item to complete (this is optional but generally the case used when a work item is queued to handle stack overflow);
Frees the work item in the work routine by calling IoFreeWorkItem;
A file system or file system filter driver may use the Ex versions of the work routines, but because they are not safe for unloadable drivers, they are deprecated for use in normal drivers.
Once running inside the work routine, your driver will complete the processing. Keep in mind that there are some potential complications when posting to a worker thread in this fashion including:
Allocating memory when remaining stack is low might trigger stack overflow. We’ve seen cases where a driver has itself caused a stack overflow when trying to create the needed work item. This can happen in particular when driver verifier is enabled, because driver verifier uses additional stack space when your driver calls ExAllocatePoolWithTag!
Serialization (locking) can become more complex in cases where your driver posts to a worker thread. One technique we’ve used in such cases is to indicate in the work item that routines should operate with the understanding that the necessary locks are held (by the original thread). That original thread is blocked waiting for the work item to complete execution. In this fashion we preserve the interlocking guarantees and can still handle stack overflow conditions. Getting this correct is not impossible, but may require some additional coding to ensure that it works. The typical symptom of incorrect serialization is either deadlock or data corruption, neither of which is desirable.