The NT Insider:Living in Harmony -- File System Filter-to-Filter Interaction

Everything Windows Driver Development

Thu, 14 Mar 2019 118020 members

Online Dump Analyzer
OSR Dev Blog
The NT Insider
The Basics
File Systems
Downloads
ListServer / Forum


	Express Links

	·	The NT Insider Digital Edition - May-June 2016 Now Available!
	·	Windows 8.1 Update: VS Express Now Supported
	·	HCK Client install on Windows N versions
	·	There's a WDFSTRING?
	·	When CAN You Call WdfIoQueueP...ously

THE NT INSIDER

Living in Harmony -- File System Filter-to-Filter Interaction
The NT Insider, Vol 12, Issue 2, March-April 2005 | Published: 15-Mar-05| Modified: 22-Mar-05

We recently had an opportunity to aid in debugging and resolving a series of crashes in what turned out to be a complex file system filter-to-filter interaction issue. While we have seen many interesting interaction issues in the past, this particular issue was actually one we had not seen before. In the process of analyzing this issue we made several useful observations on cache, memory manager, and file system interaction issues that also furthered our understanding of the underlying system (and its complex interactions).

The Background

We periodically work with customers who are experiencing problems with drivers that they are developing. Generally, we recommend a comprehensive review of the design and the code--a time-consuming process, because it starts off with a thorough code reading. From that review we extract information about data structures and their interaction patterns. Once we've constructed a "big picture" model (the design of the as-implemented code) we go back through the original source code to ensure that the implementation fits into this model. This technique allows us to identify low level coding errors (e.g., "you are modifying this field without holding the appropriate lock") as well as higher level logical errors (e.g., "you aren't handling this range of conditions").

After reviewing many filters, the general "holes" are found by looking at the more esoteric cases that must be handled properly--even though they seldom happen on the developer's test machines or in the test lab, unless specific tests have been established to create these circumstances.

In this particular case, however, the client was experiencing a series of mysterious bug checks. The most pronounced of these would manifest when the cache manager would crash accessing the file object in the shared cache map structure.

Figure 1: File Object Access

Here, the file object in use by the Cache Manager was different than the file object in use by the Memory Manager (see Figure 1). Although unusual, this condition does happen. We've never seen this cause a problem in the past, but in this instance what we observed is that the file object referenced by the Shared Cache Map did not actually point back to the shared cache map (see the sidebar, Examining the Cache Map). This ultimately caused the Cache Manager to dereference a NULL pointer, and the system crashed.

Examining the Cache Map

In order to "peer into" the shared cache map, we relied upon the kernel debugger, which makes it simple for us to examine these key data structures. The reason for this is that the data types necessary are stored in the PDB files exported by Microsoft (and available on the public symbol server). Also, keep in mind that this information is not available for older O/S versions.

To look at the shared cache map we use the debugger command "dt nt!_SHARED_CACHE_MAP." If we want to look at a specific instance, we can provide the address of the shared cache map. For any file on which caching has been established, this address is stored in the SharedCacheMap field of the SectionObjectPointers structure.

The other key data structure we peek into is the "control area."

Our initial theory posited that there was some sort of reference counting problem--after all, the file object should not "go away" or be scrubbed until after the file has been closed. We turned our attention to the use of "stream file objects" that was made by this particular file system filter driver. While we couldn't find anything obviously wrong with the use of stream file objects, we did not have much else to "grasp onto."

What we did determine is that if all the reads from this filter were done using non-cached I/O, the problem did not manifest--but the performance of the system suffered dramatically as a result. Clearly, our theory that this was a cache interaction issue "made sense" given that the problem went away when we disabled caching. This meant we needed to find the source of the underlying problem, and find a way to provide correct behavior without sacrificing performance.

Our next step was to add logic to perform non-cached I/O when caching had not been previously established for the file; this ensured that the stream file object was not used to back the cache. Initial testing of this solution suggested it resolved the problem. Unfortunately, subsequent testing indicated there were still some cases in which this undesirable condition would still occur.

Further analysis finally identified that the earlier presumption--that this involved the stream file objects--was likely faulty. All testing had been done with various anti-virus products installed (a mandatory configuration for this particular project), and after additional observation we ascertained that the cause was due to performing read operations in the IRP_MJ_CLEANUP handler. The process layering was that the anti-virus filter was first, followed by the driver under study, then finally the file system driver.

After considerable study, we determined that the anti-virus product was calling IRP_MJ_CLEANUP as part of a file open cancellation. This filter would then perform a read against the file object and establish caching. Upon return to the anti-virus filter, that filter would then issue the IRP_MJ_CLOSE and "scrub" the file object. Afterwards, when the Cache Manager background thread accessed that file object, it would crash if caching had not been re-established on this newly scrubbed file object.

Having identified the specific issue, the resolution was reasonably simple (we forced cache teardown on the file object) and eliminated that class of crash issues. However, this led us to revisit the numerous issues around canceling an IRP_MJ_CREATE, as well as our previous observations about performing I/O using a passed-in file object. If anything, this experience reinforced our fundamental belief that a file system filter that wishes to perform I/O--particularly in the IRP_MJ_CREATE, IRP_MJ_CLEANUP or IRP_MJ_CLOSE paths--must do so using their own file object.

Figure 1, below, shows a code fragment that we wrote while constructing a test scenario. [Please note that it is provided as a sample. Try to understand the underlying mechanism before using the "cut and paste" programming technique. In short, while we believe this code sample should work, as written, it has not been exhaustively tested in the "real world."] We wanted to validate that the basic idea of using a stream file object always works for cached I/O (see sidebar Unrelated DDK Doc Bug re: a bug found in the DDK docs while writing our test scenario).

This situation has reinforced our conviction that filter-to-filter interactions remain some of the most complex to properly code, handle, and resolve. Avoiding such interacts requires a thorough understanding of the underlying Windows system, particularly the interaction between file system drivers, cache manager, and memory manager (see sidebar, Speaking of Understanding Interactions...). Frequently, drivers are tested in the absence of all other drivers and such interoperability errors are only detected during beta testing or after release testing.

Speaking of Understanding Interactions...

One particularly curious interaction we observed had little to do with file systems, but rather with a different technique used by this particular product to ensure that their process was not prematurely terminated. Specifically, they used a "system call hooking" mechanism to trap calls to NtOpenProcess. Under most circumstances this worked fine. However, when running in the presence of a particular anti-virus product, the system would crash during the "automatic update" process performed by said anti-virus product.

After tracking this down, we could see in the debugger how this automatic update process would also "hook" the NtOpenProcess entry point. It turns out that it was not the hooking that generated the problem, but rather the fact that the Microsoft Installer would scan the list of open/running processes. Since this was perceived as an intrusion attempt on the product under study, it rejected the open attempt (returning STATUS_ACCESS_DENIED, although any error code caused the same problem). Subsequently, this standard anti-virus product would crash--the stack overrun as the system entered a re-entrant exception handling sequence (the exception handler caused an exception, which triggered another call into the exception handling chain, and so forth until there was no more stack space available).

Thus, it appears that the anti-virus product--at least in this case--did not properly handle the error return (likely because it had never seen this call fail). This off-the-shelf anti-virus product is one that has been in the market for many years and is thus "mature" by software standards. Even so, this is a case that hadn't been previously seen.

The conclusion we draw from this is: no matter how much "field time" a particular product has, there is always the possibility that there are further bugs lurking, just waiting to be discovered. Or--as we say here at OSR--think of this as job security.

To protect against this class of problems, we suggest that you take the following steps:

Check to make sure your driver works on its own. That way, when you encounter subsequent problems, you'll have some confidence that this is an interoperability issue.

Always use the available tools for analyzing your driver. Use Driver Verifier. Also use driver prefast; the generic prefast template is helpful, but the new driver template for prefast finds interesting and unusual errors.

Use PC lintif you have it available. (See All About Lint, in the September-October 2002 issue of The NT Insider)

Use the IFS Kit tests. They are necessary, but not sufficient.

Test against all kernel drivers shipped by Microsoft. This includes Services For Unix, network access (SMB/CIFS/ LanManagerServer), DFS, SIS (Single Instance Store), RSS (the HSM product), and even client-side caching for SMB. Each of these exhibits different behavior.

Build your own tests for unusual situations. The available tests do not test all of the newest features. Are you sure you work in the presence of the USN journal? What about encrypted or compressed files? Files that do "open by object ID?" Files that manipulate streams? All of these are common trouble spots.

Attend Microsoft PlugFest. This gives you an opportunity to test interoperability with other third-party products.

Identify other file system filter drivers likely to be present in your environment. It is likely that your target systems will have an anti-virus filter installed, for example.

PFILE_OBJECT sfo;

sfo = IoCreateStreamFileObjectLite(NULL, IoGetBaseFileSystemDeviceObject(currentIrpStackLocation->FileObject));

if (sfo) {

   LONG_PTR refCount;
   PIO_STACK_LOCATION iosl;
   KEVENT event;
   CHAR buffer[20];
   ULONG oldFlags = Irp->Flags;
   IO_STACK_LOCATION savedIOSL;
   KPROCESSOR_MODE savedMode;

   //
   // clear the cleanup bit here...
   //
   sfo->Flags &= ~FO_CLEANUP_COMPLETE;

   //
   // Set the Stream File Object name up correctly
   //
   sfo->FileName.Buffer = (PWCHAR) ExAllocatePoolWithTag(PagedPool,
        currentIrpStackLocation->FileObject->FileName.Length,
        'Tmaw');

   if (NULL == sfo->FileName.Buffer) {

     KeBugCheck(0);

   }

   sfo->FileName.MaximumLength = sfo->FileName.Length = currentIrpStackLocation->FileObject->FileName.Length;

   RtlCopyMemory(sfo->FileName.Buffer, currentIrpStackLocation->FileObject->FileName.Buffer, sfo->FileName.Length);

   IoCopyCurrentIrpStackLocationToNext(Irp);

   KeInitializeEvent(&event, NotificationEvent, FALSE);

   IoSetCompletionRoutine(Irp, WamTestComplete1, (PVOID) &event, TRUE, TRUE, TRUE);

   iosl = IoGetNextIrpStackLocation(Irp);

   //
   // Save existing stack location setup
   //
   RtlCopyMemory(&savedIOSL, iosl, sizeof(IO_STACK_LOCATION));

   iosl->FileObject = sfo;

   (void) IoCallDriver(DeviceObject, Irp);

   (void) KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);

   //
   // Free buffer
   //
   ExFreePool(sfo->FileName.Buffer);
   sfo->FileName.Buffer = 0;
   sfo->FileName.Length = sfo->FileName.MaximumLength = 0;

   if (NT_SUCCESS(Irp->IoStatus.Status)) {

     //
     // Now I've opened the file, let's read from it.
     //

     iosl->MajorFunction = IRP_MJ_READ;
     iosl->FileObject = sfo;
     Irp->UserBuffer = buffer;
     Irp->MdlAddress = NULL;
     Irp->Flags = IRP_READ_OPERATION | IRP_DEFER_IO_COMPLETION;
     iosl->Parameters.Read.Length = 10;
     iosl->Parameters.Read.Key = 0;
     iosl->Parameters.Read.ByteOffset.QuadPart = 0;
     savedMode = Irp->RequestorMode;
     Irp->RequestorMode = KernelMode; // so we can use KM buffer

     //
     // CACHED READ
     //
     KeInitializeEvent(&event, NotificationEvent, FALSE);

     IoSetCompletionRoutine(Irp, WamTestComplete1, &event, TRUE, TRUE, TRUE);

     (void) IoCallDriver(DeviceObject, Irp);

     (void) KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);

     DbgPrint("SFO (0x%x) -> SectionObjectPointer (0x%x) -> SharedCacheMap (0x%x), status 0x%x\n",
       sfo,
       sfo ? sfo->SectionObjectPointer : 0 ,
       (sfo && sfo->SectionObjectPointer) ? sfo->SectionObjectPointer->SharedCacheMap :0,
       Irp->IoStatus.Status);

     //
     // Amazing thing is that I don't CARE about the data, I just want to initiate
     // the caching.
     //
     iosl->MajorFunction = IRP_MJ_CLEANUP;
     iosl->FileObject = sfo;
     Irp->UserBuffer = 0;
     Irp->RequestorMode = savedMode;
     Irp->Flags = IRP_SYNCHRONOUS_API | IRP_CLOSE_OPERATION;

     KeInitializeEvent(&event, NotificationEvent, FALSE);

     IoSetCompletionRoutine(Irp, WamTestComplete1, &event, TRUE, TRUE, TRUE);

     (void) IoCallDriver(DeviceObject, Irp);

     (void) KeWaitForSingleObject(&event, Executive, KernelMode, FALSE, NULL);

     //
     // Restore IRP flags, 'cuz we're done here
     //
     Irp->Flags = oldFlags;

   } else {

     //
     // Create failed
     //
     DbgPrint("create failed, status 0x%x\n", Irp->IoStatus.Status);

   }

   //
   // set the cleanup bit here...
   //
   sfo->Flags |= FO_CLEANUP_COMPLETE;

   refCount = ObDereferenceObject(sfo);

   //
   // Drop reference count on the sfo
   //
   DbgPrint("Reference count after dropping ref count: %d\n", refCount);


   //
   // Restore next stack location setup
   //
   RtlCopyMemory(iosl, &savedIOSL, sizeof(IO_STACK_LOCATION));

}

Figure 1 -- Code Fragment for Test Scenario

Unrelated DDK Doc Bug

In the process of writing our code sample, we did find a bug in the DDK documentation regarding ObReferenceObject and ObDereferenceObject. These functions are documented as being VOID, yet they actually return the value of the reference count prior to the operation being performed.

We have logged this documentation bug with Microsoft and expect it to be fixed in a subsequent release of the DDK.

User Comments
Rate this article and give us feedback. Do you find anything missing? Share your opinion with the community!
Post Your Comment

	Post Your Comments.
	Print this article.
	Email this article.