Print an article from OSR Online

The NT Insider

Proper Completion -- Resubmitting IRPs from within a Completion Routine
(By: The NT Insider, Vol 12, Issue 3, May-June 2005 | Published: 19-Apr-05| Modified: 25-Apr-05)

Let's face the facts: Handling IRPs can be a pain, and sometimes it's about as close to black magic as you can get. One problem in IRP handling that's commonly encountered by driver writers is dealing with resubmitting IRPs from a driver?s I/O completion routine. That's the topic this article examines.

Some examples of situations where you might want to use IRP resubmitting are:

Fault tolerant devices where a failed I/O request needs to be resubmitted to the device in a way that is seamless to the application
Filter drivers that want to "fake out" drivers that use the hanging IRP model. In these cases, the application pends an I/O operation and waits for its completion as notification of an external event. By resubmitting the IRP back to the lower driver from within the completion routine, the filter consumes the event without the application ever being notified
Supporting devices that continuously stream data, such as an isochronous USB device

The concept of the technique is simple: When the completion routine runs, you simply forward the IRP to the lower driver, using IoCallDriver, and return STATUS_MORE_ PROCESSING_REQUIRED. The lower driver now owns the IRP again, and the return status indicates to the I/O Manager that the IRP has been reclaimed and that completion processing should abort. There are a few issues here that people are quick to overlook, so we figured we'd do our part in contributing to driver quality and address those issues.

You DO Know How Completion Processing Works, Right?
There's a certain amount of mystery surrounding I/O completion, but once you understand all of the fundamental rules and why they exist, you're sitting pretty. However, covering all aspects of I/O completion in detail is not a goal of this article. If it was, I'd never get around to watching that episode of Fear Factor that's been sitting on my Tivo for weeks.

So, if you think that checking for Irp->PendingReturned is required for all I/O completion routines or can't recite the circumstances under which it is required, you need to read the tour de force that is Secrets of the Universe Revealed! -- How NT Handles I/O Completion.

The Dispatch Entry Point
You've decided that, for whatever reason, you might be taking an IRP that arrives at your dispatch entry point and resubmitting it from within the completion routine. By deciding to do this, you've broken any sort of guarantee that you might have had about the "quickness" of this I/O operation. Because we're all good little driver writers, this means that the IRP really needs to be pended from within the dispatch entry point. Therefore, whenever you receive an IRP that may be resubmitted, you need to mark the IRP as pending with IoMarkIrpPending and return STATUS_PENDING.

The Completion Side of Things
As an example, we'll use a fault-tolerant device. If an error occurs on the device while processing a request, we'll resubmit the request a predetermined number of times. If after the maximum number of attempts the device is still not responding properly, we'll allow the request to propagate up to the application.

Let's start with some infrastructure that looks like it does exactly what we want.

Completion Routine Context
Because we want to support multiple fault-tolerant requests at once, we'll need a context that we can associate with each request. This context will allow us to keep track of the number of times a request has been resubmitted on a per-IRP basis:

typedef struct _NOTHING_FT_REQUEST_CONTEXT {
    //
    // Our device extension
    //
    PNOTHING_DEVICE_EXT DevExt;

    //
    // The number of times that we have attempted
    // this request
    //
    ULONG               AttemptededCount;

}NOTHING_FT_REQUEST_CONTEXT, *PNOTHING_FT_REQUEST_CONTEXT;

Every time we receive a request that we'll be resubmitting, we allocate one of these structures, initialize it properly, and set it as the context parameter for our completion routine.

Completion Routine: Attempt #1
Here's an attempt at a completion routine that looks like it probably does the right thing. Note that we did not allocate this IRP ourselves, it was handed to us by the I/O Manager, and we followed our rule of calling IoMarkIrpPending and returning STATUS_PENDING in our dispatch entry point:

NTSTATUS
NothingIoctlCompletionRoutine(
    PDEVICE_OBJECT DeviceObject,
    PIRP Irp,
    PVOID Context
    ) {

    PNOTHING_FT_REQUEST_CONTEXT ftContext
                  = (PNOTHING_FT_REQUEST_CONTEXT)Context;

    PNOTHING_DEVICE_EXT devExt = ftContext->DevExt;

    //
    // If you think that this is necessary you haven't
    // done the required reading!
    //
    //if (Irp->PendingReturned) {
    //
    //    IoMarkIrpPending(Irp);
    //
    }


    //
    // Did the IRP succeed?
    //
    if (NT_SUCCESS(Irp->IoStatus.Status)) {

        //
        // Our work here is done! Free our context
        // and allow completion processing to
        // proceed
        //
        ExFreePool(ftContext);

        return STATUS_SUCCESS;

    }

    //
    // The IRP has failed, we need to figure out if
    // we're going to abort or retry the request
    //

    //
    // That's one more attempt under our belt.
    // Note that we OWN the IRP and its context right now,
    // so there's no need for synchronization
    //
    ftContext->AttemptedCount++;

    //
    // Have we exhausted our retry count? Also, if this IRP
    // has been cancelled we should honor that request.
    //
    if (ftContext->AttemptedCount == NOTHING_FT_REQUEST_MAX_ATTEMPTS ||
        Irp->Cancel) {

        //
        // Nothing to do but free our context and return
        // STATUS_SUCCESS. This will allow I/O completion
        // processing to finish and the app will be notified
        // of the error
        //
        ExFreePool(ftContext);

        return STATUS_SUCCESS;

    }

    //
    // We failed, but we're willing to try it again.
    // Call the lower driver and return
    //   STATUS_MORE_PROCESSING_REQUIRED.
    // This special return tells the I/O manager that it can no
    //   longer make any assumptions about the state of the IRP
    //   and cause it to abort completion processing.
    //
    (VOID)IoCallDriver(devExt->DeviceToSendIrpsTo,
                       Irp);

    return STATUS_MORE_PROCESSING_REQUIRED;

}

I can see at least three bugs in this code, one of which will show up pretty quickly if you actually try to run it. I'll even give you the symptom: The first time that an IRP is resubmitted using this routine it will appear at the IRP_MJ_CREATE handler of the called driver.

A Bit About IoCompleteRequest
When you call IoCompleteRequest it does two things that are of interest to our conversation:

It walks back up the I/O stack locations and directly calls the completion routines as subroutines
It zeroes out the stack locations when it is done with them

From those two points you can deduce that when your completion routine runs you are in the same call stack as the caller of IoCompleteRequest, and the stack location below you, i.e., the next stack location, will be zero filled. Taking a look at the definition of IRP_MJ_CREATE will probably give you the answer to why the resubmitted request is always going to the create handler of the called driver.

So, we can fix one bug in this completion routine by setting up the next stack location in the IRP.

    //
    // The I/O manager zeroed out the lower driver's
    // stack location, so we need to reinitialize it
    //
    IoCopyCurrentIrpStackLocationToNext(Irp);
    IoSetCompletionRoutine(Irp,
                           NothingIoctlCompletionRoutine,
                           ftContext,
                           TRUE, TRUE, TRUE);

    (VOID)IoCallDriver(devExt->DeviceToSendIrpsTo,
                       Irp);

    return STATUS_MORE_PROCESSING_REQUIRED;

Notice the Other Bugs Yet?
The other two bugs are a bit more subtle than the previous one, but are also related to the IoCompleteRequest implementation details that we outlined.

If you look at the IRQL restrictions in the documentation for IoCompleteRequest, it notes that IoCompleteRequest is callable at IRQL <= DISPATCH_LEVEL. Because we know that IoCompleteRequest directly calls the completion routines within the IRP, we can assume that if IoCompleteRequest is called at IRQL DISPATCH_LEVEL our completion routine will also be called at IRQL DISPATCH_LEVEL.

A general rule of device drivers is that dispatch entry points are only callable at IRQL PASSIVE_LEVEL unless explicitly noted otherwise. So, what happens if our completion routine running at DISPATCH_LEVEL sends the IRP to a driver that expects its dispatch entry point to be called at PASSIVE_LEVEL? Hopefully we'll get lucky and the system will die immediately during testing, but Murphy's Law of Programming says that only when your driver is deployed at a customer site will bugs like this show up.

This means that we're going to need a way to lower our IRQL in the cases where we are called at DISPATCH_LEVEL so that we can safely submit the request to the lower driver.

Lowering IRQL
If anyone out there just said, "Well, I'll just call KeLowerIrql if I notice that my IRQL is too high," please back away slowly from The NT Insider. But, I'm a bit of an optimist so I'll guess that most of you immediately thought of using a work item as a means to lower your IRQL.

For those of you that don't know, work items provide a way to be called back within the context of a system worker thread at IRQL PASSIVE_LEVEL. From within this work item, we'll be free to pass the IRP to the next driver without any worry of violating IRQL restrictions. Note that the fact that we'll be resubmitting the IRP from a context other than the original caller is another reason why it is important to mark the IRP pending.

To implement our new and improved, work-item-using completion routine, we'll need to add some additional fields to our completion context:

typedef struct _NOTHING_FT_REQUEST_CONTEXT {
    ...

    //
    // We need a pointer to the IRP so that we
    // can get it back from within our work item
    //
    PIRP                Irp;

    //
    // Our work item for this request
    //
    PIO_WORKITEM        WorkItem;

}NOTHING_FT_REQUEST_CONTEXT, *PNOTHING_FT_REQUEST_CONTEXT;

We'll also need to add a work item routine to do the actual resubmission of the IRP:

VOID
NothingIoctlWorkItem(
    PDEVICE_OBJECT DeviceObject,
    PVOID Context
    ) {

    PNOTHING_FT_REQUEST_CONTEXT ftContext
                  = (PNOTHING_FT_REQUEST_CONTEXT)Context;

    PNOTHING_DEVICE_EXT devExt = ftContext->DevExt;

    //
    // Simply pass the IRP on to the next device
    //
    (VOID)IoCallDriver(devExt->DeviceToSendIrpsTo,
                       ftContext->Irp);

    return;

}

Completion Routine: Attempt #2
Now let's add our fix to our completion routine, and try to be clever in order to save some cycles:

    IoCopyCurrentIrpStackLocationToNext(...);
    IoSetCompletionRoutine(...);

    if (KeGetCurrentIrql() != PASSIVE_LEVEL) {

        //
        // We need to allocate a work item if we haven't
        // done so for this FT operation
        //
        if (ftContext->WorkItem == NULL) {

            ftContext->WorkItem = IoAllocateWorkItem(DeviceObject);

            if (!ftContext->WorkItem) {
                //
                // Error condition handling removed...
                //
            }

        }

        //
        // Queue our work item.
        //
        IoQueueWorkItem(ftContext->WorkItem,
                        NothingIoctlWorkItem,
                        DelayedWorkQueue,
                        ftContext);

    } else {

        //
        // We're already at PASSIVE_LEVEL, no point
        // in adding the work item overhead.
        //
        (VOID)IoCallDriver(devExt->DeviceToSendIrpsTo,
                           Irp);

    }

    return STATUS_MORE_PROCESSING_REQUIRED;

For the sake of brevity, the rest of the changes to the completion routine are not shown here. However, the only added modifications that need to be made are calls to IoFreeWorkItem to free the work item before freeing the overall context structure.

The attempt at avoiding the overhead of queuing the work item, along with our knowledge of how IoCompleteRequest works, should expose the final bug in this completion routine.

Beware Recursion in the Kernel!
We all know now that a driver calls IoCompleteRequest, and IoCompleteRequest calls the completion routine. There are many cases where a driver may complete a request from within its dispatch entry point, which will result in a direct call to our completion routine. If our completion routine then resubmits the IRP to the driver's dispatch entry point, the driver might again complete the request immediately. This would then cause our completion routine to run on the same call stack and resubmit the IRP to the driver again.

If our resubmission threshold was particularly high or if one of these routines happened to be using a large amount of the execution stack, we could easily recurse enough times to exhaust the entire 12KB assigned to the kernel-mode thread stack. Therefore, this attempt at an optimization is really just a juicy piece of bug bait and should be entirely removed:

    IoCopyCurrentIrpStackLocationToNext(...);
    IoSetCompletionRoutine(...);

    //
    // We need to allocate a work item if we haven't
    // done so for this FT operation
    //
    if (ftContext->WorkItem == NULL) {

        ftContext->WorkItem = IoAllocateWorkItem(DeviceObject);

        if (!ftContext->WorkItem) {
            //
            // Error condition handling removed...
            //
        }

    }

    //
    // Queue our work item.
    //
    IoQueueWorkItem(ftContext->WorkItem,
                    NothingIoctlWorkItem,
                    DelayedWorkQueue,
                    ftContext);

    return STATUS_MORE_PROCESSING_REQUIRED;

A Special Case: Allocating Your Own IRPs
This article has only discussed how to resubmit IRPs that were passed to you, not how you would resubmit IRPs that you have allocated. This is perfectly legal, but comes with a few extra caveats:

Do not attempt to call IoMarkIrpPending on an IRP that you have allocated. IoMarkIrpPending sets the SL_PENDING_RETURNED bit in the current stack location, and newly allocated IRPs do not have a current stack location.
Always use IoAllocateIrp to allocate IRPs that you will be resubmitting from within your completion routine. IRPs built with one of the IoBuildXxxRequest routines are tied to the current thread, which could lead to subtle, hard-to-find bugs.
Before you resubmit an IRP that you have allocated yourself, you'll want to make a call to IoReuseIrp. This call will take care of clearing out any remnants of the last IRP completion, like, say, that nasty IRP-> Cancel flag.

There's No Escaping I/O Completion
You can see from the content of this article that the initial implementation would only come from a misunderstanding of the I/O completion process or IRP handling rules in general. I highly suggest reading all literature available on the I/O subsystem within Windows, otherwise you're going to be tracking down mysterious hangs, crashes and IRP_MJ_ CREATE requests for years to come.

This article was printed from OSR Online http://www.osronline.com