Shared Buffer Between Lower Filter Driver and Application during IO Completion

Hi Everyone, Tim,

I have a functional Lower filter driver capable of sending inverted calls to an Application using a non-pnp object for direct access to Application.

Is there a way in which I can map a huge buffer allocated by the upper layer function driver to the Application ? or optionally a buffer allocated by the filter driver in the Internal loctl context received from application ?

The Flow is going to be something as follows (This is only for testing purposes and is not directed for a project !)

Device Sends data to a Buffer -> Filter Driver IO Completion Routine -> Filter Driver Maps Buffer to user space -> User space does some processing -> The request completed and sent to Function Driver.

Any suggestions/alternative ways or conditions for Allocation to be met ?

Thanks !
Arya

Sorry, a correction :

"or optionally a buffer allocated by the filter driver in the Internal loctl context received from application ? "

"or optionally a buffer allocated by the filter driver in the Internal loctl context received from Function Driver ? "

Well, in order to realize that your design is flawed all you have to do is to consider a situation
(which,BTW, will be encountered in the vast majority of cases) when your IO Completion Routine
gets invoked in context of a thread that has nothing to do with your app whatsoever. Are you going to map your buffer to the arbitrary address space or what? Therefore, your shared buffer has to be allocated and mapped in advance. MmMapLockedPagesSpecifyCache() is your friend here.However, please note that, for the security reasons, it would be better to allocate your buffer in the userland and subsequently map locked MDL into the kernel address space, rather than doing things the other way around.

The way I understand it, you want to move some processing to the userland instead of doing everything in your driver. Therefore, your whole scheme should work more or less the following way. A driver for underlying PDO will complete a request, so that your your IO Completion Routine gets invoked. Your driver will fill a buffer, complete an outstanding app’s request and return STATUS_MORE_PROCESSING_REQUIRED, effectively terminating a callback chain.
As a result of request completion, your user app will know that it is time to do some processing. It will process the buffer and submit an IOCT to your driver. At this point your driver will be already in a position to complete the original request, processing of which it had earlier postponed by returning STATUS_MORE_PROCESSING_REQUIRED. As a result of your call to IOCompleteRequest() IO Completion Routine of the function driver (i.e. the one above yours on the stack) will get invoked, i.e the IRP processing chain will keep on running as if nothing had happened…

Please note that on a way down the stack your driver should return STATUS_PENDING, no matter what IOCompleteRequest()returns.In order to understand why it has tobe done this way,consider a situation when underlying PDO driver completes the request in the spot in context of the original thread…

Anton Bassov

>Please note that on a way down the stack your driver should return STATUS_PENDING,

no matter what IOCompleteRequest()returns

A correction - as you may have guessed, I meant IoCallDriver() and not IOCompleteRequest() here…

Anton Bassov

Thanks Anton ! that makes sense, because of the arbitrary context of the completion routine, the buffers should be pre-allocated and mapped.

Since this involves a copy (allocated buffers, swap the pointers and copy them back after processing), I am thinking of a possible optimization here in which the function driver’s buffer itself can be mapped to the user space when the Internal Ioctl is caught. For synchronized IO in this case, is there any guarantees that the driver within this Internal Ioctl dispatch routine that it is running in the calling app’s context space ? If this is so, it helps reduce a copy and should have significant performance benefits due to the sheer size of the buffer.

function driver (buffer) -> Internal Ioctl -> filter (2nd buffer alloc and map ) -> lower stack
Instead of above, would it be possible to just map the function driver’s buffer ?

>processing), I am thinking of a possible optimization here in which the function driver’s buffer itself

can be mapped to the user space when the Internal Ioctl is caught.

Probably WaveRT audio driver class is the only component of inbox Windows which uses this way.

How it works:

  • the DMA common buffer of the hardware is mapped to user space.
  • the user space is using it as a bytewise circular buffer with head/tail pointers.
  • after the user space have put a bit more bytes to the buffer, it calls a special IOCTL to push the new tail pointer to the driver, which then pushes it to the hardware.

Also (Tim can correct me, since he is a great audio guy here) it only works for playback and not recording, i.e. for writes to the device and not reads.

I don’t know how it is good, and whether there are real performance or latency gains due to this. If it is good - then why is it not widely used in Windows? why having a queue of IRPs, converting their MDLs to SGLs and feeding the SGL chain to DMA is a recommended approach?

I have a strong suspect that WaveRT is only beneficial due to the fact its data buffer is a target of heavy calculations done by the mixer and sound effects, and so having this single large buffer simplifies the implementation of the latter ones.

Or maybe this is because Intel High Definition Audio probably has no SGL support?


Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

xxxxx@gmail.com wrote:

I have a functional Lower filter driver capable of sending inverted calls to an Application using a non-pnp object for direct access to Application.

Is there a way in which I can map a huge buffer allocated by the upper layer function driver to the Application ? or optionally a buffer allocated by the filter driver in the Internal loctl context received from application ?

The Flow is going to be something as follows (This is only for testing purposes and is not directed for a project !)

Device Sends data to a Buffer -> Filter Driver IO Completion Routine -> Filter Driver Maps Buffer to user space -> User space does some processing -> The request completed and sent to Function Driver.

Not the way you have described it. Once you have been given an IRP, you
can’t assign a different buffer to the IRP. The usual way to do this
would be to have the application allocate the large buffer and pass it
to you in the IRP that you hold. Your driver can then fill it and
complete the request.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

xxxxx@gmail.com wrote:

Since this involves a copy (allocated buffers, swap the pointers and copy them back after processing), I am thinking of a possible optimization here in which the function driver’s buffer itself can be mapped to the user space when the Internal Ioctl is caught.

You are overestimating the cost of a copy. Today’s processors can copy
a gigabyte per second using less than 10% CPU load. If the copy design
makes the driver simpler and more reliable, it’s well worth the price.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

>

xxxxx@gmail.com wrote:
> Since this involves a copy (allocated buffers, swap the pointers and copy
them back after processing), I am thinking of a possible optimization here in
which the function driver’s buffer itself can be mapped to the user space
when the Internal Ioctl is caught.

You are overestimating the cost of a copy. Today’s processors can copy a
gigabyte per second using less than 10% CPU load. If the copy design makes
the driver simpler and more reliable, it’s well worth the price.

Are there any decent papers on this subject? I’ve always been a bit curious about the overhead of a copy, including the cost of mapping and unmapping memory, and the effect on your cpu cache (having just pumped a gigabyte of data through it). It’s easy to do isolated testing on a small scale and see that a copy is very fast, but not so easy to test on an otherwise busy system in a controlled way and measure the impact on the system as a whole…

James

> Today’s processors can copy a gigabyte per second using less than 10% CPU load.

It is not as simple as that …

In this respect you should always look for a “weakest link in a chain”, so that you cannot take a memory controller and FSB (whatever this term may mean for, say, a NUMA system that is implemend as a collection of separate UMA nodes) out of equation. Thererfore, you have
to mention the specific system arhitecure,conffiguration, as well as specific conditions under which such a result got obtained - I suspect it may vary wildly, depending on above mentioned factors…

Anton Bassov

xxxxx@hotmail.com wrote:

It is not as simple as that …

In this respect you should always look for a “weakest link in a chain”, so that you cannot take a memory controller and FSB (whatever this term may mean for, say, a NUMA system that is implemend as a collection of separate UMA nodes) out of equation. Thererfore, you have
to mention the specific system arhitecure,conffiguration, as well as specific conditions under which such a result got obtained - I suspect it may vary wildly, depending on above mentioned factors…

I will grant you all of that. Never the less, many people immediately
discard a “copy” option unnecessarily. If it can simplify the design,
it’s worth considering.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thanks for posting this ! So, I have something similar and for me latencies are critical. I am handling around 5 - 30 MB of data with a short time frame say similar to a video frame.

Is there any alternative way of mapping the top level driver buffer directly without any more copies ? Maybe there can be a check using the IRP parameters, if the current context is actually the same as the IO calling application ? Only then we try mapping the incoming buffer in the dispatch routine, else handle using the pre-allocated drivers.

As a second option, how about a UMDF Lower Filter which can link with the library that processes the frames ?

-Eli

UMDF v2, if possible, seems like a good idea ! The entire approach has some limitations compared to KMDF. Some thing for me to look into, since I have never created one before.

By Context checking are referring to something like :
IRP pointer -> Tail.Overlay.Thread == PsGetCurrentThread() ??

xxxxx@gmail.com wrote:

Thanks for posting this ! So, I have something similar and for me latencies are critical. I am handling around 5 - 30 MB of data with a short time frame say similar to a video frame.

Is there any alternative way of mapping the top level driver buffer directly without any more copies ? Maybe there can be a check using the IRP parameters, if the current context is actually the same as the IO calling application ? Only then we try mapping the incoming buffer in the dispatch routine, else handle using the pre-allocated drivers.

As long as you use an ioctl with METHOD_OUT_DIRECT, your output buffer
will be mapped into kernel mode for you, and that kernel address is
available in an arbitrary context.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.