Yet another - user mode/kernel mode sharing

Marinca_Gheorghe · May 28, 2015, 9:36am

Hi,

I use inverted call model in order to pass data from driver. the specifics of this are particular I would say. the protocol is as fallows

I pass a IOCTL (METHOD_OUT_DIRECT) where user mode configures an output buffer (memory pool used to make possible for user mode to have all driver written data there - signaling is handled by using read irp-s)

In the default IOQueue (associated with read and ctrl-s IRPS) which is configured parallel with execution level not set (and not synchronized) after driver startup user-mode passed the out buffer, which the documentation says that If I use WdfRequestRetrieveOutputBuffer I lock it down and obtain a kernel mode address so I can copy to that buffer at later time.

In order for that buffer to be locked down and be able to use it as long as the driver is running I make the control IRP pending until application shutdown when it gets canceled

In Read callout (by using same default IO queue) every time I receive a ReadFile from user mode I get an IRP (this get-s called so it’s working) and instead of writing actual data to the IRP out buffer I write the result to our previous configured global buffer.

RTLCopyMemory is called but in user mode after ReadFile returns (it also reports to user mode how many bytes have been written - which is fine) and I look in the global buffer I don’t see the data written.

I know this design is a little wired as described (I will have to see if performance tests make it obsolete) because I don’t use the request out buffer for data - but should that work at all considering that before reading the globa; buffer I wait for IRP to complete ?

Alex_Grig · May 28, 2015, 11:13am

I’m not sure what you want to achieve and why you’re dong that, but if you only want to pass data back, just use READ requests.

Marinca_Gheorghe · May 28, 2015, 12:54pm

@ Alex Grig

Long story short, one WDM driver that I saw was using this “questionable” method, probably because they found it easier to manage buffer reuse like that.

Did some measurements though using a single thread (as a proof of concept) and blocking IO ReadFile and was able to transfer about 550 MB of data per second in user mode like that. I had test tcp packets (buffers) as transfer of 1,6 KB each. All this yields more than 300.000 kernel/user mode transitions (completed read IRP-s) per second ! - at 100 % thread utilization (one core/CPU)

Considering that I find it hard to believe that my user mode app won’t be the limiting factor of whole system.

Tim_Roberts · May 28, 2015, 1:41pm

xxxxx@gfi.com wrote:

I use inverted call model in order to pass data from driver. the specifics of this are particular I would say. the protocol is as fallows

I pass a IOCTL (METHOD_OUT_DIRECT) where user mode configures an output buffer (memory pool used to make possible for user mode to have all driver written data there - signaling is handled by using read irp-s)

Why would you do this, instead of just transferring data using the
normal ReadFile process? What do you think you have gained?

In Read callout (by using same default IO queue) every time I receive a ReadFile from user mode I get an IRP (this get-s called so it’s working) and instead of writing actual data to the IRP out buffer I write the result to our previous configured global buffer.

So, you are passing a buffer with the ReadFile call, but you aren’t
using it? Then I’m afraid your design is idiotic. You are paying the
penalty of user/kernel transitions and the cost of locking down that
buffer anyway. You have gained nothing except complexity.

RTLCopyMemory is called but in user mode after ReadFile returns (it also reports to user mode how many bytes have been written - which is fine) and I look in the global buffer I don’t see the data written.

Your description is a little bit vague. Are you saying that you can
write into the original ioctl buffer from your driver, but you don’t see
anything change in the user mapping? How are you checking that?
Debugger, or printfs?

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Alex_Grig · May 28, 2015, 1:50pm

Why don’t you pass larger buffers in your ReadFile calls, instead of single packet-sized.

Marinca_Gheorghe · May 28, 2015, 1:59pm

@ Tim

I was using printf’s.
I was trying do that complex design because that was something people that “had more driver experience” done. Didn’t quite fallow their rationale considering that locking down of memory happens in both cases (only if the size of locked down memory was something that mattered in any meaningful way)

Anyway as I said the numbers are quite convincing on why you wouldn’t want to complicate by any means things - just use plain IRP-s and you are good.

anton_bassov · May 28, 2015, 2:27pm

> So, you are passing a buffer with the ReadFile call, but you aren’t using it? Then I’m afraid

your design is idiotic.

???

Look - if the amounts of data to be transferred are expressedin terms of megabytes this approach in itself is perfectly reasonable. You don’t want to make such an allocation behind the scenes every time you want to transfer data, right. Therefore, if you use a tiny bogus buffer for IRP while doing the actual IO into a pre-allocated and locked buffer you can combine the advantages of memory sharing with the ease/safety of “regular” IO processing. This could be particularly helpful if done in asynch fashion , i.e. you read data from the shared buffer only when an event in OVERLAPPED sructure gets signalled…

Anton Bassov

Marinca_Gheorghe · May 28, 2015, 2:39pm

@ anton

Of course that out buffer of read was used it indicates in this kind of design where data e.g. The index from where user mode has to read. Pesumably it helps the memory pool reuse strategy but introduces understanding overhead and same memory pool design can be acomplished by creating and reusing multiple buffers as needed. Actually associating individual buffes with each request make the design more flexible easier to reason about, scalable, less bug prone

Tim_Roberts · May 28, 2015, 3:19pm

xxxxx@hotmail.com wrote:

> So, you are passing a buffer with the ReadFile call, but you aren’t using it? Then I’m afraid
> your design is idiotic.
???

Look - if the amounts of data to be transferred are expressedin terms of megabytes this approach in itself is perfectly reasonable. You don’t want to make such an allocation behind the scenes every time you want to transfer data, right.

What allocation? All that happens is the map-and-lock. That’s why I
made the comment: if he is reading a megabyte and passing a
megabyte-sized buffer in the ReadFile call that is ignored, then he is
paying the map-and-lock penalty anyway.

Therefore, if you use a tiny bogus buffer for IRP while doing the actual IO into a pre-allocated and locked buffer you can combine the advantages of memory sharing with the ease/safety of “regular” IO processing.

You are assuming that he is passing a tiny buffer in the ReadFile call.
Maybe he is. In that case, he has the problem of returning a transfer
size in IoStatus.Information that is larger than the buffer.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

anton_bassov · May 28, 2015, 3:32pm

> Actually associating individual buffes with each request make the design more flexible easier to >reason about, scalable, less bug prone

Sure, but everything depends on a data rate. As long as it is sufficiently low you DEFINITELY don’t need any extra complications that shared memory implies. Otherwise, you can implement a shared buffer appproach as a pool of separate pre-allocated and pre-locked multiple buffers that get assiciated with multiple IO requests, but in such case your data rate should be more or less constant…

Anton Bassov

anton_bassov · May 28, 2015, 3:58pm

> What allocation? All that happens is the map-and-lock.

Duh!!! I just overlooked the fact that he uses direct IO, rather than METHOD_BUFFERED.
However, in this case, probing and locking an MDL by the IO Manager every time the requests gets submitted seems to imply even larger overhead, compared to simple buffer allocation

Actually, what I meant here was using multiple tiny METHOD_BUFFERED requests that indicate filling of a shared buffer that had been earlier submitted with METHOD_OUT_DIRECT request that got infinitely pended by a driver…

You are assuming that he is passing a tiny buffer in the ReadFile call. Maybe he is. In that case,
he has the problem of returning a transfer size in IoStatus.Information that is larger than the buffer.

He does not need to put this info into IoStatus.Information, does he - instead, he can use the tiny buffer exactly for this purpose (in fact, the sole one), and IoStatus.Information will represent the number of bytes that he transfers in this tiny buffer in order to convey this info…

Anton Bassov

Marinca_Gheorghe · May 28, 2015, 3:59pm

@ anton

Actually the data rate will be high, think pacepkets going through a server, but at 550 MB per sec. serviced by a fully loaded thread, packet by packet I think that is quite high even if its all kernel time cpu. packet analysis would certainly add much more than that.

anton_bassov · May 28, 2015, 5:45pm

> Actually the data rate will be high, think pacepkets going through a server,

but at 550 MB per sec. serviced by a fully loaded thread, packet by packet I think that
is quite high even if its all kernel time cpu. packet analysis would certainly add much
more than that.

I dunno, but I would rather first try to make it as “stupid and simple” as possible just in order to see how it works. If the performance is not satisfactory you can start thinking about optimizing things. Otherwise, just keep in mind Knuth’s famous statement about premature optimization that happens to be “the mother of all evil in all but 5% of cases”…

Anton Bassov

Tim_Roberts · May 28, 2015, 7:04pm

xxxxx@gfi.com wrote:

Actually the data rate will be high, think pacepkets going through a server, but at 550 MB per sec. serviced by a fully loaded thread, packet by packet I think that is quite high

I’m not sure what “pacepkets” are, but copying 550 MB/s requires 3% CPU
load on a modern processor.

even if its all kernel time cpu.

You do understand that the CPU runs at exactly the same speed in kernel
mode and user mode, right?

packet analysis would certainly add much more than that.

Yes, but that was not the question.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Maxim_S_Shatskih · May 28, 2015, 8:05pm

>Otherwise, just keep in mind Knuth’s famous statement about premature optimization that happens

to be “the mother of all evil in all but 5% of cases”…

Oh yes, at least it is a mother of spaghetti code.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Marinca_Gheorghe · May 28, 2015, 11:53pm

Just to answer some of the questins here.

Yes cpu time is smth. that is universal but on my test app that was doing only transfer from driver allmost cpu time was spent in kernel by the read servicing which was copying data
At stream mode with 8k buffers one fully loaded thread gets 2.5 GB in a second hence in order to have trully high speeds large buffers count as usual.

Marinca_Gheorghe · May 29, 2015, 12:01am

If anyoane used wfp to intercept packets at network level and reinject. Do I have to be worried because of packet possible reoredering while getting them, passing them to user mode and reinjecting them. ?

I use an external lib that does the reoredering and inspection and Im affraid that in user mode I wil do injection in the same mangled way the packets have come