Massive data exchange between User and Kernel spaces. Best practice question

Vitaly_Tkachenko · November 14, 2014, 5:50am

Hello there,

We have got a kernel driver and system service which are currently communicating via the IOCTL requests (direct method). And it’s OK for now since the data size is not very big (about 20-30 MB).

But what if we need to get 100-150 MB of data from the kernel space? How should we act in such situation? I guess memory allocation should be done on the driver side and then we should map this memory in user space, but I’m not sure whether this correct or not?

And what if we need to pass about 1 GB of data? The allocation of 1 GB obviously is not very good idea? Should I use memory mapped files in such case?

Thanks in advance,
Vitaly

Peter_Viscarola_OSR · November 14, 2014, 8:39am

What Mr. “M M” said.

You should continue to allocate the memory in user-mode. Allocating the memory from kernel-mode gets you no advantage whatsoever. None. And it gets you complications, and potential security issues.

As the amount of data you need to transfer increases, you have to think if you *really* want to attempt to lock all the pages containing that data in memory simultaneously. And you have to give careful thought to whether the user-mode app *really* needs all that data returned in a single chunk, or whether it’s not an overall better idea to return that data in smaller chunks.

Hand in hand with the size of the data transfer goes the frequency with which these transfers take place. 1GB once a day? As long as you can get it to work reliably, who cares how you do it? 1GB 100 times per second? Well, now you’re talking about an entirely different class of problem. I suspect your requirements lie somewhere between those two.

It’s an interesting problem. We discuss problems like this quite extensively in our Advanced Implementation Techniques seminar.

There are a lot of trade-offs… if you tell us more about the characteristics of your transfers/workload we can give you some more advice.

Peter
OSR
@OSRDrivers

Alex_Grig · November 14, 2014, 12:40pm

100-150 MB is simply 20-30 MB five times. Do you really need to get all that in one call, or five calls is OK?

Peter_Wieland · November 14, 2014, 3:25pm

Unless you really want to share the kernel side’s data with the user side (and handle the fun cases that happen when the kernel side is modifying the data while the user side is reading it) you will still need to copy the data from the kernel buffers into user buffers. The best option there is METHOD_DIRECT, since you copy your data directly to user-side pages (rather than METHOD_BUFFERED which would indirect through an additional kernel buffer).

If your kernel-side is taking data from a device and putting it into the user-side buffers then METHOD_DIRECT is also the best option. Whether you’re doing this with PIO or DMA the handling would be identical to how you handle it with copying your own data.

If you really have huge amounts of data that’s contiguously mapped in the kernel and which remains static so that you don’t need any kernel/user synchronization (or is a stream where you’re writing and the user is reading, though you may still need to coordinate around changes to the ring pointers) then you would consider using shared memory. In that case you probably would allocate kernel buffer with an MDL (using MmAllocatePagesForMdl to back it with pages) rather than from pool, and you would have to handle mapping it partially or completely into kernel address space and user address space. If your buffers are really big then you may have trouble mapping them into kernel space unless you’re on a 64-bit machine, since you still need to find a contiguous KVA range.

METHOD_DIRECT is your friend.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@gmail.com
Sent: Friday, November 14, 2014 2:50 AM
To: Windows System Software Devs Interest List
Subject: [ntdev] Massive data exchange between User and Kernel spaces. Best practice question

Hello there,

We have got a kernel driver and system service which are currently communicating via the IOCTL requests (direct method). And it’s OK for now since the data size is not very big (about 20-30 MB).

But what if we need to get 100-150 MB of data from the kernel space? How should we act in such situation? I guess memory allocation should be done on the driver side and then we should map this memory in user space, but I’m not sure whether this correct or not?

And what if we need to pass about 1 GB of data? The allocation of 1 GB obviously is not very good idea? Should I use memory mapped files in such case?

Thanks in advance,
Vitaly

NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Vitaly_Tkachenko · November 17, 2014, 5:20am

Hello everyone!

First of all, thanks for your replies! It is really helpful.

Here are the details of our project. We are working on a Windows Filtering platform callout driver, which will intercept the IP packets and its payload, which is required for deep packet inspection functionality. The users’ connection is usually about 100Mb/s, what means the 20-30 MB buffer is more than enough to transfer the data from the driver to the user mode service. Currently the transfer happens once per second via the direct method and it works fine.

Since the bandwidth is significantly increasing and 1Gb/s network connection on a home machine is not a miracle at all (and 10Gb/s connection probably will be available soon even for a regular user), we need to make some sort of optimizations to avoid possible performance issues.
Sure, I realize that such massive network activity does not happen all the time. However, such applications as bittorrent etc. are able to consume even 1Gb/s connection actively sending and receiving small chunks of data. That is why we were trying to avoid fetching data from kernel space very often (it seems that switching the context is quite expensive operation) and our idea was to collect some data in kernel space and then transfer it to the user space as a single block.

To optimize this bottleneck, I was thinking on how to organize two common buffers, which would be backed up with pagefile (since the amount of data may be quite big) and make the driver to fulfill one of these buffers, while the other is processed by the user mode service and vice-versa.

But now I see that it’s probably not a good idea and will be better if I will transfer the data divided into small chunks from the driver to user mode.

Thanks,
Vitaly

Maxim_S_Shatskih · November 17, 2014, 9:18am

>more than enough to transfer the data from the driver to the user mode service. Currently the transfer

happens once per second via the direct method and it works fine.

Do you understand that you introduce 1second delay to the packet path, which will decrease your TCP performance a lot?

TCP perf is WindowSize * RTT, and you have just increased RTT to 1 second.

receiving small chunks of data. That is why we were trying to avoid fetching data from kernel space
very often (it seems that switching the context is quite expensive operation)

User/kernel transition is not a context switch. Context switch is from one thread to another, and this is exactly what you have just done with your inverted-call-based server process.

Also: please define “seems”. Have you done any measurements? have you really identified your bottlenecks?

and our idea was to collect some data in kernel space and then transfer it to the user space as a
single block.

What about implementing your analyzis in the kernel fully?

“Single block” means “large delay and RTT increase”, just due to a wait to fully fill this single block. This means perf drop, probably more major then lots of U/K transitions.

To optimize this bottleneck, I was thinking on how to organize two common buffers, which would be
backed up with pagefile (since the amount of data may be quite big)

Oh my God. You’ve just suggested to decrease the TCP performance (and increase the RTT) by a disk page read delay.

This idea itself (with kernel-allocated locked memory buffer) is quite usable for some scenarios, for instance, the OS-provided HD Audio support (WaveRT) uses this.

But still there is an option to implement the whole algo in kernel mode.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com

Vitaly_Tkachenko · November 17, 2014, 10:19pm

Hello Maxim,

Thanks for your reply.

Probably, we have got a misunderstanding here. Let me explain this.
We do not work with the original packets, we clone them. So our driver does not affect the main network workflow.

For the context switching, do I understand you right, that the IOCTL requests are not very expensive and I can use them rather often? Also here is another question. Does the driver will receive the IOCTL request in the context of the same thread, which initiated the request?

Implementation of the algo in kernel mode is actually possible, but I’m afraid the further modification of the algorithm may become a problem later. So I’m trying to avoid this.

Thanks,
Vitaly

Maxim_S_Shatskih · November 18, 2014, 1:56am

Oh yes, with cloning you do not need to bother this much about perf hits.

For the context switching, do I understand you right, that the IOCTL requests are not very expensive
and I can use them rather often?

Some years ago there was a discussion on it here. IIRC the outcome was: you can do 1M IOCTLs per second, provided they use FastIo.

Anyway: such things require measurements. First measure, then make a decision.

I would really expect that the context (thread) switch to your server process will cost much more than IOCTL.

Does the driver will receive the IOCTL request in the context of the same thread, which initiated the
request?

Driver without filters on top? surely yes, otherwise, METHOD_NEITHER would not work.

Implementation of the algo in kernel mode is actually possible, but I’m afraid the further modification
of the algorithm may become a problem later. So I’m trying to avoid this.

This is not about algo. Any algo code can be trivially compiled for the kernel (provided it does not eat the stack a lot).

This is about external libraries used by the algo.

–
Maxim S. Shatskih
Microsoft MVP on File System And Storage
xxxxx@storagecraft.com
http://www.storagecraft.com