Kernel DMA buffer copy to user buffer too slow

Hi,
I allocated 20M common buffer for PCIE DMA operation. I called RtlZeroMemory to initialize this buffer before triggered DMA start. This operation spend about 2ms. Then I call WdfMemoryCopyToBuffer to copy data to user buffer. This operation spend about 16ms.
My question is why Windows need so long to initialize memory and copy data from kernel space to user space? Is there any way to avoid these operations or decrease the time of copy?
Thanks

On Jan 2, 2017, at 6:49 PM, xxxxx@hotmail.com wrote:

I allocated 20M common buffer for PCIE DMA operation. I called RtlZeroMemory to initialize this buffer before triggered DMA start. This operation spend about 2ms. Then I call WdfMemoryCopyToBuffer to copy data to user buffer. This operation spend about 16ms.
My question is why Windows need so long to initialize memory and copy data from kernel space to user space?

Have you done the math on this? This is not a Windows thing, this is simply how long those operations take. We have become spoiled by our fast processors, but instructions take time, and memory is fast but not infinitely fast.

Consider that clearing 20MB in 2ms means 100 picoseconds per byte, which is a frighteningly fast rate. Copying 20MB in 16ms is a rate of just over 1 gigabyte per second. A “rep movsq” instruction, once it is rolling, can move 8 bytes per cycle, but in this case I suspect you’re being limited by memory bandwidth. Every byte has to pass through twice – once for read, once for write.

Is there any way to avoid these operations or decrease the time of copy?

The fastest option is to have the hardware DMA directly into the user-mode buffer. Does your hardware not support scatter/gather? If not, then you have no alternative but to copy, as you are doing. It’s up to you to decide whether you need to clear the memory or not.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Hi, Tim:

Have you done the math on this?

No, I do not any thing on the buffer except initialization and copy from common buffer(allocate by WdfCommonBufferCreate) to user application buffer.

The fastest option is to have the hardware DMA directly into the user-mode buffer. Does your >hardware not support scatter/gather? If not, then you have no alternative but to copy, as you are >doing. It’s up to you to decide whether you need to clear the memory or not.

Sigh, our hardware team don’t support scatter/gather for this PCIE device. I know Linux can malloc physical buffer in kernel driver then it can mmap virtual address to this physical address in user application. Then user application can user virtual address to access the DMA buffer to avoid copy data from kernel to its buffer. Does windows OS provide similar method? Or it is the only way to use scatter/gather list method?

Thank you very much!

On Jan 2, 2017, at 9:56 PM, xxxxx@hotmail.com wrote:

> Have you done the math on this?

No, I do not any thing on the buffer except initialization and copy from common buffer(allocate by WdfCommonBufferCreate) to user application buffer.

That’s not what I meant. Often, when people complain about slow performance, they haven’t done the mathematics to figure out whether the times are reasonable or not. I’m guessing you have not done that.

Sigh, our hardware team don’t support scatter/gather for this PCIE device.

That’s a design flaw that they will come to regret.

I know Linux can malloc physical buffer in kernel driver then it can mmap virtual address to this physical address in user application. Then user application can user virtual address to access the DMA buffer to avoid copy data from kernel to its buffer. Does windows OS provide similar method? Or it is the only way to use scatter/gather list method?

Well, it’s not impossible to allocate a common buffer and then map it into user space, but such a technique is error prone and considered dangerous. What makes you think the copy time is going to impact you? Before you embark on a wild path, you need to be absolutely sure that the safe and supported path is not doing the job.

Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Though it is not similar to Linux mmap interface but it is possible to map locked pages to a user space address with MmMapLockedPagesSpecifyCache( Mdl, UserMode, … ) .

PVOID MmMapLockedPagesSpecifyCache(
In PMDLX MemoryDescriptorList,
In KPROCESSOR_MODE AccessMode,
In MEMORY_CACHING_TYPE CacheType,
In_opt PVOID BaseAddress,
In ULONG BugCheckOnFailure,
In MM_PAGE_PRIORITY Priority
);

Hi, Tim:
Than you for your kindly response. It is very helpful for me.

Well, it’s not impossible to allocate a common buffer and then map it into user space, but such a technique is error prone and considered dangerous.

Could you please show me how this wild path work? Thank you!

What makes you think the copy time is going to impact you? Before you embark on a wild path, you >need to be absolutely sure that the safe and supported path is not doing the job.

Well, let me describe my device more details. This device is PCIE 3.0 interface device. It has 8 MSI-X interrupts as said in another thread I posted on OSR website before. It supports 8 DMA descriptors in BAR0 space. Driver can configure these registers to trigger a DMA operation. But these 8 descriptors can’t work together like a chain. So I need to allocate a continuously DMA buffer for each descriptor. For read data from device direction, driver can got a interrupt after triggered a 20M DMA transfer in about 4~5ms. If I don’t copy data from common buffer to user buffer(just complete the read request immediately after received read complete interrupt), the user thread show read speed is about 4000MBps. But when I copy data from common buffer to user buffer(spend about 16ms or more) then complete the read request, the user thread show read speed is about 800MBps. Driver user want more higher read speed, that is why I was concerned to decrease the copy data time.

Hi, Slava:
Thank you!

Though it is not similar to Linux mmap interface but it is possible to map locked pages to a user >space address with MmMapLockedPagesSpecifyCache( Mdl, UserMode, … )

By the document of this api, I don’t know how to implement the common buffer to user space address mapping. Could you please show me more details or any simple sample?
Thank you very much!

Look at http://www.osronline.com/article.cfm?article=39

Don Burn
Windows Driver Consulting
Website: http://www.windrvr.com

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@hotmail.com
Sent: Tuesday, January 03, 2017 7:59 AM
To: Windows System Software Devs Interest List
Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow

Hi, Slava:
Thank you!

>Though it is not similar to Linux mmap interface but it is possible to map
locked pages to a user >space address with MmMapLockedPagesSpecifyCache(
Mdl, UserMode, … )

By the document of this api, I don’t know how to implement the common buffer
to user space address mapping. Could you please show me more details or any
simple sample?
Thank you very much!


NTDEV is sponsored by OSR

Visit the list online at:
http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software
drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at
http:</http:></http:></http:>

Is your common buffer cached or non-cached?

Well, the obvious first thing to try is having the hardware DMA directly into the buffer from your UM application instead of into a buffer you allocate in your driver. This eliminates the need to copy anything, but you do need to ensure that you have enough buffers available and handle the case where you run out (do you queue the data in your driver or just throw it away). This means that the application must be using overlapped IO and should pend several ReadFile calls so that you have a queue of buffers to fill

The best design will depend on how your application is designed and what kind of data this is. Most importantly what to do when your application can?t keep up with the data from the device

Sent from Mailhttps: for Windows 10

From: xxxxx@hotmail.commailto:xxxxx
Sent: January 3, 2017 7:37 AM
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow

Hi, Tim:
Than you for your kindly response. It is very helpful for me.

>Well, it’s not impossible to allocate a common buffer and then map it into user space, but such a technique is error prone and considered dangerous.

Could you please show me how this wild path work? Thank you!

>What makes you think the copy time is going to impact you? Before you embark on a wild path, you >need to be absolutely sure that the safe and supported path is not doing the job.

Well, let me describe my device more details. This device is PCIE 3.0 interface device. It has 8 MSI-X interrupts as said in another thread I posted on OSR website before. It supports 8 DMA descriptors in BAR0 space. Driver can configure these registers to trigger a DMA operation. But these 8 descriptors can’t work together like a chain. So I need to allocate a continuously DMA buffer for each descriptor. For read data from device direction, driver can got a interrupt after triggered a 20M DMA transfer in about 4~5ms. If I don’t copy data from common buffer to user buffer(just complete the read request immediately after received read complete interrupt), the user thread show read speed is about 4000MBps. But when I copy data from common buffer to user buffer(spend about 16ms or more) then complete the read request, the user thread show read speed is about 800MBps. Driver user want more higher read speed, that is why I was concerned to decrease the copy data time.


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:></mailto:xxxxx></mailto:xxxxx></https:>

Hi,

here you have two options,namely:

  1. Readfile/Writefile (buffering type)
  2. IOCTL (buffering, direct and nither type)

With first option your capabilities are quite limited as you could noted.
But with the second option you have quite good alternative, namely: direct
or nither.

I suppose if you’ll change your I/O schema to direct for example you can
get better result.

Main benefit of the IOCTL are:

  • full-duplex mode of I/O operaion while ReadFile/WriteFile allows to
    perform a request in one of two possible directions
  • different I/O schemas (bufering, direct and nither)

Below you can find web-link that quite good explain the Direct I/O method
Using Direct I/O with DMA
https:

Netiher method is more exotic but provide more flexibility and allows you
to manage all the I/O operations by your own Using Neither Buffered Nor
Direct I/O
https:
.

Certanly it might be because of other reasons. Some of them have been
already mentioned by otheres. It’s jsut a suggestion from my side. May be
you’ve already tried this.

Best Regards,
Dmitry

On Tue, Jan 3, 2017 at 4:49 AM, wrote:

> Hi,
> I allocated 20M common buffer for PCIE DMA operation. I called
> RtlZeroMemory to initialize this buffer before triggered DMA start. This
> operation spend about 2ms. Then I call WdfMemoryCopyToBuffer to copy data
> to user buffer. This operation spend about 16ms.
> My question is why Windows need so long to initialize memory and copy data
> from kernel space to user space? Is there any way to avoid these operations
> or decrease the time of copy?
> Thanks
>
> —
> NTDEV is sponsored by OSR
>
> Visit the list online at: http:> showlists.cfm?list=ntdev>
>
> MONTHLY seminars on crash dump analysis, WDF, Windows internals and
> software drivers!
> Details at http:
>
> To unsubscribe, visit the List Server section of OSR Online at <
> http://www.osronline.com/page.cfm?name=ListServer&gt;
></http:></http:></https:></https:>

xxxxx@hotmail.com wrote:

Well, let me describe my device more details. This device is PCIE 3.0 interface device. It has 8 MSI-X interrupts as said in another thread I posted on OSR website before. It supports 8 DMA descriptors in BAR0 space. Driver can configure these registers to trigger a DMA operation. But these 8 descriptors can’t work together like a chain. So I need to allocate a continuously DMA buffer for each descriptor. For read data from device direction, driver can got a interrupt after triggered a 20M DMA transfer in about 4~5ms. If I don’t copy data from common buffer to user buffer(just complete the read request immediately after received read complete interrupt), the user thread show read speed is about 4000MBps. But when I copy data from common buffer to user buffer(spend about 16ms or more) then complete the read request, the user thread show read speed is about 800MBps. Driver user want more higher read speed, that is why I was concerned to decrease the copy data time.

What we have here is a research-quality case study of a project which
failed to get the commitment of the hardware design team to read and
adhere to the user requirements. They designed whatever was fun and
easy to design, totally without regard for any actual customer needs.
And now you’re paying the price. This is a failure of project management.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> Though it is not similar to Linux mmap interface but it is possible to map locked pages

to a user space address with MmMapLockedPagesSpecifyCache( Mdl, UserMode, … ) .

Well,you can accuse me of trolling if you wish, but such a suggestion is normally just bound
to result in “strong reaction” from certain posters. However, up to this point it has not yet happened in so far, although the “usual suspects” are participating in this thread.Bizarre…

Anyway, I can do their job if you wish. The above suggestion is, indeed, pretty unsafe in Windows environment. For example,consider what happens if the target app terminates abnormally, or, even better, some other app that runs under the same account calls WriteProcessMemory() on the target range…

Anton Bassov

It’s reached epidemic levels: Engineers who can figure out JUST ENOUGH to throw some IP blocks together, but don’t really understand how to design a proper device. I see it all the time, and it drives me nuts. The give-away in this design is the eight MSI-X interrupts. Easy in Verilog, and sounds ever so useful when reading the Express spec. In real life? Not so much. It’s no different from software devs who grab and hack some sample from the web, without knowing in depth what they’re actually doing. Good enough. Just hack some shit. “I think I’ll map this buffer back into user virtual address space,” with no earthly clue as to the complexities involved or problems this can cause.

Arrrgh.

Peter
OSR
@OSRDrivers

OTOH,I don’t see any indication that this is,indeed, a final and immutable release-stage version of the card.Probably this is just a prototype that is meant to check whether such a hardware design in itself might be feasible in terms of interfacing a device to the software, i.e. this is exactly the question that the OP’s assignment is meant to answer…

Anton Bassov

Would someone say what the danger is, assuming that in the kernel+user
mapped area there are just data from some machine?

Hi, Don:

Look at http://www.osronline.com/article.cfm?article=39

I read the article and download the sample code. In the sample code, driver call MmAllocatePagesForMdl to allocate physical pages. But from MSDN, this api MmAllocatePagesForMdl/MmAllocatePagesForMdlEx can’t allocate continuous buffer. So I can’t use these physical buffer to do DMA transfer.
Thank you!

MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS

Justin Schoenwald wrote:

Would someone say what the danger is, assuming that in the kernel+user
mapped area there are just data from some machine?

If the shared memory contains something like a linked list with pointers that the kernel side relies upon, usermode can tamper with these pointers and cause the kernel side to crash or access wrong data. Or, as Anton mentioned, someone can call WriteProcessMemory() on it.

The driver must validate such shared structures all the time, which hits performance.

Regards,

    • pa

> Would someone say what the danger is, assuming that in the kernel+user

mapped area there are just data from some machine?

We’ve done this on this list about a zillion times. In brief:

Let’s say you map a chunk of memory Into user address space:

Don’t allocate the space from non-paged pool. Please. Allocate the space with AllocatePagesForMdl or something similar.

When do you do the unmap? The article at http://www.osronline.com/article.cfm?article=39 (which I wrote) says “in cleanup”… then notes the problem with this is “what happens if somebody duplicates the handle” – For years, I advocated ignoring this hole as one that’s unlikely to cause problems. I even argued here that it didn’t matter. I’ve changed my mind: It matters. If you’re going to map the section back to an app, you have to figure out, specifically, how you’re going to handle the fact that you can get the cleanup at the wrong time, and in the context of the wrong application. It’s not impossibly hard to handle… it’s just non-trivial, and something that a lot of devs don’t fully understand and therefore feel comfortable ignoring. Doing so creates security vulnerabilities.

Also note that the example in the cited article fails to wrap its call to MmMapLockedPagesSpecifyCache in a try/except. Ugh.

There IS an EASIER way to share a chunk of memory between user mod and kernel mode: Have the APP allocate the memory, then have the app send a pointer to this memory using a METHOD_OUT_DIRECT IOCTL. The user already has the buffer in memory. Windows will create an MDL that describes the buffer and the driver can then easily access it (in WDF you’d just call WdfRequestRetrieveOutputBuffer in WDM MmGetSystemAddressForMdlSafe). The trick is to Keep the request pending while the buffer is needed and shared.

Of course, this doesn’t help the OP who wants physically contiguous memory.

Peter
OSR
@OSRDrivers