Driver Problems? Questions? Issues?
Put OSR's experience to work for you! Contact us for assistance with:
  • Creating the right design for your requirements
  • Reviewing your existing driver code
  • Analyzing driver reliability/performance issues
  • Custom training mixed with consulting and focused directly on your specific areas of interest/concern.
Check us out. OSR, the Windows driver experts.

Monthly Seminars at OSR Headquarters

East Coast USA
Windows Internals and SW Drivers, Dulles (Sterling) VA, 13 November 2017

Kernel Debugging & Crash Analysis for Windows, Nashua (Amherst) NH, 4 December 2017

Writing WDF Drivers I: Core Concepts, Nashua (Amherst) NH, 8 January 2018

WDF Drivers II: Advanced Implementation Techniques, Nashua (Amherst) NH, 15 January 2018


Go Back   OSR Online Lists > ntdev
Welcome, Guest
You must login to post to this list
  Message 1 of 42  
02 Jan 17 21:50
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

Hi, I allocated 20M common buffer for PCIE DMA operation. I called RtlZeroMemory to initialize this buffer before triggered DMA start. This operation spend about 2ms. Then I call WdfMemoryCopyToBuffer to copy data to user buffer. This operation spend about 16ms. My question is why Windows need so long to initialize memory and copy data from kernel space to user space? Is there any way to avoid these operations or decrease the time of copy? Thanks
  Message 2 of 42  
02 Jan 17 23:17
Tim Roberts
xxxxxx@probo.com
Join Date: 28 Jan 2005
Posts To This List: 11622
Kernel DMA buffer copy to user buffer too slow

On Jan 2, 2017, at 6:49 PM, xxxxx@hotmail.com wrote: > > I allocated 20M common buffer for PCIE DMA operation. I called RtlZeroMemory to initialize this buffer before triggered DMA start. This operation spend about 2ms. Then I call WdfMemoryCopyToBuffer to copy data to user buffer. This operation spend about 16ms. > My question is why Windows need so long to initialize memory and copy data from kernel space to user space? Have you done the math on this? This is not a Windows thing, this is simply how long those operations take. We have become spoiled by our fast processors, but instructions take time, and memory is fast but not infinitely fast. Consider that clearing 20MB in 2ms means 100 picoseconds per byte, which is a frighteningly fast rate. Copying 20MB in 16ms is a rate of just over 1 gigabyte per second. A "rep movsq" instruction, once it is rolling, can move 8 bytes per cycle, but in this case I suspect you're being limited by memory bandwidth. Every byte has to pass through twice -- once for read, once for write. > Is there any way to avoid these operations or decrease the time of copy? The fastest option is to have the hardware DMA directly into the user-mode buffer. Does your hardware not support scatter/gather? If not, then you have no alternative but to copy, as you are doing. It's up to you to decide whether you need to clear the memory or not. — Tim Roberts, xxxxx@probo.com Providenza & Boekelheide, Inc.
  Message 3 of 42  
03 Jan 17 00:57
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

Hi, Tim: > Have you done the math on this? No, I do not any thing on the buffer except initialization and copy from common buffer(allocate by WdfCommonBufferCreate) to user application buffer. >The fastest option is to have the hardware DMA directly into the user-mode buffer. Does your >hardware not support scatter/gather? If not, then you have no alternative but to copy, as you are >doing. It's up to you to decide whether you need to clear the memory or not. Sigh, our hardware team don't support scatter/gather for this PCIE device. I know Linux can malloc physical buffer in kernel driver then it can mmap virtual address to this physical address in user application. Then user application can user virtual address to access the DMA buffer to avoid copy data from kernel to its buffer. Does windows OS provide similar method? Or it is the only way to use scatter/gather list method? Thank you very much!
  Message 4 of 42  
03 Jan 17 02:39
Tim Roberts
xxxxxx@probo.com
Join Date: 28 Jan 2005
Posts To This List: 11622
Kernel DMA buffer copy to user buffer too slow

On Jan 2, 2017, at 9:56 PM, xxxxx@hotmail.com wrote: > >> Have you done the math on this? > > No, I do not any thing on the buffer except initialization and copy from common buffer(allocate by WdfCommonBufferCreate) to user application buffer. That's not what I meant. Often, when people complain about slow performance, they haven't done the mathematics to figure out whether the times are reasonable or not. I'm guessing you have not done that. > Sigh, our hardware team don't support scatter/gather for this PCIE device. That's a design flaw that they will come to regret. > I know Linux can malloc physical buffer in kernel driver then it can mmap virtual address to this physical address in user application. Then user application can user virtual address to access the DMA buffer to avoid copy data from kernel to its buffer. Does windows OS provide similar method? Or it is the only way to use scatter/gather list method? Well, it's not impossible to allocate a common buffer and then map it into user space, but such a technique is error prone and considered dangerous. What makes you think the copy time is going to impact you? Before you embark on a wild path, you need to be absolutely sure that the safe and supported path is not doing the job. — Tim Roberts, xxxxx@probo.com Providenza & Boekelheide, Inc.
  Message 5 of 42  
03 Jan 17 07:19
Slava Imameev
xxxxxx@hotmail.com
Join Date: 13 Sep 2013
Posts To This List: 204
Kernel DMA buffer copy to user buffer too slow

<QUOTE> Linux can malloc physical buffer in kernel driver then it can mmap virtual address to this physical address in user application. Then user application can user virtual address to access the DMA buffer to avoid copy data from kernel to its buffer. Does windows OS provide similar method. </QUOTE> Though it is not similar to Linux mmap interface but it is possible to map locked pages to a user space address with MmMapLockedPagesSpecifyCache( Mdl, UserMode, ... ) . PVOID MmMapLockedPagesSpecifyCache( _In_ PMDLX MemoryDescriptorList, _In_ KPROCESSOR_MODE AccessMode, _In_ MEMORY_CACHING_TYPE CacheType, _In_opt_ PVOID BaseAddress, _In_ ULONG BugCheckOnFailure, _In_ MM_PAGE_PRIORITY Priority );
  Message 6 of 42  
03 Jan 17 07:38
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

Hi, Tim: Than you for your kindly response. It is very helpful for me. >Well, it's not impossible to allocate a common buffer and then map it into user space, but such a technique is error prone and considered dangerous. Could you please show me how this wild path work? Thank you! >What makes you think the copy time is going to impact you? Before you embark on a wild path, you >need to be absolutely sure that the safe and supported path is not doing the job. Well, let me describe my device more details. This device is PCIE 3.0 interface device. It has 8 MSI-X interrupts as said in another thread I posted on OSR website before. It supports 8 DMA descriptors in BAR0 space. Driver can configure these registers to trigger a DMA operation. But these 8 descriptors can't work together like a chain. So I need to allocate a continuously DMA buffer for each descriptor. For read data from device direction, driver can got a interrupt after triggered a 20M DMA transfer in about 4~5ms. If I don't copy data from common buffer to user buffer(just complete the read request immediately after received read complete interrupt), the user thread show read speed is about 4000MBps. But when I copy data from common buffer to user buffer(spend about 16ms or more) then complete the read request, the user thread show read speed is about 800MBps. Driver user want more higher read speed, that is why I was concerned to decrease the copy data time.
  Message 7 of 42  
03 Jan 17 08:00
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

Hi, Slava: Thank you! >Though it is not similar to Linux mmap interface but it is possible to map locked pages to a user >space address with MmMapLockedPagesSpecifyCache( Mdl, UserMode, ... ) By the document of this api, I don't know how to implement the common buffer to user space address mapping. Could you please show me more details or any simple sample? Thank you very much!
  Message 8 of 42  
03 Jan 17 08:25
Don Burn
xxxxxx@windrvr.com
Join Date: 23 Feb 2011
Posts To This List: 1349
Kernel DMA buffer copy to user buffer too slow

Look at http://www.osronline.com/article.cfm?article=39 Don Burn Windows Driver Consulting Website: http://www.windrvr.com -----Original Message----- From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com Sent: Tuesday, January 03, 2017 7:59 AM To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow Hi, Slava: Thank you! >Though it is not similar to Linux mmap interface but it is possible to map locked pages to a user >space address with MmMapLockedPagesSpecifyCache( Mdl, UserMode, ... ) By the document of this api, I don't know how to implement the common buffer to user space address mapping. Could you please show me more details or any simple sample? Thank you very much! --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer>
  Message 9 of 42  
03 Jan 17 09:10
Alex Grig
xxxxxx@broadcom.com
Join Date: 14 Apr 2008
Posts To This List: 3216
Kernel DMA buffer copy to user buffer too slow

Is your common buffer cached or non-cached?
  Message 10 of 42  
03 Jan 17 09:24
M M
xxxxxx@hotmail.com
Join Date: 21 Oct 2010
Posts To This List: 744
Kernel DMA buffer copy to user buffer too slow

Well, the obvious first thing to try is having the hardware DMA directly into the buffer from your UM application instead of into a buffer you allocate in your driver. This eliminates the need to copy anything, but you do need to ensure that you have enough buffers available and handle the case where you run out (do you queue the data in your driver or just throw it away). This means that the application must be using overlapped IO and should pend several ReadFile calls so that you have a queue of buffers to fill The best design will depend on how your application is designed and what kind of data this is. Most importantly what to do when your application can=92t keep up with the data from the device Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 From: xxxxx@hotmail.com<mailto:xxxxx@hotmail.com> Sent: January 3, 2017 7:37 AM To: Windows System Software Devs Interest List<mailto:xxxxx@lists.osr.com> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow Hi, Tim: Than you for your kindly response. It is very helpful for me. >Well, it's not impossible to allocate a common buffer and then map it into user space, but such a technique is error prone and considered dangerous. Could you please show me how this wild path work? Thank you! >What makes you think the copy time is going to impact you? Before you embark on a wild path, you >need to be absolutely sure that the safe and supported path is not doing the job. Well, let me describe my device more details. This device is PCIE 3.0 interface device. It has 8 MSI-X interrupts as said in another thread I posted on OSR website before. It supports 8 DMA descriptors in BAR0 space. Driver can configure these registers to trigger a DMA operation. But these 8 descriptors can't work together like a chain. So I need to allocate a continuously DMA buffer for each descriptor. For read data from device direction, driver can got a interrupt after triggered a 20M DMA transfer in about 4~5ms. If I don't copy data from common buffer to user buffer(just complete the read request immediately after received read complete interrupt), the user thread show read speed is about 4000MBps. But when I copy data from common buffer to user buffer(spend about 16ms or more) then complete the read request, the user thread show read speed is about 800MBps. Driver user want more higher read speed, that is why I was concerned to decrease the copy data time. --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer> --
  Message 11 of 42  
03 Jan 17 14:15
Dzmitry Altukhou
xxxxxx@gmail.com
Join Date: 24 Jun 2016
Posts To This List: 12
Kernel DMA buffer copy to user buffer too slow

Hi, here you have two options,namely: 1. Readfile/Writefile (buffering type) 2. IOCTL (buffering, direct and nither type) With first option your capabilities are quite limited as you could noted. But with the second option you have quite good alternative, namely: direct or nither. I suppose if you'll change your I/O schema to direct for example you can get better result. Main benefit of the IOCTL are: - full-duplex mode of I/O operaion while ReadFile/WriteFile allows to perform a request in one of two possible directions - different I/O schemas (bufering, direct and nither) Below you can find web-link that quite good explain the Direct I/O method Using Direct I/O with DMA <https://msdn.microsoft.com/en-us/library/windows/hardware/ff565374(v=vs.85).aspx > Netiher method is more exotic but provide more flexibility and allows you to manage all the I/O operations by your own Using Neither Buffered Nor Direct I/O <https://msdn.microsoft.com/en-us/library/windows/hardware/ff565432(v=vs.85).aspx > . Certanly it might be because of other reasons. Some of them have been already mentioned by otheres. It's jsut a suggestion from my side. May be you've already tried this. Best Regards, Dmitry On Tue, Jan 3, 2017 at 4:49 AM, <xxxxx@hotmail.com> wrote: > Hi, > I allocated 20M common buffer for PCIE DMA operation. I called > RtlZeroMemory to initialize this buffer before triggered DMA start. This > operation spend about 2ms. Then I call WdfMemoryCopyToBuffer to copy data > to user buffer. This operation spend about 16ms. > My question is why Windows need so long to initialize memory and copy data > from kernel space to user space? Is there any way to avoid these operations > or decrease the time of copy? > Thanks > <...excess quoted lines suppressed...> --
  Message 12 of 42  
03 Jan 17 19:11
Tim Roberts
xxxxxx@probo.com
Join Date: 28 Jan 2005
Posts To This List: 11622
Kernel DMA buffer copy to user buffer too slow

xxxxx@hotmail.com wrote: > > Well, let me describe my device more details. This device is PCIE 3.0 interface device. It has 8 MSI-X interrupts as said in another thread I posted on OSR website before. It supports 8 DMA descriptors in BAR0 space. Driver can configure these registers to trigger a DMA operation. But these 8 descriptors can't work together like a chain. So I need to allocate a continuously DMA buffer for each descriptor. For read data from device direction, driver can got a interrupt after triggered a 20M DMA transfer in about 4~5ms. If I don't copy data from common buffer to user buffer(just complete the read request immediately after received read complete interrupt), the user thread show read speed is about 4000MBps. But when I copy data from common buffer to user buffer(spend about 16ms or more) then complete the read request, the user thread show read speed is about 800MBps. Driver user want more higher read speed, that is why I was concerned to decrease the copy data time. What we have here is a research-quality case study of a project which failed to get the commitment of the hardware design team to read and adhere to the user requirements. They designed whatever was fun and easy to design, totally without regard for any actual customer needs. And now you're paying the price. This is a failure of project management. -- Tim Roberts, xxxxx@probo.com Providenza & Boekelheide, Inc.
  Message 13 of 42  
03 Jan 17 19:54
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4384
Kernel DMA buffer copy to user buffer too slow

> Though it is not similar to Linux mmap interface but it is possible to map locked pages > to a user space address with MmMapLockedPagesSpecifyCache( Mdl, UserMode, ... ) . Well,you can accuse me of trolling if you wish, but such a suggestion is normally just bound to result in "strong reaction" from certain posters. However, up to this point it has not yet happened in so far, although the "usual suspects" are participating in this thread.Bizarre...... Anyway, I can do their job if you wish. The above suggestion is, indeed, pretty unsafe in Windows environment. For example,consider what happens if the target app terminates abnormally, or, even better, some other app that runs under the same account calls WriteProcessMemory() on the target range..... Anton Bassov
  Message 14 of 42  
03 Jan 17 20:00
Peter Viscarola (OSR)
xxxxxx@osr.com
Join Date:
Posts To This List: 5949
List Moderator
Kernel DMA buffer copy to user buffer too slow

<quote> a research-quality case study of a project which failed to get the commitment of the hardware design team to read and adhere to the user requirements. </quote> It's reached epidemic levels: Engineers who can figure out JUST ENOUGH to throw some IP blocks together, but don't really understand how to design a proper device. I see it all the time, and it drives me nuts. The give-away in this design is the eight MSI-X interrupts. Easy in Verilog, and sounds ever so useful when reading the Express spec. In real life? Not so much. It's no different from software devs who grab and hack some sample from the web, without knowing in depth what they're actually doing. Good enough. Just hack some shit. "I think I'll map this buffer back into user virtual address space," with no earthly clue as to the complexities involved or problems this can cause. Arrrgh. Peter OSR @OSRDrivers
  Message 15 of 42  
03 Jan 17 20:01
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4384
Kernel DMA buffer copy to user buffer too slow

<quote> What we have here is a research-quality case study of a project which failed to get the commitment of the hardware design team to read and adhere to the user requirements. They designed whatever was fun and easy to design, totally without regard for any actual customer needs. And now you're paying the price. This is a failure of project management. </quote> OTOH,I don't see any indication that this is,indeed, a final and immutable release-stage version of the card.Probably this is just a prototype that is meant to check whether such a hardware design in itself might be feasible in terms of interfacing a device to the software, i.e. this is exactly the question that the OP's assignment is meant to answer.... Anton Bassov
  Message 16 of 42  
03 Jan 17 21:11
Justin Schoenwald
xxxxxx@cornell.edu
Join Date:
Posts To This List: 21
Kernel DMA buffer copy to user buffer too slow

Would someone say what the danger is, assuming that in the kernel+user mapped area there are just data from some machine?
  Message 17 of 42  
04 Jan 17 01:55
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

Hi, Don: >Look at http://www.osronline.com/article.cfm?article=39 I read the article and download the sample code. In the sample code, driver call MmAllocatePagesForMdl to allocate physical pages. But from MSDN, this api MmAllocatePagesForMdl/MmAllocatePagesForMdlEx can't allocate continuous buffer. So I can't use these physical buffer to do DMA transfer. Thank you!
  Message 18 of 42  
04 Jan 17 04:47
msr
xxxxxx@yahoo.com
Join Date: 03 Feb 2006
Posts To This List: 301
Kernel DMA buffer copy to user buffer too slow

MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS
  Message 19 of 42  
04 Jan 17 06:07
Pavel A
xxxxxx@fastmail.fm
Join Date: 21 Jul 2008
Posts To This List: 2401
Kernel DMA buffer copy to user buffer too slow

Justin Schoenwald wrote: > Would someone say what the danger is, assuming that in the kernel+user > mapped area there are just data from some machine? If the shared memory contains something like a linked list with pointers that the kernel side relies upon, usermode can tamper with these pointers and cause the kernel side to crash or access wrong data. Or, as Anton mentioned, someone can call WriteProcessMemory() on it. The driver must validate such shared structures all the time, which hits performance. Regards, - - pa
  Message 20 of 42  
04 Jan 17 08:11
Peter Viscarola (OSR)
xxxxxx@osr.com
Join Date:
Posts To This List: 5949
List Moderator
Kernel DMA buffer copy to user buffer too slow

> Would someone say what the danger is, assuming that in the kernel+user > mapped area there are just data from some machine? We've done this on this list about a zillion times. In brief: Let's say you map a chunk of memory Into user address space: Don't allocate the space from non-paged pool. Please. Allocate the space with AllocatePagesForMdl or something similar. When do you do the unmap? The article at http://www.osronline.com/article.cfm?article=39 (which I wrote) says "in cleanup"... then notes the problem with this is "what happens if somebody duplicates the handle" -- For years, I advocated ignoring this hole as one that's unlikely to cause problems. I even argued here that it didn't matter. I've changed my mind: It matters. If you're going to map the section back to an app, you have to figure out, specifically, how you're going to handle the fact that you can get the cleanup at the wrong time, and in the context of the wrong application. It's not impossibly hard to handle... it's just non-trivial, and something that a lot of devs don't fully understand and therefore feel comfortable ignoring. Doing so creates security vulnerabilities. Also note that the example in the cited article fails to wrap its call to MmMapLockedPagesSpecifyCache in a try/except. Ugh. There IS an EASIER way to share a chunk of memory between user mod and kernel mode: Have the APP allocate the memory, then have the app send a pointer to this memory using a METHOD_OUT_DIRECT IOCTL. The user already has the buffer in memory. Windows will create an MDL that describes the buffer and the driver can then easily access it (in WDF you'd just call WdfRequestRetrieveOutputBuffer in WDM MmGetSystemAddressForMdlSafe). The trick is to Keep the request pending while the buffer is needed and shared. Of course, this doesn't help the OP who wants physically contiguous memory. Peter OSR @OSRDrivers
  Message 21 of 42  
04 Jan 17 19:49
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

Hi, MSR: >MM_ALLOCATE_REQUIRE_CONTIGUOUS_CHUNKS Thank you very much! I will try this approach since customer used Win7 later OS.
  Message 22 of 42  
04 Jan 17 20:55
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

Hi, All: Thank you all for your kindly help and advice. Firstly, I know this is the fault of hardware design. I have emphasized many times to require hw engineer to support SGL. But obviously, I failed. Anyway, I need to fit the hw design finaly and try to meet performance requirement from customer. Secondly, the security problem caused by mapping physical memory to user space is important for me. I will not consider to implement it into offical version, and just for test purpose for performance comparing. Finaly, sincerely speaking, I am a newbee in windows kernel programming even I have spent about 10 years on it. Every time when I encountered problems abou windows driver devs I just want to post thread to ask for advice in OSR. And every time you all help me so much. I remembered you, Tim, Don, Peter, Pavel... Thank you!!!
  Message 23 of 42  
05 Jan 17 07:54
msr
xxxxxx@yahoo.com
Join Date: 03 Feb 2006
Posts To This List: 301
Kernel DMA buffer copy to user buffer too slow

Assuming sharing Driver_allocated_memory with user mode (rather the other way) is needed, why below is issue? << http://www.osronline.com/article.cfm?article=39 ... Allocating pages from main memory [1] is inherently more secure than using paged or non-paged pool [2], which is never a good idea. >> >> Don't allocate the space from non-paged pool. Please. Allocate the space with AllocatePagesForMdl or something similar. << Sorry if this was explained already or obvious, but why it (ExAlloc(NPaged), MmBuildMdlforNPagedPool()) is not recommended. Is it because it -uses the scarce NPaged blocks, but even the others consume the same NPaged blocks? OR -consumes extra map buffers, not sure if that is the case either as MmGetSystemAddressForMdlSafe() is a no-op in this case (and eventual MmMapLockedPagesSpecifyCache() will have AccessMode = User) Not sure what security above is referring to, both ways will have same security issues? And MSDN.MmMapLockedPagesSpecifyCache(Mdl..) "A pointer to the MDL that is to be mapped. This MDL must describe physical pages that are locked down. A locked-down MDL can be built by the MmProbeAndLockPages or MmAllocatePagesForMdlEx routine. *** For mappings to user space, MDLs that are built by the MmBuildMdlForNonPagedPool routine can be used. ***
  Message 24 of 42  
05 Jan 17 07:56
M M
xxxxxx@hotmail.com
Join Date: 21 Oct 2010
Posts To This List: 744
Kernel DMA buffer copy to user buffer too slow

The only avenue that you can peruse from here is concurrency. This will be highly dependent on the design of the UM application, but if you need the CPU to copy, then try to have many CPUs copy smaller chunks in parallel Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 From: xxxxx@hotmail.com<mailto:xxxxx@hotmail.com> Sent: January 4, 2017 8:53 PM To: Windows System Software Devs Interest List<mailto:xxxxx@lists.osr.com> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow Hi, All: Thank you all for your kindly help and advice. Firstly, I know this is the fault of hardware design. I have emphasized many times to require hw engineer to support SGL. But obviously, I failed. Anyway, I need to fit the hw design finaly and try to meet performance requirement from customer. Secondly, the security problem caused by mapping physical memory to user space is important for me. I will not consider to implement it into offical version, and just for test purpose for performance comparing. Finaly, sincerely speaking, I am a newbee in windows kernel programming even I have spent about 10 years on it. Every time when I encountered problems abou windows driver devs I just want to post thread to ask for advice in OSR. And every time you all help me so much. I remembered you, Tim, Don, Peter, Pavel... Thank you!!! --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer> --
  Message 25 of 42  
05 Jan 17 08:03
Bob Ammerman
xxxxxx@ramsystems.biz
Join Date: 05 Jun 2016
Posts To This List: 50
Kernel DMA buffer copy to user buffer too slow

Marion Bond said: >The only avenue that you can peruse from here is concurrency.=A0 This will= be highly dependent on the design of the UM application, but if you need t= he CPU to >copy, then try to have many CPUs copy smaller chunks in parallel I would expect less than a stellar improvement by doing this. Wouldn't main= memory bandwidth be the limiting factor? If so, multiple CPUs would not he= lp that much. * Bob
  Message 26 of 42  
05 Jan 17 08:27
M M
xxxxxx@hotmail.com
Join Date: 21 Oct 2010
Posts To This List: 744
Kernel DMA buffer copy to user buffer too slow

The OP doesn?t have a lot of choices. It really depends on what his application needs to do with the data and the parallelism needs to be driven from UM Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 From: Robert Ammerman<mailto:xxxxx@ramsystems.biz> Sent: January 5, 2017 8:01 AM To: Windows System Software Devs Interest List<mailto:xxxxx@lists.osr.com> Subject: RE: [ntdev] Kernel DMA buffer copy to user buffer too slow Marion Bond said: >The only avenue that you can peruse from here is concurrency. This will be highly dependent on the design of the UM application, but if you need the CPU to >copy, then try to have many CPUs copy smaller chunks in parallel I would expect less than a stellar improvement by doing this. Wouldn't main memory bandwidth be the limiting factor? If so, multiple CPUs would not help that much. * Bob --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer> --
  Message 27 of 42  
05 Jan 17 12:51
Peter Viscarola (OSR)
xxxxxx@osr.com
Join Date:
Posts To This List: 5949
List Moderator
Kernel DMA buffer copy to user buffer too slow

<quote> but why it (ExAlloc(NPaged), MmBuildMdlforNPagedPool()) is not recommended </quote> Because Non-Paged Pool is commonly used for storage of lots of secure kernel "stuff" and the memory isn't cleared before allocation -- the risk of an information disclosure vulnerability is greater, and needless. Also, because if you get the whole cleanup process wrong (or just don't do it), you wind-up with a user-mode process that has a mapping into blocks of non-paged pool that have been freed, and subject to subsequent use for secure kernel "stuff"... again, risking an information disclosure vulnerability. Peter OSR @OSRDrivers
  Message 28 of 42  
05 Jan 17 13:39
Peter Wieland
xxxxxx@microsoft.com
Join Date: 16 Jul 2009
Posts To This List: 303
Kernel DMA buffer copy to user buffer too slow

The kernel has a virtual address space, which is divided up in a number of = ways. One of those ways is the pool (paged & non-paged), which is intended= for dynamic allocation by the kernel and by drivers. This is very similar= to how your process has an address space, some of which is allocated to He= aps (e.g. the Win32 Heap & the CRT heap) for dynamic allocations. Kernel virtual address space is a shared resource and it can run low. Less= so with 64-bit machines, but still you have a bunch of kernel components f= ighting over it along with drivers. It also has a cost, because it require= s MM to find free page-table entries, which might require allocating page t= ables, and then assign them to your memory. It's better to avoid taking it= up if you can. Fortunately in the kernel you have the option to allocate physical pages wi= thout having the mapped into kernel virtual address space. That's what MmA= llocatePagesForMdl() does - it finds pages on the free list that meet your = criteria, locks them and then gives you back a list of them (in the MDL), b= ut doesn't map them into KVA. That leaves you free to decide how to use th= em - you can DMA into them, or map just a portion into the kernel, or the w= hole thing in to the kernel, or map them into user mode, etc... -p -----Original Message----- From: xxxxx@lists.osr.com [mailto:bounce-622943-43642@lists.o= sr.com] On Behalf Of xxxxx@yahoo.com Sent: Thursday, January 5, 2017 4:53 AM To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow Assuming sharing Driver_allocated_memory with user mode (rather the other w= ay) is needed, why below is issue? << http://www.osronline.com/article.cfm?article=3D39 ... Allocating pages from main memory [1] is inherently more secure than us= ing paged or non-paged pool [2], which is never a good idea. >> >> Don't allocate the space from non-paged pool. Please. Allocate the space = with AllocatePagesForMdl or something similar. << Sorry if this was explained already or obvious, but why it (ExAlloc(NPaged)= , MmBuildMdlforNPagedPool()) is not recommended. Is it because it=20 -uses the scarce NPaged blocks, but even the others consume the same NP= aged blocks? OR -consumes extra map buffers, not sure if that is the case either as MmG= etSystemAddressForMdlSafe() is a no-op in this case (and eventual MmMapLock= edPagesSpecifyCache() will have AccessMode =3D User) Not sure what security above is referring to, both ways will have same secu= rity issues? And MSDN.MmMapLockedPagesSpecifyCache(Mdl..) "A pointer to the MDL that is to be mapped. This MDL must describe physical= pages that are locked down. A locked-down MDL can be built by the MmProbeA= ndLockPages or MmAllocatePagesForMdlEx routine. *** For mappings to user sp= ace, MDLs that are built by the MmBuildMdlForNonPagedPool routine can be us= ed. *** --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=3Dnt= dev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and softwar= e drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.= osronline.com/page.cfm?name=3DListServer>
  Message 29 of 42  
05 Jan 17 16:33
msr
xxxxxx@yahoo.com
Join Date: 03 Feb 2006
Posts To This List: 301
Kernel DMA buffer copy to user buffer too slow

O.k. so MmAllocatePagesForMdlEx() will not by default consume KVA and is just sufficient for the case when only DMA is needed) I need to check though what the values of below MDL fields will be, particularly the 2nd one PVOID MappedSystemVa; PVOID StartVa; Isn't a KVA required by ntoskrnl.exe (if not a driver.sys) in some form or other when it is time to free the memory. Of course KVA (and PTEs) is wasted if a faulty driver.sys repeatedly calls MmMapLockedPagesSpecifyCache(accessmode = kernel), where we end up with multiple KVA's pointing to same PTE/PFN. Not sure when/why any driver would do that. But either method looks like will have the same (and only) info disclosure issue (ignoring unnecessary KVA done by ExAlloc() if not really needed by driver) unless below are not taken care explicitly - maps to user without MdlMappingNoWrite, the user can do exact same damage in both cases. - user of ExAlloc/MmBuildMdlForNPagedPool() doesn't zero-init the memory (and the additional explicit zero-init cost is exactly the same/negligible as it was for zero-filled blocks returned by MmAllocatePagesForMdlEx() - surely somebody/somewhere before did zero-init these pages)
  Message 30 of 42  
05 Jan 17 17:12
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4384
Kernel DMA buffer copy to user buffer too slow

> Wouldn't main= memory bandwidth be the limiting factor? > If so, multiple CPUs would not he= lp that much. It depends on your definition of what "main memory" (as well as FSB) is....... Although it is quite easy to define it on a "classical" Intel -based (i.e. UMA) system with FSB and Northbridge, things are not necessarily that easy on the AMD -based (as well as "newer" higher-end Intel ones) NUMA one, with every CPU core potentially having its own memory controller, as well as different bus agents relying upon point-to-point links between one another. On such a system an operation that MM mentioned may be more efficient if performed by CPU core X, rather than Y.... Anton Bassov
  Message 31 of 42  
05 Jan 17 17:33
Bob Ammerman
xxxxxx@ramsystems.biz
Join Date: 05 Jun 2016
Posts To This List: 50
Kernel DMA buffer copy to user buffer too slow

The OP is doing a transfer from a contiguous physically address kernel buff= er to a likely non-contiguous user buffer. In a NUMA world, it is a fair bet that the contiguous physical buffer is al= l on the same node. Thus, only CPUs on that node would be able to do a good= job with it, leaving us with a situation no better than the UMA case. It c= an get even worse if the user mode buffer isn't all on the same node as the= kernel mode buffer. * Bob =A0 Bob Ammerman =A0 xxxxx@ramsystems.biz 716.864.8337 138 Liston St=20 Buffalo, NY 14223 www.ramsystems.biz > -----Original Message----- > From: xxxxx@lists.osr.com [mailto:bounce-622995- > xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com > Sent: Thursday, January 05, 2017 5:10 PM > To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> > Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow >=20 > > Wouldn't main=3D memory bandwidth be the limiting factor? > > If so, multiple CPUs would not he=3D lp that much. <...excess quoted lines suppressed...> ..... >=20 >=20 > Although it is quite easy to define it on a "classical" Intel -based (i.e= . UMA) > system with FSB and Northbridge, things are not necessarily that easy on = the > AMD -based (as well as "newer" higher-end Intel ones) NUMA one, with eve= ry > CPU core potentially having its own memory controller, as well as differe= nt bus > agents relying upon point-to-point links between one another. On such a s= ystem > an operation that MM mentioned may be more efficient if performed by CPU > core X, rather than Y.... >=20 >=20 > Anton Bassov >=20 > --- > NTDEV is sponsored by OSR >=20 > Visit the list online at: <http://www.osronline.com/showlists.cfm?list=3D= ntdev> >=20 > MONTHLY seminars on crash dump analysis, WDF, Windows internals and > software drivers! > Details at <http://www.osr.com/seminars> >=20 > To unsubscribe, visit the List Server section of OSR Online at > <http://www.osronline.com/page.cfm?name=3DListServer>
  Message 32 of 42  
05 Jan 17 18:08
Peter Viscarola (OSR)
xxxxxx@osr.com
Join Date:
Posts To This List: 5949
List Moderator
Kernel DMA buffer copy to user buffer too slow

<quote> Isn't a KVA required by ntoskrnl.exe (if not a driver.sys) in some form or other when it is time to free the memory. </quote> Nope. Well, except for the MDL. The MDL describes the allocated physical pages. <quote> But either method looks like will have the same (and only) info disclosure issue </quote> While it's true that there's a potential information disclosure vulnerability in all cases if the code is not written correctly, the fact that non-page pool is THE scratch storage region used by drivers makes this a more likely area for storage of system-wide sensitive information. What are the chances of finding something sensitive in the non-paged pool versus in the (much larger and holding everything) random pages of memory? That's the main point I'm trying to make. Peter OSR @OSRDrivers
  Message 33 of 42  
05 Jan 17 18:09
ntdev member 167022
xxxxxx@gmail.com
Join Date:
Posts To This List: 79
Kernel DMA buffer copy to user buffer too slow

>It really depends on what his application needs to do with the data and the parallelism needs to be driven from UM In the general case, a driver is not designed to service a single app. For example, if the device is a PCIE SCSI or SATA controler and if the system partition is located on a drive that is connected to the controler than, virtually, all running application on the system will directly or indirectly send data to or receive data from the driver. >Secondly, the security problem caused by mapping physical memory to user space is important for me. I will not consider to implement it into offical version, and just for test purpose for performance comparing. Why should your driver do this ? When an app requests data from the device, a buffer is provided by the app and the driver fills this buffer with the data transfered by the device. When an app sends data to a device, again a buffer is provided by the app and the driver transfers the data from the buffer to the device. You can use direct I/O with DMA. When a device object is configured to do direct I/O, the I/O manager prepares a MDL that represents the user buffer. This MDL can be used for DMA as explained in the following page. https://msdn.microsoft.com/en-us/library/windows/hardware/ff565374(v=vs.85).aspx Note that this MDL could be used with MmMapLockedPagesSpecifyCache to obtain a system mapping of the user buffer. You would then be able to use the buffer in an arbitrary context (Isr or Dpc routine).
  Message 34 of 42  
05 Jan 17 20:02
Peter Wieland
xxxxxx@microsoft.com
Join Date: 16 Jul 2009
Posts To This List: 303
Kernel DMA buffer copy to user buffer too slow

When you allocate memory from pool, you use the returned virtual address to= refer to the memory (for example, to free it). When you allocate pages into an MDL, you use the MDL to refer to the memory= . MmFreePagesFromMdl will free them. By default MmAllocatePagesForMdl (and the Ex version) will allocate you zer= oed pages that you own. Nothing else in the kernel will write to them (unl= ess you give that other thing the address of your pages), and they won't co= ntain stale passwords or other secrets. So no disclosure up to user-mode. The big question, if you're going to preallocate large physically contiguou= s data buffers on behalf of your application, is whether your hardware and = your app can safely share the buffers. For example, if the buffer contains= physical addresses that the hardware will read from or write to, you shoul= d not share that into user-mode (since that would allow user-mode to read o= r write any physical page it wanted). As long as it's just where the devic= e fetches or dumps its data, it's reasonable to share it into user mode. N= ot a best practice, but it might be the only option for your non-SG hardwar= e. -p -----Original Message----- From: xxxxx@lists.osr.com [mailto:bounce-622994-43642@lists.o= sr.com] On Behalf Of xxxxx@yahoo.com Sent: Thursday, January 5, 2017 1:31 PM To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow O.k. so MmAllocatePagesForMdlEx() will not by default consume KVA and is ju= st sufficient for the case when only DMA is needed) I need to check though what the values of below MDL fields will be, particu= larly the 2nd one PVOID MappedSystemVa; PVOID StartVa; Isn't a KVA required by ntoskrnl.exe (if not a driver.sys) in some form or = other when it is time to free the memory. Of course KVA (and PTEs) is waste= d if a faulty driver.sys repeatedly calls MmMapLockedPagesSpecifyCache(acce= ssmode =3D kernel), where we end up with multiple KVA's pointing to same PT= E/PFN. Not sure when/why any driver would do that. But either method looks like will have the same (and only) info disclosure = issue (ignoring unnecessary KVA done by ExAlloc() if not really needed by d= river) unless below are not taken care explicitly - maps to user without MdlMappingNoWrite, the user can do exact same dam= age in both cases. - user of ExAlloc/MmBuildMdlForNPagedPool() doesn't zero-init the memory= (and the additional explicit zero-init cost is exactly the same/negligible= as it was for zero-filled blocks returned by MmAllocatePagesForMdlEx() - s= urely somebody/somewhere before did zero-init these pages) --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=3Dnt= dev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and softwar= e drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.= osronline.com/page.cfm?name=3DListServer>
  Message 35 of 42  
05 Jan 17 21:28
M M
xxxxxx@hotmail.com
Join Date: 21 Oct 2010
Posts To This List: 744
Kernel DMA buffer copy to user buffer too slow

You are missing the point: the Op has a bad HW design he cannot change. If he has a general purpose devise he is totally sunk anyways. If he has any chance of doing anything he must have control on the UM design. If he does than even all of these problems may yet be mitigated. If he has not even this then he is sunk no mater what we might suggest Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 From: Robert Ammerman<mailto:xxxxx@ramsystems.biz> Sent: January 5, 2017 5:32 PM To: Windows System Software Devs Interest List<mailto:xxxxx@lists.osr.com> Subject: RE: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow The OP is doing a transfer from a contiguous physically address kernel buffer to a likely non-contiguous user buffer. In a NUMA world, it is a fair bet that the contiguous physical buffer is all on the same node. Thus, only CPUs on that node would be able to do a good job with it, leaving us with a situation no better than the UMA case. It can get even worse if the user mode buffer isn't all on the same node as the kernel mode buffer. * Bob Bob Ammerman xxxxx@ramsystems.biz 716.864.8337 138 Liston St Buffalo, NY 14223 www.ramsystems.biz<http://www.ramsystems.biz> > -----Original Message----- > From: xxxxx@lists.osr.com [mailto:bounce-622995- > xxxxx@lists.osr.com] On Behalf Of xxxxx@hotmail.com > Sent: Thursday, January 05, 2017 5:10 PM > To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> > Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow > > > Wouldn't main= memory bandwidth be the limiting factor? > > If so, multiple CPUs would not he= lp that much. <...excess quoted lines suppressed...> --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer> --
  Message 36 of 42  
05 Jan 17 22:56
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4384
Kernel DMA buffer copy to user buffer too slow

> Thus, only CPUs on that node would be able to do a good= job with it, leaving us > with a situation no better than the UMA case. Actually, as long as you are able to enforce a strict task - CPU core/memory controller relationship, using a pre-defined CPU core for this task may be much more efficient,compared to UMA system, due to the proximity of the CPU core and its correspondent memory controller. OTOH, this is not exactly what MM was speaking about..... Anton Bassov
  Message 37 of 42  
06 Jan 17 08:57
Peter Viscarola (OSR)
xxxxxx@osr.com
Join Date:
Posts To This List: 5949
List Moderator
Kernel DMA buffer copy to user buffer too slow

Most of this entire thread --my own posts included -- are off into the weeds with respect to the OPs issue. I guess, going back to first principles, I would ask: "How much faster than 20MB in 16ms does this have to be?" As Mr. Roberts observed eons ago, that's already pretty fast. Peter OSR @OSRDrivers
  Message 38 of 42  
06 Jan 17 11:20
Scott Noone
xxxxxx@osr.com
Join Date:
Posts To This List: 1334
List Moderator
Kernel DMA buffer copy to user buffer too slow

Pool also allows you to allocate less than a page at a time. Say, for example, you allocate a 1K non-paged pool buffer and map it to user mode. The user now has access to 3K of privileged memory because it has a mapping to the entire page, not just the logically valid portion. This is solvable of course (just always allocated in page size chunks), but another unnecessary thing to worry about when dealing with mapping the memory to user mode. -scott OSR @OSRDrivers "Peter Wieland" wrote in message news:221876@ntdev... When you allocate memory from pool, you use the returned virtual address to refer to the memory (for example, to free it). When you allocate pages into an MDL, you use the MDL to refer to the memory. MmFreePagesFromMdl will free them. By default MmAllocatePagesForMdl (and the Ex version) will allocate you zeroed pages that you own. Nothing else in the kernel will write to them (unless you give that other thing the address of your pages), and they won't contain stale passwords or other secrets. So no disclosure up to user-mode. The big question, if you're going to preallocate large physically contiguous data buffers on behalf of your application, is whether your hardware and your app can safely share the buffers. For example, if the buffer contains physical addresses that the hardware will read from or write to, you should not share that into user-mode (since that would allow user-mode to read or write any physical page it wanted). As long as it's just where the device fetches or dumps its data, it's reasonable to share it into user mode. Not a best practice, but it might be the only option for your non-SG hardware. -p -----Original Message----- From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@yahoo.com Sent: Thursday, January 5, 2017 1:31 PM To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow O.k. so MmAllocatePagesForMdlEx() will not by default consume KVA and is just sufficient for the case when only DMA is needed) I need to check though what the values of below MDL fields will be, particularly the 2nd one PVOID MappedSystemVa; PVOID StartVa; Isn't a KVA required by ntoskrnl.exe (if not a driver.sys) in some form or other when it is time to free the memory. Of course KVA (and PTEs) is wasted if a faulty driver.sys repeatedly calls MmMapLockedPagesSpecifyCache(accessmode = kernel), where we end up with multiple KVA's pointing to same PTE/PFN. Not sure when/why any driver would do that. But either method looks like will have the same (and only) info disclosure issue (ignoring unnecessary KVA done by ExAlloc() if not really needed by driver) unless below are not taken care explicitly - maps to user without MdlMappingNoWrite, the user can do exact same damage in both cases. - user of ExAlloc/MmBuildMdlForNPagedPool() doesn't zero-init the memory (and the additional explicit zero-init cost is exactly the same/negligible as it was for zero-filled blocks returned by MmAllocatePagesForMdlEx() - surely somebody/somewhere before did zero-init these pages) --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer>
  Message 39 of 42  
06 Jan 17 14:20
M M
xxxxxx@hotmail.com
Join Date: 21 Oct 2010
Posts To This List: 744
Kernel DMA buffer copy to user buffer too slow

I think we can safely assume two things here about the OP?s problem 1. There is no answer to the question ?how fast does it have to be? other than ?as fast as possible?; and 2. The problem has nothing to do with how fast a CPU can copy a 20 MB block of memory, but rather how to achieve application throughput OP: you may not know anything about the threading and IO model used by the UM application. If you do, please tell us. If not you will need to find out before you can make any improvements to your driver. The most important thing is what this application will do with the data. Does it process a series of blocks of independent data where loss / reordering is irrelevant (like a DNS server)? Does it process a stream of coordinated data where loss / reordering is very important (like a database transaction log) Does it save this data to disk? Or process it in memory and generate some analytics? I am making the assumption that you don?t have a general purpose device here and have a particular UM application in mind. If that is wrong, then please let us know that too. Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 From: xxxxx@osr.com<mailto:xxxxx@osr.com> Sent: January 6, 2017 8:56 AM To: Windows System Software Devs Interest List<mailto:xxxxx@lists.osr.com> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow Most of this entire thread --my own posts included -- are off into the weeds with respect to the OPs issue. I guess, going back to first principles, I would ask: "How much faster than 20MB in 16ms does this have to be?" As Mr. Roberts observed eons ago, that's already pretty fast. Peter OSR @OSRDrivers --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer> --
  Message 40 of 42  
11 Jan 17 06:43
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

Hi, All: Thank you very much for all kindly response. @Peter Viscarola: Yes, the problem in this thread basically is to how I can get data from my driver faster. So there are 2 methods to consider this problem. One way is to improve the memory copy speed from driver to user space. I passed it becasue there is no any clue for me to resolve it. Another way is to cancel the copy operation to achieve the purpose that improved data transportion speed. Then I want to learn how to map kernel pages into user space like Linux driver sample done. I have implemented the second way in my driver and it can allocated some pages in driver and mapped into user space. I can use these pages to share data between driver and user application. But a new problem is the page numbers. When I try to allocate 20M size, Windows is BSOD. When I try to allocate 1M size, driver returns failure always. Even 512K, 256K, 128K allocation are all failed. I can allocate 1 page(not try other short size) successfully always but 4K is not enough for my DMA transfer.
  Message 41 of 42  
11 Jan 17 06:59
Abei Liu
xxxxxx@hotmail.com
Join Date: 18 Jan 2011
Posts To This List: 40
Kernel DMA buffer copy to user buffer too slow

>> You can use direct I/O with DMA. When a device object is configured to do direct I/O, the I/O manager prepares a MDL that represents the user buffer. This MDL can be used for DMA as explained in the following page. https://msdn.microsoft.com/en-us/library/windows/hardware/ff565374(v=vs.85).aspx Note that this MDL could be used with MmMapLockedPagesSpecifyCache to obtain a system mapping of the user buffer. You would then be able to use the buffer in an arbitrary context (Isr or Dpc routine). I consider this case before. But according my device hardware, I need to get the physical address to trigger a DMA transfer. Yes, I can get the MDL from user buffer but I don't kown how to trigger DMA. The api MapTransfer didn't provide any place to let me start DMA. And there is another question, how can I design my hardware to meet the MapTransfer or bus master requirement? What is the principles I can follow?
  Message 42 of 42  
11 Jan 17 11:51
Peter Wieland
xxxxxx@microsoft.com
Join Date: 16 Jul 2009
Posts To This List: 303
Kernel DMA buffer copy to user buffer too slow

If your device is a bus master, then your driver tells your device to trigg= er the DMA transfer. There will generally be a DMA start bit that you set = after you've programmed the parameters (addresses and length(s)) for the tr= ansfer. So you call Windows's DMA APIs to get your physical addresses set up, then = you tell your device to do the transfer. -p -----Original Message----- From: xxxxx@lists.osr.com [mailto:bounce-623281-43642@lists.o= sr.com] On Behalf Of xxxxx@hotmail.com Sent: Wednesday, January 11, 2017 3:58 AM To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> Subject: RE:[ntdev] Kernel DMA buffer copy to user buffer too slow >> You can use direct I/O with DMA. When a device object is configured to d= o direct I/O, the I/O manager prepares a MDL that represents the user buffe= r. This MDL can be used for DMA as explained in the following page. https:/= /msdn.microsoft.com/en-us/library/windows/hardware/ff565374(v=3Dvs.85).aspx= Note that this MDL could be used with MmMapLockedPagesSpecifyCache to obta= in a system mapping of the user buffer. You would then be able to use the b= uffer in an arbitrary context (Isr or Dpc routine). I consider this case before. But according my device hardware, I need to ge= t the physical address to trigger a DMA transfer. Yes, I can get the MDL fr= om user buffer but I don't kown how to trigger DMA. The api MapTransfer did= n't provide any place to let me start DMA.=20 And there is another question, how can I design my hardware to meet the Map= Transfer or bus master requirement? What is the principles I can follow?=20 --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=3Dnt= dev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and softwar= e drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.= osronline.com/page.cfm?name=3DListServer>
Posting Rules  
You may not post new threads
You may not post replies
You may not post attachments
You must login to OSR Online AND be a member of the ntdev list to be able to post.

All times are GMT -5. The time now is 02:40.


Copyright ©2015, OSR Open Systems Resources, Inc.
Based on vBulletin Copyright ©2000 - 2005, Jelsoft Enterprises Ltd.
Modified under license