Driver Problems? Questions? Issues?
Put OSR's experience to work for you! Contact us for assistance with:
  • Creating the right design for your requirements
  • Reviewing your existing driver code
  • Analyzing driver reliability/performance issues
  • Custom training mixed with consulting and focused directly on your specific areas of interest/concern.
Check us out. OSR, the Windows driver experts.

OSR Seminars


Go Back   OSR Online Lists > ntdev
Welcome, Guest
You must login to post to this list
  Message 1 of 26  
13 Jul 18 02:12
Meenakshikumar Somasundaram
xxxxxx@gmail.com
Join Date: 24 Jan 2011
Posts To This List: 19
DMA Memory - MmAllocateContiguousNodeMemory

Hi I need to allocate shared contiguous memory (for DMA) on a particular NUMA node to transfer data directly between user space application and PCIe device. I have implemented following in kernel driver for buffer allocation and following is called in User Application's Context. 1) MmAllocateContiguousNodeMemory with "PAGE_READWRITE | PAGE_NOCACHE" flags. 2) Allocated MDL using IoAllocateMdl with address returned from Step 1 and built the MDL using MmBuildMdlForNonPagedPool. 3) MmMapLockedPagesSpecifyCache with UserMode, MmNonCached flags to get User Space Virtual Address of the allocated memory. 4) MmGetPhysicalAddress to get the starting physical address of the allocated memory. With the above, I am able to transfer data between application and device. The issue is, when application copies data to this DMA memory, it takes lot of time. For example, to copy 1024 bytes from local malloc() to this DMA memory takes 34us. Behaviour was almost similar in different test systems. Then I reviewed my implementation again and changed the flag "PAGE_NOCACHE" to "PAGE_WRITECOMBINE" in Step 1. Now, to copy 1024 bytes, it just takes 0.03us. In https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/content/wdm/ne-wdm- _memory_caching_type it is mentioned that, MmNonCached - The requested memory should not be cached by the processor. MmWriteCombined - The requested memory should not be cached by the processor, "but writes to the memory can be combined by the processor". Is using "PAGE_WRITECOMBINE" is right way for DMA memory (from app to device)? Also, with MmMapLockedPagesSpecifyCache I am setting MmNonCached instead of MmWriteCombined. Is this ok? Basically, I don't want any data which my user application is copying to this DMA memory gets cached, in such case, my device will see incorrect data. Regards, MK
  Message 2 of 26  
13 Jul 18 03:06
Tim Roberts
xxxxxx@probo.com
Join Date: 28 Jan 2005
Posts To This List: 12008
DMA Memory - MmAllocateContiguousNodeMemory

On Jul 12, 2018, at 11:11 PM, xxxxx@gmail.com <xxxxx@lists.osr.com> wrote: > > The issue is, when application copies data to this DMA memory, it takes lot of time. For example, to copy 1024 bytes from local malloc() to this DMA memory takes 34us. > > Behaviour was almost similar in different test systems. Then I reviewed my implementation again and changed the flag "PAGE_NOCACHE" to "PAGE_WRITECOMBINE" in Step 1. Now, to copy 1024 bytes, it just takes 0.03us. > ... > Is using "PAGE_WRITECOMBINE" is right way for DMA memory (from app to device)? > Also, with MmMapLockedPagesSpecifyCache I am setting MmNonCached instead of MmWriteCombined. Is this ok? > > Basically, I don't want any data which my user application is copying to this DMA memory gets cached, in such case, my device will see incorrect data. No, it won't. In non-exotic Intel-based architectures, DMA transfers are cache-coherent. However, you should look into KeFlushIoBuffers and FlushAdapterBuffers so your driver works in other architectures. ??? Tim Roberts, xxxxx@probo.com Providenza & Boekelheide, Inc.
  Message 3 of 26  
13 Jul 18 03:25
Meenakshikumar Somasundaram
xxxxxx@gmail.com
Join Date: 24 Jan 2011
Posts To This List: 19
DMA Memory - MmAllocateContiguousNodeMemory

In my case, kernel driver does just the memory allocation and giving it to user application. Data will be transferred directly between application and the device. Also, PCIe device is bus master here and no system DMA is used. Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn't disable caching for this reserved memory area?
  Message 4 of 26  
13 Jul 18 03:44
Dzmitry Altukhou
xxxxxx@gmail.com
Join Date: 24 Jun 2016
Posts To This List: 16
DMA Memory - MmAllocateContiguousNodeMemory

> In my case, kernel driver does just the memory allocation and giving it to user application. Data will be transferred directly between application and the device. Also, PCIe device is bus master here and no system DMA is used. > Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn't disable caching for this reserved memory area? No, it doesn't. I suppose you are digging in the wrong direction. Instead of fighting with the usage of system API better to check how you configure your HW DMA controller. With respect to the caching type, in your case (bus mustering, HW DMA) " MmNonCached" is correct setting. "MmWriteCombined" is more relevant for the streaming (non-DMA) approach of data transmission. Regards, Dzmitry Altukhou On Fri, Jul 13, 2018 at 9:26 AM xxxxx@gmail.com <xxxxx@lists.osr.com> wrote: > In my case, kernel driver does just the memory allocation and giving it to > user application. Data will be transferred directly between application and > the device. Also, PCIe device is bus master here and no system DMA is used. > > Does MmNonCached setting in MmMapLockedPagesSpecifyCache call doesn't > disable caching for this reserved memory area? > > --- > NTDEV is sponsored by OSR > <...excess quoted lines suppressed...> --
  Message 5 of 26  
13 Jul 18 05:04
Meenakshikumar Somasundaram
xxxxxx@gmail.com
Join Date: 24 Jan 2011
Posts To This List: 19
DMA Memory - MmAllocateContiguousNodeMemory

So, with MmNonCached in both MmAllocateContiguousNodeMemory & MmMapLockedPagesSpecifyCache, is the huge latency (34us for 1024 bytes) I am seeing is expected?
  Message 6 of 26  
13 Jul 18 05:25
maik peterson
xxxxxx@googlemail.com
Join Date: 17 Nov 2016
Posts To This List: 7
DMA Memory - MmAllocateContiguousNodeMemory

how can you program a dma without any clue about the achitecture/busses ? please read about pci/qpi/upi first and see these flags all about and why. mp 2018-07-13 8:11 GMT+02:00 xxxxx@gmail.com <xxxxx@lists.osr.com>: > Hi > > I need to allocate shared contiguous memory (for DMA) on a particular NUMA > node to transfer data directly between user space application and PCIe > device. I have implemented following in kernel driver for buffer allocation > and following is called in User Application's Context. > > 1) MmAllocateContiguousNodeMemory with "PAGE_READWRITE | PAGE_NOCACHE" > flags. > 2) Allocated MDL using IoAllocateMdl with address returned from Step 1 and <...excess quoted lines suppressed...> --
  Message 7 of 26  
13 Jul 18 06:58
Peter Viscarola
xxxxxx@osr.com
Join Date:
Posts To This List: 6243
List Moderator
DMA Memory - MmAllocateContiguousNodeMemory

Because Windows provides an architecture that specifically provides this feature, and a set of standardized interfaces. Read: <http://online.osr.com/article.cfm?article=539> There is, or was, a long Microsoft white paper on this topic (that I wrote) that describes all the function calls used in both WDM and WDF as well.. Peter OSR @OSRDrivers
  Message 8 of 26  
13 Jul 18 09:18
maik peterson
xxxxxx@googlemail.com
Join Date: 17 Nov 2016
Posts To This List: 7
DMA Memory - MmAllocateContiguousNodeMemory

yes, software and abstraction always is nice to have but should have at lease a minimal clue about what are you doing and what the hardware is all about. 2018-07-13 12:58 GMT+02:00 xxxxx@osr.com <xxxxx@lists.osr.com>: > Because Windows provides an architecture that specifically provides this > feature, and a set of standardized interfaces. Read: > > <http://online.osr.com/article.cfm?article=539> > > There is, or was, a long Microsoft white paper on this topic (that I > wrote) that describes all the function calls used in both WDM and WDF as > well.. > > Peter <...excess quoted lines suppressed...> --
  Message 9 of 26  
13 Jul 18 09:33
Meenakshikumar Somasundaram
xxxxxx@gmail.com
Join Date: 24 Jan 2011
Posts To This List: 19
DMA Memory - MmAllocateContiguousNodeMemory

Thanks for the information and DMA article. I had a look at it and summarizing the findings. Please correct me if I am wrong. 1) If device supports Scatter/Gather DMA and Bus Master, the Device Bus Logical Address = Physical Address. In this case, using physical address (from physically contiguous) for DMA operations obtained from MmGetPhysicalAddress() should work. 2) x64 based systems are cache coherent and hence no need to flush for DMA (KeFlushIoBuffers is an empty macro). My environment is x64. If the above two holds good, should I be worried about transferring data between application and device using the approach I followed with PAGE_WRITECOMBINE?
  Message 10 of 26  
13 Jul 18 10:25
Peter Viscarola
xxxxxx@osr.com
Join Date:
Posts To This List: 6243
List Moderator
DMA Memory - MmAllocateContiguousNodeMemory

<quote> but should have at lease a minimal clue about what are you doing and what the hardware is all about </quote> Yes. CLEARLY this is required. <quote> If device supports Scatter/Gather DMA and Bus Master, the Device Bus Logical Address = Physical Address </quote> Yes... IF the Bus Master is natively capable of DMA operations that reach all of memory. So, for a conventional PC architecture, the Bus Master is natively capable of 64-bit addressing. But, of course, this does not relieve you of the architectural requirement to utilize the Windows APIs for your DMA operations... which is still required. You may *not* ever make code assumptions that Physical Address == Device Bus Logical Address. You may *never* call MmGetPhysicalAddress and stuff that address into your DMA adapter, as doing so violates Windows OS architecture rules (and the contract that your driver has with the kernel and the HAL). <quote> x64 based systems are cache coherent and hence no need to flush for DMA </quote> Yes... but this does not relieve you of the architectural requirement to call KeFlushIoBuffers... which is still required. The fact that this does or does not generate an actual flush is a matter of implementation and not of architecture. <quote> should I be worried about transferring data between application and device using the approach I followed with PAGE_WRITECOMBINE </quote> I wouldn't use PAGE_WRITECOMBINE. The only place I've ever heard of using WRITECOMBINE is on a video frame buffer. So I, personally, wouldn't be comfortable using WRITECOMBINE for another sort of DMA mapping... and CERTAINLY not one that was shared with user-mode. But, that's just me... Perhaps you're smarter than I am and understand all the ramifications of using WRITECOMBINE. Peter OSR @OSRDrivers
  Message 11 of 26  
13 Jul 18 12:36
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4513
DMA Memory - MmAllocateContiguousNodeMemory

<quote> If device supports Scatter/Gather DMA and Bus Master, the Device Bus Logical Address = Physical Address </quote> ...which does not necessarily hold true for a system with IOMMU In general, these days you have, again, to stop making any assumptions about the target architecture - these days NT supports, again, more than just x86 and x86_64, and even above mentioned "conventional" architectures still may have "not-so-standard" features like IOMMU.... Anton Basssov
  Message 12 of 26  
13 Jul 18 13:57
Peter Viscarola
xxxxxx@osr.com
Join Date:
Posts To This List: 6243
List Moderator
DMA Memory - MmAllocateContiguousNodeMemory

>...which does not necessarily hold true for a system with IOMMU Outstanding point. And absolutely correct. (Holy SHIT! Mr. Bassov just made a post that "added value" to a thread! Mr. Bassov... you may take the weekend off, in celebration of this great event!) Peter OSR @OSRDrivers
  Message 13 of 26  
13 Jul 18 14:52
Tim Roberts
xxxxxx@probo.com
Join Date: 28 Jan 2005
Posts To This List: 12008
DMA Memory - MmAllocateContiguousNodeMemory

xxxxx@gmail.com wrote: > So, with MmNonCached in both MmAllocateContiguousNodeMemory & MmMapLockedPagesSpecifyCache, is the huge latency (34us for 1024 bytes) I am seeing is expected? Yes, absolutely.  Without caching, every dword you write to that region blocks until the write gets out to RAM.  With caching, your writes go to the cache, which takes one cycle, and then your app can move on.  The cache can take its time flushing to RAM later in big block writes, instead of blocking one dword at a time. It's easy to underestimate the enormous effect caching has on system performance.  About 10 years ago, I did some work for a patent attorney who wanted to know if it was practical to run Windows without caching.  When I turned off caching on all of RAM, it took 9 minutes for Vista to boot. -- Tim Roberts, xxxxx@probo.com Providenza & Boekelheide, Inc.
  Message 14 of 26  
13 Jul 18 16:09
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4513
DMA Memory - MmAllocateContiguousNodeMemory

<quote> (Holy SHIT! Mr. Bassov just made a post that "added value" to a thread! Mr. Bassov... you may take the weekend off, in celebration of this great event!) </quote> Are you still desperate to port my account to the new (and, scary, at least for me) platform in a Troll Mode??? Anton Bassov
  Message 15 of 26  
13 Jul 18 21:49
Peter Viscarola
xxxxxx@osr.com
Join Date:
Posts To This List: 6243
List Moderator
DMA Memory - MmAllocateContiguousNodeMemory

> Are you still desperate to port my account... One post is not quite enough earn you redemption, Mr Bassov. Peter OSR @OSRDrivers
  Message 16 of 26  
13 Jul 18 22:10
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4513
DMA Memory - MmAllocateContiguousNodeMemory

> One post is not quite enough earn you redemption, Mr Bassov. Oh, come on - I had mainly behaved recently, although still with some VERY infrequent trolling attempts (like the one one C++ thread)..... Anton Bassov
  Message 17 of 26  
14 Jul 18 18:15
Jan Bottorff
xxxxxx@pmatrix.com
Join Date: 16 Apr 2013
Posts To This List: 440
DMA Memory - MmAllocateContiguousNodeMemory

>In general, these days you have, again, to stop making any assumptions about the target > architecture - these days NT supports, again, more than just x86 and x86_ +1 I'm working on Windows ARM64 servers now (Cavium/Marvell ThunderX2). It's taking a while to discover and unlearn the sometimes subtle x64 assumptions I unconsciously make. The typical server I work on now has 2 sockets, 64 cores, and 256 logical processors (4 way hyperthreading). Jan
  Message 18 of 26  
14 Jul 18 20:43
anton bassov
xxxxxx@hotmail.com
Join Date: 16 Jul 2006
Posts To This List: 4513
DMA Memory - MmAllocateContiguousNodeMemory

> It's taking a while to discover and unlearn the sometimes subtle x64 assumptions > I unconsciously make. The very first example that gets into my head is an atomic operation like incrementing, adding, exchanging or any other atomic operation with functionality that goes beyond bit-test-and-set..... On x86 performing such an operation may be used an optimisation in some cases, because it is implemented in a hardware.However, some other architectures may not necessarily provide a hardware support for anything atomic, apart from a simple bit-test-and-set. What do you think using a function like, say, atomic_increment(), may be like on such an arch? Although such an atomic function may be available on the architecture that does not support atomic addition in a hardware, it is going to be implemented in arch-specific OS layer as a simple addition, guarded by the spinlock that is built around the atomic bit-test-and-set. Although its semantics are going to be exactly the same as they are on x86, using it may, in actuality, result in a performance penalty, rather than an optimisation that it offers on x86..... Anton Bassov .
  Message 19 of 26  
15 Jul 18 18:56
Pavel A
xxxxxx@fastmail.fm
Join Date: 21 Jul 2008
Posts To This List: 2423
DMA Memory - MmAllocateContiguousNodeMemory

Jan Bottorff wrote > I'm working on Windows ARM64 servers now (Cavium/Marvell ThunderX2) Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE at all, or it is specific to Intel arch? Regards, -- pa
  Message 20 of 26  
16 Jul 18 03:53
Dzmitry Altukhou
xxxxxx@gmail.com
Join Date: 24 Jun 2016
Posts To This List: 16
DMA Memory - MmAllocateContiguousNodeMemory

It's intended for Intel's architecture streaming (SIMD) load/store instructions. To be precise, a store instruction requires the support of SSE2 and a load, consequently, the support of SSE4.1extension. In the beginning, the main goal of the write-combine cache type for MMIO was speeding up the execution of a store instruction. With introducing of SSE4.1 ( "using a non-temporal memory hint" ) it became possible for load instructions as well. I'm using those instructions for the so-called non-DMA approach of data transmission. Certainly, it slower than normal HW DMA but it very useful in case your HW hasn't DMA controller. Nowadays, this technology able to support the transmission of up to 512 bit of data per one instruction (128/256/512). It's possible in case of support by CPU the AVX and AVX-512 extensions respectively. Regards, Dzmitry On Mon, Jul 16, 2018 at 12:56 AM xxxxx@fastmail.fm <xxxxx@lists.osr.com> wrote: > Jan Bottorff wrote > > I'm working on Windows ARM64 servers now (Cavium/Marvell ThunderX2) > > Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE > at all, or it is specific to Intel arch? > > Regards, > -- pa > > <...excess quoted lines suppressed...> --
  Message 21 of 26  
16 Jul 18 11:39
Tim Roberts
xxxxxx@probo.com
Join Date: 28 Jan 2005
Posts To This List: 12008
DMA Memory - MmAllocateContiguousNodeMemory

xxxxx@gmail.com wrote: > > It's intended for Intel's architecture streaming (SIMD) load/store > instructions. To be precise, a store instruction requires the support > of SSE2 and a load, consequently, the support of SSE4.1extension. Write combining started way before that.  It was originally designed as a way to speed up graphics operations, by allowing bitmap writes to the frame buffer to exploit bus bursting, but without turning on full caching.  Without WC, each write becomes a complete bus cycle. -- Tim Roberts, xxxxx@probo.com Providenza & Boekelheide, Inc.
  Message 22 of 26  
16 Jul 18 17:35
M M
xxxxxx@hotmail.com
Join Date: 21 Oct 2010
Posts To This List: 770
DMA Memory - MmAllocateContiguousNodeMemory

I think the concept of write combining is a few decades older that video hardware Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10 ________________________________ From: xxxxx@lists.osr.com <xxxxx@lists.osr.com> on behalf of xxxxx@probo.com <xxxxx@lists.osr.com> Sent: Monday, July 16, 2018 11:39:01 AM To: Windows System Software Devs Interest List Subject: Re: [ntdev] DMA Memory - MmAllocateContiguousNodeMemory xxxxx@gmail.com wrote: > > It's intended for Intel's architecture streaming (SIMD) load/store > instructions. To be precise, a store instruction requires the support > of SSE2 and a load, consequently, the support of SSE4.1extension. Write combining started way before that. It was originally designed as a way to speed up graphics operations, by allowing bitmap writes to the frame buffer to exploit bus bursting, but without turning on full caching. Without WC, each write becomes a complete bus cycle. -- Tim Roberts, xxxxx@probo.com Providenza & Boekelheide, Inc. --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer> --
  Message 23 of 26  
17 Jul 18 01:22
Jan Bottorff
xxxxxx@pmatrix.com
Join Date: 16 Apr 2013
Posts To This List: 440
DMA Memory - MmAllocateContiguousNodeMemory

The hardware does have something similar to write combining its called gathering, https://developer.arm.com/products/architecture/a-profile/docs/100941/latest/memo ry-types. I could not say for sure if the proper bits get set to enable this if you use the PAGE_WRITECOMBINE flag on MmMapIoSpaceEx. If not, it seems like a bug. Write combining I know makes a significant difference in performance on a simple video frame buffer. Jan -----Original Message----- From: xxxxx@lists.osr.com <xxxxx@lists.osr.com> On Behalf Of xxxxx@fastmail.fm Sent: Sunday, July 15, 2018 3:56 PM To: Windows System Software Devs Interest List <xxxxx@lists.osr.com> Subject: RE:[ntdev] DMA Memory - MmAllocateContiguousNodeMemory Jan Bottorff wrote > I'm working on Windows ARM64 servers now (Cavium/Marvell ThunderX2) Great. Could you look, please, do these creatures have PAGE_WRITECOMBINE at all, or it is specific to Intel arch? Regards, -- pa --- NTDEV is sponsored by OSR Visit the list online at: <http://www.osronline.com/showlists.cfm?list=ntdev> MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers! Details at <http://www.osr.com/seminars> To unsubscribe, visit the List Server section of OSR Online at <http://www.osronline.com/page.cfm?name=ListServer>
  Message 24 of 26  
17 Jul 18 02:14
Meenakshikumar Somasundaram
xxxxxx@gmail.com
Join Date: 24 Jan 2011
Posts To This List: 19
DMA Memory - MmAllocateContiguousNodeMemory

I have come across the Writing Combining Usage Guidelines which is quite old in http://download.intel.com/design/PentiumII/applnots/24442201.pdf. The document states the following. "WC is a weakly ordered memory type. System memory locations are not cached and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer to reduce memory accesses" Hence, I cant use PAGE_WRITECOMBINE for a system memory used as DMA. Better I would use PAGE_NOCACHE. I had an interesting observation as below. If I do memcpy(1024 bytes) to DMA memory, it takes 33us. But if I do memcpy(16 bytes) in a loop for 64 times for a 1024 byte copy, it take 1.9us. For cache and writecombine, it takes 0.04us and 0.03us respectively. So, I am planning to go with 16 byte chunk copy as of now with PAGE_NOCACHE flag.
  Message 25 of 26  
17 Jul 18 12:28
Tim Roberts
xxxxxx@probo.com
Join Date: 28 Jan 2005
Posts To This List: 12008
DMA Memory - MmAllocateContiguousNodeMemory

xxxxx@gmail.com wrote: > I have come across the Writing Combining Usage Guidelines which is quite old in http://download.intel.com/design/PentiumII/applnots/24442201.pdf. The document states the following. > > "WC is a weakly ordered memory type. System memory locations are not cached and coherency is not enforced by the processor?s bus coherency protocol. Speculative reads are allowed. Writes may be delayed and combined in the write combining buffer to reduce memory accesses" > > Hence, I cant use PAGE_WRITECOMBINE for a system memory used as DMA. Better I would use PAGE_NOCACHE. How on earth did you come to that conclusion?  The original Pentium had three caching options for each region, from least to most performant: uncached, write-combined, fully cached.  This is what the MTRR tables specify.  In virtually every case of DMA in an x86 or x64 architecture, you want fully cached. > I had an interesting observation as below. > > If I do memcpy(1024 bytes) to DMA memory, it takes 33us. But if I do memcpy(16 bytes) in a loop for 64 times for a 1024 byte copy, it take 1.9us. For cache and writecombine, it takes 0.04us and 0.03us respectively. I'd like to see your code, because I don't believe your first two numbers. -- Tim Roberts, xxxxx@probo.com Providenza & Boekelheide, Inc.
  Message 26 of 26  
Yesterday 02:56
Meenakshikumar Somasundaram
xxxxxx@gmail.com
Join Date: 24 Jan 2011
Posts To This List: 19
DMA Memory - MmAllocateContiguousNodeMemory

Contiguous memory is allocated using the following code snippet! The target system is x64 (Asus Z170 deluxe). PHYSICAL_ADDRESS LowestAcceptableAddress = { 0 }, HighestAcceptableAddress = { 0 }, BoundaryAddressMultiple = { 0 }; HighestAcceptableAddress.QuadPart = 0xFFFFFFFFFFFFFFFF; Protect = PAGE_READWRITE | PAGE_NOCACHE; ChunkSize = 32*1024*1024; NumaNode = 0; SystemVA = MmAllocateContiguousNodeMemory(ChunkSize, LowestAcceptableAddress, HighestAcceptableAddress, BoundaryAddressMultiple, Protect, NumaNode ); pMdl = IoAllocateMdl(SystemVA, ChunkSize, FALSE, FALSE, NULL); MmBuildMdlForNonPagedPool(pMdl); UserVA =(((ULONG_PTR)PAGE_ALIGN(MmMapLockedPagesSpecifyCache(pMdl, UserMode, MmNonCached, NULL, FALSE, HighPagePriority))) + MmGetMdlByteOffset(pMdl)); MappedPhyAddr = MmGetPhysicalAddress(SystemVA); The 64 bit application pseudocode is as follows. 1) Get the UserVA from the driver through an IOCTL call. 2) Alloc Temp memory for 32MB and initialize with zeros 3) Do memcpy in mentioned chunks (16 byte or 1024 byte) for 32MB. Profiling is done using QueryPerformanceCounter() wrapped around memcpy(). As you mentioned, as the system is x64, I would want to definitely take advantage of cache if the system is cache coherent for DMA operations. I would have to test the system for this. The contiguous buffer which I allocated is used by both the user space appl. and Device for R/W simultaneously. But at any point of time, only one of them access certain memory range in the buffer.
Posting Rules  
You may not post new threads
You may not post replies
You may not post attachments
You must login to OSR Online AND be a member of the ntdev list to be able to post.

All times are GMT -5. The time now is 19:20.


Copyright ©2015, OSR Open Systems Resources, Inc.
Based on vBulletin Copyright ©2000 - 2005, Jelsoft Enterprises Ltd.
Modified under license