How to map several common buffer pages to one contiguous virtual memory space?

OSR_Community_User · June 6, 2009, 6:36am

Dear group members,

Another DMA buffer memory question…: How is it possible to ‘present’ several AllocateCommonBuffer-blocks to a usermode-app as one contigous virtual memory?

I want to extend an existing WDM driver for a PCI framegrabber device. Acutally the driver allocates one buffer using AllocateCommonBuffer and creates a MDL using IoAllocateMdl and MmBuildMdlForNonPagedPool to map it into virtual memoryspace.

My device is capable to copy the requested data (a raw bitmap image) to several physical datablocks. To avoid fragmentation I want to pre-allocate a ‘list of AllocateCommonBuffer pages’ (all of the same size) once, for example each with 512kB. When the usermode-app requests an image with for example 1280x1024 pixel (1.3MB), the driver ‘takes 3 preallocated pages’, passes the physical adress of them to the PCI device. The image is split by the PCI device into 3 512kB ‘pages’ - the first 512kB are copied to the first block, the second 512kB to the second and the remaining about 260kB to the 3rd Block. The usermode-app should be able to access the complete image in one contigous virtual memory.

(fyi: in Linux, I use pci_alloc_consistent to allocate the ‘pages’ and remap_pfn_range
to map several of them to one contigous virutal memory.)

best regards,

Thomas

anton_bassov · June 6, 2009, 7:46am

> My device is capable to copy the requested data (a raw bitmap image) to several physical datablocks.

Before we proceed further, could you please explain why you think you need common buffer DMA and not scatter-gather one under these circumstances…

Anton Bassov

Mark_Roddy · June 6, 2009, 9:49am

Aren’t you approaching this backwards? Why not have the user allocate
the required virtual contiguous space and then use standard
scatter-gather dma to move the data into that region of memory?

Mark Roddy

On Sat, Jun 6, 2009 at 6:34 AM, wrote:
> Dear group members,
>
> Another DMA buffer memory question…: How is it possible to ‘present’ several AllocateCommonBuffer-blocks to a usermode-app as one contigous virtual memory?
>
> I want to extend an existing WDM driver for a PCI framegrabber device. Acutally the driver allocates one buffer using AllocateCommonBuffer and creates a MDL using IoAllocateMdl and MmBuildMdlForNonPagedPool to map it into virtual memoryspace.
>
> My device is capable to copy the requested data (a raw bitmap image) to several physical datablocks. To avoid fragmentation I want to pre-allocate a ‘list of AllocateCommonBuffer pages’ (all of the same size) once, for example each with 512kB. When the usermode-app requests an image with for example 1280x1024 pixel (1.3MB), the driver ‘takes 3 preallocated pages’, passes the physical adress of them to the PCI device. The image is split by the PCI device into 3 512kB ‘pages’ - the first 512kB are copied to the first block, the second 512kB to the second and the remaining about 260kB to the 3rd Block. The usermode-app should be able to access the complete image in one contigous virtual memory.
>
> (fyi: in Linux, I use pci_alloc_consistent to allocate the ‘pages’ and remap_pfn_range
> to map several of them to one contigous virutal memory.)
>
> best regards,
>
> Thomas
>
> —
> NTDEV is sponsored by OSR
>
> For our schedule of WDF, WDM, debugging and other seminars visit:
> http://www.osr.com/seminars
>
> To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer
>

OSR_Community_User · June 6, 2009, 12:29pm

Hi Anton and Mark,

Is it correct that ScatterGather DMA requires the device to be able to write the data to a list of blocks with arbitrary size and arbitrary number of blocks? How is this information sent to a PCI device usualy?

Is this the ‘normal’ way: The usermode part allocates a block of virtual memory, the driver locks the pages, determines the physical adresses of the pages and sends the list to the device? (I think that the pagesize is 4kB - so for a 5MB block some 1200 blocks would be requied…)

I think that my device supports something like ‘multiple commonBuffer’ - all the buffers have to be of the same size (512kB) and there is a maximum total number of blocks (10).

Is it theoretically possible to allocate several contigous blocks of memory using for example AllocateCommonBuffer and present them to the usermode app as one virtual block?

bye,

Thomas

Maxim_S_Shatskih · June 6, 2009, 2:38pm

>How is this information sent to a PCI device usualy?

Usually, the buffer list (as the chain of some device-defined descriptor structures) is placed to the common buffer, from where the device reads it. This is called “chain DMA” or “channel program”.

A classic way. Modern ATA, USB, 1394, SCSI and Ethernet controllers are all such.

Is this the ‘normal’ way: The usermode part allocates a block of virtual memory, the driver locks the
pages,

IO manager will do this for you and provide your driver with the MDL.

determines the physical adresses of the pages

Sort of. You need to pass the MDL to the DMA APIs, which will convert it to SGL, which in turn contains the addresses to send to the device.

and sends the list to the device?

Sort of.

Is it theoretically possible to allocate several contigous blocks of memory using for example
AllocateCommonBuffer and present them to the usermode app as one virtual block?

Without undocumented manual hacking of the MDLs - no. This is not the Windows way of doing things.

–
Maxim S. Shatskih
Windows DDK MVP
xxxxx@storagecraft.com
http://www.storagecraft.com

anton_bassov · June 6, 2009, 3:25pm

> Is it theoretically possible to allocate several contigous blocks of memory using for example >AllocateCommonBuffer and present them to the usermode app as one virtual block?

Although, as Max told you already, this is not really the Windows way of doing things, you can do a bit of experimentation. Try to replace a call to MmBuildMdlForNonPagedPool() with the one to MmProbeAndLockPages() - after all, documentation clearly states that the target buffer does not necessarily have to be pageable. As a result, you will be able to subsequently use MDL in a call to MmMapLockedPagesSpecifyCache(). Allocate a virtual range with MEM_RESERVE flag in the user space (this range should be sufficient for mapping all buffers at once), and try to map your target MDL to the appropriate offset in a range with MmMapLockedPagesSpecifyCache()…

Such approach does not seem to involve hackery of any description. Therefore, if it works it should be just fine…

Anton Bassov

Mark_Roddy · June 6, 2009, 4:50pm

On Sat, Jun 6, 2009 at 12:26 PM, wrote:
> Is this the ‘normal’ way: The usermode part allocates a block of virtual memory, the driver locks the pages, determines the physical adresses of the pages and sends the list to the device? (I think that the pagesize is 4kB - so for a 5MB block some 1200 blocks would be requied…)

For your 5MB virtual memory buffer you might in the worse case have
1281 discontiguous physical pages but typically you will have runs of
multiple contiguous pages. See the SCATTER_GATHER_LIST,
SCATTER_GATHER_ELEMENT objects and related interfaces and callback
function definitions.

Buffers on the order of 5MB are easily accommodated by the standard
DMA architecture in windows and do not require the driver writer to
jump through hoops of hackery.

Mark Roddy

OSR_Community_User · June 7, 2009, 8:05am

Hi Mark,

Buffers on the order of 5MB are easily accommodated by the
standard DMA architecture in windows and do not require the
driver writer to jump through hoops of hackery.

First - thanks for your fast help. My ‘experiment’ is intended to run on an ‘embedded PC’ and never on any consumer PC). My ‘strange paradigm’ is maybe based on an experiment with an existing Linux driver…

The 5MB are not easily allocatable in Linux. pci_alloc_consistent allocates max. 4MB blocks - and I need more than that. I splited the allocation into 512k Blocks and changed my pci device in a way that it sends the 5MB data in 10 blocks. Lets say as a prove of concept.

Now I try to do the same experiment in an existing WDM driver for the device. At the moment this driver uses one 1MB AllocateCommonBuffer block that is filled by the PCI device.

I was thinking of statically pre-allocating a pool of 512k Blocks and using these blocks. I would never run into any kind of fragmentation since I would not allocate anything during runtime.

For your 5MB virtual memory buffer you might in the worse case
have 1281 discontiguous physical pages but typically you will have
runs of multiple contiguous pages. See the SCATTER_GATHER_LIST,
SCATTER_GATHER_ELEMENT objects and related interfaces and
callback function definitions.
On my PCI card there is a controller with local memory. I can write to this memory. My way to communicate to the PCI device is to write data into its memory and trigger an interrupt on this controller. In my current setup I would have to reserve 1281xSCATTER_GATHER_ELEMENT for a 5MB block (of which I can have various) to be prepared for the ‘worst case’. The 5MB are also just a number - it can be more.

But I think in ‘the normal way’ of thinking this is also the other way round - correct? Is it true that the ‘normal way’ is to send the adress of the ScatterGather List to the device and the device reads the complete list by DMA and starts sending data?

best regards,

Thomas

OSR_Community_User · June 7, 2009, 8:24am

Hi Anton,

thanks for your input.

… replace a call to MmBuildMdlForNonPagedPool() with the
one to MmProbeAndLockPages() - …
As a result, you will be able to subsequently use MDL in a call to
MmMapLockedPagesSpecifyCache(). Allocate a virtual range with
MEM_RESERVE flag …map your target MDL to
the appropriate offset in a range with MmMapLockedPagesSpecifyCache()…

Since it is a rainy weekend I tried your suggestion but I couldnt finish it.
Maybe its just one more newbie bug that stops it. So here a description of my experiment:

1.) At the moment I call AllocateCommonBuffer in the DriverEntry function (maybe not the correct place) for each of my test-buffers. There are 4 test buffers, each 16 Bytes (too small?).

2.) If AllocateCommonBuffer is ok, I call IoAllocateMdl for each buffer and store its return value on a static variable.

3.) In a usermode app, I call VirtualAlloc with MEM_RESERVE and PAGE_READWRITE for 4x16Bytes.

4.) using an DeviceIoControl call, I send the adress returned by VirtualAlloc to the driver.

5.) In the driver I use the statically stored MDL (returned by IoAllocateMdl in DriverEntry) to call MmProbeAndLockPages. AccessMode: UserMode, Operation: IoModifyAccess.

6.) After that MmMapLockedPagesSpecifyCache with the same MDL and AccessMode: UserMode, CacheType: MmCached, BaseAddress: the value of VirtualAlloc.

7.) In the usermode app, when I access the pointer returned by VirtualAlloc, I get an ‘access violation’.

When I understand it correctly, VirtualAlloc doesnt alloc any physical memory when called with MEM_RESERVE. But shouldnt MmMapLockedPagesSpecifyCache do this?

Do you see any obvious error?

bye,

Thomas

anton_bassov · June 7, 2009, 8:44pm

> Do you see any obvious error?

The most obvious error is that you allocate MDL smaller than PAGE_SIZE. You just cannot map less than PAGE_SIZE into the virtual address space, which is just an architectural feature that does not depend on the OS type - whenever you do it, in actuality you have to map the whole page. If your MDL describes just 16 bytes (i.e the rest of the page may belong to someone else) and you map the whole page into the user space, it means that an app is able to write to memory that belongs to someone else, effectively screwing up the system…

What is the return value of MmMapLockedPagesSpecifyCache()??? I strongly suspect it is NULL, i.e. your operation simply fails. This is why accessing the address results in exception…

Try it with MDLs of at least PAGE_SIZE. If it still does not work, try to OR MEM_RESERVE with MEM_PHYSICAL in flags, so that the OS will know that the range is reserved for someone other than VirtualAlloc() with MEM_COMMIT flag (I don’t know if it plays any role, but anyway). If it still does not work, then you have to accept that experiment is unsuccessful, and give it up…

Anton Bassov

Pavel_Lebedinsky · June 7, 2009, 11:35pm

> Allocate a virtual range with MEM_RESERVE flag in the user space

(this range should be sufficient for mapping all buffers at once), and
try to map your target MDL to the appropriate offset in a range with
MmMapLockedPagesSpecifyCache()…

This won’t work. MmMapLockedPagesSpecifyCache(UserMode)
always allocates a new VAD, so if there is an existing overlapping
VAD it will fail with STATUS_CONFLICTING_ADDRESSES.

–
Pavel Lebedinsky/Windows Kernel Test
This posting is provided “AS IS” with no warranties, and confers no rights.

anton_bassov · June 8, 2009, 6:54am

Thomas,

As Pavel says, the whole thing doew not work because MmMapLockedPagesSpecifyCache(UserMode) always allocates a new VAD anyway. Therefore, you have to apply a small modification to your UM code - instead of reserving memory with VirtualAlloc() discover a free region that is sufficient for mapping your buffer with VirtualQuery(), and pass it to a driver.

However, with such approach you will have to call VirtualQuery() immediately before submitting an IOCTL and map all MDLs before returning to the UM - otherwise, once you cannot reserve a range, a call to memory management function like VirtualAlloc() or MapViewOfFile() that can be made behind the scenes when you make API calls potentially may use the range that you plan to use for your buffer.

In any case, ensure that your common buffers are rounded on PAGE_SIZE boundary…

Anton Bassov

Maxim_S_Shatskih · June 8, 2009, 1:10pm

> But I think in ‘the normal way’ of thinking this is also the other way round - correct? Is it true that the

‘normal way’ is to send the adress of the ScatterGather List to the device and the device reads the
complete list by DMA and starts sending data?

Yes.

–
Maxim S. Shatskih
Windows DDK MVP

xxxxx@storagecraft.com

http://www.storagecraft.com

OSR_Community_User · June 8, 2009, 5:49pm

Hi Anton, Pavel and Maxim

thanks again… I have checked my testcode and found 2 bugs. Due to that, the observations I posted yesterday could be wrong in some points. At the moment the test-IOCTL crashes the PC quite spectacularly - most likely due to another stupid bug.

I think that I have to improve my debug environment before I can go on. I’ll try to make the VirtualQuery/MmMapLockedPagesSpecifyCache idea work anyway (most likely not before next weekend…). But I think I have to look at the ScatterGather DMA in detail and the most reasonable way might be to re-implement it that way and change the PCI device. If ScatterGather is the way its available for sure also in Linux…

Can you recommend a starting point for a ScatterGather DMA from a PCI device? Is there some kind of tutorial or democode available?

best regards,

Thomas