Help with large DMA transfers

Lubo · August 2, 2017, 10:45pm

Hi all, first post here, please be gentle…

I am maintaining a PCIe device driver for an imaging device that transfers a lot of data to host via DMA. Currently we use up to 2GB circular buffer and the transfer rates are around 800MB/s. The device has an Altera FPGA and we have full control over the firmware. The plan is to quadruple the data rate, switch to PCIe x8 or Gen 3.

The typical scenario is as follows:

User application allocates the buffer, up to 2GB in size
User application passes the buffer pointer and size to the device driver
Device driver creates MDL, locks the memory
Device driver creates a single DMA transaction and executes it
Device driver obtains the OS SGDMA descriptors, programs them to the device
Device is told to start imaging, continuously fills in data into the circular buffer
Each time an image is transferred, the device interrupts, the driver notifies user application about new image in the circular buffer

To me this approach is very efficient. There is no memcpy anywhere and the images simply ?appear? in the buffer on the application side. The buffer is also used in non-circular fashion, the application simply says: here is a 2GB buffer, fill it with images. And the transfer is fast and efficient.

The problem is, the number of SGDMA descriptors to ?describe? a 2GB user buffer can be up to 524288 if I count right (2GB/4kB page size). Given a SGDMA descriptor for our device is 16B large we?d need about 524288x16=8MB of memory on the device. To reduce device resource usage we created only a small memory on the device?s FPGA and we share that memory with the driver, using MmMapIoSpace. We fill only what we can and during acquisition the device interrupts the driver when it depletes the current batch of descriptors - and the driver fills in another batch, on and on.

This approach worked fine but with increasing data rates the number of interrupts per second increased and sometimes we cannot update the shared memory fast enough. Expanding the FPGA?s memory to 8MB to fit all the descriptors is out of question.

So I came across Common Buffers. I?ve created a 8MB common buffer with WdfCommonBufferCreate, filled it with all the descriptors and gave the logical address to the device. The device seems to be able to read the common buffer just fine but we have not completely finished the firmware side yet.

However, I realized this is probably a not a future proof solution as well. I think 8MB common buffer is already quite large and I?ve already seen STATUS_INSUFFICIENT_RESOURCES when my system was running few days and I reloaded the driver several times. I cannot imagine asking the system for 128MB common buffer for DMA descriptors if we?d like to use huge DMA transfers.

Since I am not really super experienced in this area I?d like to kindly ask you, experts, for an advice.

In general, what is the correct approach to efficiently handle large DMA transfers? Is it even possible? We?d really like to support huge user buffers. We?d like our application to simply say: here is a 60GB large buffer, fill it with images and let me know when it?s done ? and the only bottleneck would be the PCIe to RAM bandwidth.

I can think of few options:

A ? Use shared memory on the device, program it continuously during acquisition

As it is now. Requires frequent interrupts and somewhat tricky on-the-fly descriptor updating, consumes resources on the device, we try to avoid this.

B ? Use ?common buffer?

Common buffer seems to be unreliable, even if I shrink it to 1MB and somehow continuously update it with new descriptors on the fly (in a similar way as we do now)
I can always get the STATUS_INSUFFICIENT_RESOURCES at some point. Yes, I can allocate common buffer during boot (right now our driver is loaded during boot but there is an effort to support PCIe hot plug and so I can imagine that our driver can be loaded any time after boot, when the memory is already fragmented). I could also create a helper ?boot? driver that will allocate the common buffer for later use but that sounds like another workaround.

C ? Separate DMA buffer

Have a small DMA buffer in the driver itself, say, 128MB DMA buffer and use RtlCopyMemory() to copy data to user buffer when each image arrives. I am reluctant to do this, the DMA is so efficient and I will kill the performance with another memory copy. It would solve all the problems though.

D - Use device’s own DDR memory

Before starting the acquisition, use DMA to transfer all the descriptors to the device’s own DDR memory. Doable if nothing else works.

E ? Two ongoing DMA transfers

One to transfer SGDMA descriptors themselves from RAM to the device, another one to transfer the actual images from the device to RAM (as done now). That sounds quite crazy to me but would probably allow us to have huge array of descriptors and would be future proof. No common buffer needed, the descriptors themselves would be in RAM but would need to be DMA’d to the device on the fly.

Any inputs and recommendations are very appreciated, thank you!

Lubo

Tim_Roberts · August 3, 2017, 1:31pm

xxxxx@gmail.com xxxxx@lists.osr.com wrote:

The problem is, the number of SGDMA descriptors to ?describe? a 2GB user buffer can be up to 524288 if I count right (2GB/4kB page size). Given a SGDMA descriptor for our device is 16B large we?d need about 524288x16=8MB of memory on the device.

That’s the worst case, yes.

So I came across Common Buffers. I?ve created a 8MB common buffer with WdfCommonBufferCreate, filled it with all the descriptors and gave the logical address to the device. The device seems to be able to read the common buffer just fine but we have not completely finished the firmware side yet.

That’s a common design. The FPGA pulls in a set of S/G descriptors, and
when it runs low it DMAs the next set.

However, I realized this is probably a not a future proof solution as well. I think 8MB common buffer is already quite large and I?ve already seen STATUS_INSUFFICIENT_RESOURCES when my system was running few days and I reloaded the driver several times. I cannot imagine asking the system for 128MB common buffer for DMA descriptors if we?d like to use huge DMA transfers.

You can get large common buffers at boot time, before physical space
gets fragmented. There’s usually no reason why your driver would have
to restart in a production environment.

A ? Use shared memory on the device, program it continuously during acquisition

As it is now. Requires frequent interrupts and somewhat tricky on-the-fly descriptor updating, consumes resources on the device, we try to avoid this.

Yes, this is icky.

B ? Use ?common buffer?

Common buffer seems to be unreliable, even if I shrink it to 1MB and somehow continuously update it with new descriptors on the fly (in a similar way as we do now)
I can always get the STATUS_INSUFFICIENT_RESOURCES at some point. Yes, I can allocate common buffer during boot (right now our driver is loaded during boot but there is an effort to support PCIe hot plug and so I can imagine that our driver can be loaded any time after boot, when the memory is already fragmented). I could also create a helper ?boot? driver that will allocate the common buffer for later use but that sounds like another workaround.

I’ve done several designs like this. Your concerns are valid, although
in my opinion the scenarios you describe are uncommon.

C ? Separate DMA buffer

Have a small DMA buffer in the driver itself, say, 128MB DMA buffer and use RtlCopyMemory() to copy data to user buffer when each image arrives. I am reluctant to do this, the DMA is so efficient and I will kill the performance with another memory copy. It would solve all the problems though.

I’ve done this as well. Memory copies are pretty danged fast these
days, and it does simplify the design, but there is a measurable
performance cost.

D - Use device’s own DDR memory

Before starting the acquisition, use DMA to transfer all the descriptors to the device’s own DDR memory. Doable if nothing else works.

This is another common design that I’ve done several times. A few
megabytes of RAM is a pretty cheap investment.

E ? Two ongoing DMA transfers

One to transfer SGDMA descriptors themselves from RAM to the device, another one to transfer the actual images from the device to RAM (as done now). That sounds quite crazy to me but would probably allow us to have huge array of descriptors and would be future proof. No common buffer needed, the descriptors themselves would be in RAM but would need to be DMA’d to the device on the fly.

I’ve actually done this one, as well. You’d have a set of descriptors
for the descriptor buffer, and have some kind of level trigger to load
the next set. As long as your DMA engine is not 100% busy, this is
simple from the driver’s side, although it adds complexity in the FPGA.

–
Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Tomas_Whitlock · August 5, 2017, 1:35pm

An approach that I have used is as follows:

Have a common buffer able to hold N descriptors (e.g. N = 256). With 32-byte descriptors, this is a mere 4 kiB.

Break the large DMA transfer into chunks, each chunk using no more than N descriptors. In practice, this means you have to avoid spanning more than N pages with a single DMA chunk, because in the worst case, 1 page = 1 descriptor = 1 “map register”. You can tune the chunk size to get an acceptable interrupt rate for the sustained DMA throughput that you can achieve with the hardware.

In the driver, for each chunk, call AllocateAdapterChannel etc. or Get/BuildScatterGatherList.

In the AdapterControl / AdapterListControl routine, write values into the descriptors.

In the (hardware) DMA completion ISR, schedule a DPC.

In the DPC, do the necessary, i.e. (FlushAdapterBuffers + FreeMapRegisters) OR PutScatterGatherList. Then go and do another chunk.

A variation on this, which is admittedly pretty complex and tricky to get right, is to overlap the transfer on the hardware with setting up the next DMA chunk. In other words, you have 2 or 3 sets of N descriptors, and rotate them, and by the time the DMA transfer has finished, you have everything ready to tell the hardware to execute another set of descriptors. This is probably not worth doing unless you need every last drop of performance.

The drawback of this technique is that there is generally a gap of a few microseconds between chunks of the DMA transfer, owing the interrupt response time, DPC scheduling latency etc. Obviously, a larger chunk size means that the inter-chunk gap matters less in terms of performance loss.

However, we’ve got some hardware at the moment that we’re working on which allows more than one DMA operation to be queued up at a given moment. This more or less eliminates the inter-chunk gap.

One potential advantage of this technique is that you’re not mapping a huge DMA transfer in one go, so the delay from the user-mode program saying “please do this DMA transfer” to it starting in hardware is lower. That fact can be useful if your hardware device has a FIFO and data is coming in at a steady rate, which cannot be stopped. Windows is not an RTOS, though, so unless the system as a whole is fault-tolerant to an occasional FIFO overflow, it’s not a particularly good system design.

Tomas_Whitlock · August 5, 2017, 1:37pm

Oops - 32 * 256 = 8 kiB.

Lubo · August 7, 2017, 4:49pm

Thank you Tim and Tomas for taking your time to provide suggestions.

Right now we seem to have success with the 8MB Common Buffer approach and accepted the fact that in a rare corner case the buffer might not get allocated and OS reboot will be required.

Once we get the large buffers requirement on our plates we will switch to one of those more complicated proposals - and thanks to your answers we seem to have several feasible options.

L.