Help with large DMA transfers
Hi all, first post here, please be gentle...
I am maintaining a PCIe device driver for an imaging device that transfers a lot
of data to host via DMA. Currently we use up to 2GB circular buffer and the
transfer rates are around 800MB/s. The device has an Altera FPGA and we have
full control over the firmware. The plan is to quadruple the data rate, switch
to PCIe x8 or Gen 3.
The typical scenario is as follows:
1) User application allocates the buffer, up to 2GB in size
2) User application passes the buffer pointer and size to the device driver
3) Device driver creates MDL, locks the memory
4) Device driver creates a single DMA transaction and executes it
5) Device driver obtains the OS SGDMA descriptors, programs them to the device
6) Device is told to start imaging, continuously fills in data into the circular
7) Each time an image is transferred, the device interrupts, the driver notifies
user application about new image in the circular buffer
To me this approach is very efficient. There is no memcpy anywhere and the
images simply ?appear? in the buffer on the application side. The buffer is also
used in non-circular fashion, the application simply says: here is a 2GB buffer,
fill it with images. And the transfer is fast and efficient.
The problem is, the number of SGDMA descriptors to ?describe? a 2GB user buffer
can be up to 524288 if I count right (2GB/4kB page size). Given a SGDMA
descriptor for our device is 16B large we?d need about 524288x16=8MB of memory
on the device. To reduce device resource usage we created only a small memory on
the device?s FPGA and we share that memory with the driver, using MmMapIoSpace.
We fill only what we can and during acquisition the device interrupts the driver
when it depletes the current batch of descriptors - and the driver fills in
another batch, on and on.
This approach worked fine but with increasing data rates the number of
interrupts per second increased and sometimes we cannot update the shared memory
fast enough. Expanding the FPGA?s memory to 8MB to fit all the descriptors is
out of question.
So I came across Common Buffers. I?ve created a 8MB common buffer with
WdfCommonBufferCreate, filled it with all the descriptors and gave the logical
address to the device. The device seems to be able to read the common buffer
just fine but we have not completely finished the firmware side yet.
However, I realized this is probably a not a future proof solution as well. I
think 8MB common buffer is already quite large and I?ve already seen
STATUS_INSUFFICIENT_RESOURCES when my system was running few days and I reloaded
the driver several times. I cannot imagine asking the system for 128MB common
buffer for DMA descriptors if we?d like to use huge DMA transfers.
Since I am not really super experienced in this area I?d like to kindly ask you,
experts, for an advice.
In general, what is the correct approach to efficiently handle large DMA
transfers? Is it even possible? We?d really like to support huge user buffers.
We?d like our application to simply say: here is a 60GB large buffer, fill it
with images and let me know when it?s done ? and the only bottleneck would be
the PCIe to RAM bandwidth.
I can think of few options:
A ? Use shared memory on the device, program it continuously during acquisition
- As it is now. Requires frequent interrupts and somewhat tricky on-the-fly
descriptor updating, consumes resources on the device, we try to avoid this.
B ? Use ?common buffer?
- Common buffer seems to be unreliable, even if I shrink it to 1MB and somehow
continuously update it with new descriptors on the fly (in a similar way as we
I can always get the STATUS_INSUFFICIENT_RESOURCES at some point. Yes, I can
allocate common buffer during boot (right now our driver is loaded during boot
but there is an effort to support PCIe hot plug and so I can imagine that our
driver can be loaded any time after boot, when the memory is already
fragmented). I could also create a helper ?boot? driver that will allocate the
common buffer for later use but that sounds like another workaround.
C ? Separate DMA buffer
- Have a small DMA buffer in the driver itself, say, 128MB DMA buffer and use
RtlCopyMemory() to copy data to user buffer when each image arrives. I am
reluctant to do this, the DMA is so efficient and I will kill the performance
with another memory copy. It would solve all the problems though.
D - Use device's own DDR memory
- Before starting the acquisition, use DMA to transfer all the descriptors to
the device's own DDR memory. Doable if nothing else works.
E ? Two ongoing DMA transfers
- One to transfer SGDMA descriptors themselves from RAM to the device, another
one to transfer the actual images from the device to RAM (as done now). That
sounds quite crazy to me but would probably allow us to have huge array of
descriptors and would be future proof. No common buffer needed, the descriptors
themselves would be in RAM but would need to be DMA'd to the device on the fly.
Any inputs and recommendations are very appreciated, thank you!