Hi all, first post here, please be gentle…
I am maintaining a PCIe device driver for an imaging device that transfers a lot of data to host via DMA. Currently we use up to 2GB circular buffer and the transfer rates are around 800MB/s. The device has an Altera FPGA and we have full control over the firmware. The plan is to quadruple the data rate, switch to PCIe x8 or Gen 3.
The typical scenario is as follows:
- User application allocates the buffer, up to 2GB in size
- User application passes the buffer pointer and size to the device driver
- Device driver creates MDL, locks the memory
- Device driver creates a single DMA transaction and executes it
- Device driver obtains the OS SGDMA descriptors, programs them to the device
- Device is told to start imaging, continuously fills in data into the circular buffer
- Each time an image is transferred, the device interrupts, the driver notifies user application about new image in the circular buffer
To me this approach is very efficient. There is no memcpy anywhere and the images simply ?appear? in the buffer on the application side. The buffer is also used in non-circular fashion, the application simply says: here is a 2GB buffer, fill it with images. And the transfer is fast and efficient.
The problem is, the number of SGDMA descriptors to ?describe? a 2GB user buffer can be up to 524288 if I count right (2GB/4kB page size). Given a SGDMA descriptor for our device is 16B large we?d need about 524288x16=8MB of memory on the device. To reduce device resource usage we created only a small memory on the device?s FPGA and we share that memory with the driver, using MmMapIoSpace. We fill only what we can and during acquisition the device interrupts the driver when it depletes the current batch of descriptors - and the driver fills in another batch, on and on.
This approach worked fine but with increasing data rates the number of interrupts per second increased and sometimes we cannot update the shared memory fast enough. Expanding the FPGA?s memory to 8MB to fit all the descriptors is out of question.
So I came across Common Buffers. I?ve created a 8MB common buffer with WdfCommonBufferCreate, filled it with all the descriptors and gave the logical address to the device. The device seems to be able to read the common buffer just fine but we have not completely finished the firmware side yet.
However, I realized this is probably a not a future proof solution as well. I think 8MB common buffer is already quite large and I?ve already seen STATUS_INSUFFICIENT_RESOURCES when my system was running few days and I reloaded the driver several times. I cannot imagine asking the system for 128MB common buffer for DMA descriptors if we?d like to use huge DMA transfers.
Since I am not really super experienced in this area I?d like to kindly ask you, experts, for an advice.
In general, what is the correct approach to efficiently handle large DMA transfers? Is it even possible? We?d really like to support huge user buffers. We?d like our application to simply say: here is a 60GB large buffer, fill it with images and let me know when it?s done ? and the only bottleneck would be the PCIe to RAM bandwidth.
I can think of few options:
A ? Use shared memory on the device, program it continuously during acquisition
- As it is now. Requires frequent interrupts and somewhat tricky on-the-fly descriptor updating, consumes resources on the device, we try to avoid this.
B ? Use ?common buffer?
- Common buffer seems to be unreliable, even if I shrink it to 1MB and somehow continuously update it with new descriptors on the fly (in a similar way as we do now)
I can always get the STATUS_INSUFFICIENT_RESOURCES at some point. Yes, I can allocate common buffer during boot (right now our driver is loaded during boot but there is an effort to support PCIe hot plug and so I can imagine that our driver can be loaded any time after boot, when the memory is already fragmented). I could also create a helper ?boot? driver that will allocate the common buffer for later use but that sounds like another workaround.
C ? Separate DMA buffer
- Have a small DMA buffer in the driver itself, say, 128MB DMA buffer and use RtlCopyMemory() to copy data to user buffer when each image arrives. I am reluctant to do this, the DMA is so efficient and I will kill the performance with another memory copy. It would solve all the problems though.
D - Use device’s own DDR memory
- Before starting the acquisition, use DMA to transfer all the descriptors to the device’s own DDR memory. Doable if nothing else works.
E ? Two ongoing DMA transfers
- One to transfer SGDMA descriptors themselves from RAM to the device, another one to transfer the actual images from the device to RAM (as done now). That sounds quite crazy to me but would probably allow us to have huge array of descriptors and would be future proof. No common buffer needed, the descriptors themselves would be in RAM but would need to be DMA’d to the device on the fly.
Any inputs and recommendations are very appreciated, thank you!
Lubo