Data streaming software architecture

Hi,

I’m trying to make a driver and user-land application for streaming data to or from a PCI Express card (but never in both directions at the same time). I’ll focus on data flow in read case:
Camera -> DMA1 -> board RAM -> DMA2 -> PCI Express bus -> PC RAM -> PC hard disk
There are multiple cameras and the maximum amount of data they would generate is 240MB, three times per second (720MB/s in total).
So, before investing time I wanted to ask for suggestions as to how to design my software on PC side for this kind of streaming.

DMA on PCIe card supports scatter/gather mode and, although it supports 64bit addresses, that part is not so great because I can’t really use 64bit address directly. I have to divide it, and write upper 32bits in some register and then provide lower 32bits on the bus. So DMA in S/G can work properly as long as upper 32bits are constant. Otherwise, I’d have to stop the DMA and set that register again which defeats the purpose of S/G I guess.

Anyway, I had this idea:

  • allocate two 240MB buffers in application
  • use IO completion ports API, ReadFile/WriteFile and switch between the two buffers
  • since the hard disk I have is capable of reads and writes up to 120MB/s, I’d just save first 25MB or so from the stream buffer on each WriteFile until I get a better disk

Initially I thought I’d issue many small read requests (32KB or so) to the driver but then it would always create S/G lists so perhaps it’s better to have application pass the two pointers before starting the stream, driver then locks the memory and creates two S/G lists and sends them to the device. That way S/G lists are transferred only once and memory is unlocked after the stream stops. Then I found SetFileIoOverlappedRange function which does the memory-locking so if I got it right, I don’t have to do MmGetSystemAddressForMdlSafe inside my driver?

It’s kind of messy concept in my head so I hope someone can clear this up for me.

Thanks.

xxxxx@gmail.com wrote:

I’m trying to make a driver and user-land application for streaming data to or from a PCI Express card (but never in both directions at the same time). I’ll focus on data flow in read case:
Camera -> DMA1 -> board RAM -> DMA2 -> PCI Express bus -> PC RAM -> PC hard disk
There are multiple cameras and the maximum amount of data they would generate is 240MB, three times per second (720MB/s in total).

What kind of cameras are these?  Do they really have the ability to
transfer into board RAM at 720MB/s?

So, before investing time I wanted to ask for suggestions as to how to design my software on PC side for this kind of streaming.

Do you plan to write an AVStream driver, so that any video capture
application could use it?  Which PCIe generation is your board, and how
many lanes?

DMA on PCIe card supports scatter/gather mode and, although it supports 64bit addresses, that part is not so great because I can’t really use 64bit address directly. I have to divide it, and write upper 32bits in some register and then provide lower 32bits on the bus. So DMA in S/G can work properly as long as upper 32bits are constant. Otherwise, I’d have to stop the DMA and set that register again which defeats the purpose of S/G I guess.

Yes, it certainly does.  That hardly qualifies as 64-bit scatter/gather
support.  Why do the hardware engineers do this?  It isn’t any harder to
create 64-bit registers in Verilog than it is to create 32-bit registers.

Anyway, I had this idea:

  • allocate two 240MB buffers in application
  • use IO completion ports API, ReadFile/WriteFile and switch between the two buffers
  • since the hard disk I have is capable of reads and writes up to 120MB/s, I’d just save first 25MB or so from the stream buffer on each WriteFile until I get a better disk

Initially I thought I’d issue many small read requests (32KB or so) to the driver…

No, that’s a horrible idea.  At 720MB/s, your 32kB buffer lasts 40µs 
You certainly do not want to have user/kernel/user transitions every 40µs.

but then it would always create S/G lists so perhaps it’s better to have application pass the two pointers before starting the stream, driver then locks the memory and creates two S/G lists and sends them to the device. That way S/G lists are transferred only once and memory is unlocked after the stream stops.

You could just submit two overlapped ReadFile requests for 240MB each. 
The driver locks them, generates an S/G list and submits it.  When the
frame completes, you unlock and complete the request.  The cost of
creating an S/G list is not high, especially since you only do it 3
times a second.

However, I don’t see how you can do this with your hardware’s addressing
bug.  If the application allocates the buffer, you have no control over
the physical addresses, so you can’t guarantee they’re all in the same
4GB block.  I suspect you’re going to have to allocate a common buffer
inside your driver to receive the DMA, and copy the frames to the
user-mode request when the frame is complete.

Then I found SetFileIoOverlappedRange function which does the memory-locking…

No, you’re reading too much into that.  SetFileIoOverlappedRange is used
to declare a region of memory that holds OVERLAPPED structures for doing
asynchronous I/O.  It is useless except in very, very special situations.

so if I got it right, I don’t have to do MmGetSystemAddressForMdlSafe inside my driver?

If you get an MDL, you ALWAYS have to do MmGetSystemAddressForMdlSafe,
because you need to get a kernel-mode address for the buffer. 
Fortunately, if the buffer is locked, that call is about 3 instructions,
because the kernel address is already present in the MDL.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Those are OV10640 image sensors, there are 9 of them and each is streaming 30 frames per second. The size of single frame is 1280*1080 * 2 bytes per pixel = 2700KB. Camera signals are fed into Xilinx Zynq-7 and then video frames are stored in RAM using VDMA IPs. I set each VDMA to generate interrupt to ARM CPU after 10 transfered frames. Then in FreeRTOS task, after all VDMAs transfered 10 frames, I would start transfering 10*size_of_frame * 9 cameras ~= 240MB over PCIe with CDMA IP. During the PCIe transfer, all VDMAs would stream into another block of 240MB on the board RAM and the process repeats… So I’m not transfering frame by frame over PCIe but 10 frames for all cameras 3 times per sec. That’s just what I came up with because it seemed more practical than having interrupts and all for each frame. I’m open for suggestions if this seems like a bad design.
Board supports PCIe generation 2.0 with 4 lanes.
The reason behind the mess with 64bit addresses is actually caused by AXI Memory Mapped to PCI-Express IP, not the CDMA itself. However, I think I found a solution for that: the only way to use 64bit addresses with S/G is to double the regular number of descriptors. For each PC memory page (one regular S/G transfer) I would have the first descriptor transfer 4 bytes into PCIe IP register I mentioned earlier and then second descriptor would actually transfer the 4K data.
I don’t plan to write AVStream driver since I only want to store video on disk for later processing.

xxxxx@gmail.com wrote:

Those are OV10640 image sensors, there are 9 of them and each is streaming 30 frames per second. The size of single frame is 1280*1080 * 2 bytes per pixel = 2700KB. Camera signals are fed into Xilinx Zynq-7 and then video frames are stored in RAM using VDMA IPs. I set each VDMA to generate interrupt to ARM CPU after 10 transfered frames. Then in FreeRTOS task, after all VDMAs transfered 10 frames, I would start transfering 10*size_of_frame * 9 cameras ~= 240MB over PCIe with CDMA IP. During the PCIe transfer, all VDMAs would stream into another block of 240MB on the board RAM and the process repeats… So I’m not transfering frame by frame over PCIe but 10 frames for all cameras 3 times per sec. That’s just what I came up with because it seemed more practical than having interrupts and all for each frame. I’m open for suggestions if this seems like a bad design.

I don’t think there’s anything wrong with the design.  Cameras typically
interrupt once per frame, because that is the natural division.  Your
plan is going to require at least 500MB of onboard RAM.  You don’t have
any guarantee that the 9 cameras are going to stay in sync, so you’ll
have to make sure you have time after the last camera finished to send
the frame before the first camera bounces back to the first buffer.  If
you transferred after every frame, you’d only need about 50MB.

Board supports PCIe generation 2.0 with 4 lanes.
The reason behind the mess with 64bit addresses is actually caused by AXI Memory Mapped to PCI-Express IP, not the CDMA itself. However, I think I found a solution for that: the only way to use 64bit addresses with S/G is to double the regular number of descriptors. For each PC memory page (one regular S/G transfer) I would have the first descriptor transfer 4 bytes into PCIe IP register I mentioned earlier and then second descriptor would actually transfer the 4K data.

You’d have to think about the timing of that.  Remember that PCIe
exchanges are not instantaneous.  Do you plan to pull the entire set of
descriptors into board RAM?  If so, then you may have time for that.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I’m aware of the 500MB requirement but the board has 1GB of RAM so it’s not a problem.

And yes, that’s exactly why I wrote earlier that I wanted to lock the buffers before starting stream, send all descriptors to board and unlock buffers once stream is stopped. Otherwise I’d have to copy roughly 2MB of descriptors per second which seems inefficient to me.

I started writing code and this is how I get S/G list: allocate buffer in user space with VirtualAlloc, pass it to my driver via DeviceIoControl (METHOD_NEITHER), then ProbeForRead/Write, IoAllocateMdl, MmProbeAndLockPages and finally GetScatterGatherList.
I’m still trying to figure out how to get some kind of notification or similar when application exits because, if someone forgets to call cleanup function in application before exiting, they are going to get PROCESS_HAS_LOCKED_PAGES BSoD.

Thanks for your answers, Tim. I really appreciate it since no one around me has ever written a single driver for Windows and I still have much to learn so it’s nice to talk to someone experienced.

xxxxx@gmail.com wrote:

And yes, that’s exactly why I wrote earlier that I wanted to lock the buffers before starting stream, send all descriptors to board and unlock buffers once stream is stopped. Otherwise I’d have to copy roughly 2MB of descriptors per second which seems inefficient to me.

That’s an increase in bus traffic of 0.3%.  Hardly worth worrying about,
but fetching them all first should also be OK.

 

I started writing code and this is how I get S/G list: allocate buffer in user space with VirtualAlloc, pass it to my driver via DeviceIoControl (METHOD_NEITHER), then ProbeForRead/Write, IoAllocateMdl, MmProbeAndLockPages and finally GetScatterGatherList.
I’m still trying to figure out how to get some kind of notification or similar when application exits because, if someone forgets to call cleanup function in application before exiting, they are going to get PROCESS_HAS_LOCKED_PAGES BSoD.

Assuming you are using KMDF, you need to catch the EvtFileClose
callback.  To get the file object callbacks, you need to call
WdfDeviceInitSetFileObjectConfig during initialization.  If you’re
actually using WDM, you need IRP_MJ_CLOSE.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

I am actually using WDM. And here I was, thinking it has to be more complicated than that. Ha!

Thanks again, Tim!