Infinite DMA transmitting

Hi,

My device supports only WdfDmaProfilePacket64, and it sends data stream.
Architectures x86 and x86_64.

I use next scheme for DMA:

  1. Allocate buffer WdfCommonBufferCreate (1MB) - it is my circular buffer for device, this buffer is accessible from device side (device writes to this buffer) and from user mode (user mode reads from this buffer)

  2. Send the buffer address (WdfCommonBufferGetAlignedLogicalAddress) to device

  3. Create MDL for the buffer and map memory to user mode (WdfCommonBufferGetAlignedVirtualAddress -> IoAllocateMdl -> MmBuildMdlForNonPagedPool -> MmMapLockedPagesSpecifyCache)

  4. Start data transmitting:
    a) a device writes a few packages to the buffer and generates MSI
    b) the driver sets event in DPC for user mode application each time when device sends MSI
    c) application waits on event and reads data from the buffer

Such data transmitting is working for a few hours, without any additional work with DMA Transactions. The device and applicaiton work with one buffer, and driver does not do anything with the device except of handling MSI.

  1. Is this scheme ok? Can I say that it is good practice for transmitting streamming data?
  2. Is it ok to use CommonBuffer without DmaTransaction? I have no transactions, the device always writes to the same buffer. I didn’t find the same usage of CommonBuffer.
  3. Should I call KeFlushIoBuffers each time before set event for user mode in DPC? Because the buffer is cached memory, but application should always read actual data when event is set. I read a lot of cache coherency and about CommonBuffer but did not find answer.

xxxxx@gmail.com wrote:

  1. Start data transmitting:
    a) a device writes a few packages to the buffer and generates MSI
    b) the driver sets event in DPC for user mode application each time when device sends MSI
    c) application waits on event and reads data from the buffer

Such data transmitting is working for a few hours, without any additional work with DMA Transactions. The device and applicaiton work with one buffer, and driver does not do anything with the device except of handling MSI.

  1. Is this scheme ok? Can I say that it is good practice for transmitting streamming data?

This is generally how circular buffer schemes work.  You will need to be
aware of the possibility that the circular buffer can wrap around before
your user-mode process wakes up.  That’s a difficult thing to detect and
handle, so you just need to be aware that the latency to wake a
user-mode sleeper is higher than the latency to wake up an ISR/DPC.

  1. Is it ok to use CommonBuffer without DmaTransaction? I have no transactions, the device always writes to the same buffer. I didn’t find the same usage of CommonBuffer.

Absolutely.  The DMA transaction abstraction has two fundamental
purposes: to serialize access to your DMA engine, and to get you the
logical address of your buffer.  With a common buffer, you already have
the logical address.  As long as you handle serialization yourself, you
don’t need the DMA transaction.

  1. Should I call KeFlushIoBuffers each time before set event for user mode in DPC? Because the buffer is cached memory, but application should always read actual data when event is set. I read a lot of cache coherency and about CommonBuffer but did not find answer.

Is the CPU ever going to be writing to your circular buffer?  If not,
then you don’t need KeFlushIoBuffers.  It flushes the CPU caches, so if
there’s no possibility that relevant data will be in the CPU cache, you
don’t need to flush it.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

> Is the CPU ever going to be writing to your circular buffer?

No, CPU doesn’t write to the buffer, but what about read cache?
I mean a case, when application reads old data from the buffer because CPU is cached those bytes (e.g. application twice read the same offset in buffer without switching of thread’s context and the device updates the same offset between first and second reading) - Will an application always read last data?
Should I use KeFlushIoBuffers to flush read caches before set an event ?

>Should I use KeFlushIoBuffers to flush read caches before set an event

I certainly would.

In the common case (x86/x64) it’ll do nothing and so therefore you’ll be fine.

If you’re running on ARM, it’ll guarantee you’re doing the right thing.

Having SAID that, I wouldn’t go nuts and insert KeFlushIoBuffers everywhere you can possibly think of. On ARM there can be a real cost to this call. So… use as needed.

Peter
OSR
@OSRDrivers

xxxxx@gmail.com wrote:

> Is the CPU ever going to be writing to your circular buffer?
No, CPU doesn’t write to the buffer, but what about read cache?
I mean a case, when application reads old data from the buffer because CPU is cached those bytes (e.g. application twice read the same offset in buffer without switching of thread’s context and the device updates the same offset between first and second reading) - Will an application always read last data?
Should I use KeFlushIoBuffers to flush read caches before set an event ?

The documentation for KeFlushIoBuffers seem to say that it not
necessary.  On x86 and x64, it’s certainly not; those DMA writes are
cache-coherent.  However, Peter’s natural caution is probably why he’s
still successful…


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

In the ARM64, PCIe bus masters will generally also be cache coherent too, although I believe there are options in the hardware to have devices that are not cache coherent, and require manual cache flushing. I think this is controlled by the IOMMU.

I seem to remember there being something about mapped GPU memory not being guaranteed coherent, as in if you have a device with cacheable BARs, the memory on the device visible through that BAR will not cause CPU snoop requests. I think this is true on Intel processors, as every time the GPU does a write to GPU memory, it does not cause a cache snoop to the processor(s), so the processor is responsible for cache flushing. Perhaps a GPU driver developer here could clarify this. It seems like there are other devices with cached BARs that might also be in this category, like NV storage devices that map their memory instead of using I/O requests. Peer-to-peer bus master transfers between a say a RDMA NIC and a non-coherent mapped BAR for NV storage source like care is needed.

if you follow the Windows DMA packet DMA modem, optimal flushing or not should be handled transparently by the DMA adapter object. Windows common memory I believe is defined to be cache coherent, so the question may only apply if you allocate some random block of memory and treat it as common memory without using the common memory API’s.

Jan

-----Original Message-----
From: xxxxx@lists.osr.com On Behalf Of xxxxx@probo.com
Sent: Monday, July 16, 2018 12:52 PM
To: Windows System Software Devs Interest List
Subject: Re: [ntdev] Infinite DMA transmitting

xxxxx@gmail.com wrote:
>> Is the CPU ever going to be writing to your circular buffer?
> No, CPU doesn’t write to the buffer, but what about read cache?
> I mean a case, when application reads old data from the buffer because CPU is cached those bytes (e.g. application twice read the same offset in buffer without switching of thread’s context and the device updates the same offset between first and second reading) - Will an application always read last data?
> Should I use KeFlushIoBuffers to flush read caches before set an event ?

The documentation for KeFlushIoBuffers seem to say that it not necessary. On x86 and x64, it’s certainly not; those DMA writes are cache-coherent. However, Peter’s natural caution is probably why he’s still successful…


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.


NTDEV is sponsored by OSR

Visit the list online at: http:

MONTHLY seminars on crash dump analysis, WDF, Windows internals and software drivers!
Details at http:

To unsubscribe, visit the List Server section of OSR Online at http:</http:></http:></http:>

>

if you follow the Windows DMA packet [model], optimal flushing or not should
be handled transparently by the DMA adapter object

ONLY if you use the “new” DMA Adapter Structure V3. Otherwise… not so much.

In the ARM64, PCIe bus masters will generally also be cache coherent too

That would make sense, thanks. Not a platform I’ve personally thought much about.

When I mentioned ARM, I was thinking 32-bit, which does this:

https:</https:>

Windows common memory I
believe is defined to be cache coherent

If you mean “Common Buffer” memory… I was going to write “it depends and it’s changed over time” … but, honestly, I can’t find a case where what you’ve written is not true (and that is not what I expected).

And on ARM, I just learned, it depends on the device. From the docs for AllocateCommonBuffer:

Recall that the CacheEnabled parameter to AllocateCommonBuffer has *always* been ignored. Always.

And in AllocateCommonBufferEx:

I had never heard of the _CCA attribute… So I learned something today. Thanks!

Peter
OSR
@OSRDrivers