WdfDmaProfilePacket questions

Hi,

I’m working on adding DMA support to a driver for a custom PCI Express data acquisition board. After reading the “DMA support in KMDF drivers” whitepaper and talking with the hardware engineer, I am trying to make a go of it using WdfDmaProfilePacket64, as the board doesn’t really have hardware S-G support. The board does have multiple DMA channels, however, 1 channel for each device on the board. So for example, if the board has 10 A/D converters, there will be 10 independent DMA channels, 1 per A/D.

The goal is to be able to use all devices on the board simultaneously if requested. I’ve written my driver so far to create a separate DMA transaction object for each channel on the board. I do this in the EvtDevicePrepareHardware callback. This seems to work and I can successfully DMA to/from any 1 device. When I try to execute a second DMA transaction on another channel (while the first is still active on a different channel), WdfDmaTransactionExecute fails with STATUS_WDF_BUSY. The KMDF documentation pretty clearly points out that this will result from using a packet profile and attempting to execute a Transaction while another is still executing.

Is it possible to use multiple transaction objects when you’ve selected a packet profile?

The answer seems obvious based on the Microsoft docs, but some of the advice given in the thread below seemed to indicate otherwise:

http://www.osronline.com/showthread.cfm?link=166607

Thanks in advance.

If you have 10 independent DMA controllers, you should look at creating 10 separate DMA adapters.

-p

-----Original Message-----
From: xxxxx@lists.osr.com [mailto:xxxxx@lists.osr.com] On Behalf Of xxxxx@gmail.com
Sent: Monday, January 19, 2015 2:53 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] WdfDmaProfilePacket questions

Hi,

I’m working on adding DMA support to a driver for a custom PCI Express data acquisition board. After reading the “DMA support in KMDF drivers” whitepaper and talking with the hardware engineer, I am trying to make a go of it using WdfDmaProfilePacket64, as the board doesn’t really have hardware S-G support. The board does have multiple DMA channels, however, 1 channel for each device on the board. So for example, if the board has 10 A/D converters, there will be 10 independent DMA channels, 1 per A/D.

The goal is to be able to use all devices on the board simultaneously if requested. I’ve written my driver so far to create a separate DMA transaction object for each channel on the board. I do this in the EvtDevicePrepareHardware callback. This seems to work and I can successfully DMA to/from any 1 device. When I try to execute a second DMA transaction on another channel (while the first is still active on a different channel), WdfDmaTransactionExecute fails with STATUS_WDF_BUSY. The KMDF documentation pretty clearly points out that this will result from using a packet profile and attempting to execute a Transaction while another is still executing.

Is it possible to use multiple transaction objects when you’ve selected a packet profile?

The answer seems obvious based on the Microsoft docs, but some of the advice given in the thread below seemed to indicate otherwise:

http://www.osronline.com/showthread.cfm?link=166607

Thanks in advance.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer

Thanks Peter.

I’m currently creating a DMA enabler object for each independent channel on the board. I then create a DMA transaction object for each enabler. When I look at the enablers with the debugger, I see that each enabler corresponds to a different WDM AdapterObject.

A couple of things do look a little weird:

  1. !wdfdmaenabler claims my profile is WdfDmaProfileSystem.
  2. !dma does not work at all, either on its own or with the address given in the output of !wdfdmaenabler. I get a “Error retrieving adapter list offset”.

Could those both just be issues with my environment? I’m looking into this, but I’m still not sure when I’m getting the STATUS_WDF_BUSY error since it appears each of my enablers is associated with a different adapter.

We designed the DMA engine in-house and have the ability to change it, if needed. Is there a guideline or documentation for hardware designers on how to best design the hardware to support DMA in Windows? I have heard of such a document on this forum but have been unable to find it anywhere.

On 21-Jan-2015 16:03, compeng03@ wrote:

Is there a guideline or documentation for hardware designers on how to best design the hardware to support DMA in Windows?

Have you looked at the Hardware Certification Requirements, in sections
relevant for your device type?
https://msdn.microsoft.com/en-us/library/windows/hardware/dn423132

– pa

Pavel,

Thank you for the link. I completely forgot about the hardware certification requirements. After going through anything there that seems applicable to our boards, I still can’t find mention of just some fundamental DMA design guidelines targeting Windows.

What is the “best of breed” for a DMA engine design that works well with Windows?

>What is the “best of breed” for a DMA engine design that works well with

Windows?

An optimal DMA engine design depends on the type of device. For example, a
NIC device had better be prepared to handle small fragments on arbitrary
byte alignment. A storage device can get away with 4 or perhaps even 8
byte alignment. There are apps that might try to do unbuffered transfers
to 4 byte align buffers, ignoring what the reported alignment requirement
is. Apps sometimes are designed based on observing what current hardware
does, not based on how they are supposed to be designed, and many current
storage controllers require no more than 4 byte alignment.

Maximum transfer size it very device dependent, a NIC transferring an
Ethernet packet is constrained by the maximum jumbo packet size, although
you might have a fragment per header layer (ideally not, but packet
filters do what they do), and you also have to handle huge offloaded TCP
segmented packets (where the hardware does the final packetization).
Storage devices can be asked to make multi-megabyte transfers.

Another area is memory mapping, which can get complex. An SR-IOV device
has to translate guest logical addresses into actual hardware physical
addresses, which also might be translated by IOMMUs.

There also are devices that have unusual granularity of protection, for
example an Infiniband HCA typically supports mapping the command ring and
status ring into user mode apps, on a per connection basis, and they
support I/O requests that contain pre-mapped memory regions. This is
arranged such that an app can map buffers and do I/O without making
user<->kernel transitions, allowing microsecond node-to-node message
latency. The hardware register layout and DMA capabilities are designed
such that the driver can expose a safe interface to an app, mapped into
user mode address space.

There also is a question of how you implement interrupts, as MSI(X)
interrupts are essentially DMA memory writes. I know there have been
proprietary (patented) ways of DMAing coalesced interrupt status for
things like SR-IOV NIC cards.

I think “hardware designs that work well on Windows” are part of the
intellectual property many companies view as proprietary. I also am a
believer that Microsoft should publish some minimum requirements for
different device types, so new companies can create new devices that at
least work without initial drivers having performance robbing things like
buffer copies. Reality may be to create a competitive product, you have to
deeply understand what competitive products do.

I can be real educational to run a competing product with a PCI(e) bus
analyzer connected.

I don’t think you will find a Microsoft document titled “How to design
hardware that instantly works as well as hardware from long time IHVs”.

xxxxx@gmail.com wrote:

What is the “best of breed” for a DMA engine design that works well with Windows?

All we can tell you is the things that have caused us pain in the past.

Make sure it supports 64-bit physical addresses.

Make sure it doesn’t have wild alignment requirements. If you have
scatter/gather support and you plan to use buffers from user mode, you
are going to get every conceivable buffer alignment. If your hardware
requires 32-byte alignment on all transfers, you can’t really use
scatter/gather. You’ll have to copy in and out.

You’ll need a completion interrupt, of course. If your hardware has a
queue of requests or a chain of buffers, make sure you have a way to ask
how far it got. Nothing is more frustrating than writing an ISR and
realizing there is no way to know which buffers are done and which are
yet to come. “What am I supposed to do, guess?” Remind your hardware
people that that is NO GUARANTEE that every interrupt results in exactly
one firing of the ISR.

There are a lot of ways to implement the “how far” thing. There can be
a register that has a processed byte count, or a register that has the
address of the last completed address, or a queue of completed requests,
etc. All of those are good strategies.

If your hardware accepts a list of buffers, it’s nice to have the
flexibility to choose whether you get an interrupt after every buffer,
or only an interrupt now and then. My favorite implementations have a
bit in each scatter/gather descriptor that says whether it should
interrupt after this transfer.

Make sure your hardware is reasonable about error conditions. One of
the DMA engines I used had something like 12 different error conditions,
virtually none of which would ever really occur. If DMA stops because
of an error, I need to know that, and I need to know where to go to
restart the transfer, but after hardware experiments are done, I don’t
really care want to query 12 different error sources.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.

Thanks for the replies Tim and Jan.

Jan, I was preparing for an eventual discussion where the hardware guys ask how they can better support Windows and I’d be able to show them some “official” guidelines that had more than just my opinions behind them. Your response makes a lot of sense, of course specific details would wildly vary depending on what it is your hardware needs to do.

Tim, your response makes me feel a lot better about what we do have. I think every suggestion you made is currently implemented in some form. I think we’re in pretty good shape from a status/control standpoint but I am struggling with the basic approach to take to actually get date to/from the board.

Each of our DMA channels (a board can theoretically have many such channels) is made up of some number of buffers, typically 7. Each buffer can hold up to 16MB. So, in a way we have hardware scatter gather, but we can support a maximum of 7 elements per transfer and each element could be 16MB in size. Each buffer can independently interrupt or loop back to the first buffer.

My initial approach was to use a packet profile with a DMA enabler and transaction object on each buffer. This was simplest to implement and meshed nicely with the common library API shared with our Linux driver package. Since it’s giving me the STATUS_WDF_BUSY when attempting a second transaction, I’m wondering if it makes sense to try to a scatter gather profile instead, which would seemingly remove that restriction while also reducing the number of enablers and transactions allocated.

Would scatter gather make sense with a small number of large elements like that?

Thanks,
Matt

xxxxx@gmail.com wrote:

Each of our DMA channels (a board can theoretically have many such channels) is made up of some number of buffers, typically 7. Each buffer can hold up to 16MB. So, in a way we have hardware scatter gather, but we can support a maximum of 7 elements per transfer and each element could be 16MB in size. Each buffer can independently interrupt or loop back to the first buffer.

My initial approach was to use a packet profile with a DMA enabler and transaction object on each buffer. This was simplest to implement and meshed nicely with the common library API shared with our Linux driver package. Since it’s giving me the STATUS_WDF_BUSY when attempting a second transaction, I’m wondering if it makes sense to try to a scatter gather profile instead, which would seemingly remove that restriction while also reducing the number of enablers and transactions allocated.

Would scatter gather make sense with a small number of large elements like that?

Since you can only handle 7 scatter/gather entries, you would either
have to specify a limit of 28k bytes per transfer, or chop the transfer
up into 7 entries at a time in your driver. Can you live with that?
It’s really just a little extra bookkeeping. If not, you may have to
use a common buffer implementation, and copy the user buffers to/from
the common buffers.


Tim Roberts, xxxxx@probo.com
Providenza & Boekelheide, Inc.