Driver Basics - DMA Concepts
The NT Insider, Volume 16, Issue 3, Sept-Oct 2009 | Published: 29-Sep-09| Modified: 29-Sep-09
Sixteen years after the introduction of Windows NT, we continue to encounter numerous fundamental errors in drivers for Direct Memory Access (DMA) devices. Most of these errors appear to be the result of two problems:
Windows use of unique, and somewhat outdated, terminology in describing DMA device designs;
Driver devs lack of understanding about the Windows DMA abstraction, what advantages it provides, and how to manipulate the Windows DMA system to get it to do what they need.
The WDK documentation doesn't include a lot of background about DMA. So where would a good developer learn this stuff? Several years ago, we here at OSR wrote a white paper for WHDC that described the fundamentals of Windows DMA and provided code samples for both WDM and KMDF drivers. It was a darn good paper too. What survives from this paper after random editing changes and the removal of all mentions of WDM can be found in the paper "DMA Support In KMDF Drivers." It's still not a bad paper, and if you're interested in DMA (in either WDM or KMDF) we recommend you read it just for the concepts it describes.
To help you even further, this article will discuss the key concepts involved in Windows DMA, as part of our continuing series on the basics Windows driver development. We'll discuss the architectural concepts behind Direct Memory Access (DMA), starting with some basic definitions and general information about DMA device architectures. The paper will then discuss the Windows DMA architecture and the implementation of that architecture.
As you read this article, keep in mind that the I/O Subsystem, HAL, and bus drivers (as well as the WDF framework for KMDF drivers) work together to create the overall environment in which drivers for DMA devices exist in Windows. Even though there's not a single, specific, Windows component responsible for implementing DMA in Windows, we'll refer to the responsible components collectively as the DMA subsystem.
Definitions and Device Basics
So, what is DMA? You probably know the answer, but we have to start somewhere, right? DMA is the ability of a device to read and/or write main memory without using the host CPU. Even more specifically, what we're interested in this article is Bus Master DMA. Bus Master DMA is the ability of a device to perform DMA operations autonomously, without the need for any external DMA control logic in the system. You might also hear Bus Master DMA referred to as "first party DMA."
In order to perform autonomous transfers between itself and main memory, a DMA device needs to be able to reference the main memory address of the requestor's data buffer that's to be used for the transfer. And, of course, even thought the requestor's data buffer is virtually contiguous, it is not necessarily contiguous in main memory. The virtually contiguous requestor's buffer might exist in unique pages, or clusters of pages, scattered throughout main memory. And DMA transfers are always performed according to physical memory layout, not virtual memory organization.
Simple, traditional, DMA devices deal with the problem of a physically fragmented data buffer by performing one transfer for each physical buffer fragment. In such devices, the driver is responsible for determining (by asking the DMA subsystem in Windows) the base and length of each physical fragment and programming the device to perform a separate DMA transfer for each of those fragments. This can result in lower device utilization and more transfer overhead than desired.
Drivers for more complex (and most modern) DMA devices also are responsible for determining the base and length of each buffer fragment. However, drivers for these devices are able to program the device with the entire list of buffer fragments before starting the DMA operation. The device is then responsible for chaining the transfers together to automatically complete the requested DMA transaction. The driver just needs to program the device once, and the device does the work of transferring all the buffer fragments. Devices with this ability are referred to as supporting "Scatter/Gather" or "DMA chaining."
Windows characterizes DMA device and driver architectures as being either "packet-based" or "common buffer." When a driver programs a device with the base and length of each buffer fragment in response to a transfer request from a user, and instructs the device to begin the transfer, that driver is implementing a "packet-based" DMA scheme. Contrast this with the approach in which the driver and the device share a (physically contiguous) buffer in host memory, referred to as a "common buffer." The layout and contents of this buffer are understood by both the driver the firmware running in the device. In a pure common buffer approach, the driver copies write data to the shared buffer area, typically to a series of ring buffers that also contain control information. To process this write request, the device transfers the data from that common buffer (and not from the original requestor's buffer) to the device. Reads are handled similarly, with the driver being responsible for moving the data received into the common buffer to the requestor's data buffer. One of the hallmarks of a common buffer design is that, after the shared data area has been established in main memory and the device has been apprised of its location, the device performs transfers to/from this buffer as needed, without any specific instruction being required from the driver. The device uses control information stored in the common buffer to know what to transfer and when.
More common these days is the hybrid packet-based and common buffer design. In this design, which is used by most network cards and many storage controllers, the driver allocates a (physically contiguous) block of host memory that it shares with the device, and that block of memory has a mutually agreed structure, just as in the common buffer approach. However, in a hybrid design, the driver does not need to move the data between the requestor's buffer and the common buffer as it would in a pure common buffer design. Rather, when it receives a request to perform a transfer operation, the driver places into the common buffer information that describes the base and length of each fragment of the requestor's buffer, just like it would program the device in a pure packet-based design. The device then performs the transfer directly in to or out of the requestor's data buffer using the information it reads from the common buffer.
Why An Abstraction?
Windows is designed to be an extraordinarily flexible operating system. In fact, Windows view of DMA is so flexible, that it doesn't presume much of anything about the underlying machine topology. But instead of forcing the DMA device driver to deal directly with all this flexibility and the details of the underlying machine topology directly, Windows nicely abstracts many of the details for the driver developer. Thus, when you write a driver for a DMA device on Windows, you write your driver not to the specific details of the underlying machine hardware, but rather you write your driver to match the non-changing Windows DMA abstraction.
The advantages of this approach are plain: When you write a Windows driver for a DMA device, you get to concentrate on the details of programming your device. You don't have to worry about the hardware details of the system that your driver might be running on, such as the physical path your data takes as it travels from your device to main memory. You don't end-up with conditionals in your code to handle various system hardware layouts.
You write your driver once. You write it to the Windows DMA abstraction. After that, your driver just needs to be compiled for each processor architecture (x86, AMD-64, and Itanic) and it'll work regardless of the underlying system hardware configuration. Sweet, huh?
What the Windows approach does require -- first and foremost -- is that before developing a driver for a DMA device, a developer very clearly understand the Windows DMA abstraction. After all, the DMA abstraction defines the environment in which your driver will run.
In the Windows DMA abstraction, devices attach to a "device bus." This device bus is connected only indirectly to the bus that is used to access main memory. The connection between the device bus and the main memory bus is handled by a set of map registers. See Figure 1 for a diagram that illustrates this abstraction.
Figure 1 - Conceptual connection between Device Bus and Main Memory Bus...via Map Registers
Architecturally, map registers perform much the same function as memory management Page Table Entries (PTEs). However, instead of converting from virtual address space to physical address space like PTEs, map registers convert from device bus logical address space to physical address space in main memory. The DMA subsystem (mostly the HAL) is responsible for allocating and programming map registers on your driver's behalf.
While it might not be immediately obvious, this part of the DMA abstraction has some very important and far-reaching implications for driver developers. The single most important of these is that under Windows, DMA is never performed directly using physical memory addresses. Not ever. Rather, when your driver programs your device with the base and length of each buffer fragment in preparation for a DMA transfer, the base address provided to your device must be a device bus logical address that you have received from the DMA subsystem. This means that it is a serious design error to determine the physical address of a user data buffer (such as by calling MmGetPhysicalAddress) and program your DMA device with that address. If you haven't gotten the address from the DMA subsystem, it's not a valid device bus logical address!
So, in Windows, the steps a driver will typically take to perform a packet-based DMA transfer are as follows:
The driver receives a transfer request, with the requestor data buffer described using an MDL;
The driver calls an appropriate Windows function that will logically: (a) Allocate a set of map registers, (b) program those map registers with the base address of each physical memory fragment comprising the requestor's data buffer, and (c) return to the driver the device bus logical addresses of the map registers that will be used to relocate this transfer
The driver programs the DMA device with the device bus logical addresses provided by the DMA subsystem.
The driver instructs the device to perform the DMA transfer.
When the transfer is completed, the driver informs the DMA subsystem, and the DMA subsystem frees any map registers that were used.
All of Memory, Always Accessible
Another interesting feature of the map register design is the fact that, in the Windows DMA abstraction, all devices can always reach all of physical memory. Because device bus logical addresses are "relocated" to physical memory addresses using map registers, your device that only supports 32-bit DMA operations will still work on systems with 4GB or more of physical memory. Cool, huh?
Architecturally, this is no different than how PAE works in memory management on an x86 system. Even though an application is limited to 32-bits of virtual address space, the memory management registers on a system supporting PAE can relocate that 32-bit address space to a larger physical address space.
Map registers can also be used to increase the throughput (and decrease the complexity) of drivers for DMA devices that do not support scatter/gather. If, as part of initialization, a driver indicates that its device does not have scatter/gather support, the DMA subsystem will try to allocate a block of map registers that are contiguous in device bus logical address space. If this contiguous map register allocation is successful, the driver will receive a single base device bus logical address with which the device can be programmed. This single base address will be sufficient to describe the entire transfer. This saves the driver from having to break the overall DMA transaction (for a buffer that spans multiple, non-contiguous, physical fragments) into multiple individual DMA transfers. Not only is this convenient, but it also increases device throughput and utilization.
Again, this is exactly architecturally analogous to how a data buffer in an application can be virtually contiguous, but not physically contiguous.
The Overall Impact
As a driver writer, the impact of using the Windows DMA abstraction is that your driver is simpler to write and significantly more reliable.
Don't believe me? Well, how about if you write a driver for a device that only supports 32-bit DMA. Your customer buys this device and installs your driver knowing full-well that it only has 32-bit DMA support. Things work fine. The customer is happy: The device is relatively inexpensive, and it works just fine in his Windows server system with 2GB of main memory.
But, you know how these things go. Time passes. Demands increase. Memory prices continue to plummet. The customer decides to add more memory, and replaces his two 1GB DIMMS with two 4GB DIMMS. Now the system has 8GB of physical memory.
And, what happens to support for your device as a result of adding memory about the 4GB mark? Well, one of three things:
If you ignored the Windows DMA abstraction, and didn't worry about supporting address spaces greater than 32-bits, your device fails to operate properly. Worse, your customer's system ceases to work properly. Ooops! Hope he didn't throw away those old DIMMs.
If you ignored the Windows DMA abstraction, but being clever, you wrote code to handle 64-bit transfers manually by allocating an intermediate buffer, you're in luck! Well, SORT of. Up to this point, your driver hasn't actually been running any of that 64-bit handling, intermediate buffering, code. Hope you tested it really thoroughly, huh? Because if you didn't... yeah... same result as item 1, above.
If you USED the Windows DMA abstraction, your device continues to function without any problems. Problem solved, with no effort at all on your part.
I don't know about you, but I'm going for number 3.
There's one more reason why using the Windows DMA abstraction contributes to driver reliability. If you enable DMA verification in Windows Driver Verifier, it will automatically enable checks that will help ensure that your driver correctly interprets scatter/gather lists and properly programs your device's DMA transfers. That sounds like a major feature to me.
By now, you're probably thinking to yourself "They say the best things in life are free. But I doubt that one of those things is the Windows DMA abstraction. How is it implemented?"
Well, of course, how it's implemented depends on the underlying processor hardware (much hand waving and histrionics here). I mean, it is possible that some motherboard somewhere could, you know, maybe, possibly, implement map registers in hardware... as a direct physical translation of the Windows abstraction. There have been systems with hardware map registers in history. Both the PDP-11 and the VAX worked this way (surely a coincidence). In fact, there have even been a few systems that ran Windows NT that had hardware map registers.
But, enough nonsense. Without exception on modern hardware, the Windows DMA subsystem implements map registers entirely in software.
The way they work is pretty simple: Either the DMA subsystem just gives you the physical address of the user's data buffer and simply calls that the device bus logical address, or the DMA subsystem allocates an intermediate buffer (a "bounce buffer") in system memory (below 4GB) and gives you the base address of this buffer as the device bus logical address.
Which you get depends on several details, as shown in Table 1.
The DMA subsystem will allocate an intermediate buffer, and use that buffer, when either (a) your driver indicates that your device does not support scatter/gather, or (b) the device your driver supports is not capable of directly accessing all fragments of the user's data buffer (because the device has only 32-bit DMA support and one or more of the buffer fragments are located at physical address 4GB or above). When such an intermediate buffer is used, your device winds-up transferring data to/from the intermediate buffer (the "map register", so to speak), and the Windows DMA subsystem is responsible for moving the data between that intermediate buffer and the actual requestor's data buffer. As we often say in class when we describe this, "this results in you getting all the cost and complexity of DMA, with all the CPU overhead of programmed I/O!"
But seriously... when the alternative is having a device that doesn't work, the Windows DMA abstraction makes a great deal of sense.
One detail with which some developers choose to take issue is the use of intermediate buffering for all DMA operations on devices that don't support scatter/gather. Well, this is something about which reasonable engineering might reasonably agree to disagree. Fortunately, you're not forced to accept the Windows DMA abstraction's way of doing things in this regard. If you don't want to avail yourself of this feature, simply have your driver indicate that your device supports scatter/gather! The result: You'll get all the other benefits of the abstraction, but the DMA subsystem will give you the list of device bus logical addresses which you can use to program your device directly. You can perform the transfer by programming multiple, individual, DMA transactions on your device.
And In The End
So, there you have it: A brief introduction to the key concepts of DMA in general, and Windows DMA in particular. Armed with this information, you should be able to read the WDK documentation (KMDF or WDM) that describes DMA operations much more easily.
Rate this article and give us feedback. Do you find anything missing? Share your opinion with the community!
Post Your Comment
"Compelling argument to use the DMA API"
Before I give my $0.02, I wanted to thank NTInsider for posting this good article and working to clarify WDM DMA API. Regarding, "you write your driver to match the non-changing Windows DMA abstraction. The advantages of this approach are plain", I don't feel that this article's argument is compelling. Driver writers should understand both their device and target system architecture. Who wants to complicate their life writing to the packet-based DMA abstraction, when the benefit isn’t clear? When I first described the packet-based API to a HW engineer many years ago, they laughed and said, "What map registers?! This isn't a mainframe." Allocate memory your device can touch, lock it down, get the physical pages and go. Wintel is cache-coherent without any map registers or IO buffering to worry about... or it was...
What compelled me to consider the DMA API? Not PCIe Isoch... I was compelled by IO-MMUs (VT-d, DMAr, ...). They're here now in new servers and workstations, and they're supported by Server 2008. I recommend the following articles for motivation: * "DMA Directions And Windows": http://download.microsoft.com/download/a/f/d/afdfd50d-6eb9-425e-84e1-b4085a80e34e/SYS-T304_WH07.pptx * "Windows Virtualization Best Practices And Future Hardware Directions": http://download.microsoft.com/download/5/b/9/5b97017b-e28a-4bae-ba48-174cf47d23cd/VIR124_WH06.ppt Happy coding!
08-Oct-09, Jeremiah Cox
"intermediate, double-buffer performance"
Regarding, "this results in you getting all the cost and complexity of DMA, with all the CPU overhead of programmed I/O!" - This could be notably better than PIO for big transfers. The mirror buffer is probably in cached memory so the CPU reads will likely operate in cache line chunks with prefetch. We don't have to stall the CPU while we round-trip to our device in register-sized (generally 32-bit) chunks, flushing PCI post buffers as we go. Furthermore, a device can burst the data into memory making more efficient use of the bus. The CPU, device, and bus cycle savings may add up for folks on a power budget.
But you're right; it isn't fast enough for some of us...
Mirror buffer performance unacceptable? Don't need physically contiguous pages as provided by AllocateCommonBuffer()? Consider MmAllocatePagesForMdl() as suggested to me by Peter several years ago at WinHEC. Watch out for that IO-MMU pitfall...
07-Oct-09, Jeremiah Cox