Higher performance is always a good thing. But, let’s face it, many devices – perhaps even most devices – just don’t warrant the amount of work necessary to develop and support a device driver that achieves device ultimate performance.
Then again, there are those devices that do require the highest level of performance that you can wring out of the system. In this article, we’ll briefly review the Win2K/WinXP busmaster DMA architecture, and explore the constraints on Windows 2000 and Windows XP DMA performance. We’ll also discuss methods for achieving higher performance for busmaster DMA devices on Win2K and WinXP. Note that this article isn’t meant to be a tutorial on how to implement busmaster DMA on Win2K or WinXP. We assume you already know that. It also applies only to packet-based busmaster type devices. It doesn’t apply to devices that implement system DMA, or devices that utilize common buffer DMA schemes.
For more information on this topic, consider attendance at one of OSR’s Advanced Driver Development for Windows 2000 seminars
Building a Mystery
Windows 2000 and XP have a specific architecture for implementing DMA. In fact, this architecture falls within the Hardware Abstraction Layer’s set of abstractions. We’ve discussed this before here in The NT Insider (A New Way to DMA, Volume 6, Issue 4). Basically, the Win2K/WinXP driver architecture specifies the following steps for implementing DMA transfers:
1. During device initialization, the driver describes the characteristics of the DMA device it supports, and gets a pointer to a DMA Adapter. This is done by calling IoGetDmaAdapter().
2. Before starting a DMA transfer, the driver is responsible for ensuring that the system’s hardware cache and main memory are coherent for the DMA usage. Drivers do this by calling KeFlushIoBuffers(). Before you send me email, note that this function call is a noop on Win2K and WinXP. Check out the defintion for this function in NTDDK.H and/or WDM.H.
3. DMA transfers between a device and host memory require the use of “Map Registers”. According to the HAL’s abstraction, these Map Registers are used to translate from device bus logical addresses to main memory physical addresses. Before a transfer can be started, sufficient Map Registers must be allocated by the operating system to describe the DMA transfer that’s being requested. In addition, the device’s DMA Adapter must also be allocated and available. These are both accomplished by the driver calling either GetScatterGatherList() (if the driver implements New Model DMA) or AllocateAdapterChannel() (if the driver implements Classic Model DMA). During this call, the driver provides a pointer to a function that is to be called-back when the operating system has amassed the resources required for the transfer.
4. When the required resources are available, the driver is called back at its callback function. If using Classic Model DMA, it then retrieves one or more device bus logical addresses that describe the user data buffer by calling MapTransfer(). If using New Model DMA those addresses are provided in the supplied scatter/gather list. In the Win2K/XP DMA model, it is typically at this time that the driver programs its device with the device bus logical address(es) that describes the user buffer, sets the transfer length, sets the direction of the transfer, and tells the device to start the transfer.
5. After the DMA transfer has been requested, the device interrupts to indicate that the transfer is complete (either successfully or with error). Within the interrupt service routine, the driver saves the reason for the interrupt (if required) and requests a callback to IRQL DISPATCH_LEVEL for interrupt service routine completion. This callback is to the driver’s DpcForIsr.
6. The driver’s DpcForIsr runs, and the driver examines the reason for the interrupt. Assuming the interrupt indicates completion of a DMA transfer (and use of Classic Model DMA), the driver must flush any remaining buffers by calling FlushAdapterBuffers(). If the driver is using New Model DMA, the scatter/gather list needs to be returned by calling PutScatterGatherList(). The driver then determines if this transfer was sufficient to complete the entire operation requested by the user. If not, the remainder of the user-requested operation is processed (typically by calling GetScatterGatherList()or Allocate AdapterChannel() again). If the entire operation is completed, the driver calls IoCompleteRequest().
7. At some point after a transfer has been completed, the driver must propagate its execution by attempting to start the next user request which may be queued to the device. This is typically done in the DpcForIsr, immediately before calling IoCompleteRequest() (doing it before the call to IoCompleteRequest() is a common optimization that provides a small increase in device utilization). Propagating device execution typically involves once again calling GetScatterGatherList() or AllocateAdapter Channel().
The bottom line is: The Windows 2000/XP operating system architecture has a specific architecture that drivers that perform DMA must follow. For any solution to be acceptable, it must follow the Windows 2000/XP architecture as outlined above.
Take A Look Around
Given the model just presented, some potential issues for drivers that want to implement truly high-performance DMA can be identified. For example:
The latency involved in awaiting resources to be assembled, and awaiting the subsequent callback, after calling (for example) GetScatterGatherList().
The classic latency issue of awaiting a DpcForIsr callback from the ISR.
There are other potential issues as well, involved in precisely how a given HAL might implement parts of the DMA architecture, particularly the Map Register abstraction. However, we’ve dealt with these before in The NT Insider. Let’s assume those are non-issues (because, when you come down to it, they are) at least for the purposes of this discussion.
So, it appears that the architecture needs to be followed, and doing so may lead to some potential issues. Does this perhaps mean that the Win2K/XP architecture shouldn’t be followed after all?
No! Wrong answer. Get a clue. I am seriously tired of debating the issue of whether, or by how much, the Win2K/XP DMA architecture needs to be followed. Mostly, I get into this discussion with luddites, *nix bigots, and other random semi-literates who think that they know better than the HAL developers. To these folks all I can say is: Look, many years ago I used to agree with you. But after several years of painful experience trying to upgrade drivers that didn’t follow the abstraction, and after going through the transition to 64-bit PCI and support for PAE, I’m now firmly and irretrievably in the “follow the architecture and let the HAL deal with the details” camp. Everything they say about converts is true.
Following the architecture means that doing DMA by calling MmGetPhysicalAddress() and taking the result and stuffing it into your device is not an option. Sure, it’s fast. But it also won’t work on all the platforms that can support Win2K/XP. Driver writers that do this should be shot, or at least have their driver-writing license revoked.
Take It Easy
So, what’s a driver writer who wants to squeeze out maximum performance from their device to do? Well, believe it or not, there are some very good options.
The first step, and what is perhaps the easiest option, is to be sure that your driver is using New Model DMA. The only drivers that should be calling AllocateAdapterChannel() are drivers that need to run on both Win2K/XP and Windows 98.
If you haven’t looked at updating your driver to New Model DMA, you should. The interface is faster, easier, and less error prone than the Classic Model DMA APIs. No more calling nasty-old MapTransfer(). No more risk of crashing the system if you block awaiting DMA resources a second time. New Model DMA is cool. Too bad they didn’t add it to Win9x. But what do you expect? It is Win9x after all, isn’t it?
Let Me Take You Higher
But how about the cases where you really need to maximize device utilization? Surely there are approaches beyond “update your driver to call GetScatterGatherList()”, right?
Indeed there are. One of my favorites is queuing requests awaiting device availability after the Map Registers have been allocated. Let me explain further.
The typical DMA driver that uses New Model DMA, in following the Win2K/XP architecture, receives an IRP in its dispatch routine, validates it, and then checks to see if the device is available to start a DMA transfer. If the device is not available, the IRP is queued. If the device is available, the driver calls GetScatterGatherList(). When resources for the DMA operation are available, the driver’s execution routine is called back, and the device is programmed for the transfer.
A cool alternative to this is to always call GetScatterGatherList() directly from your dispatch routine, immediately after you validate the IRP. In this approach you check to see if the device is available from your driver’s execution routine call-back, when you have all the resources “ready to go” for your transfer. If during the callback your device is available, you immediately program the device to start the transfer. If the device is busy (or if there are other requests on the queue ahead of you) you store a pointer to the provided scatter/gather list someplace in the IRP (perhaps one of the DriverContext locations), and queue the IRP internally in your driver. When the device is free, POW! You’re ready to immediately program the device and start the transfer.
The advantage to this approach is that it moves the overhead of amassing resources (Map Registers or whatever) to the front end of the process. Therefore, there’s less to do between the completion of one DMA transfer and the start of the next. We’re getting there!
The Little Things
Of course, there’s nothing in this business that doesn’t cost you in one way or another. You gain performance. So what’s the cost?
Well, by allocating the resources necessary to perform the transfer before the actual device is ready you could be tying up Map Registers. And, on most systems, Map Registers are scarce resources that are supposed to be shared among drivers. When you call GetScatterGatherList(), if no Map Registers are available, your callback will not occur until those registers are free. So, this scheme won’t lead to system failure. However, it certainly does give the device that implements this scheme an unfair advantage relative to other busmaster DMA devices on the system that need Map Registers.
The problem really only arises in the following cases:
Your device does not implement scatter/gather in hardware, and you choose (as almost everybody should) to let the HAL provide “system scatter/gather support” for your device, or
Your device is a 32-bit device, and it’s running on a system with more than 4GB of physical memory.
If your hardware implements scatter/gather (as all good Win2K/XP hardware should), and if it’s capable of 64-bit DMA when running on a system with more than 4GB of physical memory (as all good Win2K/XP hardware should), then you have no worries.
One Step Closer
The approach of queuing your IRPs after resource allocation can be further enhanced by reprogramming the device for the next DMA transfer within the ISR. This should be relatively simple, assuming you manage your IRP queue using the interlocked queue functions (for example, ExInterlockedInsertTailList() and ExInterlockedRemove HeadList() ). Remember, when you use these functions for queuing, and you access a queue at IRQLs > DISPATCH_LEVEL it is absolutely vital that the queue only be manipulated with these functions. No calling the ordinary queue functions, such as InsertTailList(), is allowed.
With this last alteration, the driver’s IRP processing path now proceeds as follows:
1. The driver is called at its dispatch entry point with an IRP. You validate the IRP, and assuming it’s valid, you mark the IRP pending, call KeFLushIoBuffers(), and then GetScatterGatherList(). You return STATUS_PENDING from your dispatch routine.
2. When your driver’s execution routine (the callback supplied to GetScatterGatherList()) is called, you check to see if your device is currently busy. If the device is busy, you store a pointer to the provided scatter gather list in the IRP, and queue the IRP on your device’s “work to do” queue using ExInterlockedInsertTailList(). If your device is available to start a new transfer, you program the device and queue the IRP on the “waiting for device” queue using ExInterlockedInsertTailList().
3. The device interrupts. In your ISR you match the interrupt to the IRP (which is pretty simple if your device has only one request outstanding). You store the completion information into the IRP and move the IRP from the “waiting for device” queue to the “ready to complete” queue, all the time using only interlocked queue functions. You then attempt to remove an entry from the “work to do” queue, and if you were successful at doing that, you program the device, and place the just dequeued IRP on the “waiting for device” queue (again, all using interlocked functions). You request a DpcForIsr, probably by calling IoRequestDpc().
4. In your DpcForIsr, you iteratively remove IRPs from the “ready to complete” queue, using the interlocked queue functions, and call PutScatterGatherList() followed by IoCompleteRequest() for each IRP removed.
Pretty straight-forward, eh? Yeah, well, I thought so. The nicest part of this approach is that it can increase device utilization right up to the max. Assuming there’s work to do, the only time the device isn’t busy is the time (a) between when the device generates an interrupt and the start of the driver’s ISR, and (b) between when the device’s ISR starts and the device is once again programmed within the ISR. Both of these times should be very small.
So, there you have it. A scheme that can dramatically increase DMA performance, if that’s what you need, and is also 100% compliant with the Win2K/XP DMA architecture. Yes, if there are multiple DMA devices in the system that don’t do scatter/gather or you’ve got several 32-bit DMA devices on a system with more than 4GB of physical memory, it can put unusual strain on sharable resources. That just means that this isn’t a scheme for every DMA driver. Just the DMA drivers that really must have the ultimate performance.
So, if your driver doesn’t need ultimate performance, keep away from this algorithm. Otherwise, give it a try and keep your device rollin’… Although I can’t promise, everything should be wonderful. If it doesn’t work, pardon me, you can say it was all my fault. If it does work, who knows? You might make yourself totally immortal.