Here’s a hot tip I bet you didn’t know: Windows NT® and Windows 2000® are only supported on x86 architecture processors. OK, you probably did know it already, but I figured that was a more interesting way to start my article than saying “You already know that…”.
One of the historic issues in NT has been the need for kernel-mode modules (other than the Kernel itself and the HAL) to scrupulously maintain processor independence. Device drivers, for example, must only access hardware via the HAL, and must strictly adhere to the HAL’s model for DMA operations. This cramps the style of some driver writers, especially those who are intimately familiar with the details of the x86 architecture and want to make use of those details.
So, given that NT (V4 and Win2K) only works now on x86 architecture systems can we finally kiss processor architecture independence in our device drivers goodbye?
The answer is a resounding NO! It’s true that relying on the HAL’s abstractions can be complex and even annoying at times. However, there are still more than enough reasons why it’s still not “safe” or appropriate to fall back into the swamp of processor-architecture specific development practices. Let me try to enumerate a few of my favorite reasons why the HAL is still your friend.
All X86 Platforms Are Not Created Equal
Without a doubt, one of the most confusing parts of the HAL’s abstraction for device driver writers is the DMA model. Hell, it seems that at least half the people writing DMA drivers don’t even understand it, never mind use it properly. Misuse of the DMA model in an unsafe way is one of the most common errors I see in NT drivers that support DMA.
What sorts of issues am I talking about? Well, for one, despite books and DDK descriptions to the contrary, some DMA driver writers insist that it’s safe to use physical addresses for DMA operations. Here’s a quote I often hear: “The Pentium uses a bridge to connect the memory bus and the PCI bus, so when a device does a DMA operation it always references physical memory addresses.” (See Figure 1.)
Figure 1 – A Pentium Architecture Utilizing a 440BX NorthBridge
Well, all well and good. But remember that “x86” is merely a processor architecture. Even in x86 systems, there are many possible ways to connect one or more PCI buses to the memory and CPU bus. Sure, bridging all these addresses together as shown in Figure 1 is one very common way of doing this. But it’s not the only way.
Let me state for the record that there are x86 architecture systems that run NT that do not bridge together memory and PCI Bus address spaces. Take for example the imaginary architecture shown in Figure 2.
Figure 2 – Where Physical Addresses Cannot Be Used for DMA
This shows a system that is 100% compliant with the NT and the HAL’s DMA model. In fact, I use a diagram similar to this in our driver seminars to illustrate the concepts in that model. The system shown in the figure uses hardware map registers to translate between physical memory addresses and device addresses (or “logical” addresses as they are sometimes called) on the PCI bus. Note that there’s no reason that the CPU in Figure 2 couldn’t be an x86 architecture processor. However, the way the CPU, memory, and I/O bus are connected is different from how one tends to envision a typical PC. The system in Figure 2 won’t run Windows 95. However, it would certainly be capable of running NT/Win2K (with the right HAL provided by the hardware vendor).
The point of Figure 2 is that if you attempt to DMA from a device that’s connected to the PCI bus, to the physical address of a user’s buffer in main memory, it’s not going to work. All DMA addresses to memory in Figure 2 are translated by map registers. If you don’t work and play well with the HAL, or you try to bypass NT and the HAL’s DMA model, you’re in for nothing but trouble.
So, just because you’re sure your driver is going to be running on an x86 processor, you can’t be sure that your driver will be running on what you might think of as a conventional PC motherboard layout. Unless, of course, you can dictate the specific environment in which your driver will run.
Your Driver Runs on 16GB Systems...Right?
Maybe you’re not convinced by my architecture arguments so far. “They may be some x86 systems that have an architecture similar to that in Figure 2”, you say, “but I’ve never seen one and I’m not going to worry about some ‘weirdo’ system that can’t run Win9x and my customers will probably never use”. Fair enough. But, are you going to limit the amount of memory your customers can install on their systems, too?
You might have noticed that Win2K supports systems with more than 4GB of physical memory. And, last time I checked, 4GB (0xFFFFFFFFF) was the highest address that you can generate from a 32-bit PCI device.
Insist on performing DMA operations using the physical addresses of user buffers and you have three choices:
1) Insist your users locate the data buffers in their programs entirely within the lower 4GB of physical memory space. This is not likely to fly, considering NT is a demand paged virtual memory system that does not allow users to control the physical placement of their process pages.
2) Write all sorts of nasty driver code that intermediately buffers transfers for those fragments of your user’s data buffers that are above the holy 4GB mark. That’s code that I prefer not to have to test, benchmark, and get right. It can be done, but it doesn’t sound like an appetizing use of a few days to me.
3) Withdraw support for your device on systems with more than 4GB of physical memory.
Of course, systems with more than 4GB of physical memory are pretty rare today. Maybe none of your customers will get one. Maybe restricting them to running your device on systems with 4GB or less of main memory is OK. For now. But there is no doubt that we’re going to see more and more of these systems over time.
The point here is that if your driver uses the HAL’s processor independent DMA model, your 32-bit Busmaster DMA device works on x86 architecture systems with more than 4GB of physical memory with no changes to the driver. You just “Let the HAL worry about it.”
Now, obviously, the HAL isn’t going to alter the laws of physics for you and make the PCI bus suddenly able to DMA to addresses above 4GB. What the HAL will do is intermediately buffer your DMA transfers, using DMA buffers below the 4GB address mark, and copying the data between those intermediate buffers and the actual user’s data buffers. While this isn’t the most efficient I/O scheme one could invent, it is probably optimal given the constraints of a 32-bit device on a system that can address more than 32 bits of memory. And, the best part of it is, you didn’t have to do anything special to make your device work. You just had to adhere to the HAL’s guidelines.
Are You Sure You're Up on Your X86 Architecture?
One nice aspect of the HAL is that it frees us from having to worry about all manner of details of the underlying processor and platform architecture. You want to write to a 4-byte device register that’s located in Port I/O space? You call WRITE_PORT_ULONG(...). You want to read from a 2-byte device register that’s in memory space? You call READ_REGISTER_USHORT(...). The only folks who have to care what these functions do are the (ever shrinking cadre of) poor soles who write HALs for a living. As long as the HAL does it correctly, and does it relatively efficiently, I don’t give a fat frogs ass what specific instructions it uses to accomplish my port and register accesses.
This concept of “Let the HAL worry about it” is a very liberating concept for driver writers. Instead of spending your days studying the latest Pentium Developer’s Manual set, you get to actually spend time worrying about the details of your device. Every time a new CPU is released, or a new platform is developed using an existing CPU, you don’t have to worry about whether or not your driver will understand the nuances of the new processor and platform architecture. You let the HAL worry about it.
Take, for example, the issue of write posting. Let’s say you’ve got an intelligent interface board, with shared memory on the board, that both your driver and a CPU on the board read and write. Maybe there are some ring-buffer structures in that on-board memory that you and the device both manipulate. The issue? When you write to this shared memory from your driver, how do you know that when you’ve completed the write request, the data has actually been written to the board? If the system on which you’re running implements write posting, the write can actually be delayed. If you use the HAL, you know because, according to the HAL’s model of device access, your only write to the on-board shared memory has been via the appropriate HAL functions: WRITE_REGISTER_Xxxx(...) or WRITE_REGISTER_BUFFER_xxx(...). And, as long as you’ve scrupulously used these functions, the HAL guarantees that what you attempt to write will get written out to the board. Again, you’ve simply let the HAL worry about it. The HAL is responsible for understanding what has to be done in order to ensure that the write posting registers are flushed.
What? You say you’ve ignored the HAL’s model and accessed device memory using RtlMoveMemory(...), RtlCopyMemory(...), or memmove/memcopy? But, you say, your device works! Sure it does. Well, it works most of the time. Except on those very rare cases when the CPU is fast, and the bus and processor are stressed and… who knows what else. I don’t know about you, but I’m not spending my weekend studying the latest x86 architecture developments. I’m going to rely on the HAL.
Expert In Madison and Deerfield?
Finally, there’s the issue of “future” processors. I’m thinkin’ that I’m not really that much of an expert on Madison and Deerfield, yet, how about you?
In case, like me, you’re not one of the “in crowd”: I searched the Intel website and it seems that Madison and Deerfield are the successors to Itanium and McKinley. They’re IA64 processors due for release in, oh, 2002 or so. And I don’t have the foggiest notion of how they’re going to work, beyond the fact that they’re 64-bit processors. Or what their architecture features are going to be. Or those of their support chipsets. Or whether any of those features will impact driver writers.
So, to be on the safe side and to protect my driver development investment, I’m sticking to using the HAL’s processor and platform abstractions. That way, I’ll just (you already know what I’m going to say) “Let the HAL worry about it.”
In For a Penny...
So, you can be sure your drivers are both architecturally correct and as future-proof as they can be by following the HAL guidelines. Utilize the published DMA model. Only access hardware with the HAL functions. And scrupulously follow all the other HAL processes. Yes, that even means calling KeFlushIoBuffers(...) (with the correct parameters – This is one of my favorite macros) before you set up those DMA transfers.
Lest you think I’ve lost my mind and become like one of the Stepford wives, let me assure you: There is plenty of room to innovate, push the envelope, maximize performance, and garner unfair advantages over your competitors, all while still slavishly adhering to the HAL’s processor abstraction and DMA model. They key is in understanding how the HAL works, and what you can safely “get away with” within the model when you need to push the conventional limits. This is all a bit much to go into here. But, until I have the time to write it up for an article in The NT Insider, you can find out this information in our Advanced Drivers seminar.
So, when you’re doing the next revision of your driver, spend some time checking to see that you’re doing things the HAL’s way. Heck, the changes you need will likely be as simple as calling AllocateCommonBuffer(...) (good) instead of MmAllocatePhysicalMemory(...) followed by MmGetPhysicalAddress(...) (very bad). And if that’s all it takes to make the HAL happy, I don’t see why anyone should complain.