Although Windows is designed to be able to run on multiple platforms, the reality of the situation is that most of us are running 32bit x86 based systems. As we saw in a recent issue of The NT Insider(“Don’t Call Us – Calling Conventions for the x86”, V10N1), having a strong understanding of how call frames are built on the x86 can be indispensable when trying to decipher a crash dump – Especially one that involves interaction with components that you do not have symbols for. When writing a driver you become part of the operating system, and this is one example of how a strong knowledge of the platform on which the operating system runs can do nothing but help you in your development and debugging experience.
Remember, you should always write your code to be platform independent by using the HAL. However, it doesn’t necessarily follow that you should be ignorant of the platform architecture on which you’re running. The following article is the first in a series that will explore the inner workings of some platform specific topics that we find interesting and that we think every driver writer should know a little bit more about. We assume you know the basics about devices, and also something about Windows drivers: Interrupt Service Routines, IRQLs, and other associated stuff.
Without input and output devices there really wouldn’t be much work for a CPU to do (duh! A computer without input or output wouldn’t really be that interesting). When a device changes its state, either because it has data to transfer or because of some other external situation that needs attention, it sets some device-specific bit in some device-specific hardware register. In order to detect this device state change, a driver could repeatedly test (poll) the bit. But that's not too efficient. Another alternative is that the device can asynchronously notify the system when its state changes (and hence its device-specific bit gets set) by generating an interrupt. By using interrupts, we can essentially ignore devices when they don’t have any work for us to do and therefore greatly improve overall system performance by not wasting cycles looking for things to do.
Because interrupts are such an integral part of devices and device driver development, we’re going to explore the inner workings of the interrupt. At the same time, we’ll hopefully demystify the process of how we get from a device generating an interrupt to a device driver being notified about it. We’re going to do this from the point of view of a driver writer, and that means we’re going to take some liberties with the infinite levels of hardware detail. So, if you’re a hardware person and wiring up interrupt controllers is near and dear to your heart, please don’t write to us and complain that we left out the details such as how the 8259 is different from the 8259A-2, when and how you need to program OCW4, or what data is exchanged when the CPU asserts INTA the second time. It’s not that we don’t know (well, not all the time, anyways). It’s that for the purposes of this article, we don’t care.
Interrupt Descriptor Table
To understand interrupts, it’s important to understand the “Interrupt Descriptor Table” (IDT). A trivial definition of the IDT is that it is an array of function pointers. The function that each array element points to is either an interrupt handler (or “Interrupt Service Routine” as we’d say in the driver world) or an exception handler. Today we’re only concerned with interrupts, so we’ll ignore the exception handlers. The IDT is indexed by “interrupt vector”, which is an UCHAR value. Note that this restricts the total number of interrupt and exception handlers to 256. Each CPU has its own IDT, and they can be viewed by the WinDBG command !kdex2x86.idt. The code below shows a portion of the output of this command from one of my test systems.
1: kd> !kdex2x86.idt
IDT for processor #0
dd: 80ac2ac2 (nt!_KiUnexpectedInterrupt173)
de: 80ac2acc (nt!_KiUnexpectedInterrupt174)
df: 80ac2ad6 (nt!_KiUnexpectedInterrupt175)
e0: 80ac2ae0 (nt!_KiUnexpectedInterrupt176)
e1: 804e0084 (HAL!HalpIpiHandler)
e2: 80ac2af4 (nt!_KiUnexpectedInterrupt178)
e3: 804dfdd8 (HAL!HalpLocalApicErrorService)
e4: 80ac2b08 (nt!_KiUnexpectedInterrupt180)
So, when a device generates an interrupt, how does that interrupt result in a particular vector being used to index into a particular IDT on a particular CPU, to call a particular Interrupt Service Routine? Well, that’s where the Interrupt Controller hardware enters the picture. The remainder of this article is going to look at the specific implementations of two Intel “Programmable Interrupt Controllers” (PICs), the traditional 8259 PIC and the later and greater “Advanced PIC” (APIC). We’ll also delve a bit into how Windows specifically handles interrupts, so that we can see how the PIC and APIC are used in the real world.
The original IBM PC contained an Intel 8259 PIC. The 8259 only supports uniprocessor systems and only supports handling interrupts from up to eight devices by supplying eight “Interrupt Request Lines” (IRQs), labeled IRQ0-IRQ7, for devices to connect their interrupts to. Later, a second 8259 was added to the system by “cascading” it to the primary 8259 through the three cascade lines (CAS0-CAS2). It was also chained through its INT line to the primary 8259’s IRQ2, adding another set of eight IRQs but losing one for the chaining, leaving a total of fifteen IRQs. Figure 1 is a simplified diagram of the arrangement of the 8259s.
All modern motherboards are now required to either contain two physical 8259 chips wired in this way, or to virtualize the presence of the two chips within their hardware, no matter what other PIC hardware they may have. Now that the history lesson is over, we can get a bit deeper into the details.
In our current simplified view of the world, each device that wants an interrupt will “claim” one unique IRQ and will then be connected to the 8259. To generate an interrupt at a particular IRQ, the device asserts the associated IRQ line on the bus.
The 8259 classifies each IRQ as a priority, starting with IRQ0 as the highest priority and going downwards in priority as the numbers grow higher. So IRQ0 is the highest and IRQ15 is the lowest, right? Well, no, not really. Remember how the second 8259 is wired into IRQ2? The actual progression of IRQ priority is:
IRQ0, IRQ1, (now off to the second 8259) IRQ8-IRQ15, (now back to the first 8259) IRQ3-IRQ7
Not entirely obvious as first glance, but it sort of kind of almost makes sense if you tilt your head and squint when you look at it.
Each one of the IRQs is individually “maskable”, meaning it can be programmatically disabled via the 8259’s “Interrupt Mask Register” (IMR). If an IRQ is masked, any device that is connected to that IRQ’s requests for interrupts are ignored. Also, higher priority IRQs will be serviced before lower priority IRQs and higher priority IRQs can “interrupt” lower priority IRQs. So if an IRQ1 request is being serviced and an IRQ0 request comes in, the processing of IRQ1 interrupt is stopped and the IRQ0 interrupt is sent to the CPU.
It‘s important to note here that although this hardware priority does exist, a Windows system utilizing an 8259 doesn’t actually use it. Windows will impose its own priority scheme on the 8259 by directly manipulating the IMR (see sidebar at the conclustion of the article).
Now that we know how the devices are connected to the 8259 and how the 8259s are connected together, how is the 8259 connected to the CPU?
An x86 architecture CPU has two interrupt lines, LINT0 and LINT1. In an 8259 configuration, LINT1 is wired to the “Non-Maskable Interrupt” (NMI). An NMI is generated when serious, potentially non-recoverable errors are detected. It is called the “Non-Maskable Interrupt” because there is no way to prevent it from being delivered – It can never be “masked off” by anything but the processor itself. An example of an event that can typically cause an NMI is a memory parity error.
LINT0 is used as the “Interrupt Input Line” (INTR), which is connected to the INT pin on the primary 8259. The 8259 asserts INT to notify the CPU of an interrupt. When the CPU acknowledges the 8259 sends an 8-bit value (which was previously programmed into the PIC by the O/S) over the data bus to the CPU. This 8-bit value is the interrupt vector associated with the IRQ. The interrupt vector is used to index into the IDT to determine the starting address of the Interrupt Service Routine (ISR). The CPU then jumps to the ISR, which does whatever processing is necessary to service the interrupt.
Well, this doesn’t seem so bad. But what happens if you have more than fifteen devices that need interrupts? You’re going to need to start sharing interrupts, which is where the “chaining” of ISRs comes into play. In this scenario, the OS calls the first ISR registered for a given vector to notify it of the interrupt. Assuming the interrupt is a level-triggered interrupt (such as line-based interrupts on the PCI bus), the OS will call all ISRs associated with the given interrupt vector until it finds one that returns TRUE, indicating that the interrupt was for that device. Sharing interrupts can be problematic for several reasons, not the least of which is a poorly written ISR could potentially hang the system. In fact, even a well-written ISR can hang the system. For an explanation of the issues involved you should read the article located at http://www.microsoft.com/hwdev/platform/proc/apic.asp. Also, the 8259 will not work in multiprocessor systems, which immediately makes it obsolete because nowadays most of us are running dual procs on our desktops. Enter the APIC.
What’s commonly referred to as the APIC is actually made up of two components: The “Local APIC” (LAPIC) and the “I/O APIC” (IOAPIC). Each (logical) CPU in the system will typically have an on chip LAPIC. So, if your system has four CPUs you will have four LAPICs (note that because there is one LAPIC per logical processor, you will also have four LAPICs if you have two HyperThreaded processors with two processors each). The IOAPIC is part of the Intel system support chipset, and a system with a Pentium IV or later can have any number of IOAPICs. Processors prior to the Pentium IV were limited to a total of eight. An IOAPIC can be designed to support up to 64 ”Interrupt Input Lines” (INTINs), but most standard systems contain IOAPICs with 24 INTINs. The INTINs serve the same purpose as the IRQs in the 8259. In other words, every device that wants to generate interrupts will claim an INTIN. As you can see, in an APIC configuration you can end up with enough INTINs to eliminate the interrupt sharing problem that was present in the 8259. Well, for the time being at least. Let's look at the two APIC chip components in detail.
As I mentioned earlier, the LAPIC is typically located on the actual CPU and has been present on all Intel CPUs since the Pentium. Processors from other vendors may or may not contain LAPICs as part of the CPU, incidentally. Anyhow, remember those LINT0 and LINT1 lines? Well in an APIC configuration these lines are connected to the LAPIC, which is then connected to the system bus (Intel processors before the Pentium IV actually had a separate APIC bus, but that’s one of those hardware details we’re going to ignore). All processors in the system are configured this way, which allows for "Interprocessor Interrupts" (IPIs) to be sent from one LAPIC across the system bus to another CPU’s LAPIC, which you can see in Figure 2.
Figure 2 – The LAPIC and IOAPIC
An important register to note in the LAPIC is the “Task Priority Register” (TPR). By setting the TPR to an appropriate value, the operating system can set the “priority level” at which a CPU is running. These priority levels can be anywhere from 0-15, with zero being the lowest priority. When an interrupt is delivered to the LAPIC from the IOAPIC for an INTIN with a vector in the range of 16-255, its priority is determined by the following algorithm:
priority = ceil(vector / 16)
Vectors in the range of 0-31 have predefined uses in the x86, so OS defined device interrupts start after 31. When an interrupt is requested, if the resulting priority is less than or equal to the current TPR value in the LAPIC of a destination processor, the interrupt is not made active on the CPU. This allows the OS to control its interrupt priorities by controlling the vectors that it assigns to interrupt handlers in the IDT and by writing appropriate values out to the TPR in the LAPIC when executing these handlers. The higher the vector the OS assigns to an interrupt, the higher the interrupt’s priority
In order to keep our conversation relatively simple, we’ll only discuss a system with a single IOAPIC that has 24 INTINs. The IOAPIC is also connected to the system bus, but not directly. It is actually connected to a bridge (also part of the Intel system support chipset) which in turn is connected to the system bus (see Figure 3). Just as in the 8259 case, all the devices in the system that require interrupts are connected to one of the INTIN lines on the bus. But, unlike the 8259, there is no implied priority associated with each INTIN. Remember, with the APIC, priority is handled by the vector number and the TPR in the LAPIC.
In the IOAPIC, there is a register called the “I/O Redirection Table” (IOREDTBL). There is one 64bit IOREDTBL entry for each INTIN and each value describes the interrupt in gory detail. Among other things, the IOREDTABLE entry describes the processor affinity for the interrupt, whether the interrupt is edge or level triggered, the interrupt polarity, whether or not the interrupt is masked, and the vector associated with the interrupt. Of course, the OS is responsible for filling in this table with appropriate values to match the IDT and the characteristics of the devices that are connected to the particular INTINs.
There are some interesting issues about the IOREDTBL, such as the fact that the bit mask that it uses to describe the CPU affinity for each interrupt is only 8-bits wide. Sigh! Let’s not worry about those complications in this article. We’ll limit our discussion to systems with 8 or fewer CPUs.
Aside: PCI to IOAPIC
Before we discuss how interrupts are processed, we probably should take a quick detour to hand-wave over the issue of how the PCI bus is connected to the IOAPIC. This would make an interesting article in itself, but suffice it to say that each of the PCI Interrupt Request lines (PIRQxs) is connected to one of the IOAPIC INTIN lines. This could be a hard-wired connection, or a connection that’s made more dynamically via the BIOS.
Handling APIC-Based Interrupts In Windows
The details that I have outlined so far apply to any operating system written to take advantage of the PIC or APIC. All OSs will have an IDT, all of them will have to program the PIC to reflect the IDT and all of them will have to have ISRs that handle the interrupts. Because this is The NT Insider, this article just wouldn’t be complete unless we tied this all in to how Windows running on the x86 actually handles interrupts. Because the 8259 is going the way of the dodo, we’ll limit our discussion to the APIC.
OK, so what exactly happens when a device signals an interrupt? Say you have a level-triggered device that is connected to INTIN16 and assigned by the OS to have vector number 0x42. When the device wants to generate an interrupt, it will assert INTIN16 and then…
The assertion of the INTIN16 line will bring us into the IOREDTBL for INTIN16 on the IOAPIC. If the IOAPIC sees that your interrupt is not masked, it will generate a message on the system bus via the bridge chipset.
In our simple case, there’s luckily an idle CPU that sees the interrupt and then jumps to the interrupt handler at its IDT indexed at 0x42. Remember, it does not need to go back out to the IOAPIC to find out which vector the ISR is at like it did with the 8259, the vector was part of the message sent out on the system bus.
Because it knows we’re handling an interrupt, the CPU will push the EIP, ESP, SS and CS registers to the stack for us before calling the Windows general interrupt dispatcher that’s pointed to by IDT[0x42].
The general Windows interrupt dispatcher saves the contents of the EBP, EAX, EBX, ECX, EDX, EDI, ESI, ES, DS, FS, or GS registers onto the stack.
To verify that the above steps are correct, we’ll set a breakpoint in one of the routines that the general dispatcher will call once it has pushed all the necessary registers. It turns out that there’s one routine that it will call for IDT entries that have chained ISRs, KiChainedDispatch, and one for non-chained, KiInterruptDispatch (these names were figured out by some not so clever disassembly: just set a breakpoint in your ISR and you’ll see one of them on the stack). A portion of the resulting stack follows:
nt!KiInterruptDispatch+0x89 (FPO: [0,2] TrapFrame @ f42f3c30)
HAL!KfLowerIrql+0x35 (FPO: [0,0,0])
nt!KeSetPriorityThread+0xc2 (FPO: [Non-Fpo])
nt!PspExitThread+0x9c (FPO: [Non-Fpo])
So what exactly is going on here? We can see one of the routines that the generic Windows interrupt handler calls, KiInterruptDispatch, but what’s that other stuff on the stack?? Remember, an interrupt can happen at any time and this interrupt happened to occur when a thread was exiting. We didn’t directly talk about what happens to the interrupted code on the CPU when an interrupt fires, so now is as good a time as any. When the interrupt is delivered a “trap frame” is created, which stores the state the CPU was in before the interrupt occurred (“Aha! So that’s why the CPU pushed registers for me before it called my ISR!” you say…). In WinDBG, the “.trap” command will allow us to set our context to that of a supplied trap frame. Let’s do that now with that value next to TrapFrame on the KiInterruptDispatch line from our stack…
1: kd> .trap f42f3c30
ErrCode = 00000000
eax=00000000 ebx=00000000 ecx=00000000 edx=ffdff538 esi=81e32da8 edi=00000009
eip=804e049d esp=f42f3ca4 ebp=f42f3cb4 iopl=0 nv up ei pl zr na po nc
cs=0008 ss=0010 ds=01ff es=01ff fs=077a gs=7f30 efl=00000246
804e049d 3bc8 cmp ecx,eax
1: kd> kb
*** Stack trace for last set context - .thread/.cxr resets it
ChildEBP RetAddr Args to Child
f42f3ca0 80a35ad8 8214c228 81e32da8 00000000 HAL!KfLowerIrql+0x35
f42f3cb4 80bb295a 00e32da8 00000010 81e32da8 nt!KeSetPriorityThread+0xc2
f42f3d40 80bb3360 00000000 00000000 f6a79680 nt!PspExitThread+0x9c
Well what do you know, we’re exactly where we were before the interrupt occurred! So I guess when we return from the interrupt, if we restore the CPU context to that of this trap frame no one will be any the wiser.
Once the appropriate registers are stored and we’re in one of the KiXxxRoutines, Windows retrieves the priority of the interrupt. In Windows speak this is the “Interrupt Request Level” (IRQL – pronounced ER-QUEL) of the interrupt. The HAL decided when it assigned your device resources what IRQL your device’s interrupt would be assigned. Note that this has nothing to do with the physical INTIN line used by your device to generate the interrupt, and there is nothing you can do to influence it. Your Device IRQL (DIRQL) is related to the interrupt priority of your device (a higher IRQL means a higher interrupt priority) and hence to the value that will be stuffed into the LAPIC TPR.
The next thing for Windows to do then is to raise the IRQL of the current CPU (IRQL is a per-CPU concept) to the DIRQL that the HAL has assigned to you. After this, the CPU is in a state where it can only be interrupted by interrupts of a higher IRQL.
Now that we’re at DIRQL, we’re guaranteed to not be interrupted by the same interrupt becoming active on the current CPU and therefore the execution of the ISR is synchronized with respect to the current CPU. Windows must now acquire a lock to ensure that the execution of the ISR is synchronized with respect to other CPUs in the system. As there may be more than one ISR associated with this interrupt because of INTIN sharing, each ISR has its own lock that must be acquired before execution of the ISR can occur and subsequently released after the ISR has executed. Once Windows has acquired this lock we’re ready to execute the ISR.
For interrupts in Windows, each IDT vector will also have a list of PKINTERRUPT objects associated with it. Because this structure is opaque, you can’t access any of the fields of the structure but you can dump it out with the debugger with the “dt” command:
1: kd> dt nt!_KINTERRUPT
+0x000 Type : Int2B
+0x002 Size : Int2B
+0x004 InterruptListEntry : _LIST_ENTRY
+0x00c ServiceRoutine : Ptr32
+0x010 ServiceContext : Ptr32 Void
+0x014 SpinLock : Uint4B
+0x018 TickCount : Uint4B
+0x01c ActualLock : Ptr32 Uint4B
+0x020 DispatchAddress : Ptr32
+0x024 Vector : Uint4B
+0x028 Irql : UChar
+0x029 SynchronizeIrql : UChar
+0x02a FloatingSave : UChar
+0x02b Connected : UChar
+0x02c Number : Char
+0x02d ShareVector : UChar
+0x030 Mode : _KINTERRUPT_MODE
+0x034 ServiceCount : Uint4B
+0x038 DispatchCount : Uint4B
+0x03c DispatchCode :  Uint4B
You can see from the structure definition that these are actually linked together with the InterruptListEntry field. Also, you’ll notice that from here you can see what vector your ISR was assigned, the corresponding DIRQL, and the spinlock that must be acquired before the ISR can be executed. Because we’ve already acquired this lock in the previous step, Windows will now start at the first PKINTERRUPT and call its ISR (the ServiceRoutine field of the structure).
The ISR will check its hardware to see if it’s interrupting, and we’ll say that in this case it is. The ISR will tell its device to stop interrupting, possibly queue a DpcForIsr, and return TRUE to indicate that the interrupt was handled. Remember, we’re level-triggered so processing can stop because we’ve found an ISR that will handle the interrupt. If the ISR had returned FALSE, Windows would assume that there is more than one device on the same INTIN, release the spinlock and move to the next PKINTERRUPT in the list. It would then grab its lock and call its ISR, repeating until it found someone that returned TRUE.
After the ISR has returned, Windows will proceed to do the following:
- Release the ISR lock
- Restore the IRQL of the CPU to the IRQL it was running at before the interrupt occurred
- Restore any previously saved registers and issue an “Interrupt Return” (IRET). The reason that we use an IRET here instead of a regular RET is that we are letting the CPU know that we’re returning from an ISR so that it will restore the EIP, ESP, SS and CS registers for us (thus restoring the “trap frame” we discussed earlier).
And that’s it! We’ve just traced the handling of an interrupt from when it is first signaled on one of the IOAPIC’s INTINs by the device to the running of an associated driver’s ISR.
There’s so much more that could be said about the topics discussed in this article, but I guess I’ll have to wait for us to write The NT Insider Bathroom Bible before I can explain every last detail. If your interest at this point is piqued, I strongly suggest grabbing copies of the Intel architecture manuals and sitting down with your favourite debugger. You just never know what you might figure out!
WAIT! Doesn’t IRQ == Interrupt Priority?
Not in Windows it doesn’t, even if the hardware documentation tells you differently. Depending on your configuration, you are either using a programmable interrupt controller (PIC) or an advanced programmable interrupt controller (APIC). If you have read the article on interrupts in this issue, you already know that the INTIN lines of the APIC have no implied priority, but that the IRQ lines of the PIC do have an implied priority. This would probably lead you to believe that your device’s IRQ (and therefore your PIC priority) has a direct relation to your IRQL (and thus your device’s interrupt priority). But, you’d be wrong. This is a common misconception and it’s time for everyone out there to forget it! Where, how, when, and why your device connects to a PIC makes absolutely no difference in terms of your device’s priority in Windows.
For example, say you’re running on a system with a PIC. Device X connects to IRQ3 and device Y connects to IRQ7. If Windows just let the PIC be, X’s interrupts would always take precedence over Y’s because its IRQ has a higher (hardware interrupt) priority. However, Windows does not use the implied hardware interrupt priorities in the PIC to prioritize its interrupts. So, it could very well be that IRQ7 has an higher IRQL than IRQ3 on your system.
On Windows, the IRQ of your device never dictates the associated level of urgency of your device’s interrupt. Period. Full stop. And this is true whether you’re running on a system with a PIC or you’re on a system with an APIC. What can you do to influence this? Well, really, nothing (short of writing your own HAL, and don’t even try to go there, my friend). It’s just the way Windows works.