The NT Insider:Tools of the Trade - A Catalog of Synchronization Mechanisms

Although Windows NT provides a very rich set of synchronization primitives, we’ve found that device driver developers are usually only aware of a few. Spin locks and Events are the tools that driver writers typically stick in their synchronization tool kit, and many times they have relied on them for so long that they have forgotten what other nifty synchronization pliers, lug wrenches and drill bits NT makes available. Always relying on spin locks to protect access to shared data may be overkill, and if you don’t know what other synchronization mechanisms exist, you won’t know if you’ve developed the best synchronization architectures for your projects. This article will review NT’s major synchronization objects, including spin locks, dispatcher objects, executive mutexes and resources, and discuss when and why you should use one type and not another, as well as point out the cases where you can’t use certain types.

If you’ve written a device driver and you didn’t just blindly cut-and-paste code from a DDK example or the company driver template that was written by Joe (what ever happened to him) a few years back, then you already have an idea of what spin locks are. Spin locks are commonly used in drivers to protect data that will be accessed by multiple driver routines running at varying IRQLs. Perhaps the most frequent function served by spin locks is to protect data that can be referenced from a driver’s dispatch functions and it’s DPC(s). First, lets review the theory and implementation of spin locks, and then we’ll describe NT specifics.

Spin locks are the simplest of locking mechanisms. If a spin lock has been acquired, a second attempt to acquire the spin lock will result in the acquiring thread busy-waiting until the spin lock is released. The basic spin lock algorithm is:

The term busy-wait is used to describe the while-loop, which has the thread that is trying to acquire the spin lock continuously check its status in a tight loop. The code snippet shown would compile to assembly language like this:

and is not sufficient to correctly implement a spin lock on a multiprocessor. Consider the following sample execution, where the interleaving of instruction executions (which is arbitrary on an MP machine) of two threads acquiring the same unlocked spin lock provokes a subtle bug in the code:

In this case, both threads have acquired the spin lock, which will almost certainly lead to major trouble down the road. A correct spin lock implementation requires either a complex software algorithm, or a simple atomic test-and-set instruction implemented by the CPU. You can guess which one NT uses. An atomic test-and-set instruction allows a thread to simultaneously read and modify a value, without fear of the two operations being interrupted by an instruction executing on another CPU. Atomic Test-and-set instructions come in a variety of flavors on different CPUs. The x86 actually has several test-and-set operations, any one of which can be made atomic by prefixing it with a LOCK byte.

Because an MP bus must be locked whenever an atomic instruction is executed, these instructions are somewhat expensive. Spin lock implementations usually try to minimize bus locking by actually using two loops: an outer one that performs the atomic operation, and an inner loop that simply tests the lock’s state.

Pseudocode for NT’s lock acquiring code is below, with the ‘&’ representing an atomic combination of the test-and-set.

Because a spin lock can only be acquired once, a thread cannot try to reacquire a spin lock (known as recursive lock acquisition) without deadlocking itself. Deadlock is a term that describes a condition where no forward progress is possible, and is applicable in such cases because the thread will end up waiting for itself to release the lock, which of course it never will (it’ll be too busy spinning!).

Another form of deadlock occurs when two threads each hold locks that the other is trying to acquire. This is demonstrated in the following sample execution:

This type of deadlock is commonly referred to as a "deadly embrace", and can involve any number of threads, as long as a locking dependency exists that will never be resolved.

While the NT DDK consistently warns driver writers away from ever acquiring more than one spin lock at a time, as the example threads attempt to, having different locks protect various data structures can improve a driver’s performance, and simplify its design (or not).

For example, lets say you have two data structures in your driver, and most of the time threads will process items on only one of them. Occasionally however, a thread will have to move an item from one data structure to the other. The easy approach of having one spin lock protect both structures is clearly not the performance efficient solution if threads can be working in parallel on each structure. To enhance parallelism, the driver should protect each structure with its own spin lock. If the data transfer is always in the same direction, that’s all there is to it. But what happens when the data transfer can go in either direction? The driver could end up in the deadlock example shown above.

Deadlock prevention, avoidance, and detection has been the subject of much academic research. As it turns out, the easiest approach to dealing with deadlock is also the most efficient. First, take all the locks that can be acquired at one time and put them in a list starting with the ones that will be accessed most frequently on the left. Then use the following rule: when your driver needs to acquire more than one lock in the list, make sure that it acquires the locks in left-to-right order. That’s it. What you’ve done when you made the list is impose a locking hierarchy on your driver. If all threads in your driver obtain the locks in the correct order it is impossible for a deadlock condition to be created.

In the example above with two data structures that can have data transferred from one to the other and vice versa, the structures would be ordered according to frequency of use. When an inter-structure data transfer is required, a thread must make sure it acquires the lock on the left before acquiring the lock on the right.

Busy-waiting on a uniprocessor is an expensive proposition. It wastes precious CPU cycles that could otherwise be used for productive work. Like virtually all uniprocessor operating systems, the uniprocessor version of NT disables thread preemption while a spin lock is held, rather than having to worry about busy-waits sucking up CPU cycles.

On NT, every spin lock has associated with it an IRQL that is at least DISPATCH_LEVEL. When a thread acquires a spin lock, the IRQL is raised to that of the spin lock. Because the NT dispatcher (scheduler) preemption is disabled at IRQLs higher than PASSIVE_LEVEL, a thread that has acquired a spin lock will not be preempted. Further, a thread owning a spin lock that has an IRQL in the DIRQL range will not be interrupted by interrupts associated with IRQLs less than or equal to the spin lock DIRQL.

What does this accomplish? It means that on a uniprocessor there is no need for the busy wait part of the spin lock code shown earlier, because spin locks are implemented by NT’s IRQL architecture! Thus, on a uniprocessor, NT’s spin lock acquire psuedocode looks like:

Running at an elevated IRQL can have a noticeably negative impact on a system’s performance and is the reason why the DDK is adamant that drivers should hold spin locks for no more than 25 microseconds. Where does 25 microseconds come from? Who knows. And how many instructions is that? How about on a 500MHz Alpha? Okay, how about on a 486/66? What I’m getting at is that you should use your own judgment (which should always override the DDK) about how long your drivers holds spin locks. Have them hold locks for as long as they need to and no longer.

On the multiprocessor version of NT, the IRQL of the CPU where a spin lock is being acquired is raised to a spin lock’s IRQL before the attempt to acquire is made. This results in the uniprocessor spin lock behavior being preserved within each individual CPU. Nifty, huh?

As I alluded to above, spin locks have associated IRQLs. However, the NT API hides this association. There are three types of spin locks used by device drivers: standard spin locks, default ISR (Interrupt Service Request) spin locks, and ISR synchronization spin locks. Each type has its own IRQL associations.

Standard spin locks are created with KeInitializeSpinLock(...). Its sole parameter is a pointer to driver allocated spin lock storage which must be in non-paged memory. Most drivers use storage in device object extensions, non-paged pool, or non-paged driver global memory. A spin lock is acquired with KeAcquireSpinLock(...), which returns the IRQL of the processor prior to the acquire. A driver is responsible for keeping this somewhere and passing it to a corresponding KeReleaseSpinLock(...). The generic locking hierarchy does not require that you release locks in the same order that you acquire them, but nesting is a requirement for multiple spin lock acquisition in NT due to the fact that the IRQLs must properly nest (i.e. after you release the last spin lock, you must be back at the IRQL you started at).

The IRQL associated with standard spin locks is DISPATCH_LEVEL. This means that your driver will be running at IRQL equal to DISPATCH_LEVEL while it has spin locks acquired, and absolutely, positively must not attempt to touch paged code or data. Doing so may work sometimes, but the first time that the referenced code or data is not resident, you’ll get an IRQL_NOT_LESS_OR_EQUAL BugCheck.

Running at DISPATCH_LEVEL also means that you have to be careful about using other kernel-mode routines. Many functions have the restriction that they cannot be called at IRQLs greater than PASSIVE_LEVEL. For instance, we’ll see when we look at dispatcher objects that a thread cannot block on a dispatcher object at DISPATCH_LEVEL or higher.

Two special routines can be used for enhanced performance in cases where you know the IRQL at the time a thread acquires a spin lock will already be DISPATCH_LEVEL (like in a DPC): KeAcquireLockAtDpcLevel(...) and KeReleaseSpinLockFromDpcLevel(...). These simply skip the IRQL manipulations performed by the other routines (KeAcquireLockAtDpcLevel(...) is a no-op on a uniprocessor!).

Default ISR spin locks allow for data-access synchronization between an ISR executing at DIRQL, and other driver routines like driver dispatch functions and DPCs that execute at lower IRQLs. Your driver will indicate to the I/O manager that it wants to use a default ISR spin locks when it calls IoConnectInterrupt(...). This function takes an optional spin lock parameter, which if NULL, means that a spin lock will be created for the ISR with an associated IRQL equal to what you pass in the Irql parameter. When the ISR is executed, the I/O manager will automatically acquire the ISR’s spin lock before the ISR is called, and release it after the ISR is completed.

ISR synchronization spin locks allow multiple ISRs that run at different DIRQLs to synchronize access to shared data. A driver directs the I/O Manager to create one by passing a pointer to a spin lock in the IoConnectInterrupt(...) SpinLock parameter. In this case the driver is also required to specify a DIRQL to associate with the spin lock, which must be equal to the highest DIRQL returned by the calls to HalGetInterruptVector(...) for the ISR’s that are involved in the synchronization.

The fact that the I/O manager creates spin locks for ISRs does not mean that your other driver routines call KeAcquireSpinLock(...) on the spin locks to acquire them - not even for ISR synchronization spin locks. Remember that KeAcquireSpinLock(...) automatically puts the current IRQL at DISPATCH_LEVEL, which would not result in correct synchronization since ISRs execute at IRQL > DISPATCH_LEVEL. Instead, you call KeSynchronizeExecution(...) with a pointer to a routine that must be synchronized with the ISR(s). NT will acquire and release on your behalf the spin lock associated with the ISR that is specified, be it a default ISR spin lock, or an ISR synchronization spin lock, before and after the synchronization function is called.

Similarly, your driver must not acquire any spin locks from your ISR. The ISR spin lock will already have been acquired, and attempting to acquire another spin lock (of any type) will almost certainly guarantee that the IRQL will get hosed (IRQL_NOT_LESS_OR_EQUAL).

With the exception of Events, dispatcher objects are often overlooked by NT driver writers. This is often because drivers are simple enough to only require a spin lock or KeSynchronizeExecution(...) or two, and because the I/O manager forces a wait on an event if synchronous I/O is desired. This section will provide a background on the common infrastructure that supports dispatcher objects, and then proceed to describe their many flavors: Events, Mutexes, Mutants, Semaphores, Timers, Threads and Processes, and Files.

Spin locks are absolutely necessary when data synchronization is required by any code that may run at an elevated IRQL. But the golden rule in NT is to spend as little time at IRQLs greater than PASSIVE_LEVEL as possible, which makes spin lock avoidance a high priority. NT’s dispatcher objects are a collection of synchronization primitives, built upon a common infrastructure, that can in general only be acquired at IRQL PASSIVE_LEVEL. Thus, they are very useful for synchronizing data access in worker threads and driver dispatch functions where (with the exception of lowest level disk drivers) the IRQL is guaranteed to be below DISPATCH_LEVEL. The wide-variety of dispatcher objects means that one probably exists to specifically solve whatever synchronization problem you encounter in your driver. NT itself makes heavy use of dispatcher objects throughout all of its code, particularly in its file systems.

The common dispatcher object infrastructure is the DISPATCHER_HEADER data structure (listed in NTDDK.H) and its support routines. The fundamental attributes shared by all dispatcher objects, be they mutexes, semaphores, or a new one Microsoft makes up for NT 5.0, is that they can have one of two states, Signaled and Non-Signaled, and that threads waiting to acquire them are blocked on a wait list located in the object’s DISPATCHER_HEADER.

The only thing that differentiates dispatcher objects from one another are the rules used to determine when they are in a signaled or non-signaled state. Table 1, which is based on a similar table in Helen Custer’s "Inside Windows NT" (Microsoft Press, 1992), summarizes the signaling rules.

Object Type	Rule for Becoming Signaled	Effect on Waiting Threads
Mutex	Thread releases mutex	One thread is released
Semaphore	Semaphore count becomes 0	All threads are released
Synchronization Event	Thread sets the event	One thread is released
Notification Event	Thread sets the event	All threads are released
Synchronization Timer	Time arrives or interval elapses	One thread is released
Notification Timer	Time arrives or interval elapses	All threads are released
Process	Last thread terminates	All threads are released
Thread	Thread terminates	All threads are released
File	I/O completes	All threads are released

A powerful characteristic of dispatcher objects is the fact that it is possible to name them using functions that wrap around Object Manager support routines. This permits different drivers or applications to synchronize without the use of a custom interface. The Win32 API exposes naming ability for dispatcher objects, but the equivalent kernel-mode interface is undocumented. Calls like NtCreateMutex(...), NtCreateSemaphore(...), and so on are used to create named dispatcher objects, and like the well known NtCreateFile(...), allocate a handle in the current process’ handle table to represent it. Device drivers typically use the raw kernel interface because there is no need for naming, so I won’t cover the named-interface.

Each type of dispatcher object has its own initialization (e.g. KeInitializeMutex(...)) and release routine (e.g. KeReleaseMutex(...)), but all use common acquire functions (namely KeWaitForSingleObject(...) and KeWaitForMultipleObjects(...)).

KeWaitForSingleObject(...) and KeWaitForMultipleObects(...) are the functions that are used to acquire dispatcher objects. If there is a possibility that the object or objects being acquired are in non-signaled states, meaning that the thread will block while attempting the acquire, then the functions must be called at PASSIVE_LEVEL. Otherwise they can be used at IRQL less than or equal to DISPATCH_LEVEL.

The DDK does a fairly good job of documenting these functions, however a few parameters warrant some clarification. The DDK states that the WaitReason parameter should be Executive or UserRequest, but it does not explain the implications of each choice. A driver could actually pass in 0x12 for the WaitReason and NT would perfectly happy. It turns out that the selection is used as nothing more than a debugging aid: when you dump a thread’s state in WinDbg, it indicates the reason the thread is blocked on a dispatcher object - guess where this reason comes from? At OSR we use custom values for this parameter to help us see what’s going on when something goes wrong.

Another parameter, WaitMode, can be either KernelMode or UserMode, and the DDK explains that this affects the delivery of user-mode APCs. It does not mention that it also determines whether or not the thread’s stack will become eligible for paging (UserMode) or not (KernelMode).

Kernel mutexes (as opposed to Executive Mutexes, which are not dispatcher objects) are the dispatcher object equivalent of spin locks. However, one important thing to remember about mutex objects is that unlike spin locks, their acquisition is thread-context specific. That is, a thread that has acquired a kernel mutex is the owner of the mutex, and can in fact recursively acquire it. This also means that if your driver acquires a mutex in a particular thread context, it must release it in the same context or you’ll get a BSOD (Blue Screen Of Death).

Mutexes are initialized with KeInitializeMutex(), which requires that the driver pass a pointer to a driver allocated, non-paged mutex data structure. The call also takes a parameter called Level. This is used for kernel enforcement of a locking hierarchy like the one described earlier. If a driver needs to acquire more than one mutex at a time, it must do so in increasing order according to the assigned mutex levels.

A Mutex’s state can be checked with KeReadStateMutex(...), and mutexes are released with KeReleaseMutex(...). KeReleaseMutex(...) takes a parameter called Wait that is used as a hint to the kernel that the thread plans on immediately acquiring another dispatcher object. If the Wait is FALSE, then KeReleaseMutex(...), and in general any other dispatcher object signaling function, can be called at IRQL less than or equal to DISPATCH_LEVEL. Remember that if the thread may block acquiring another object, that it must be at IRQL equal to PASSIVE_LEVEL, and therefore the Wait can only be TRUE at elevated IRQL if the subsequent KeWait call will not result in the thread blocking. If Wait is TRUE, the release and acquire are performed atomically and without dropping the IRQL. One other note about acquiring mutex objects versus other dispatcher objects: the wait mode must be KernelMode.

Semaphores are a more flexible form of mutexes, and are often referred to as counting mutexes. Unlike mutexes, a driver has control over how many threads may simultaneously acquire a semaphore. When one is initialized via KeInitializeSemaphore(), two parameters (besides the pointer to a non-paged semaphore structure), Count, and Limit, must be supplied. The Limit specifies how many threads can concurrently acquire the semaphore, and the Count specifies how many threads the kernel should pretend have already acquired it.

Semaphores are released with KeReleaseSemaphore(...), which has the same rules of use as KeReleaseMutex(...), minus the wait mode restriction that applies for mutexes.

The fact that multiple threads can coinstantaneosly acquire the same semaphore makes them ideal for protecting access to multiple identical resources. For example, say your driver has allocated 4 large buffers that are used for device I/O. A simple way to control access to the buffers is to create a semaphore with a limit of 4, and to have the driver’s appropriate dispatch function acquire the semaphore. If there is an available buffer, the thread making the request will be allowed to continue. If the buffers are all in use, the semaphore count will be 4, and the next thread will have to wait until the semaphore, and therefore one of the buffers, is released.

Synchronization events are also known as auto-reset events because when they are released, one waiting thread is awoken and their state is atomically returned to non-signaled. They are initialized with KeInitializeEvent(...), which is also used to initialize notification events. The State parameter allows events to be created in a signaled or non-signaled state. Signaling a synchronization event is performed with KeSetEvent(...), with the same IRQL restrictions as KeReleaseSemaphore(...).

Synchronization events are commonly used in drivers so that hardware initialization routines can wait for their hardware to signal with interrupts. The driver’s initialization routine waits on a synchronization event and its DPC for ISR signals the event, allowing the initialization routine to proceed with setup.

These types of events are more commonly used than synchronization events. When signaled, all waiting threads are released. Drivers typically use notification events to wait for IRP completion when they pass an event object to an I/O function like IoBuildDeviceIoControlRequest(...), and then wait for a lower level driver to signal the event after the IRP is sent with IoCallDriver(...).

Unlike auto-resetting synchronization events, a notification event is signaled with KeSetEvent(...), and it remains signaled until it is explicitly reset with KeResetEvent(...).

Timer dispatcher objects can also be Synchronization or Notification objects. KeInitializeTimer(...) can be used to initialize notification timers, and KeInitializeTimerEx(...) is used to initialize either synchronization or notification timers. Timers, which are started with KeSetTimer(...) or KeSetTimerEx(...), are configured to fire once or periodically, and the first instant of expiration can be expressed relative to the current time or in absolute terms. In addition, an optional DPC routine can be passed to either function: when the timer object is signaled, the DPC specified will be scheduled for execution. Both KeInitializeTimer(...) and KeSetTimer(...) can be invoked at IRQL less than or equal to DISPATCH_LEVEL.

The distinction between synchronization and notification timers is useful because notification timers are disabled (signaled) until they are explicitly reset.

Process and thread objects are an example of how NT embeds dispatcher objects within other objects. They are initialized and signaled by NT, so drivers can only wait for them to become signaled using KeWaitFor..(). Process objects are signaled when all of a process’ threads have exited, and thread objects are signaled when the thread has terminated.

Here’s another dispatcher object embedded within another data structure. File objects are also dispatcher objects that can be waited on. They are signaled, and auto-reset, when file I/O completes.

Most device driver writers are unfamiliar with resources, and its no wonder since the DDK documents them only in the Kernel-Mode Reference. No mention of them is made in the Design Guide, even though resources are alleged to be Dave Cutler’s favorite synchronization primitive. Resources are not dispatcher objects, so there is no way to name them. What they provide is a form of mutex that can be acquired in two different modes: exclusive and shared. When acquired in exclusive mode, resources behave like a standard mutex: another thread trying to acquire the mutex, for either shared or exclusive access, will block. Resources acquired in shared mode can be simultaneously acquired by any number of threads.

Resources are ideal for protecting data structures that can be read concurrently by several threads, but that must be exclusively accessed for modification. Resources are initialized with the function ExInitializeResourceLite(...). Three calls acquire them for shared access: ExAcquireResourceSharedLite(...), ExAcquireSharedWaitForExclusive(...), and ExAcquireSharedStarveExclusive(...). ExAcquireResourceSharedLite(...) will only grant access to the calling thread if it already has shared or exclusive access, or if no other thread has the thread acquired for exclusive access and no other thread is already waiting for exclusive access. ExAcquireSharedWaitForExclusive(...) is identical (at least as far as we can tell). ExAcquireSharedStarveExclusive(...) is also the same, except that if there is a thread waiting to gain exclusive access, the caller of ExAcquireSharedStarveExclusive(...) will be given priority.

Exclusive access of a resource is gained through a call to ExAcquireResourceExclusiveLite(...). Access is granted only when the resource has not already been acquired. This means that counter to intuition, it is not possible for a thread to first acquire a resource for shared access, and then upgrade the access to exclusive without first releasing the resource. However, it is possible using the ExConvertExclusiveToSharedLite(...) function to change a thread’s exclusive access to shared.

All resource acquisition functions take a parameter called Wait, that if TRUE specifies that the thread attempting to gain the resource will not block if it cannot obtain it immediately. Using ExTryToAcquireResourceSharedLite(...) is equivalent to calling ExAcquireResourceExclusiveLite(...) with a Wait of FALSE, but it performs better. The resource API allows threads to check the number of waiters of a resource with ExGetExclusiveWaiterCount(...) and ExGetSharedWaiterCount(...). Finally, resources are released with ExReleaseResourceForThreadLite(...).

Resource functions are more restrictive than the dispatcher object functions because except for ExReleaseResourceForThreadLite(...) (which can be called at DISPATCH_LEVEL), they cannot be called at DISPATCH_LEVEL or above.

We’re not going to spend much time on executive mutexes (also known as Fast Mutexes), because under NT 3.51 they have a bug that makes it impossible to acquire them at above PASSIVE_LEVEL without causing an IRQL BugCheck. In addition, under 3.51 they use APC’s to implement blocking. Under NT 4.0 the IRQL bug has been corrected and the mutex support functions have been rewritten to use Event dispatcher objects instead of APCs for blocking, but unlike Kernel Mutexes, they are still not recursively acquirable. Most of us would like our driver code to work under both versions of NT, so executive mutexes are not as attractive as other synchronization primitives.

Executive mutexes (which are, interestingly enough, implemented in HAL.DLL) are initialized with the call ExInitializeFastMutex(...), which takes a pointer to a non-paged fast mutex object. They are acquired with ExAcquireFastMutex(...), ExAcquireFastMutexUnsafe(...), or ExTryToAcquireFastMutex(...). Typically, a driver will use ExAcquireFastMutex(...), which blocks the calling thread with APCs disabled until the mutex can be acquired. If a thread is running with APCs already disabled, as would be the case if it was in a critical section, it can use ExAcquireFastMutexUnsafe(...).ExTryToAcquireFastMutex(...) immediately returns a FALSE if the mutex cannot be acquired, and a TRUE if it was successfully acquired. It should be used if the caller does not want to block if it cannot immediately acquire the mutex, but can get other useful work done before it tries again.

Mutexes are released with ExReleaseFastMutex(...). The only Fast mutex function that can be called at DISPATCH_LEVEL is ExInitializeFastMutex(...).

So you can see that NT provides a wide array of synchronization mechanisms, each of which is aimed at solving different problems. Table 2 shows the IRQL restrictions for each type of primitive.

Now that you’ve seen the tools that NT provides, you’ll be sure that when you select a philips-head, it will be the best fit for the screw you’re device driving.

*Object Type*	Acquire, No Block	Acquire, Block	Release/Signal
Standard Spin Lock	<= DISPATCH_LEVEL	DISPATCH_LEVEL
Default ISR Spin Lock	<= DIRQL	DIRQL
ISR Synchronize Spin Lock	<= Specified DIRQL	Specified DIRQL
Mutex	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	<=DISPATCH_LEVEL
Semaphore	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	<=DISPATCH_LEVEL
Synchronization Event	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	<=DISPATCH_LEVEL
Notification Event	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	<=DISPATCH_LEVEL
Synchronization Timer	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	-
Notification Timer	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	-
Process	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	-
Thread	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	-
File	<=DISPATCH_LEVEL	<DISPATCH_LEVEL	-
Resources	< DISPATCH_LEVEL	<DISPATCH_LEVEL	<=DISPATCH_LEVEL