InterlockedRead / InterlockedWrite

Below is the MSDN documentation for InterlockedExchange():

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683590(v=vs.85).aspx

It says “This function is atomic with respect to calls to other interlocked functions.” That seems reasonable because the caller should not make any assumptions about the implementation. On most architectures there is HW support for the interlocked functions and they are also atomic with respect to simple reads and writes. However, if HW support is unavailable, the interlocked functions would need to use spinlocks. Why, then, is there no InterlockedRead() or InterlockedWrite()? Consider the following example:

NTSTATUS try_lock(ULONG *x)
{
return(InterlockedExchange(x, 1) == 0 ? STATUS_SUCCESS : STATUS_UNSUCCESSFUL);
}

void release_lock(ULONG *x)
{
*x = 0; // InterlockedWrite() is not available
}

Now suppose the follow sequence of events from two different processes (P2 already owns the lock):

P1: calls try_lock()
P1: acquires spinlock
P1: reads *x == 1
P2: calls release_lock()
P2: writes *x = 0
P1: writes *x = 1
P1: releases spinlock
P1: try_lock() returns STATUS_UNSUCCESSFUL

Now the lock is left in a bad state and all future calls to try_lock() will fail.

You might want to look at Doron’s blog from 8 years ago, in particular
http://blogs.msdn.com/b/doronh/archive/2006/12/06/creating-your-own-interloc
kedxxx-operation.aspx

Don Burn
Windows Filesystem and Driver Consulting
Website: http://www.windrvr.com
Blog: http://msmvps.com/blogs/WinDrvr

-----Original Message-----
From: xxxxx@lists.osr.com
[mailto:xxxxx@lists.osr.com] On Behalf Of
xxxxx@qualcomm.com
Sent: Tuesday, April 01, 2014 2:35 PM
To: Windows System Software Devs Interest List
Subject: [ntdev] InterlockedRead / InterlockedWrite

Below is the MSDN documentation for InterlockedExchange():

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683590(v=vs.85).as
px

It says “This function is atomic with respect to calls to other interlocked
functions.” That seems reasonable because the caller should not make any
assumptions about the implementation. On most architectures there is HW
support for the interlocked functions and they are also atomic with respect
to simple reads and writes. However, if HW support is unavailable, the
interlocked functions would need to use spinlocks. Why, then, is there no
InterlockedRead() or InterlockedWrite()? Consider the following example:

NTSTATUS try_lock(ULONG *x)
{
return(InterlockedExchange(x, 1) == 0 ? STATUS_SUCCESS :
STATUS_UNSUCCESSFUL); }

void release_lock(ULONG *x)
{
*x = 0; // InterlockedWrite() is not available }

Now suppose the follow sequence of events from two different processes (P2
already owns the lock):

P1: calls try_lock()
P1: acquires spinlock
P1: reads *x == 1
P2: calls release_lock()
P2: writes *x = 0
P1: writes *x = 1
P1: releases spinlock
P1: try_lock() returns STATUS_UNSUCCESSFUL

Now the lock is left in a bad state and all future calls to try_lock() will
fail.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> Why, then, is there no InterlockedRead() or InterlockedWrite()

Simply because interlocking does not apply to simple reads and writes…

It applies to operations that can be described by the following sequence that has to appear atomic

A. Read data from memory to some internal temporary register
B. Do some operation on this temporary register ( increment, decrement, test-and-set, exchange with some other register, etc)
C. Write register contents back to memory

As you can see, it applies only to those to operations that involve both read and write cycles to memory that have to appear atomic, and simple reads and writes are not among them…

Anton Bassov

@Don, thanks for the link.
@Anton, did you read my entire post?

>However, if HW support is unavailable, the interlocked functions would need to use spinlocks.

How would you implement spinlocks, if interlocked HW commands are not available?

> @Anton, did you read my entire post?

Well, to be honest, I did not understand it at all. For example,could you please explain the statements below:

You seem to be considering the scenario when P1 acquires a spinlock that is already owned by P2…

Anton Bassov

@Alex, there might be HW support sufficient for a spinlock but insufficient to directly implement all of the interlocked APIs. Or there might be no HW support at all and mutual exclusion would need to be implemented using something like Peterson’s algorithm.

@Anton, in my example there are two locks. One that is acquired/released via try_lock()/release_lock() and another that is used internally by the interlocked functions to guarantee atomicity between themselves.

>> owever, if HW support is unavailable, the interlocked functions would need to use spinlocks.

How would you implement spinlocks, if interlocked HW commands are not available?

Please note that an arch may limit itself to providing HW support only for a single type of interlocked operations ( test-and-set is the most likely candidate). Interlocked ops like, say, increment or compare_exchange have to implement bus locking semantics in a software using a spinlock abstraction that, in turn, is implemented on top of test-and-set( i.e. interlocked operation that is implemented by hardware), on such an arch. Check Linux sources for arches other than x86 and x86_64, and you will see it with your own eyes. Therefore, in this respect the OP’s question is perfectly reasonable…

Anton Bassov

> @Anton, in my example there are two locks. One that is acquired/released via try_lock()/release_lock()

and another that is used internally by the interlocked functions to guarantee atomicity between themselves.

If you think about it carefully you will (hopefully) realize that a single operand that is guarded by 2 separate
locks is equivalent to the one that is not guarded by any locks at all. Therefore, your original assumption is just logically faulty, and you prove it yourself by pointing out how such a “guard” can be defeated .

In terms of stupidity such a scheme is equivalent to the “masterpiece” below:

if(irql< DISPATCH_LEVEL)
{

ExAcquireFastMutex();

do_things_with_varianble_X();

ExReleaseFastMutex ();
}

else

{

KeAcquireSpinLock();

do_things_with_varianble_X();

KeReleaseSpinLock();

}

With such a “guard” variable X may be update by two separate paths simultaneously. The scheme that you have mentioned earlier is from the same field…

Anton Bassov

Your example is very confusing, if you’re talking about two different spinlocks. I suggest you rewrite it more explicitly.

If we’re talking about Windows, there is an assumption of some hardware guarantees. For example, cache coherency, and interlocked compare-exchange availability.

Interlocked compare-exchange allows you to implement the full range of atomic interlocked operations - increment, decrement, add, OR, AND, XOR, you name it.

> Below is the MSDN documentation for InterlockedExchange():

http://msdn.microsoft.com/en-us/library/windows/desktop/ms683590(v=vs.85).aspx

It says “This function is atomic with respect to calls to other
interlocked functions.” That seems reasonable because the caller should
not make any assumptions about the implementation. On most architectures
there is HW support for the interlocked functions and they are also atomic
with respect to simple reads and writes. However, if HW support is
unavailable, the interlocked functions would need to use spinlocks. Why,
then, is there no InterlockedRead() or InterlockedWrite()? Consider the
following example:

NTSTATUS try_lock(ULONG *x)
{
return(InterlockedExchange(x, 1) == 0 ? STATUS_SUCCESS :
STATUS_UNSUCCESSFUL);
}

void release_lock(ULONG *x)
{
*x = 0; // InterlockedWrite() is not available
}

Now suppose the follow sequence of events from two different processes (P2
already owns the lock):

P1: calls try_lock()
P1: acquires spinlock

The above sequence makes no sense, because it requires a read, whose
outcome is untested, so it is just a way to waste time. It accomplishes
nothing. In fact, whether it gets the lock or not in no way impacts any
of the remaining code.

P1: reads *x == 1
P2: calls release_lock()
P2: writes *x = 0
P1: writes *x = 1
P1: releases spinlock
Just to clarify: processes are irrelevant to this discussion. Only
threads matter, and they can belong to the same process. Note that there
are two-stage acquisition algorithms that work even on machines with no
interlocks, where an interrupt and context switch can occur between any
two instructions. It was even a question on a final exam in the OS course
I took in 1968, but I’m not going to try to rediscover it 46 years later.
But in the above code, you are right; the state of x is wrong. The reason
it is wrong is that your code is wrong; it makes no sense to allow thread
P2 to write to x without any synchronization, when thread P1believes it
has exclusive access. You are releasing te lock before doing te write.

There is this about sychronization: it doesn’t do any good if you don’t
use it correctly. Back in 1968, Edsgar Dijkstra wrote a now-famous paper
on the use of synchronization primitives P() and V(), and we have made
virtually no progress in understanding how to use these in a robust
fashion in the intervening 46 years. The “synchronized” attribute of Javs
and C# is just a P/V in sheep’s clothing. A tiny number of
mostly-academic languages have introduced meta-concepts such as “condition
variables”, but these languages tend to exist in only one compiler in one
university. The closest language to mainstream that exhibited anything
more sophisticated than P/V was Ada, with its “rendezvous” mechanism.

The reason your code example doesn’t work is simply that it is wrong, so
of course it looks like nonsense. That’s because it is. Consider the
race condition where P2 sets the value to 0 and P1sets it to 1. If those
threads are running on distinct cores, a timing change of a couple hundred
PICOseconds makes the difference between the sequence you show and a
sequence which leaves *x as 0. When you see something like this, you know
immediately that your code is just plain WRONG.

If, to get te address in the pointer x, you use &Schroedinger as the name
of the variable, this should give you a hint as to what you have done
wrong. The first access to the variable collapses the quantum states and
gives you A value, you just don’t know which one you have until you read
it.

Generally, for one value, InterlockedCompareAndExchange is all the locking
you need. No locking is required for a store. But when you have a tuple
of values, you want to change all members of the tuple to maintain some
correctness invariant. In this case you need a lock, because you want to
treat the N non-atomic modifications as a single atomic transaction. So
unlinking a block from a doubly-linked list requires a lock that prohibits
any other thread from either reading information for which the correctness
inariant is false, or concurrently modifying the tuple such that two
threads each think they have each left the tuple in a correct state. You
demonstrate none of that in your example. Note that you can create
locking structures that prevent concurrent access while still creating
completely erroneous results, such as the example you have given.

P1: try_lock() returns STATUS_UNSUCCESSFUL

Now the lock is left in a bad state and all future calls to try_lock()
will fail.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

>>However, if HW support is unavailable, the interlocked functions would

> need to use spinlocks.

How would you implement spinlocks, if interlocked HW commands are not
available?

We had to implement them using a hypothtical “multithreaded Algol
compiler” which had absolutely no guarantees of atomicity. It can be done
with a two-phase acquisition strategy. But it’s simpler and more
efficient if you have interlocked instructions.
joe


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

> Your example is very confusing, if you’re talking about two different

spinlocks. I suggest you rewrite it more explicitly.

If we’re talking about Windows, there is an assumption of some hardware
guarantees. For example, cache coherency, and interlocked compare-exchange
availability.

Interlocked compare-exchange allows you to implement the full range of
atomic interlocked operations - increment, decrement, add, OR, AND, XOR,
you name it.

And if you really want to see an amazing sequence of code, look at the
code the C compiler creates for an atomic *= operation

#pragma omp atomic
x *= y;

It uses a fascinating variation on the concept of “spin lock” (in reading
the code, ignore the code executed if te iperands are not DWORD-aligned;
it is obvious and trivial, albeit slow with potential high lock conflict;
look at the main path for DWORD-aligned operands)
joe


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

It is surprising how much time some people will spend writing responses and how little time reading and understanding the original question.

There are TWO locks in the example. The first lock (which I refer to as “lock”) is a trylock. This is what try_lock() and release_lock() manipulate. The second lock (which I refer to as “spinlock”) is a *hypothetical* lock that the interlocked functions *may* be using to guarantee atomicity between each other. It’s hypothetical because we don’t know how the interlocked functions are implemented on every architecture we may encounter.

The try_lock() and release_lock() functions would be safe and correct if I used InterlockedWrite() instead of a simple write inside release_lock(). But InterlockedWrite() doesn’t exist.

> It is surprising how much time some people will spend writing responses

and how little time reading and understanding the original question.

I read the original question. It was largely nonsense, because it did a
try-lock without testing to see if it succeeded or not, and there is, as
has been pointed out, absolutely no need for either an InterlockedRead or
InterlockedWrite, since that is how memory works anyway.
joe

There are TWO locks in the example. The first lock (which I refer to as
“lock”) is a trylock. This is what try_lock() and release_lock()
manipulate. The second lock (which I refer to as “spinlock”) is a
*hypothetical* lock that the interlocked functions *may* be using to
guarantee atomicity between each other. It’s hypothetical because we
don’t know how the interlocked functions are implemented on every
architecture we may encounter.

The try_lock() and release_lock() functions would be safe and correct if I
used InterlockedWrite() instead of a simple write inside release_lock().
But InterlockedWrite() doesn’t exist.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at
http://www.osronline.com/page.cfm?name=ListServer

Joe, if you didn’t comprehend the original question then you should have asked for clarification or kept quiet.

> It is surprising how much time some people will spend writing responses and how little time

reading and understanding the original question

Well, we are just trying to explain to you that the original question is nonsensical in itself, but you don’t want to listen to us, do you…

But InterlockedWrite() doesn’t exist.

OK, let put it simply:

A. InterlockedWrite(int * x, int val)
{

*x=val;

}

B. InterlockedWrite(int * x, int val, spinlock_t lock)
{

spin_lock(lock);
*x=val;
spin_unlock(lock);
}

Could you please explain the purpose of a spinlock in example B. If you somehow realize that it serves no purpose whatsoever and that, for all practical purposes, A and B are equivalent, you will (hopefully) realize that there is no reason for interlocked writes.

The try_lock() and release_lock() functions would be safe and correct if I used InterlockedWrite()
instead of a simple write inside release_lock().

If you handle the first part well, you will also realize that try_lock() and release_lock() functions would NOT be safe and correct in multithreaded environment, no matter how you implement your hypothetical InterlockedWrite()…

Anton Bassov

@Anton:

Acquiring the InterlockedXxx spinlock (if it exists) before the write operation in release_lock() guarantees it will occur before or after (not during) the read/write operation in try_lock().

  1. If the write in release_lock() happens before the read/write in try_lock() then try_lock() will return STATUS_SUCCESS and *x==1 (GOOD).

  2. If the write in release_lock() happens after the read/write in try_lock() then try_lock() will return STATUS_UNSUCCESSFUL and *x==0 (GOOD).

  3. If the write in release_lock happens during the read/write in try_lock() then try_lock() will return STATUS_UNSUCCESSFUL and *x==1 (BAD).

To be pedantic, ARM has non-cache-coherent DMA that may require flushes & also lacks hardware compare-exchange primitives as amd64 or x86 have (the compiler will synthesize the requisite intrinsic interlocked operation using an appropriate ldrex/strex sequence).

It would be fair to say that we presently require hardware upon which an interlocked compare exchange can be synthesized.

  • Ken (MSFT)
    (Occasional ARM kernel guy.)
    TwC Security

From: xxxxx@broadcom.commailto:xxxxx
Sent: ?4/?1/?2014 13:58
To: Windows System Software Devs Interest Listmailto:xxxxx
Subject: RE:[ntdev] InterlockedRead / InterlockedWrite

Your example is very confusing, if you’re talking about two different spinlocks. I suggest you rewrite it more explicitly.

If we’re talking about Windows, there is an assumption of some hardware guarantees. For example, cache coherency, and interlocked compare-exchange availability.

Interlocked compare-exchange allows you to implement the full range of atomic interlocked operations - increment, decrement, add, OR, AND, XOR, you name it.


NTDEV is sponsored by OSR

Visit the list at: http://www.osronline.com/showlists.cfm?list=ntdev

OSR is HIRING!! See http://www.osr.com/careers

For our schedule of WDF, WDM, debugging and other seminars visit:
http://www.osr.com/seminars

To unsubscribe, visit the List Server section of OSR Online at http://www.osronline.com/page.cfm?name=ListServer</mailto:xxxxx></mailto:xxxxx>

OK, let’s ask that again:

  1. On what lock try_lock() and release_lock() operate?
  2. What is *x in pseudo-code? Is it related in any way to the lock for try_lock() and release_lock()?
  3. If *x is the same lock as try_lock() and release_lock() operate on, WHY THE HELL you read and modify it outside of those functions?