Oplocks: how can I downgrade oplock and why no acknowledgment required for RW->RW?

So the PseudoCode to learn on practice how oplocks works(all from the same thread on the same file handle)

  1. Open file with async access for read/write (for the CSV volume)
  2. Get the R oplock for the file above
  3. Get the R oplock (againe for the same file)
  4. Get the RW oplock
  5. Get the RW oplock
  6. Get the R oplock

After reading MSDN documentation my expectation:

  1. Getting R oplocks when I have owned R should not cause my R oplock break:
    https://msdn.microsoft.com/en-us/library/windows/hardware/ff551011(v=vs.85).aspx(Oplocks Overview): “A Read (R) oplock (shared) indicates that there are multiple readers of a stream and no writers”
    https://msdn.microsoft.com/en-us/library/windows/hardware/ff549334(v=vs.85).aspx (Breaking oplocks, IRP_MJ_READ): “Read.
    The oplock is not broken, no acknowledgment is required, and the operation proceeds immediately.”
  2. RW -> RW (“An acknowledgment must be received before the operation continues.”)
  3. RW ->R (“An acknowledgment must be received before the operation continues.” - (Breaking oplocks, IRP_MJ_READ):https://msdn.microsoft.com/en-us/library/windows/hardware/ff549334(v=vs.85).aspx).
    But that looks in conflict with https://msdn.microsoft.com/en-us/library/windows/hardware/ff547373(v=vs.85).aspx(Granting Oplocks) : “Level 1, Batch, Filter, Read-Write, Read-Write-Handle: STATUS_OPLOCK_NOT_GRANTED is returned.” If the last on is correct how to downgrade oplock from RW to R?
    And win reality I see totally different behavior. (I am using following code to get oplock http://www.osronline.com/showthread.cfm?link=268983)

a) R->R causes that read break routine is called. So oplock looks broken from R to R NewOplockLevel = 0x1.
a) R->RW causes that read break is called. So oplock broken to NewOplockLevel = 0x5(=RW). No ack required.
b) RW->RW that is first strange case. I see that there is no REQUEST_OPLOCK_OUTPUT_FLAG_ACK_REQUIRED flag I the OutputBuffer.Flag. And I am expected to see that flag to flush data and accept break. Why no REQUEST_OPLOCK_OUTPUT_FLAG_ACK_REQUIRED?
c) RW->R(downgrading): read break routine is called! And not write break as expected. Why so? And flags looks incorrect: OriginalOplockLevel = 0x1 (should be 5 = RW), NewOplockLevel = 0x1.
Then FltPerformAsynchronousIo returns finally STATUS_OPLOCK_NOT_GRANTED status.

So how can I downgrade oplock without closing the handle and why no acknowledgment required for RW->RW?

First, to answer the topic question: You can’t downgrade your own oplock, at least not directly. The model is that you request an oplock that represents the kind of caching you want to do, then you do your file operations. If some other client shows up and tries to do something with the file you’re working with, *their* operations may break *your* oplock to notify you that you need to take action to make your cache coherent. More precisely, if you’ve requested an oplock on FILE_OBJECT FO1, and an I/O arrives on FILE_OBJECT FO2 for the same file (assuming it was not opened using the same oplock key as FO1, or that no oplock keys were used), that I/O operation might break your oplock according to the rules described in the “Breaking Oplocks” section of the “Oplock Semantics” documentation at https://msdn.microsoft.com/en-us/library/windows/hardware/ff538991(v=vs.85).aspx.

Given the steps you outline, the behaviour you’re seeing is correct. You’re not issuing any I/O that would break an oplock using the rules in the “Breaking Oplocks” section of “Oplock Semantics”. What you’re doing is covered in the “Granting Oplocks” section here: https://msdn.microsoft.com/en-us/library/windows/hardware/ff547373(v=vs.85).aspx

I suggest you carefully read the entire “Oplock Semantics” document starting with the Overview: https://msdn.microsoft.com/en-us/library/windows/hardware/ff551011(v=vs.85).aspx

  1. Open file with async access for read/write (for the CSV volume)
  2. Get the R oplock for the file above
  3. Get the R oplock (again for the same file)

Since you’re requesting R (again) on the same handle, the behaviour you should see is this (from the “Granting Oplocks” doc:

Multiple Level 2 (but not Read) oplocks can even exist on the same handle.
If a Read oplock is requested on a handle that already has a Read oplock granted to it, the first Read oplock’s IRP is completed with STATUS_OPLOCK_SWITCHED_TO_NEW_HANDLE before the second Read oplock is granted.

  1. Get the RW oplock

This is an upgrade from R to RW. From the “Granting Oplocks” doc:

Be aware that if the current oplock state is:
* No Oplock: the request is granted.
* Read or Read-Write and the existing oplock has the same oplock key as the request: the existing oplock’s IRP is completed with STATUS_OPLOCK_SWITCHED_TO_NEW_HANDLE, the request is granted.
Else STATUS_OPLOCK_NOT_GRANTED is returned.

The second time you request RW (after you already have RW) that same part of the doc applies. No acknowledgement is required because that only applies when the oplock is actually being broken. In your case it is not being broken, it is being “upgraded” to the same level.

When you request R when you have RW the call fails with STATUS_OPLOCK_NOT_GRANTED per the doc: If you already have “Level 1, Batch, Filter, Read-Write, Read-Write-Handle: STATUS_OPLOCK_NOT_GRANTED is returned.”

Another thing to note: I think you may be confusing yourself a bit by referring to your async IO completion routine as “OplockBreakRoutine”. That routine is called when your oplock request completes for any reason, not just for oplock break. The request completes if you request another oplock on the same file object, as you’re seeing here.

Christian [MSFT]
This posting is provided “AS IS” with no warranties, and confers no rights.

Christian, thanks a lot!
One more not clear, as for me, question.
When I got REQUEST_OPLOCK_OUTPUT_FLAG_ACK_REQUIRED flag in the RW break then I need to flush data and confirm oplock break AFTER that so cache will be coherent.
But threre are strong synchronization limitations of PFLT_COMPLETED_ASYNC_IO_CALLBACK https://msdn.microsoft.com/en-us/library/windows/hardware/ff551067(v=vs.85).aspx
“Because the PFLT_COMPLETED_ASYNC_IO_CALLBACK routine can be called at IRQL DISPATCH_LEVEL, it is subject to the following constraints:
-It cannot safely call any kernel-mode routines that require a lower IRQL.
-Any data structures used in this routine must be allocated from nonpaged pool.
-It cannot be made pageable.
-It cannot acquire resources, mutexes, or fast mutexes. However, it can acquire spin locks.”

And flashing is pagable operation by its nature. At once I thought about Event (I can use KeSetEvent at WriteOplockRequest?ompletion routine with Wait = FALSE at IRQL <= DISPATCH_LEVEL) so I can maid FltQueueGenericWorkItem with target file flushing routine that is waiting for that event. But how can I know that flushing is done? Spin locks looks not very appropriate for this case.
What is the best practive for this -looks like- common case?

Looks like I can do this after PFLT_COMPLETED_ASYNC_IO_CALLBACK complition so I can just use FltQueueGenericWorkItem in PFLT_COMPLETED_ASYNC_IO_CALLBACK .

But I find an other problem. If we can downgrade oplock, then orchestration looks unclear.
Let talk in the context of Metadata manager. We have metadata file.
We opening this file on two nodes A and B with R.
Then node A upgrades the oplock to RW and to changes in metadata file.
Oplock downgrade is not possible, so how Node B will read changed metadata? And node B can’t request R oplock while there is RW oplock-

Granting Oplocks Read-Write, Read-Write-Handle: STATUS_OPLOCK_NOT_GRANTED is returned… Vicious circle.
Maybe I can close file for this, but that is inconvinient and looks strange as for me.

An other crytical thing. I see that the only way for Node B to be notified when node A will finish the process of metadata file modification is to request RW oplock instead of R and wait for acknowledgment, but it nead only read access and other nodes should have read access at the same time and RW does not allow this.
Myabe I can request RW from B and after acknowledgment close file and request for the R and A should close file after sending acknowledgment and ask for R also. But all that looks like magic and workarounds.
Can you explaine plese what is the right strategy?

You can’t downgrade an oplock. As the oplock requester you can upgrade an oplock. An oplock only “downgrades” when it is broken by an I/O operation.

We opening this file on two nodes A and B with R.
Then node A upgrades the oplock to RW

Node A can’t upgrade its oplock to RW if the file is open by somebody else at the same time. You can only have RW and RWH oplocks if your handle is the only one open to the file. Besides, if node A is updating the file directly (i.e. not locally caching the writes to send to the file later) it doesn’t need an RW oplock.

I think you might be somewhat misunderstanding what oplocks are for. Oplocks are a notification and signalling mechanism, not an actual resource locking mechanism, designed for use in two scenarios. The first, and the reason they were originally invented, is for cache coherency. An example application is the SMB/CIFS server and client. The second scenario is when you want to be unobtrusive, i.e. you need to be informed when you should close your handle to get out of the way. An example application is the Windows search indexer.

-=Scenario 1 (Cache Coherency)=-
The SMB client (“C1”) requests an oplock that describes the kind of caching it wants to do. For example, it opens a file on the server and requests an RW oplock. This allows C1 to cache the data it reads from the server, and modify its cache without writing the changes back to the server, which reduces network traffic. If another client (“C2”) opens the file on the server, then C1’s RW oplock breaks to R and C2’s open blocks. The oplock break tells C1 that it must write its cached changes back to the server, then acknowledge the oplock break. Acknowledging the break unblocks C2’s open. C1 can continue to cache the data it read from the server. If C2 then writes to the file, C1’s R oplock breaks to None. This tells C1 that its cache is no longer current, so it should discard the cache and re-fetch the data from the server. If C1 can’t get an RW (or R) oplock when it opens the file, that means that it cannot cache but must read/write across the network directly on the server. The client periodically re-requests the oplock so that it can start caching again. As long as the oplock is not granted the client must send all its I/O across the network. Note that C1 is not directly notified when C2 is finished making its modifications. C1 has to periodically request an oplock until it gets one.

-=Scenario 2 (Unobtrusive Service)=-
The second scenario is for a service such as the Windows indexer. The indexer opens files for read access to read and index the contents. If an application, for example Word, comes along and wants to write to the file, it is possible that the presence of the indexer would cause Word to get a sharing violation. The indexer wants to be invisible, so while it is reading it takes out an RH oplock on the file. When Word comes along its open causes a sharing violation. However because of the indexer’s RH oplock the sharing violation is not returned to Word. Instead Word’s open operation blocks and the indexer’s RH oplock breaks. That tells the indexer that it should close its handle and try again later. When the indexer closes its handle that serves to acknowledge the RH oplock break. Word’s open proceeds and it gets to use the file, blissfully unaware that something else was working on it.

Christian [MSFT]
This posting is provided “AS IS” with no warranties, and confers no rights.