oplocks

Some thoughts on oplocks

Posted on Posted in Uncategorized

Brief Introduction

I think one of the least understood or misunderstood concepts when it comes to file systems (FSDs) and file system filter drivers(FSFDs), are the oplocks and the oplock semantics in general. This probably is the case since the documentation does not come in abundance out there and in my opinion, it makes some assumptions that the reader would already know, some concepts about these semantics, which can lead to confusions. Also if you look at the Microsoft samples, they don’t really deal with oplocks. For example one of the more complex samples, the AvScan, which has support for transactions and some bits and pieces for CsvFs, does not attach to network file systems and does not really take oplocks that much into account.

Oplocks, as I’ve seen it from many driver writers’ point of view, are just one of those concepts you don’t really want to hear about or deal with, until it hits you and you have no idea how to fix your driver to support the very many scenarios in which you could deadlock your system or a single application by misusing or misbehaving around oplocks.

Overview and documentation

This blog post assumes you have read the documentation on oplocks from Microsoft and you are familiar with the basic concepts. To read more on oplocks and their semantics here is an overview by Microsoft on their oplock semantics page.

I want to touch with with this post on some concepts regarding oplocks that are maybe confusing, not that clear in the documentation or you may have never dealt with and would be useful.

 Oplock package and implementation

You may have wondered how do FSDs implement oplocks and supports them. Well, the fact is that they don’t. The oplock package is implemented separately in the file system run-time library ( FsRtlXxx functions ). All FSDs do, if they want to support oplocks, is typically call into the FsRtl oplock library to update the status of an oplock or request one. FSDs treat the OPLOCK structure as opaque and only the oplock library updates it as needed. Usually when a FSD wants to support oplocks, they need to call into a handful of routines like:

  1. FsRtlInitializeOplock typically called by FSDs during stream (FCB/SCB) creation. The OPLOCK struct is initialized here by oplock lib.
  2. FsRtlOplockFsctrl being the heart and soul of oplock management in FSDs. This is used by the FSD to serve oplock requests, or acknowledge oplock breaks. This call implements all the FSCTLs related to requesting, acknowledging and notifying oplock breaks. If you request an oplock and your FO is opened for synchronous I/O this function will not grant the oplock. Furthermore, this function is also used to upgrade an oplock ( this works only for the enhanced oplocks, Windows 7 and above ). With this call the oplock package also associates an oplock key with the OPLOCK in the CREATE path, if the oplock owner has specified one.
  3. FsRtlCheckOplock is probably the most called function throughout the FSD. Every time the FSD processes an operation that could cause an oplock break (check here for more info), it calls this to check if the current operation breaks the oplock. If the operation should break the oplock, this call can block until the break is acknowledged or the FSD can provide a routine for which the oplock package to call after the break is acknowledged, and the FSD gives up the ownership of the IRP to the oplock package at this point. This function call does the oplock breaking itself and depending on the operation and type of oplock, it breaks the oplock to a new state which could be none or level 2. See documentation for more info.  There are a few exceptions here, where the FSD actually requests the breaks itself directly ( calling functions such FsRtlOplockBreakH, FsRtlOplockBreakToNoneEx). These cases happen all during the IRP_MJ_CREATE, and here they are:
    1. Request a break if this is a network query open and a KTM transaction is present. Otherwise, do not request a break on network query open.
    2. If a SUPERSEDE, OVERWRITE or OVERWRITE_IF operation is performed on an alternate data stream and FILE_SHARE_DELETE is not specified and there is a Batch or Filter oplock on the primary data stream, request a break of the Batch or Filter oplock on the primary data stream.
    3. If a SUPERSEDE, OVERWRITE or OVERWRITE_IF operation is performed on the primary data stream and DELETE access has been requested and there are Batch or Filter oplocks on any alternate data stream, request a break of the Batch or Filter oplocks on all alternate data streams that have them.

What is an oplock upgrade ? Is there such thing as a downgrade ?

You may have read in the documentation something about an oplock being upgraded from one level to another, that is typically from shared(Level2) to exclusive(Level1) or from Read to ReadWrite, ReadHandle or ReadWriteHandle. What does that mean and how does one upgrade an oplock ?

Well in order for an oplock to be upgraded, there are a couple of prerequisites. First, an oplock must already be in place. If a client wants to upgrade the oplock all it has to do is request another higher level oplock. Depending on the existing type of oplocks, the upgrade is done in 2 ways:

  1. Legacy or old style oplocks. Let’s suppose you have a LEVEL2/Shared oplock and you want to upgrade it to Level1/Exclusive oplock. When you send down the oplock request the initial IRP gets completed indicating the oplock broke t0 NONE, and the new IRP gets granted a LEVEL1 oplock, becoming the new pended IRP.
  2. Enhanced oplocks. Let’s suppose you have a READ oplock acquired and you want to upgrade it to a RWH/Exclusive oplock. When you send down the oplock request, the oplock package actually upgrades the oplock, the old IRP gets completed with the new oplock level and status: STATUS_OPLOCK_SWITCHED_TO_NEW_HANDLE, while the new IRP gets pended, having granted the new oplock level.

How about downgrading ?

There is no such thing. You cannot downgrade an oplock. Let’s suppose you have an exclusive oplock granted in your application. If you send the fsctl to request a shared oplock, in hope that the exclusive one will be downgraded, it is not going to happen. The FSD, and more precisely the FsRtl package will return an error, usually: 0x0000012C ( The oplock request is denied ). To “downgrade” an oplock you simply have to break the oplock to a lower level or to none. “Downgrading the oplock” = breaking the oplock if you really must use this term.

There is however a difference between the 2 types of oplocks and how the break happens:

  1. Legacy or old style oplocks. If you have a Level1/Exclusive oplock granted and an application breaks that oplock to Level2/Shared, when you acknowledge the break with FSCTL_OPLOCK_BREAK_ACKNOWLEDGE, now the acknowledge IRP will be pended, owning the new Level2 shared oplock. This IRP will be completed when this Level2 oplock is broken to none. If you acknowledge the break with FSCTL_OPLOCK_BREAK_ACK_NO_2 the oplock just breaks to None. For more details go here.
  2. Enhanced or new style oplocks. If you have RWH/Exclusive oplock and an application breaks that oplock to RH/Shared, when you acknowledge the break, the IRP will simply complete, in contrast with the legacy oplocks which may pend an acknowledge. You should check the output of the oplock fsctl to know if you must acknowledge the break or not.

Things to consider in file system filters

What filter drivers do is usually provide extra functionality to the system, without ( and this is important ) altering the original features existent. What this translates into in our context is, filters should never break oplocks. This behavior can appear to a client using oplocks as either they are not working correctly ( as stated in documentation ), or the client cannot implement its functionality correctly since they experience unexpected oplock breaks, or the worst case, deadlocks. FSFDs should not be worried about how to handle things when they broke an oplock so they won’t deadlock, but rather how to implement their functionality without breaking the oplocks in the first place. Here are some few things to consider:

  1. FILE_COMPLETE_IF_OPLOCKED flag is set in the CREATE path and it is a way for the calling thread to not get hanged waiting for an oplock break if the CREATE operation breaks an existing oplock. In your filter, if you see this and the CREATE get completed with the status: STATUS_OPLOCK_BREAK_IN_PROGRESS, you should NOT keep the request from completing by either trying to read from the stream or waiting for and acknowledge in post CREATE. The same thread could be the one doing the acknowledge. What you should do in your filter is one of 2 things:
    1. Let the create request go and wait, for an acknowledge on that stream. Listen for the oplock break acknowledge  in your filter for this stream, and after this you could access the stream yourself to read from it and such.
    2. Another thing you could do is, from your create routine, send the FO you want to let’s say scan to a worker thread and let the CREATE request go. By how the oplocks are implemented your worker thread should be able to access the file as soon as the oplock break gets acknowledged. Be sure to synchronize everything correctly.
  2. FSCTL_OPLOCK_BREAK_ACKNOWLEDGE is NOT the only way oplock breaks are acknowledged by their owners. In fact, in Windows 7 and above you might rarely see that, but instead, since the enhanced oplocks were introduced you would see the new FSCTL_REQUEST_OPLOCK with the REQUEST_OPLOCK_INPUT_BUFFER’s Flags memeber set to REQUEST_OPLOCK_INPUT_FLAG_ACK.
  3. FSCTL_OPLOCK_BREAK_NOTIFY is another fsctl that I have see poses some issues with filters. This control code is used to get notified when an oplock break is being acknowledged. I used bold the “oplock break”, since sending this down before the oplock is broken of after the acknowledgement is over this fsctl will do nothing. Keep in mind that this control code will pend the request until the oplock break is acknowledged. Let’s suppose you are in the CLEANUP path and you want to scan the file before cleanup. You issue a CreateFile, and you see it completes with the oplock break in progress status. Sending this fsctl down now, could prove useless in the sense that the CLEANUP itself would acknowledge the oplock break and since you block it, this will never happen. Instead do the trick with the worker thread I have suggest at point 1.

Debugging oplock related issues

Until Windows  8 you would need the private symbols for the FsRtl lib to look into an OPLOCK internal structure. As you probably noticed all the FsRtl functions take this opaque structure as a parameter. It is up to the FSD where to keep its OPLOCK structure if they want to support oplocks, but at least starting Windows8 this is in the FSRTL_ADVANCED_FCB_HEADER. If you look at the FAT sample you can see that prior to Windows 8 it is embedded in the FCB in a union called “Specific”.

But really what’s in this opaque structure. Well, since it is undocumented we can only guess but starting Windows 8 we have a command that we can use and that is:

kd> !fltkd.oplock
oplock [addr] [flags]                     Dump oplock given address of owning FILE_OBJECT or stuck IRP (tells you what oplock it is waiting for)
where [flags] specify what information to dump:
0x00000001 [addr] points to an OPLOCK (default is FILE_OBJECT or IRP)

This is useful to know because since this command exists there is a new structure in the public symbols of NT module and that is:

4: kd> dt ntkrnlmp!_NONOPAQUE_OPLOCK
+0x000 IrpExclusiveOplock : Ptr64 _IRP
+0x008 FileObject       : Ptr64 _FILE_OBJECT
+0x010 ExclusiveOplockOwner : Ptr64 _EPROCESS
+0x018 ExclusiveOplockOwnerThread : Ptr64 _ETHREAD
+0x020 WaiterPriority   : UChar
+0x028 IrpOplocksR      : _LIST_ENTRY
+0x038 IrpOplocksRH     : _LIST_ENTRY
+0x048 RHBreakQueue     : _LIST_ENTRY
+0x058 WaitingIrps      : _LIST_ENTRY
+0x068 DelayAckFileObjectQueue : _LIST_ENTRY
+0x078 AtomicQueue      : _LIST_ENTRY
+0x088 DeleterParentKey : Ptr64 _GUID
+0x090 OplockState      : Uint4B
+0x098 FastMutex        : Ptr64 _FAST_MUTEX

To support the !fltkd.oplock command this structure had to be exposed publicly. From playing around with it I have noticed a few things but take nothing for granted since this is not documented anywhere as far as I am concerned.

I am guessing the IrpExclusiveOplock member is a pointer to an IRP that is now pending and had granted an exclusive type of oplock. I have noticed that this member is not populated if you request a shared oplock and I am guessing the LIST_ENTRYs below keep that kind of information.

The FileObject member is straight forward, that is the FileObject representing the stream for which the oplock is granted/requested etc..

As you can see there is also a Process/Thread talking about the exclusive oplock owner. These members are also populated only when an exclusive oplock is granted. Keep in mind though that even if your process owns an exclusive oplock it could still break it internally if it does not comply with the oplock breaking rules mentioned in the documentation.

I am going to guess that the IrpOplocksR, IrpOplocksRH, AtomicQueue members are lists with IRP that have shared oplocks granted on this stream of the same type as their names suggest. As you know you can have multiple shared oplocks on the same stream. Atomic oplocks are the ones you get using the FILE_OPEN_REQUIRING_OPLOCK option in IoCreateFile.

WaitingIrps member probably contains a list of IRP waiting to do their I/O, but need an oplock break acknowledge. Remember that the FSD gives up it’s I/O IRPs to the FsRtl oplock library by calling FsRtlCheckOplock.

DeleterParentKey member has to do with directory oplocks in my opinion.

OplockState memeber is a flag type member keeping track of the current state of the oplock, if it is Shared or Exclusive, breaking or not and to what level. You can see that from the !fltkd.oplock command output.

The last member, the mutex is a way to synchronize access to this structure.

You can see pretty similar story using the !fltkd.oplock command. Here is an example:

6: kd> !fltkd.oplock 0xfffffa80`375563a0
Oplock: fffff8a002a09e40 RH Granted/Atomic Requested
State Flags              : [00013000] CacheReadLevel CacheHandleLevel AtomicOplockRequest
Excl. Oplock Request Irp : <not exclusive oplock>
Excl. Oplock File Object : <not exclusive oplock>
Excl. Oplock Owning Proc.: <not exclusive oplock>

RH Oplocks               : (fffff8a002a09e78)  Count=2

And the corresponding structure:

6: kd> dt ntkrnlmp!_NONOPAQUE_OPLOCK fffff8a0`02a09e40
+0x000 IrpExclusiveOplock : (null)
+0x008 FileObject       : (null)
+0x010 ExclusiveOplockOwner : 0xfffffa80`31ce0940 _EPROCESS
+0x018 ExclusiveOplockOwnerThread : (null)
+0x020 WaiterPriority   : 0 ”
+0x028 IrpOplocksR      : _LIST_ENTRY [ 0xfffff8a0`02a09e68 – 0xfffff8a0`02a09e68 ]
+0x038 IrpOplocksRH     : _LIST_ENTRY [ 0xfffff8a0`0fbb4570 – 0xfffff8a0`10e37a60 ]
+0x048 RHBreakQueue     : _LIST_ENTRY [ 0xfffff8a0`02a09e88 – 0xfffff8a0`02a09e88 ]
+0x058 WaitingIrps      : _LIST_ENTRY [ 0xfffff8a0`02a09e98 – 0xfffff8a0`02a09e98 ]
+0x068 DelayAckFileObjectQueue : _LIST_ENTRY [ 0xfffff8a0`02a09ea8 – 0xfffff8a0`02a09ea8 ]
+0x078 AtomicQueue      : _LIST_ENTRY [ 0xfffff8a0`0fbb45a8 – 0xfffff8a0`0fbb45a8 ]
+0x088 DeleterParentKey : (null)
+0x090 OplockState      : 0x13000
+0x098 FastMutex        : 0xfffffa80`33f98f60 _FAST_MUTEX

As you can see the oplock is a RH and requested Atomic. There is no IRP, no owning Proc since this is a shared oplock.

Now for an exclusive oplock:

Oplock: fffff8a002a09e40 RWH Granted
State Flags              : [00007040] Exclusive CacheReadLevel CacheHandleLevel CacheWriteLevel
Excl. Oplock Request Irp : fffff980208b4c60
Excl. Oplock File Object : fffffa803766c760
Excl. Oplock Owning Proc.: fffffa8031ce0940

And the structure below:
0: kd> dt ntkrnlmp!_NONOPAQUE_OPLOCK fffff8a002a09e40
+0x000 IrpExclusiveOplock : 0xfffff980`208b4c60 _IRP
+0x008 FileObject       : 0xfffffa80`3766c760 _FILE_OBJECT
+0x010 ExclusiveOplockOwner : 0xfffffa80`31ce0940 _EPROCESS
+0x018 ExclusiveOplockOwnerThread : 0xfffffa80`31c6f080 _ETHREAD
+0x020 WaiterPriority   : 0 ”
+0x028 IrpOplocksR      : _LIST_ENTRY [ 0xfffff8a0`02a09e68 – 0xfffff8a0`02a09e68 ]
+0x038 IrpOplocksRH     : _LIST_ENTRY [ 0xfffff8a0`02a09e78 – 0xfffff8a0`02a09e78 ]
+0x048 RHBreakQueue     : _LIST_ENTRY [ 0xfffff8a0`02a09e88 – 0xfffff8a0`02a09e88 ]
+0x058 WaitingIrps      : _LIST_ENTRY [ 0xfffff8a0`02a09e98 – 0xfffff8a0`02a09e98 ]
+0x068 DelayAckFileObjectQueue : _LIST_ENTRY [ 0xfffff8a0`02a09ea8 – 0xfffff8a0`02a09ea8 ]
+0x078 AtomicQueue      : _LIST_ENTRY [ 0xfffff8a0`02a09eb8 – 0xfffff8a0`02a09eb8 ]
+0x088 DeleterParentKey : (null)
+0x090 OplockState      : 0x7040
+0x098 FastMutex        : 0xfffffa80`33f98f60 _FAST_MUTEX

Now if you break it by reading:

Oplock: fffff8a002a09e40 RWH Breaking to RH
State Flags              : [00507040] Exclusive CacheReadLevel CacheHandleLevel CacheWriteLevel BreakToCacheRead BreakToCacheHandle
Excl. Oplock Request Irp : 0000000000000000
Excl. Oplock File Object : fffffa803766c760
Excl. Oplock Owning Proc.: fffffa8031ce0940

I/O Requests Awaiting Oplock Break Acknowledgement: (fffff8a002a09e98)  Count=1
Irp Address           Completion Rtn.     Completion Ctx.     BreakAllRH?
—————-      —————-    —————-    ———–
Could not read offset of field “Links” from type nt!_WAITING_IRP

6: kd> dt ntkrnlmp!_NONOPAQUE_OPLOCK fffff8a002a09e40
+0x000 IrpExclusiveOplock : (null)
+0x008 FileObject       : 0xfffffa80`3766c760 _FILE_OBJECT
+0x010 ExclusiveOplockOwner : 0xfffffa80`31ce0940 _EPROCESS
+0x018 ExclusiveOplockOwnerThread : 0xfffffa80`31c6f080 _ETHREAD
+0x020 WaiterPriority   : 0xa ”
+0x028 IrpOplocksR      : _LIST_ENTRY [ 0xfffff8a0`02a09e68 – 0xfffff8a0`02a09e68 ]
+0x038 IrpOplocksRH     : _LIST_ENTRY [ 0xfffff8a0`02a09e78 – 0xfffff8a0`02a09e78 ]
+0x048 RHBreakQueue     : _LIST_ENTRY [ 0xfffff8a0`02a09e88 – 0xfffff8a0`02a09e88 ]
+0x058 WaitingIrps      : _LIST_ENTRY [ 0xfffff8a0`10f21720 – 0xfffff8a0`10f21720 ]
+0x068 DelayAckFileObjectQueue : _LIST_ENTRY [ 0xfffff8a0`02a09ea8 – 0xfffff8a0`02a09ea8 ]
+0x078 AtomicQueue      : _LIST_ENTRY [ 0xfffff8a0`02a09eb8 – 0xfffff8a0`02a09eb8 ]
+0x088 DeleterParentKey : (null)
+0x090 OplockState      : 0x507040
+0x098 FastMutex        : 0xfffffa80`33f98f60 _FAST_MUTEX

As you can see here I have a problem with my symbols ( Could not read offset of field “Links” from type nt!_WAITING_IRP ), but now the WaitingIrps list contains one entry of type nt!_WAITING_IRP, having among other things inside an IRP which is blocked in a READ, waiting the oplock break acknowledge.

Here is the IRP waiting acknowledge, will only paste the relevant info:

6: kd> !irp fffff980`14192c60
Irp is active with 10 stacks 8 is current (= 0xfffff98014192f28)
Mdl=fffffa80375822d0: No System Buffer: Thread fffffa80323665c0:  Irp stack trace.
cmd  flg cl Device   File     Completion-Context

……..

            Args: 00000000 00000000 00000000 00000000
>[IRP_MJ_READ(3), N/A(0)]
0 e1 fffffa8033913030 fffffa80374e1910 fffff880012049a0-fffffa8036f5c890 Success Error Cancel pending
\FileSystem\Ntfs    fltmgr!FltpPassThroughCompletion
Args: 00010000 00000000 00060000 00000000

………

How to not break oplocks in your filter, some rules of thumb:

  1. Try riding the file object used by the client. Don’t issue your own CREATE if you can help it, but rather use the original FO in as many cases as you can.
  2. If you must issue your own CREATE, use the oplock key. Using an oplock key will “trick” the oplock package library into thinking you are the cache client owner. This comes via an ECP and you can read more about it here. It is only available starting Windows 7.
  3. Avoid doing too many things in the pre-Create path. Depending on the oplock status on the file you are about to process, even issuing a FltGetFileNameInformation could break an oplock.

Where to from here ?

If you want to play more with oplocks yourself I would suggest using the filetest utility alongside filespy to track down oplock request, break and acknowledges and learn more about how oplocks work.

As, is you have any questions for me, or you want some other details regarding this article just comment below. Any suggestions are welcome.

Leave a Reply

Your email address will not be published. Required fields are marked *