NetApp Data ONTAP 8.1.1 – SMB2.1 oplocks

With a NetApp FAS2240-2 filer with CIFS enabled I encountered a problem with the performance of Windows File Share clients and Citrix XenApp Servers on Windows Server 2008 R2 SP1.

After investigation of the problem it filtered down to user data, but then for random users. Hmm oke not good. Further investigation led to the real error: Oplocks on the NetApp filer. Oplocks are used for performance and should not be a problem!

So what was happening?

The NetApp was running Data ONTAP 8.1.1 which should be able to talk SMB2. And it does! But Windows Server 2008 R2 SP1 talks SMB2.1….and Data ONTAP 8.1.1 does not!

Aha so that’s where my Oplocks and unrecognized commands are coming from.

Solution:

Upgrade the NetApp filer to Data ONTAP 8.1.2 (which has SMB2.1 disabled by default) and all my errors (and problems went away).

 

This is also discussed on the NetApp forum under: https://forums.netapp.com/thread/35860

 

More Info on Oplocks:

Opportunistic locking (oplocks) is a Windows-specific mechanism for client/server data to allow multiple processes to lock the same file while allowing for local (client) data caching to improve performance over Windows networks.

Microsoft’s documentation states “An opportunistic lock (also called an oplock) is a lock placed by a client on a file residing on a server. In most cases, a client requests an oplock so it can cache data locally, thus reducing network traffic and improving apparent response time. Oplocks are used by network redirectors on clients with remote servers, as well as by client applications on local servers” and “Oplocks are requests from the client to the server. From the point of view of the client, they are opportunistic. In other words, the server grants such locks whenever other factors make the locks possible.”.

You can read more about oplocks in Microsoft’s documentation:

 

 

NetApp Performance Monitoring

 

Netapp sysstat reports filer performance statistics like CPU utilization, the amount of disk traffic, and cache utilization. When run without options, sysstat will print a new line every 15 seconds, of just a basic amount of information. You have to use control-C (^c) or set the interval count (-c count ) to stop sysstat after time. For more detailed information, use the -u option. For specific information to one particular protocol, you can use other options.

 

More info: http://www.wafl.co.uk/sysstat/

 

Synopsis:

sysstat [ interval ]

sysstat [ -c count ] [ -s ] [ -u | -x | -m | -f | -i | -b ] [ interval ]

  • -c count

    Terminate the output after count number of iterations. The count is a positive, nonzero integer, values larger than LONG_MAX will be truncated to LONG_MAX.

  • -s

    Display a summary of the output columns upon termination, descriptive columns such as `CP ty’ will not have summaries printed. Note that, with the exception of `Cache hit’, the `Avg’ summary for percentage values is an average of percentages, not a true mean of the underlying data. The `Avg’ is only intended as a gross indicator of performance. For more detailed information use tools such as nfsstat, netstat, or statit.

  • -f

    For the default format display FCP statistics.

  • -i

    For the default format display iSCSI statistics.

  • -b

    Display the SAN extended statistics instead of the default display.

  • -u

    Display the extended utilization statistics instead of the default display.

  • -x

    Displays the extended output format instead of the default display. This includes all available output fields. Be aware that this produces output that is longer than 80 columns and is generally intended for “offline” types of analysis and not for “realtime” viewing.

  • -m

    Displays multi-processor CPU utilization statistics. In addition to the percentage of the time that one or more CPUs were busy (ANY), the average (AVG) is displayed, as well as, the individual utilization of each processor.

  • interval

    A positive, non-zero integer that represents the reporting interval in seconds. If not provided, the default is 15 seconds.

     

Here are some explanations on the columns of netapp sysstat command.

 

Cache age : The age in minutes or seconds (by the added s) of the oldest read-only blocks in the buffer cache. Data in this column indicates how fast read operations are cycling through system memory; when the filer is reading very large files, buffer cache age will be very low. Also if reads are random, the cache age will be low. If you have a performance problem, where the read performance is poor, this number may indicate you need a larger memory system or  analyze the application to reduce the randomness of the workload.

 

Cache hit : This is the WAFL cache hit rate percentage. This is the percentage of times where WAFL tried to read a data block from disk that and the data was found already cached in memory. A dash in this column indicates that WAFL did not attempt to load any blocks during the measurement interval.

 

CP Ty : Consistency Point (CP) type is the reason that a CP started in that interval. The CP types are:

 


  • No CP started during sampling interval

  • number

    Number of CPs started during sampling interval, if greater than one

  • B

    Back to back CPs (CP generated CP)

  • b

    Deferred back to back CPs (CP generated CP)

  • F

    CP caused by full NVLog

  • H

    A type H CP is a CP from high watermark in modified buffers. If a CP is not in progress, and the number of buffers holding data that has been modified but not yet written to disk exceeds a threshold, then a CP from high watermark is triggered.

  • L

    A type L CP is a CP from low watermark in available buffers. If a CP is not in progress, and the number of buffers available goes below a threshold, then a CP form low watermark is triggered.

  • S

    CP caused by snapshot operation

  • T

    CP caused by timer

  • U

    CP caused by flush

  • Z

    CP caused by internal sync

  • V

    CP caused by low virtual buffers

  • M

    CP caused by low mbufs

  • D

    CP caused by low datavecs

  • :

    continuation of CP from previous interval

  • #

    continuation of CP from previous interval, and the NVLog for the next CP is now full, so that the next CP will be of type B.

 

The type character is followed by a second character which indicates the phase of the CP at the end of the sampling interval. If the CP completed during the sampling interval, this second character will be blank. The phases are:

 

  • 0

    Initializing

  • n

    Processing normal files

  • s

    Processing special files

  • q

    Processing quota files

  • f

    Flushing modified data to disk

  • v

    Flushing modified superblock to disk

     

CP util : The Consistency Point (CP) utilization, the % of time spent in a CP.  100% time in CP is a good thing. It means, the amount of time, used out of the cpu, that was dedicated to writing data, 100% of it was used. 75% means, that only 75% of the time allocated to writing data was utilized, which means we wasted 25% of that time. A good CP percentage has to be at or near 100%.

 

Examples:

 

sysstat
Display the default output every 15 seconds, requires control-C to terminate.

sysstat 1
Display the default output every second, requires control-C to terminate.

sysstat -s 1
Display the default output every second, upon control-C termination print out the summary statistics.

sysstat -c 10
Display the default output every 15 seconds, stopping after the 10th iteration.

sysstat -c 10 -s -u 2

sysstat -u -c 10 -s 2
Display the utilization output format, every 2 seconds, stopping after the 10th iteration, upon completion print out the summary statistics.

sysstat -x -s 5
Display the extended (full) output, every 5 seconds, upon control-C termination print out the summary statistics.