PowerVC로 구현하는 Private Cloud: 2018

Tuning Parameters GPFS

▼ This section describes some of the configuration parameters available in GPFS. Included are some notes on how they may affect performance.

These are GPFS configuration parameters that can be set cluster wide, on a specific node or sets of nodes. To view the configuration parameters that has been changed from the default mmlsconfig To view the active value of any of these parameters you can run (v 3.4 and later) mmdiag --config To change any of these parameters use mmchconfig.

For example to change the pagepool setting on all nodes.

mmchconfig pagepool=256M

Some options take effect immediately using the -i or -I flag to mmchconfig, some take effect after the node is restarted. Use -i to make the change permanent and affect the running GPFS daemon immediately. Use -I to affect the GPFS daemon only (reverts to saved settings on restart). Refer to the GPFS Documentation for details. In addition some parameters have a section called Tuning Guidelines.

These are general guidelines that can be used to determine a starting point for tuning a parameter.

GPFSCmdPortRange

leaseRecoveryWait

logfile maxBufferDescs

maxFilesToCache

maxMBpS

maxMissedPingTimeout

maxReceiverThreads

maxStatCache

minMissedPingTimeout

nfsPrefetchStrategy

nsdBufSpace

nsdInlineWriteMax

nsdMaxWorkerThreads

nsdMultiQueue

nsdSmallBufferSize

nsdSmallThreadRatio

nsdThreadMethod

nsdThreadsPerQueue

numaMemoryInterleave

opensslLibName

pagepool

prefetchPct

prefetchThreads

privateSubnetOverride

readReplicaPolicy

scatterBuffers

scatterBufferSize

seqDiscardThreshold

sharedMemLimit

socketMaxListenConnections

socketRcvBufferSize

socketSndBufferSize

statCacheLimit tokenMemLimit

verbsLibName

verbsRdmaQpRtrSl

verbsrdmasperconnection

verbsrdmaspernode

worker1Threads

worker3Threads

writebehindThreshold

ignorePrefetchLUNCount

leaseRecoveryWait

The leaseRecoveryWait parameter defines how long the FS manager of a filesystem will wait after the last known lease expiration of any failed nodes before running recovery.

A failed node cannot reconnect to the cluster before recovery is finished.

The leaseRecoveryWait parameter value is in seconds and the default is 35. Making this value smaller increases the risk that there may be IO in flight from the failing node to the disk/controller when recovery starts running.

This may result in out of order IOs between the FS manager and the dying node.

In most cases where a node is expelled from the cluster there is a either a problem with the network or the node running out of resources like paging. For example, if there is an application running on a node paging the machine to death or overrunning network capacity, GPFS may not have a chance to contact the Cluster Manager node to renew its lease within the timeout period.

GPFSCmdPortRange

When GPFS administration commands are executed they may use one or more TCP/IP ports to complete the command. For example when using standard ssh an admin command opens a connection using port 22. In addition to the remote shell or file copy command ports there are additional ports that are opened to pass data to and from remote GPFS daemons. By default a GPFS command uses one of the ephemeral ports and the remote node handling the command (typically the Cluster Manager node or one of the File System Manager nodes) to connect back to the node originating the command. In some environments you may want to limit the range of ports used by GPFS administration commands. You can control the ports used by the remote shell and file copy commands by using different tools or configuring these tools to use different ports. The ports used by the GPFS daemon for administrative command execiution can be defined using the GPFS configuration parameter GPFSCmdPortRange.

mmchconfig GPFSCmdPortRange=lowport-highport

This allows you to limit the ports used for GPFS administration mm* command execution. You need enough ports to support several concurrent commands from a single node, so you should define 20 or more ports for this purpose.

Example: mmchconfig GPFSCmdPortRange=30000-30100 Logfile "Logfile" size should be larger for high metadata rate systems to prevent more glitches when the log has to wrap. Can be as large as 16MB on large blocksize file systems. To set this parameter use the --L flag on mmcrfs.

minMissedPingTimeout

The minMissedPingTimeout and maxMissedPingTimeout parameters set limits on the calculation of missedPingTimeout (MPT) which is the allowable time for pings to fail from the Cluster Manager (CM) to a node that has not renewed its lease. The default MPT is leaseRecoveryWait minus 5 seconds. The CM will wait MPT seconds after the lease has expired before declaring a node out of the cluster. The minMissedPingTimeout and maxMissedPingTimeout parameters value is in seconds and the defaults are 3 and 60 respectively. If these values are changed, only GPFS on the quorum nodes (from which the CM is elected) need to be recycled to take effect. This can be used to cover over something like a central network switch failure timeout (or other network glitches) that may be longer than leaseRecoveryWait. It may prevent false node down conditions but will extend the time for node recovery to finish which may block other nodes making progress if the failing node held tokens for many shared files. Just as in the case of leaseRecoveryWait, in most cases where a node is expelled from the cluster there is a either a problem with the network or the node running out of resources like paging. For example, if there is an application running on a node paging the machine to death or overrunning network capacity, GPFS may not have a chance to contact the Cluster Manager node to renew its lease within the timeout period.

maxMissedPingTimeout

See:minMissedPingTimeout

maxReceiverThreads

The maxReceiverThreads parameter is the number of threads used to handle incoming TCP packets. These threads gather the packets until there are enough bytes for the incoming RPC (or RPC reply) to be handled. For some simple RPCs, the receiver thread handles he message immediately, otherwise it hands it off some handler threads. maxReceiverThreads defaults to the number of CPUs in the node up to 16. It can be configured higher if necessary up to 128 for very large clusters.

pagepool

The Pagepool parameter determines the size of the GPFS file data block cache. Unlike local file systems that use the operating system page cache to cache file data, GPFS allocates its own cache called the pagepool. The GPFS pagepool is used to cache user file data and file system metadata. The old default pagepool size of 64MB is too small for many applications so this is a good place to start looking for performance improvement. In release 3.5, the default is 1GB for new installs. When upgrading it keeps the old setting. Along with file data, the pagepool supplies memory for various types of buffers like prefetch and write behind For Sequential IO The default pagepool size may be sufficient for sequential IO workloads, however, a recommended value of 256MB is known to work well in many cases. To change the pagepool size, use the mmchconfig command.

For example, to change the pagepool size to 2GB on all nodes in the cluster, execute the mmchconfig command: mmchconfig pagepool=2G [-i] If the file system blocksize is larger than the default (256K), the pagepool size should be scaled accordingly to allow the same number of buffers to be cached. Random IO The default pagepool size will likely not be sufficient for Random IO or workloads involving a large number of small files. In some cases allocating 4GB, 8GB or more memory can improve workload performance. Random Direct IO For database applications that use Direct IO, the pagepool is not used for any user data. It's main purpose in this case is for system metadata and caching the indirect blocks for the files. NSD servers Assuming no applications or Filesystem Manager services are running on traditional NSD servers (not GNR or GSS servers), the pagepool is only used transiently by the NSD worker threads to gather data from client nodes and write the data to disk. The NSD server does not cache any of the data. Each NSD worker just needs one pagepool buffer per operation, and the buffer can be potentially as large as the largest filesystem blocksize that the disks belong to.

With the default NSD configuration, there will be 3 NSD worker threads per LUN (nsdThreadsPerDisk - pre GPFS 3.5) or per queue (GPFS 3.5 and later) that the node services. So the amount of memory needed in the pagepool will be 3*#LUNS*maxBlockSize. The target amount of space in the pagepool for NSD workers is controlled by nsdBufSpace which defaults to 30%. So the pagepool should be large enough so that 30% of it has enough buffers. 32 Bit operating systems On 32-bit operating systems pagepool is limited by the GPFS daemons address space. This means that it cannot exceed 4GB in size and is often much smaller due to other limitations.

opensslLibName

To initialize multi-cluster communiations GPFS uses openssl. When initializng openssl GPFS looks for these ssl libraries: libssl.so:libssl.so.0:libssl.so.4 (as of GPFS 3.4.0.4). If you are using a newer version of openssl the filename may not match one in the list (exmaple libssl.so.6). You can use the opensslLibName parameter to tell GPFS to look for the newer version instead. mmchconfig opensslLibName="libssl.so.6" readReplicaPolicy Options:

default, local Default By default when data is replicated GPFS spreads the reads over all of the available failure groups. This configuration typically best when the nodes running GPFS have equal access to both copies of the data. Local A value of local has two effects on reading data in a replicated storage pool. Data is read from: 1. A local block device 2. A "local" NSD Server The local block device means that the path to the disk is through a block special device on Linux, for example that would be a /dev/sd* or on AIX a /dev/hdisk device. GPFS does not do any further determination, so if disks at two sites are connected with a long distance fiber connection GPFS cannot distinguish what is local. So to use this option connect the sites using the NSD protocol over TCP/IP or InfiniBand Verbs (Linux Only). Further GPFS uses the subnets configuration setting to determine what NSD servers are "local" to an NSD client. For NSD clients to benefit from "local" read access the NSD servers supporting the local disk need to be on the same subnet as the NSD clients accessing the data and that subnet needs to be defined using the "subnets" configuration parameter. This parameter is useful when GPFS replication is used to mirror data across sites and there are NSD clients in the cluster. This keeps read access requests from being sent over the WAN.

scatterBuffers

The scatterBuffer parameter affects how GPFS organizes file data in the pagepool. The default is scatterBuffers=yes (Starting in GPFS 3.5). The scatterBuffers parameter was introduced in GPFS 3.5 as a method to better handle fragmented pagepool memory. It behaves differently depending on the what operating system anddrivers you are using. It is best to test different setting of scatterBuffers and scatterBufferSize (See section scatterBufferSize) to see what works best for your application. Tuning Guidelines If you are on AIX and your workload is mostly sequential disable the scatterBuffers feature by setting scatterBuffers=no. If you are not observing full blocksize IO's being sent to the storage during sequential IO operations disabling scatterBuffers or increasing scatterBufferSize may help (See scatterBufferSize). scatterBufferSize The scatterBufferSize parameter sets the size of the scatter buffer use by GPFS. The default is 32KiB (Starting in GPFS 3.5). Tuning Guidelines When tuning for sequential IO workloads it may help to increase scatterBufferSzie to be the same as the file system blocksize. If you are not observing full blocksize IO's being sent to the storage during sequential IO operations disabling scatterBuffers or increasing scatterBufferSize may help (See scatterBufferSize).

seqDiscardThreshold

The seqDiscardThreshold parameter affects what happens when GPFS detects a sequential read (or write) access pattern and has to decide what to do with the pagepool buffer after it is consumed (or flushed by writebehind threads). This is the highest performing option for the case where a very large file is read (or written) sequentially. The default for this value is 1MB which means that if you have a file that is sequentially read and is greater than 1MB GPFS does not keep the data in cache after consumption. There are some instances where large files are reread often by multiple processes; data analytics for example. In some cases you can improve the performance of these applications by increasing seqDiscardThreshold to be larger than the sets of files you would like to cache. Increasing seqDiscardthreshold tells GPFS to attempt to keep as much data in cache as possible for the files below that threshold. The value of seqDiscardThreshold is file size in bytes. The default is 1MB (1048576 bytes).Tuning Guidelines Increase this value if you want to cache most files that are sequentially read or written and are larger than 1MB in size. Make sure there are enough buffer descriptors to cache the file data. (See: maxBufferDescs )

sharedMemLimit

The sharedMemLimit parameter allows you to increase the amount of memory available to store various GPFS structures including inode cache and tokens. When the value of sharedMemLimit is set to 0 GPFS automatically determines a value for sharedMemLimit. The default value varies on each platform. In GPFS 3.4 the default on Linux and Windows is 256MB. In GPFS 3.4 on Windows sharedMemLimit can only be used to decrease the size of the shared segment. To determine whether or not increasing sharedMemLimit may help you can use the mmfsadm dump fs command. For example, if you run mmfsadm dump fs and see that you are not getting the desired levels of maxFilesToCache (aka fileCacheLimit) or maxStatCache (aka statCacheLimit) you can try increasing sharedMemLimit. # mmfsadm dump fs | head -8 Filesystem dump: FSP 0x18051D75AB0 UMALLOC limits: bufferDescLimit 40000 desired 40000 fileCacheLimit 4000 desired 4000 statCacheLimit 1000 desired 1000 diskAddrBuffLimit 200 desired 200 The sharedMemLimit parameter is set in bytes. As of release 3.4 the largest sharedMemLimit on Windows is 256M. On Linux and AIX the largest setting is 256G on 64 bit architectures and 2047M on 32 bit architectures. Using larger values may not work on some platforms/GPFS code versions. The actual sharedMemLimit on Linux may be reduced to a percentage of the kernel vmalloc space limit.

socketMaxListenConnections

The parameter socketMaxListenConnections sets the number of TCP/IP sockets that the daemon can listen on in parallel. This tunable was introduced in 3.4.0.7 specifically for large clusters, where an incast message to a manager node from a large number of client nodes may require multiple listen() calls and timeout. To be effective, the Linux tunable /proc/sys/net/core/somaxconn must also be modified from the default of 128. The effective value is the smaller of the GPFS tunable and the kernel tunable. Incoming connection requests may be silently dropped by the kernel networking component if the GPFS listen queue backlog is exceeded. When many nodes are TCP connecting to a node, a TCP connect may fail if the connection request is dropped too many times. At this point the GPFS node calling connect sends an expel request. Parameter Values Versions prior to 3.4.0.7 are fixed at 128. The default remains 128 on Linux and 1024 on AIX. The Linux kernel tunable also defaults to 128. The minimum and maximum Value: 1 and 65536 Tuning Guidelines Set the value of socketMaxListenConnections greater than or equal to the number of nodes that will create a TCP connection to any one node. Tuning Guidelines For clusters under 500 nodes tuning this value should not be required. For larger clusters it should be set to approximately the number of nodes in the GPFS cluster. Example mmchconfig socketMaxListenConnections=1500 echo 1500 > /proc/sys/net/core/somaxconn (or) sysctl -w net.core.somaxconn=1500 AIX: The command no p -o somaxconn must also be used to increase the value of somaxconn to a value greater than or equal to the value of socketMaxListenConnections. Linux: The sysctl.conf file must be modified to increase the value of net.core.somaxconn to a value greater than or equal to socketMaxListenConnections.

socketRcvBufferSize

The parameter socketRcvBufferSize sets the size of the TCP/IP receive buffer used for NSD data communication. This parameter is in bytes. socketSndBufferSize The parameter socketSndBufferSize sets the size of the TCP/IP send buffer used for NSD data communication. This parameter is in bytes. tokenMemLimit The parameter tokenMemLimit sets the size of memory available for manager nodes to use for caching tokens. The default is to use one memory segment (different on each operating system). To allow nodes acting as token manager to cache more tokens increase the value of tokenMemLimit. You only need to set this parameter on manager ndoes that may be doing token management. This parameter is in bytes. (See maxFilesToCache) maxBufferDescs The value of maxBufferDescs defaults 10 * maxFilesToCache up to pagepool size/16K. When caching small files, it actually does not need to be more than a small multiple of maxFilesToCache since only OpenFile objects (not stat cache objects) can cache data blocks.If an application needs to cache very large files you can tune maxBufferDescs to ensure there are enough to cache large files. To see the current value use the mmfsadm command: mmfsadm dump fs | head -8 [statistics never reset] Filesystem dump: FSP 0x18051D75AB0 UMALLOC limits: bufferDescLimit 40000 desired 40000 fileCacheLimit 4000 desired 4000 statCacheLimit 1000 desired 1000 diskAddrBuffLimit 200 desired 200 In this case there are 10,000 buffer descriptors configured. If you have a 1MiB file system blocksize and want to cache a 20GiB file, you will not have enough buffer descriptors. In this case to cache a 20GiB file increase maxBufferDescs to at least 20,480 (20GiB/1MiB=20,480). It is not exactly a one to one mapping so a value of 32k may be appropriate. mmchconfig maxBufferDescs=32k

maxFilesToCache

The maxFilesToCache (MFTC) parameter controls how many files each node can cache. Each file cached requires memory for the inode and a token(lock). In addition to this parameter, the maxStatCache (MSC) parameter controls how many files are partially cached. In GPFS 3.5 and earlier the default value of maxStatCache is 4 * maxFilesToCache in GPFS 4.1 it is now opposite with maxFilesToCache defualting to 4000 and maxStatCache to 1000. The Token Managers (TM) for a cluster has to keep token state for all nodes in the cluster and from nodes in remote clusters that mount the file systems. A Token Manager uses roughly 400 bytes of memory to manage one token for one node. The amount of memory available for caching tokens on the each Token Manager node is controlled by the tokenMemLimit parameter which defaults to on memory segment, which varies per operating system. In a large cluster, a change in the value of maxFilesToCache is greatly magnified. Increasing maxFilesToCache from the default of 4000 by a factor of 2 in a cluster with 200 nodes increases the number of tokens a token manager needs to store by approximately 800,000. Therefore on large clusters it is recommended to only increase maxFilesToCache where needed. This is usually on a subset of nodes that are used as login nodes where multiple users are concurrently doing directory listings, for example. On these nodes you should increase the maxFilesToCache parameter to 60k to 100k. Nodes that may benefit from increasing maxFilesToCache include: login nodes, NFS/CIFS exporters, email servers or other file servers. For systems where applications use a large number of files, of any size, increasing the value for maxFilesToCache may prove beneficial. This is particularly true for systems where a large number of small files are accessed. The increased value should be large enough to handle the number of concurrently open files plus allow caching of recently used files. You can use mmpmon (See monitoring ) to measure the number of files opened and closed on a GPFS file system. Changing the value of maxFilesToCache effects the amount of memory used on the node as well. The amount of memory required for inodes and control data structures can be calculated as: maxFilesToCache × 3.5 KB where 3.5 KB = 3 KB + 512 bytes for an inode. If you have larger inodes, the size gets larger. Valid values of maxFilesToCache range from 1 to 100,000,000. In some rare cases there are other additional consumers of this memory space including byte range locks. This means that you may not always have the full segment of memory to use. If you need additional memory space you can increase the amount of memory for inode caching by increasing the value of sharedMemLimit. Note: prior to release 3.5 the default maxFilesToCache and maxStatCache were 1000 and 4000. As of release 3.5, the default values are 4000 and 1000. If you change the maxFilesToCache value but not the maxStatCache value, then maxStatCache defaults to 4 * maxFilesToCache. Tuning Guidelines: The increased value should be large enough to handle the number of concurrently open files plus allow caching of recently used files. Increasing maxFilesToCache can improve the performance of user interactive operations like running ls. Don't increase the value of maxFilesToCache on all nodes in a large cluster without ensuring you have sufficient token manager memory to support the possible number of outstanding tokens. maxMBpS The maxMBpS option is an indicator of the maximum throughput in megabytes that can be submitted by GPFS per second into or out of a single node. It is not a hard limit, but rather the maxMBpS value is a hint GPFS uses to calculate how many prefetch/writebehind threads should be scheduled (up to the prefetchThreads setting) for sequential file access. In GPFS 3.3, the default maxMBpS value is 150, and in GPFS 3.5 it defaults to 2048. The maximum value is 100,000. The maxMBpS value should be adjusted for the nodes to match the IO throughput the node is expected to support. For example, you should adjust maxMBpS for nodes that are directly attached to storage. A good rule of thumb is to set maxMBpS to twice the IO throughput required of a system. For example, if a system has two 4Gbit HBA's (400MB/sec per HBA) maxMBpS should be set to 1600. If the maxMBpS value is set too low sequential IO performance may be reduced. This setting is not used by NSD servers. It is only used for application nodes doing sequential access to files. maxStatCache The maxStatCache parameter sets aside pageable memory to cache attributes of files that are not currently in the regular file cache. This can be useful to improve the performance of stat() calls for applications with a working set that does not fit in the regular file cache. The memory occupied by the stat cache can be calculated as: maxStatCache × 176 bytes. Valid values of maxStatCache range from 0 to 10,000,000. For systems where applications test the existence of files, or the properties of files, without actually opening them (as backup applications do), increasing the value for maxStatCache may prove beneficial. The default value is 1,000. On system where maxFilesToCache is greatly increased it is recommended that this value be manually set to something less than 4 * maxFilesToCache. For example if you set maxFilesToCache to 30,000 you may want to set maxStatCache to 30,000 as well. On compute nodes, this can usually be set much lower since they only have a few active files in use for any one job anyway. The way Linux handles inodes makes maxStatCache generally ineffective. So on Linux systems leave maxStatCache at the default of 1000 and modify maxFilesToCache as needed. Note: Prior to release 3.5 the default maxFilesToCache and maxStatCache were 1000 and 4000. The size of the GPFS shared segment can limit the maximum setting of maxStatCache.

nfsPrefetchStrategy

The parameter nfsPrefetchStrategy tells GPFS to optimize prefetching for NFS file style access patterns. It defines a window of the number of blocks around the current position that are treated as "fuzzy sequential" access. This can improve performance when reading big files sequentially, but because of kernel scheduling, some of the read requests come to GPFS out of order and therefore do not look "strictly sequential". If the filesystem blocksize is small relative to the read request sizes, making this bigger will provide a bigger window of blocks. The default is 0 . Tuning Guidelines Setting nfsPrefetchStrategy to 1 can improve sequential read performance when large files are accessed using NFS and the filesystem block size is small relative to the NFS transfer block size.

nsdBufSpace

The parameter nsdBufSpace specifies the percent of pagepool which can be utilized for NSD IO buffers. In GPFS 3.5, nsdBufSpace places an indirect maximum limit on the number of NSD threads at startup time, by limiting the available space for the buffers dedicated to NSD threads. In GPFS 3.4, nsdBufSpace was more of a dynamic limit as threads used buffers. In GPFS 3.5 nsdBufSpace is a limit imposed when the queues and threads are laid out at server startup time.

nsdInlineWriteMax

The nsdInlineWriteMax parameter specifies the maximum transaction size which can be sent as embedded data in a NSD write RPC. In most cases the NSD write RPC exchange uses two steps: 1. An initial RPC from client to server requesting a write, and describing it, so the server can prepare to receive it 2. A GetData RPC back from the server to the client, requesting the data. For data lsmaller than nsdInlineWriteMax GPFS sends that amount of write data directly, to avoid step 2. Note that it may be a good idea to increase this value when, for example, the configuration is using 4k inode size or the workload consists of many small writes. The default value in GPFS 3.5 is 1KiB.

nsdMaxWorkerThreads

The parameter nsdMaxWorkerThreads sets the maximum number of NSD threads on an NSD server that will be concurrently transferring data with NSD clients.The maximum value depends on the sum of worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 8192 on 64bit architectures. The default is 64 (in 3.4) 512 (in 3.5) with a minimum of 8 and maximum of 8,192. This default works well in many clusters. In some cases it may help to increase nsdMaxWorkerThreads for large clusters. Scale this with the number of LUNs, not the number of clients. You need this to manage flow control on the network between the clients and the servers.

nsdMultiQueue

The parameter nsdMultiQueue sets the maximum number of queues (small + large). The default is 256. nsdSmallBufferSize The parameter nsdSmallBufferSize specifies the largest IO request size that is considered "small" and thus placed in a "small" IO queue. IO requests larger than this value are sent to a large IO queue. The default value is 65536. This may need to be changed for different workloads. If for example the maxBlockSize is small (64k etc) it may help to set nsdSmallBufferSize lower (perhaps 16KB). In most cases the default works well.

nsdSmallTheradRatio (New in GPFS 3.5)

The parameter nsdSmallThreadRatio determines the ratio of NSD server queues for small IO's (default les sthan 64KiB) to the number of NSD server quques that handle large IO's (> 64KiB). The default is to have more small queues than large queues. This may work well when there are a high number of small file or metadata IO operations, though on clusters with a high percentage of large IO operations there are often not enough large queues and threads to keep the storage busy. In these cases you need to modify these parameters to provide more IO processing capability. See: NSD Server Tuning for more details

nsdThreadMethod

The parameter nsdSmallBufferSize controls the heuristic used to determine queue allocations. In earlier versions of 3.5, this was set to zero. The related heuristic was not very effective, especially when dealing with clusters that have been upgraded from 3.4. An improved heuristic (related to setting nsdThreadMethod = 1) has been the default in later versions of 3.5. The default value in later versions of GPFS 3.5 is 1, prior to that it was 0. This parameter should be set to 1 in GPFS 3.5.

nsdThreadsPerQueue (New in GPFS 3.5)

The parameter nsdThreadsPerQueue determines the number of threads assigned to process each NSD server IO queue. This value is aplpied for small aIO and large IO queues (See nsdSmallthreadRatio fo discussion of IO queues). See: NSD Server Tuning for more details numaMemoryInterleave On Linux, setting numaMemoryInterleave to yes starts mmfsd with numactl --interleave=all. Enabling this parameter may improve the performance of GPFS running on NUMA based systems, for example if the system is based on a Intel Nehalem processor. For this parameter to work you need to have the Linux numactl utility installed. prefetchPct "prefetchPct" defaults to 20% of pagepool. GPFS uses this as a guideline which limits how much pagepool space will be used for prefetch or writebehind buffers in the case of active sequential streams. The default works well for many applications. On the other hand, if the workload is mostly sequential (video serving/ingest) with very little caching of small files or random IO, then this number should be increased up to its 60% maximum, so that each stream can have more buffers cached for prefetch and write behind operations. prefetchThreads To see how many prefetchThreads are in use use the mmfsadm command: mmfsadm dump fs | egrep "nPrefetchThreads:|total wait" Tuning Guidelines: You usually don't need prefetchThreads to be more than twice the number of LUNs available to the node (see ignorePrefetchLUNCount). Any more than that typically do nothing but wait in queues. The maximum value depends on the sum of worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 8192 on 64bit architectures

privateSubnetOverride

The privateSubnetOverride parameter tells GPFS to allow the use of multiple networks or communication between multiple clusters when using multiple networks in a GPFS cluster the primary cluster IP address (the address displayed when running the mmlscluster command) should not be a private IP address. A private TCP/IP address is defined in RFC 1597 as: 10.0.0.0 - 10.255.255.255 172.16.0.0 - 172.31.255.255 192.168.0.0 - 192.168.255.255 By default you cannot use multiple TCP/IP interfaces in a cluster or mount a file system accross clusters if the daemonnodename is a private IP Address. If you need to use private IP address with multiple interfaces or when using multi-cluster you can tell GPFS to allow a mount to another private subnet by setting the privateSubnetOverride parameter. Setting privateSubnetOverride to 1 instructs GPFS to allow the use of multiple private subnets. The default for privateSubnetOverride is 0. verbsLibName To initialize IB RDMA GPFS looks for a file called libverbs.so. If that file name is different on your system libverbs.so.1.0 , for example, you can change this parameter to match. Example mmchconfig verbsLibName=libverbs.so.1.0 verbsRdmaQpRtrSl Use verbsRdmaQpRtrSl to set the infiniband quality of service level for GPFS communication. This value needs to match the quality of service level defined for GPFS in your InfiniBand subnet manager. Example If you define a service level of 2 for GPFS in the InfiniBand subnet manager set verbsRdmaQpRtrSl to 2. mmchconfig verbsRdmaQpRtrSl=2 verbsrdmasperconnection This is the maximum number of RDMAs that can be outstanding on any single RDMA connection. The default value is 8. Tuning Guidelines In testing the default was more than enough on SDR. All performance testing of the parameters was done on OFED 1.1 IB SDR. verbsrdmaspernode This is the maximum number of RDMAs that can be outstanding from the node. The default value is 0 (0 means default which is 32). Tuning Guidelines In testing the default was more than enough to keep adapters busy on SDR. All performance testing of the parameters was done on OFED 1.1 IB SDR.

worker1Threads

The worker1threads parameter represents the total number of concurrent application requests that can be processed at one time. This may include metadata operations like file stat() requests, open or close and for data operations. The work1threads parameter can be reduced without having to restart the GPFS daemon. Increasing the value of worker1threads requires a restart of the GPFS daemon. To determine whether you have a sufficient number of worker1threads configured you can use the mmfsadm dump mb command. # mmfsadm dump mb | grep Worker1 Worker1Threads: max 48 current limit 48 in use 0 waiting 0 PageDecl: max 131072 in use 0 Using the mmfsadm command you can see how many threads are "in use" and how many application requests are "waiting" for a worker1thread. Tuning Guidelines The default is good for most workloads. You may want to increase worker1threads if your application uses many threads and does Asynchronous IO (AIO) or Direct IO (DIO). In these cases the worker1threads are doing the IO operations. A good place to start is to have worker1theads set to approximately 2 times the number of LUNS in the file system so GPFS can keep the disks busy with parallel requests. The maximum value depends on the sum of worker1Threads + prefetchThreads + nsdMaxWorkerThreads < 8192 on 64bit architectures Do not use excessive values of worker1threads since it may cause contention on common mutexes and locks.

worker3Threads

The worker3threads parameter specifies the number of threads to use for inode prefetch. A value of zero disables inode prefetch. The Default is 8. Tuning Guidelines The default is good for most workloads. writebehindThreshold The writebehindThreshold parameter determines at what point GPFS starts flushing newly written data out of the pagepool for a file being sequentially written. Until the file size reaches this threshold, no writebehind is started as fullblocks are filled. Increasing this value will defer writebehind for new larger files. This can be useful, for example, if your workload contains temp files that are smaller than writebehindThreshold and are deleted before they are flushed from cache. The default is 512k (524288 bytes). If the value is too large, there may be too many dirty buffers that the sync thread has to flush at the next sync interval causing a surge in disk IO. Keeping it small will ensure a smooth flow of dirty data to disk. Tuning Guidelines The default is good for most workloads. Increase this value if you have a workload where not flushing newly written files larger than 512k would be beneficial. ignorePrefetchLUNCount

NOTE: This does not apply to an NSD server doing IO on behalf of other nodes. It also does not affect random access to files or to files smaller than a full block. On a client node GPFS calculates how many sequential access prefetch/writebehind threads to run concurrently for each filesystem by using the count of the number of LUNs in the filesystem and the maxMBpS setting. However, if the LUNs being used are really composed of many physical disks this calculation can underestimate how much IO can be done concurrently. For example, GNR (GSS), XIV, or SVC disk subsystems logically may stripe a LUN to hundreds of disks. (As of 3.4.0.21 or 3.5.0.10) Setting ignorePrefetchLUNCount=yes will ignore the LUN count and only use the maxMBpS setting to dynamically determine how many threads to schedule up to the maxPrefetchThreads setting. Prefetching may become much more aggressive because it depends on the maxMBpS setting and the actual IO times of the last 16 full block IOs for each filesystem. Under heavy loads, the IO times will increase due to queuing on the disks or NSD servers, resulting in GPFS doing more prefetching to try to attain maxMBpS. So set maxMBpS to a reasonable expectation of how much IO bandwidth a client node can get either to directly attached disks or over the network to NSD servers. MaxPrefetchThreads should then be set as a cap on the number of concurrent prefetch/writebehind threads when maxMBpS calculation tries too hard. Tuning Guidelines The default (no) is good for traditional LUNs where one LUN maps to a single disk or n+mP array. Use "yes" when the LUNs presented to GPFS are made up of a large numbers of physical disks.

PowerVC로 구현하는 Private Cloud

2018년 12월 26일 수요일

GPFS 주요 튜닝 파라미터