Introduction:
Currently, it is observed that OpenSSH does handle the krb5_kuserok call properly.
Impacted:
- OpenSSH <= 6.1p1
Details:
The krb5_kuserok(authctxt->krb5_ctx, authctxt->krb5_user, client) call verifies if principal "authctxt->krb5_user" is allowed to login as local user "client".
However, if AUTH_DOMAIN is set, "client" will be of the form USER@REALM, which breaks the call. As a result, the last parameter should always be the local user name (authctxt->pw->pw_name) as it was before implementing Bugzilla item # 1583.
Resolution:
- Bugzilla item # 2032 was created to address this issue. The item contains a patch to the source which solves the issue.
zaterdag 18 augustus 2012
zaterdag 2 oktober 2010
Kerberos : Authentication flow
When initially logging on to a network, users must negotiate access by providing a log-in name and password in order to be verified by the Authentication Server (AS) portion of a Key Distribution Center (KDC) within their domain. Once successfully authenticated, the user is granted a Ticket Granting Ticket (TGT) that is valid for the local domain. The TGT has a default lifetime of 10 hours and may be renewed throughout the user's session without requiring the user to re-enter his password. The TGT is cached on the local machine and is used to request sessions with services throughout the network. The following example explains the TGT/TGS retrieval process (assume user JOHNDOE belongs to Kerberos realm MYREALM1.COM):
- User JOHNDOE enters his password on a client machine.
- The client performs a one-way hash function on the entered password, which becomes the secret key of JOHNDOE.
- The client sends a message (AS-REQ) to the AS portion of the KDC, requesting a TGT on behalf of user JOHNDOE. The message contains the following:
- (Cleartext) Client Principal Name : JOHNDOE
- (Cleartext) Realm : MYREALM1.COM
- (Cleartext) Server Principal name : krbtgt/MYREALM1.COM
- (Encrypted) PA-ENC-TIMESTAMP -> A time stamp encrypted with the secret key obtained from (2). The purpose of preauthentication is to force JOHNDOE to prove that he really is JOHNDOE (As he is the only one that knows the password) and to prevent a replay attack. Without preauthentication, the KDC will send the AS-REQ message to anyone claiming to be JOHNDOE. As a result, a offline brute force attack on JOHNDOE's password could be done. Preauthenication alleviates this. - When receiving the AS-REQ message from the client, the AS will first check if client principal JOHNDOE is known in the KDC database. If not, the AS will return the error KRB5KDC_ERR_C_PRINCIPAL_UNKNOWN in the AS-REP message. If JOHNDOE exists, the AS will collect his password in the KDC database and perform the same one-way hash function as in (2). This secret key will be used to decrypt the PA-ENC-TIMESTAMP from the AS-REQ. If it can be decrypted successfully, the AS will send an AS-REP message back to the client, containing the following:
- (Encrypted) (A) A session key for client/TGS communication, encrypted with the secret key of JOHNDOE.
- (Encrypted) (B) A TGT which includes the client ID (JOHNDOE), client network address, ticket validity period and the client/TGS session key. The TGT is encrypted using the secret key of the TGS. - Once the client receives the AS-REP message, it will decrypt message A to obtain the client/TGS session key. This session key is used for further communication with the TGS. The client also caches the TGT, which it cannot decrypt as it encrypted with the secret key of the TGS.
- hen requesting service myservice/abc.realm1.com@REALM1.COM, the client sends the following TGS-REQ message to the TGS:
- (Cleartext) Server Principal name : myservice/abc.realm1.com@REALM1.COM
- (Encrypted) The cached TGT
- (Encrypted) An authenticator, which is composed of the client ID (JOHNDOE) and a timestamp, encrypted using the client/TGS session key. - When receiving the TGS-REP message from the client, the TGS will decrypt the TGT using its secrey key and will extract the client/TGS session key from it. Subsequently, the TGS will decrypt the authenticator. If the decryption is successful, it proves JOHNDOE's identity to the TGS. The TGS will send a TGS-REP message to the client containing the following:
- (Encrypted) (C) A client/server ticket which includes the client ID (JOHNDOE), client network address, ticket validity period and the client/server session key. The client/server ticket is encrypted using the service's secret key.
- (Encrypted) (D) A client/server session key encrypted with the client/TGS session key.
vrijdag 6 augustus 2010
AIX : LVM mirror pools
LVM mirror pools in AIX 6 make it possible to divide the physical volumes of a scalable volume group into separate pools. When creating a logical volume, each opy of the logical volume being created can be assigned to a mirror pool. Logical volume copies that are assigned to a mirror pool will only allocate partitions from the physical volumes in that mirror pool. This provides the ability to restrict the disks that a logical volume copy can use. Without mirror pools, the only way to restrict which physical volume is used for allocation when creating or extending a logical volume is to use a map file. Thus, using mirror pools greatly simplify this process.
Assume a scalable volume group 'myvg' consisting of the following:
- physical volumes hdisk2 and hdisk3, which originate from a storage box in computer room ABC.
- physical volumes hdisk4 and hdisk5, which originate from a storage box in computer room DEF.
- logical volume 'mylv', which uses a parallel write with sequential read scheduling policy. This means that all mirrors are written in parallel but always read from the first mirror if the first mirror is available. This setting can improve sequential data access on the backend storage box significantly. Using this scheduling policy, there is primary and one (or possibly two) secondary copy (copies) of a physical partition.
- log logical volume loglv01.
Converting a scalable volume group / logical volumes to use mirror pools can be done as follows:
- Attach a mirror pool to the hdisks.
# chpv -p ABC hdisk2
# chpv -p ABC hdisk3
# chpv -p DEF hdisk4
# chpv -p DEF hdisk5
- Next, add the respective mirror pools to each existing logical volume in the volume group.
# chlv -m copy1=ABC -m copy2=DEF mylv
# chlv -m copy1=ABC -m copy2=DEF loglv01
- Set the mirror pool strictness to 'super'. This means that each newly created logical volume in the volume group must be assigned to a mirror pool and that each mirror pool must contain at least one copy of each logical volume in the volume group.
# chvg -M s myvg
- Check the status of the mirror pools
# lsvg -m myvg
Logical Volume Copy 1 Copy 2 Copy 3
loglv01 ABC DEF None
mylv ABC DEF None
# lsmp myvg
VOLUME GROUP: myvg Mirror Pool Super Strict: yes
MIRROR POOL: ABC Mirroring Mode: SYNC
MIRROR POOL: DEF Mirroring Mode: SYNC
Now when increasing the size of the file system built on top of mylv, there is no need to worry about the remote mirroring anymore.
Assume a scalable volume group 'myvg' consisting of the following:
- physical volumes hdisk2 and hdisk3, which originate from a storage box in computer room ABC.
- physical volumes hdisk4 and hdisk5, which originate from a storage box in computer room DEF.
- logical volume 'mylv', which uses a parallel write with sequential read scheduling policy. This means that all mirrors are written in parallel but always read from the first mirror if the first mirror is available. This setting can improve sequential data access on the backend storage box significantly. Using this scheduling policy, there is primary and one (or possibly two) secondary copy (copies) of a physical partition.
- log logical volume loglv01.
Converting a scalable volume group / logical volumes to use mirror pools can be done as follows:
- Attach a mirror pool to the hdisks.
# chpv -p ABC hdisk2
# chpv -p ABC hdisk3
# chpv -p DEF hdisk4
# chpv -p DEF hdisk5
- Next, add the respective mirror pools to each existing logical volume in the volume group.
# chlv -m copy1=ABC -m copy2=DEF mylv
# chlv -m copy1=ABC -m copy2=DEF loglv01
- Set the mirror pool strictness to 'super'. This means that each newly created logical volume in the volume group must be assigned to a mirror pool and that each mirror pool must contain at least one copy of each logical volume in the volume group.
# chvg -M s myvg
- Check the status of the mirror pools
# lsvg -m myvg
Logical Volume Copy 1 Copy 2 Copy 3
loglv01 ABC DEF None
mylv ABC DEF None
# lsmp myvg
VOLUME GROUP: myvg Mirror Pool Super Strict: yes
MIRROR POOL: ABC Mirroring Mode: SYNC
MIRROR POOL: DEF Mirroring Mode: SYNC
Now when increasing the size of the file system built on top of mylv, there is no need to worry about the remote mirroring anymore.
zaterdag 31 juli 2010
AIX : Cross realm Kerberos TGT verification does not work
Introduction
Currently, the AIX Kerberos load module KRB5 cannot handle cross realm Ticket Granting Ticket verification.
Details
Assume the following setup:
- An Active Directory domain ABC.COM which contains all user principals.
- A Kerberos/NAS realm XYZ.COM which contains all service principals, including host service principals.
- Active Directory user foo which is mapped to AIX user foo using the following generic mapping rule in krb5.conf
auth_to_local = RULE:[1:$1@$0](.*@ABC.COM)s/@.*//
- The default Kerberos realm in krb5.conf is XYZ.COM
- The AIX user foo has the following properties:
- SYSTEM=KRB5, which identifies that the Kerberos load module KRB5 needs to be used for authentication
- auth_domain=ABC.COM, which identifies that an Active Directory KDC in domain ABC.COM should be contacted for authentication. This information is present in the [realms] section of krb5.conf.
By default, the following Kerberos authentication flow is followed when Active Directory user foo wishes to connect to AIX host bar.xyz.com:
- The KRB5 load module on bar.xyz.com sends an AS-REQ message to a KDC in Active Directory domain ABC.COM. This is regulated by the auth_domain user attribute of the local AIX user foo to which foo@ABC.COM is mapped using the auth_to_local rule. If auth_domain isn't set, the default realm in krb5.conf is used.
- The Active Directory KDC returns the AS-REQ message to bar.xyz.com, containing the TGT and TGS session key.
- Now bar.xyz.com needs to verify that the TGT is genuine by actually using it to see if it can access services locally with it. The idea behind this is that only the "real" KDC would be able to produce a valid service ticket, encrypted with the service's private key. In order to perform this TGT verification, a keytab must be installed on bar.xyz.com which contains the host/bar.xyz.com@XYZ.COM host service principal. This keytab contains the secret key of the host service principal, only known to the TGS and the service itself.
Apparently, the KRB5 load module sends the TGT verification TGS-REQ message to the Active Directory KDC in ABC.COM instead of to the NAS KDC in XYZ.COM, which contains the service principal. Since the Active Directory KDC does not know the host/bar.xyz.com@ABC.COM host service principal, it will return KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN, resulting in user authentication failure. The problem lies in the fact that the KRB5 Kerberos load module was never designed to handle cross realm authentication properly. Therefore, the TGT verification code does not use the krb5_verify_init_creds() call.
Solution
- Disable TGT verification in /usr/lib/security/methods.cfg
KRB5:
program = /usr/lib/security/KRB5
program_64 = /usr/lib/security/KRB5_64
options = authonly,tgt_verify=no
- A DCR was created and accepted for this issue, which should be fixed in AIX 6.1 TL6 and AIX 7.1.
DCR MR082009165 : TGT verification for user defined on a KDC does not work when host principals are defined on a different KDC in a cross realm setup.
APAR IZ78593 : ENABLE CROSS REALM TGT VERIFICATION IN KRB5A
Note: Starting AIX 6.1 TL2, the KRB5A and KRB5 load modules have been merged. APAR IZ26224 addresses this.
Currently, the AIX Kerberos load module KRB5 cannot handle cross realm Ticket Granting Ticket verification.
Details
Assume the following setup:
- An Active Directory domain ABC.COM which contains all user principals.
- A Kerberos/NAS realm XYZ.COM which contains all service principals, including host service principals.
- Active Directory user foo which is mapped to AIX user foo using the following generic mapping rule in krb5.conf
auth_to_local = RULE:[1:$1@$0](.*@ABC.COM)s/@.*//
- The default Kerberos realm in krb5.conf is XYZ.COM
- The AIX user foo has the following properties:
- SYSTEM=KRB5, which identifies that the Kerberos load module KRB5 needs to be used for authentication
- auth_domain=ABC.COM, which identifies that an Active Directory KDC in domain ABC.COM should be contacted for authentication. This information is present in the [realms] section of krb5.conf.
By default, the following Kerberos authentication flow is followed when Active Directory user foo wishes to connect to AIX host bar.xyz.com:
- The KRB5 load module on bar.xyz.com sends an AS-REQ message to a KDC in Active Directory domain ABC.COM. This is regulated by the auth_domain user attribute of the local AIX user foo to which foo@ABC.COM is mapped using the auth_to_local rule. If auth_domain isn't set, the default realm in krb5.conf is used.
- The Active Directory KDC returns the AS-REQ message to bar.xyz.com, containing the TGT and TGS session key.
- Now bar.xyz.com needs to verify that the TGT is genuine by actually using it to see if it can access services locally with it. The idea behind this is that only the "real" KDC would be able to produce a valid service ticket, encrypted with the service's private key. In order to perform this TGT verification, a keytab must be installed on bar.xyz.com which contains the host/bar.xyz.com@XYZ.COM host service principal. This keytab contains the secret key of the host service principal, only known to the TGS and the service itself.
Apparently, the KRB5 load module sends the TGT verification TGS-REQ message to the Active Directory KDC in ABC.COM instead of to the NAS KDC in XYZ.COM, which contains the service principal. Since the Active Directory KDC does not know the host/bar.xyz.com@ABC.COM host service principal, it will return KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN, resulting in user authentication failure. The problem lies in the fact that the KRB5 Kerberos load module was never designed to handle cross realm authentication properly. Therefore, the TGT verification code does not use the krb5_verify_init_creds() call.
Solution
- Disable TGT verification in /usr/lib/security/methods.cfg
KRB5:
program = /usr/lib/security/KRB5
program_64 = /usr/lib/security/KRB5_64
options = authonly,tgt_verify=no
- A DCR was created and accepted for this issue, which should be fixed in AIX 6.1 TL6 and AIX 7.1.
DCR MR082009165 : TGT verification for user defined on a KDC does not work when host principals are defined on a different KDC in a cross realm setup.
APAR IZ78593 : ENABLE CROSS REALM TGT VERIFICATION IN KRB5A
Note: Starting AIX 6.1 TL2, the KRB5A and KRB5 load modules have been merged. APAR IZ26224 addresses this.
donderdag 22 juli 2010
VIO : Change default value of hcheck_interval
Currently, the default value of hcheck_interval for VSCSI hdisks is set to 0, meaning that healthchecking is disabled. The hcheck_interval attribute of a hdisk can only be changed online if the volume group to which the hdisk belongs, is not active. If the volume group is active, the ODM value of the hcheck_interval can be altered in the CuAt class, as shown in the following example:
chdev -l hdisk0 -a hcheck_interval=60 -P
The change will then be applied once the system is rebooted. However, it is possible to change the default value of the hcheck_interval attribute in the PdAt ODM class. As a result, you won't have to worry about its value anymore and newly discovered hdisks will automatically get the new default value, as illustrated in the example below:
odmget -q 'attribute = hcheck_interval AND uniquetype = "PCM/friend/vscsi"' PdAt | sed 's/deflt = "0"/deflt = "60"/' | odmchange -o PdAt -q 'attribute = hcheck_interval AND uniquetype = "PCM/friend/vscsi"'
chdev -l hdisk0 -a hcheck_interval=60 -P
The change will then be applied once the system is rebooted. However, it is possible to change the default value of the hcheck_interval attribute in the PdAt ODM class. As a result, you won't have to worry about its value anymore and newly discovered hdisks will automatically get the new default value, as illustrated in the example below:
odmget -q 'attribute = hcheck_interval AND uniquetype = "PCM/friend/vscsi"' PdAt | sed 's/deflt = "0"/deflt = "60"/' | odmchange -o PdAt -q 'attribute = hcheck_interval AND uniquetype = "PCM/friend/vscsi"'
woensdag 21 juli 2010
AIX : Examining ulimits of running process
Currently, AIX does not allow to examine the ulimits of a running process through the /proc file system. F.e. in Linux this is possible, as shown in the following example:
# cat /proc/4121/limits
The only alternative currently available in AIX is to use dbx and attach to the running process, as shown in the following example:
# dbx -a 700464
(dbx) proc rlimit
(dbx) quit
The only major drawback from this alternative is that dbx interrupts the process by sending a SIGTRAP signal. After the dbx sessions finishes, the process terminates. A Design Change Request was made to address this.
# cat /proc/4121/limits
Limit | Soft Limit | Hard Limit | Units |
Max cpu time | unlimited | unlimited | ms |
Max file size | unlimited | unlimited | bytes |
Max data size | unlimited | unlimited | bytes |
Max stack size | 8388608 | unlimited | bytes |
Max core file size | ulimited | unlimited | bytes |
Max resident set | unlimited | unlimited | bytes |
. . . |
The only alternative currently available in AIX is to use dbx and attach to the running process, as shown in the following example:
# dbx -a 700464
(dbx) proc rlimit
rlimit name | rlimit_cur | rlimit_max | (units) |
RLIMIT_CPU: | (unlimited) | (unlimited) | sec |
RLIMIT_FSIZE: | (unlimited) | (unlimited) | bytes |
RLIMIT_DATA: | 67108864 | 100663296 | bytes |
RLIMIT_STACK: | 33554432 | 4294967296 | bytes |
RLIMIT_CORE: | (unlimited) | (unlimited) | bytes |
RLIMIT_RSS: | (unlimited) | (unlimited) | bytes |
. . . |
The only major drawback from this alternative is that dbx interrupts the process by sending a SIGTRAP signal. After the dbx sessions finishes, the process terminates. A Design Change Request was made to address this.
donderdag 31 december 2009
GPFS : Tuning recommendations
A few words on important GPFS tunables.
Pagepool & SeqDiscardTheshold
GPFS does not use the regular file buffer cache of the operating system (f.e. non-computational memory in AIX) but uses its own mechanism to implement caching. GPFS uses pinned computational memory to maintain its file buffer cache, called the pagepool, which is used to cache user file data and file system metadata. The default pagepool size is 64MB and is too small for many applications most of the times. Applications that re-use files a lot and perform sequential reads will benefit from the pagepool. Non-DIO writes will also be done to the pagepool. For a sequential write operation to the pagepool, write-behind will improve overall performance. For random I/O, GPFS will not be able to use read-ahead or write-behind techniques and will be able to rely on striping for improved performance.
A parameter that affects how data is cached in the pagepool is SeqDiscardTheshold, which will instruct GPFS to try to keep as much data in pagepool as possible. The default for this value is 1MB which means that if a file greater than 1 MB is read sequentially, GPFS will not keep the data in the pagepool. There might be applications in which large files are often re-read by multiple processes, which can lead to improved performance if this tunable is set to a higer value.
It should also be noted that NSD servers don't not cache anything for their NSD clients. If both NSD client A and NSD client B request the same file from an NSD server, the NSD server will get the data twice from disk. As a result, increasing the pagepool on an NSD server would have no effect.
Block size
This is one of the most important things to think about when designing a GPFS file system. After creating the file system with a specific block size, there's no way back other than recreating the file system with the new block size. Choosing the optimal GPFS block size is not a straight forward exercise since it relies on several other factors:
RAID 5 4+1 , 128 KB LUN segment size = 512 KB LUN stripe size.
As a result, a GPFS block size of 512 KB (or a multiple of) would be good.
A GPFS block size of 256 KB will almost certainly lead to reduced performance because the disk subsystem would have to read the remaining 256 KB of the 512 KB stripe in order to calculate parity in a write operation. Summarized the operations for both 256 KB and 512 KB block sizes on the disk subsystem would look like:
Split data / metadata
Splitting data and metadata is one of the most underestimateded design questions.
The actual division can be changed online so unlike changing the GPFS file system block size, there is no downtime involved. If metadata (inodes + indirect data blocks) cannot be accessed fast enough, overall performance will degrade severely. Metadata access can be compared to a seek operation on a normal hard drive. Generally, it's a good idea to do the following:
As ready stated, the pagepool is GPFS's file buffer cache in pinned computational memory which caches user file data and file system metadata. On the other hand, GPFS uses regular computational memory to maintain its inode and stat cache (user file metadata). The inode cache (controlled by the maxFilesToCache tunable, default 1000) contains copies of inodes for open files and for some recently used files that are no longer open. Storing a file's inode in cache permits faster re-access to that file. The stat cache (controlled by the maxStatCache tunable, default 4 * maxFilesToCache) contains enough information to open the file and satisfy a stat() call. It is intended to help functions such as ls -l, du, and certain backup programs that scan entire directories looking for modification times and file sizes. However, the stat cache entry does not contain enough information to read from or write to the file since it does not contain the indirect block references (unlike a regular inode). A stat cache entry consumes significantly less memory than a full inode.
It is possible that the number of currently opened files is larger than the size of the inode cache. In that case, the inode needs to retrieved from disk first if a node wishes to read a file, of which the inode in not in the inode cache. Therefore, it's very important that metadata can be accessed as fast as possible (see above).
prefetchThreads & worker1Threads & maxMBps
The prefetchThreads tunable (default 72) controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially, or to handle sequential write-behind. On the other hand, the worker1Threads tunable (default 48) controls the maximum number of concurrent file operations at a time. If there are more requests than the number of worker1Threads, the excess will wait until a previous request has finished. The primary use is for random read or write requests that cannot be prefetched, random I/O requests, or small file activity. The maximum value of prefetchThreads plus worker1Threads is 550 (64-bit kernels) or 164 (32-bit kernels). These values need tuning sometimes, f.e. in an Oracle RAC environment. Oracle does not need many prefetchThreads, since Oracle does its own prefetching and does not the GPFS pagepool (Oracle uses DIO to access files on a GPFS filesystem). However, Oracle does need a high amount of worker1Threads to allow as many Oracle AIO threads as possible to work in parallel.
The maxMBps tunable (default 150) is used for estimating the amount of I/O triggered by sequential read-ahead and write-behind. Setting this value higher than the default will get more parallelism if there are many LUNs. By lowering this value, the load on the disk subsystem can be limited artificially. Setting this value too high usually does not cause problems because of other limiting factors, such as the size of the pagepool and the number of prefetch threads.
Pagepool & SeqDiscardTheshold
GPFS does not use the regular file buffer cache of the operating system (f.e. non-computational memory in AIX) but uses its own mechanism to implement caching. GPFS uses pinned computational memory to maintain its file buffer cache, called the pagepool, which is used to cache user file data and file system metadata. The default pagepool size is 64MB and is too small for many applications most of the times. Applications that re-use files a lot and perform sequential reads will benefit from the pagepool. Non-DIO writes will also be done to the pagepool. For a sequential write operation to the pagepool, write-behind will improve overall performance. For random I/O, GPFS will not be able to use read-ahead or write-behind techniques and will be able to rely on striping for improved performance.
A parameter that affects how data is cached in the pagepool is SeqDiscardTheshold, which will instruct GPFS to try to keep as much data in pagepool as possible. The default for this value is 1MB which means that if a file greater than 1 MB is read sequentially, GPFS will not keep the data in the pagepool. There might be applications in which large files are often re-read by multiple processes, which can lead to improved performance if this tunable is set to a higer value.
It should also be noted that NSD servers don't not cache anything for their NSD clients. If both NSD client A and NSD client B request the same file from an NSD server, the NSD server will get the data twice from disk. As a result, increasing the pagepool on an NSD server would have no effect.
Block size
This is one of the most important things to think about when designing a GPFS file system. After creating the file system with a specific block size, there's no way back other than recreating the file system with the new block size. Choosing the optimal GPFS block size is not a straight forward exercise since it relies on several other factors:
- Physical disk block size
- LUN segment size, which is the maximum amount of data that is written or read from a disk per operation before the next disk in the array is used
- Application block size (f.e. DB block size for a RDBMS application)
RAID 5 4+1 , 128 KB LUN segment size = 512 KB LUN stripe size.
As a result, a GPFS block size of 512 KB (or a multiple of) would be good.
A GPFS block size of 256 KB will almost certainly lead to reduced performance because the disk subsystem would have to read the remaining 256 KB of the 512 KB stripe in order to calculate parity in a write operation. Summarized the operations for both 256 KB and 512 KB block sizes on the disk subsystem would look like:
- GPFS write (256 KB) = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Read LUN Segment #3 (128 KB) + Read LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)
- GPFS write (512 KB) = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Write LUN Segment #3 (128 KB) + Write LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)
Split data / metadata
Splitting data and metadata is one of the most underestimateded design questions.
The actual division can be changed online so unlike changing the GPFS file system block size, there is no downtime involved. If metadata (inodes + indirect data blocks) cannot be accessed fast enough, overall performance will degrade severely. Metadata access can be compared to a seek operation on a normal hard drive. Generally, it's a good idea to do the following:
- Metadata (RAID 1 + Enable read/write cache on the disk subsystem)
Try to keep as much metadata as possible in the cache of the disk subsystem so that every node that looks for it will find it in the cache.
Write operations on metadata are generally random and small. Moreover, these write operations should be as fast as possible.
As a result, using the write cache for metadata will be very beneficial.
Finally, since metadata is more read from than written to, RAID 1 is a better (though more costly) backend for metadata than RAID 5. - Data (RAID 5 + Disable read/write cache on the disk subsystem)
Try to protect metadata in the cache as much as possible by disabling caching on data.
As a result, nodes that are reading a lot of data don't thrash the cache, which can be used efficiently for nodes that need access to metadata instead. Sometimes it can even be beneficial to dedicate a disk controller to metadata LUNs (caching enabled on that controller).
As ready stated, the pagepool is GPFS's file buffer cache in pinned computational memory which caches user file data and file system metadata. On the other hand, GPFS uses regular computational memory to maintain its inode and stat cache (user file metadata). The inode cache (controlled by the maxFilesToCache tunable, default 1000) contains copies of inodes for open files and for some recently used files that are no longer open. Storing a file's inode in cache permits faster re-access to that file. The stat cache (controlled by the maxStatCache tunable, default 4 * maxFilesToCache) contains enough information to open the file and satisfy a stat() call. It is intended to help functions such as ls -l, du, and certain backup programs that scan entire directories looking for modification times and file sizes. However, the stat cache entry does not contain enough information to read from or write to the file since it does not contain the indirect block references (unlike a regular inode). A stat cache entry consumes significantly less memory than a full inode.
It is possible that the number of currently opened files is larger than the size of the inode cache. In that case, the inode needs to retrieved from disk first if a node wishes to read a file, of which the inode in not in the inode cache. Therefore, it's very important that metadata can be accessed as fast as possible (see above).
prefetchThreads & worker1Threads & maxMBps
The prefetchThreads tunable (default 72) controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially, or to handle sequential write-behind. On the other hand, the worker1Threads tunable (default 48) controls the maximum number of concurrent file operations at a time. If there are more requests than the number of worker1Threads, the excess will wait until a previous request has finished. The primary use is for random read or write requests that cannot be prefetched, random I/O requests, or small file activity. The maximum value of prefetchThreads plus worker1Threads is 550 (64-bit kernels) or 164 (32-bit kernels). These values need tuning sometimes, f.e. in an Oracle RAC environment. Oracle does not need many prefetchThreads, since Oracle does its own prefetching and does not the GPFS pagepool (Oracle uses DIO to access files on a GPFS filesystem). However, Oracle does need a high amount of worker1Threads to allow as many Oracle AIO threads as possible to work in parallel.
The maxMBps tunable (default 150) is used for estimating the amount of I/O triggered by sequential read-ahead and write-behind. Setting this value higher than the default will get more parallelism if there are many LUNs. By lowering this value, the load on the disk subsystem can be limited artificially. Setting this value too high usually does not cause problems because of other limiting factors, such as the size of the pagepool and the number of prefetch threads.
Abonneren op:
Posts (Atom)