tag:blogger.com,1999:blog-13612123884728746852024-03-06T06:32:24.153+01:00Miguel's SpotMiguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.comBlogger19125tag:blogger.com,1999:blog-1361212388472874685.post-24163741593036153082012-08-18T12:35:00.002+02:002012-08-18T12:35:57.256+02:00OpenSSH : Local user name in krb5_kuserok call <em><strong>Introduction:</strong></em><br />
Currently, it is observed that OpenSSH does handle the krb5_kuserok call properly.<br />
<br />
<em><strong>Impacted:</strong></em><br />
- OpenSSH <= 6.1p1<br />
<br />
<em><strong>Details:</strong></em><br />
The krb5_kuserok(authctxt->krb5_ctx, authctxt->krb5_user, client) call verifies if principal "authctxt->krb5_user" is allowed to login as local user "client".
However, if AUTH_DOMAIN is set, "client" will be of the form USER@REALM, which breaks the call. As a result, the last parameter should always be the local user name (authctxt->pw->pw_name) as it was before implementing <a href="https://bugzilla.mindrot.org/show_bug.cgi?id=1583" target="_blank">Bugzilla item # 1583</a>. <br /><br />
<em><strong>Resolution:</strong></em><br />
- <a href="https://bugzilla.mindrot.org/show_bug.cgi?id=2032" target="_blank">Bugzilla item # 2032</a> was created to address this issue. The item contains a patch to the source which solves the issue.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-21826220014378022892010-10-02T11:09:00.013+02:002010-10-02T12:44:10.711+02:00Kerberos : Authentication flowWhen initially logging on to a network, users must negotiate access by providing a log-in name and password in order to be verified by the <span style="font-weight:bold;">Authentication Server</span> (AS) portion of a <span style="font-weight:bold;">Key Distribution Center</span> (KDC) within their domain. Once successfully authenticated, the user is granted a <span style="font-weight:bold;">Ticket Granting Ticket</span> (TGT) that is valid for the local domain. The TGT has a default lifetime of 10 hours and may be renewed throughout the user's session without requiring the user to re-enter his password. The TGT is cached on the local machine and is used to request sessions with services throughout the network. The following example explains the TGT/TGS retrieval process (assume user JOHNDOE belongs to Kerberos realm MYREALM1.COM):<br /><ul><li>User JOHNDOE enters his password on a client machine.</li><li>The client performs a one-way hash function on the entered password, which becomes the secret key of JOHNDOE.</li><li>The client sends a message (AS-REQ) to the AS portion of the KDC, requesting a TGT on behalf of user JOHNDOE. The message contains the following:<br /> - (Cleartext) Client Principal Name : JOHNDOE<br /> - (Cleartext) Realm : MYREALM1.COM<br /> - (Cleartext) Server Principal name : krbtgt/MYREALM1.COM<br /> - (Encrypted) PA-ENC-TIMESTAMP -> A time stamp encrypted with the secret key obtained from (2). The purpose of preauthentication is to force JOHNDOE to prove that he really is JOHNDOE (As he is the only one that knows the password) and to prevent a replay attack. Without preauthentication, the KDC will send the AS-REQ message to anyone claiming to be JOHNDOE. As a result, a offline brute force attack on JOHNDOE's password could be done. Preauthenication alleviates this.</li><li>When receiving the AS-REQ message from the client, the AS will first check if client principal JOHNDOE is known in the KDC database. If not, the AS will return the error KRB5KDC_ERR_C_PRINCIPAL_UNKNOWN in the AS-REP message. If JOHNDOE exists, the AS will collect his password in the KDC database and perform the same one-way hash function as in (2). This secret key will be used to decrypt the PA-ENC-TIMESTAMP from the AS-REQ. If it can be decrypted successfully, the AS will send an AS-REP message back to the client, containing the following:<br /> - (Encrypted) (A) A session key for client/TGS communication, encrypted with the secret key of JOHNDOE.<br /> - (Encrypted) (B) A TGT which includes the client ID (JOHNDOE), client network address, ticket validity period and the client/TGS session key. The TGT is encrypted using the secret key of the TGS.</li><li>Once the client receives the AS-REP message, it will decrypt message A to obtain the client/TGS session key. This session key is used for further communication with the TGS. The client also caches the TGT, which it cannot decrypt as it encrypted with the secret key of the TGS.</li><li>hen requesting service myservice/abc.realm1.com@REALM1.COM, the client sends the following TGS-REQ message to the TGS:<br /> - (Cleartext) Server Principal name : myservice/abc.realm1.com@REALM1.COM<br /> - (Encrypted) The cached TGT<br /> - (Encrypted) An authenticator, which is composed of the client ID (JOHNDOE) and a timestamp, encrypted using the client/TGS session key.</li><li>When receiving the TGS-REP message from the client, the TGS will decrypt the TGT using its secrey key and will extract the client/TGS session key from it. Subsequently, the TGS will decrypt the authenticator. If the decryption is successful, it proves JOHNDOE's identity to the TGS. The TGS will send a TGS-REP message to the client containing the following:<br /> - (Encrypted) (C) A client/server ticket which includes the client ID (JOHNDOE), client network address, ticket validity period and the client/server session key. The client/server ticket is encrypted using the service's secret key.<br /> - (Encrypted) (D) A client/server session key encrypted with the client/TGS session key.</li><br /></ul>Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-4611185700243594512010-08-06T14:31:00.014+02:002010-08-06T22:23:46.201+02:00AIX : LVM mirror poolsLVM mirror pools in AIX 6 make it possible to divide the physical volumes of a scalable volume group into separate pools. When creating a logical volume, each opy of the logical volume being created can be assigned to a mirror pool. Logical volume copies that are assigned to a mirror pool will only allocate partitions from the physical volumes in that mirror pool. This provides the ability to restrict the disks that a logical volume copy can use. Without mirror pools, the only way to restrict which physical volume is used for allocation when creating or extending a logical volume is to use a map file. Thus, using mirror pools greatly simplify this process.<br /><br />Assume a scalable volume group 'myvg' consisting of the following:<br />- physical volumes hdisk2 and hdisk3, which originate from a storage box in computer room ABC.<br />- physical volumes hdisk4 and hdisk5, which originate from a storage box in computer room DEF.<br />- logical volume 'mylv', which uses a parallel write with sequential read scheduling policy. This means that all mirrors are written in parallel but always read from the first mirror if the first mirror is available. This setting can improve sequential data access on the backend storage box significantly. Using this scheduling policy, there is primary and one (or possibly two) secondary copy (copies) of a physical partition.<br />- log logical volume loglv01.<br /><br />Converting a scalable volume group / logical volumes to use mirror pools can be done as follows:<br />- Attach a mirror pool to the hdisks. <br /><span style="font-weight: bold;font-size:85%;" ># chpv -p ABC hdisk2<br /># chpv -p ABC hdisk3<br /># chpv -p DEF hdisk4<br /># chpv -p DEF hdisk5 </span><span style="font-weight: bold;"> </span><br /><br />- Next, add the respective mirror pools to each existing logical volume in the volume group. <br /><span style="font-weight: bold;font-size:85%;" ># chlv -m copy1=ABC -m copy2=DEF mylv <br /># chlv -m copy1=ABC -m copy2=DEF loglv01</span><br /><br />- Set the mirror pool strictness to 'super'. This means that each newly created logical volume in the volume group must be assigned to a mirror pool and that each mirror pool must contain at least one copy of each logical volume in the volume group.<br /><span style="font-weight: bold;font-size:85%;" ># chvg -M s myvg</span><br /><br />- Check the status of the mirror pools<br /><span style="font-weight: bold;font-size:85%;" ># lsvg -m myvg<br />Logical Volume Copy 1 Copy 2 Copy 3<br />loglv01 ABC DEF None<br />mylv ABC DEF None</span><br /><br /><span style="font-weight: bold;font-size:85%;" ># lsmp myvg<br />VOLUME GROUP: myvg Mirror Pool Super Strict: yes<br /><br />MIRROR POOL: ABC Mirroring Mode: SYNC<br />MIRROR POOL: DEF Mirroring Mode: SYNC</span><br /><br />Now when increasing the size of the file system built on top of mylv, there is no need to worry about the remote mirroring anymore.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-22545329484896538172010-07-31T17:43:00.012+02:002010-07-31T18:04:39.146+02:00AIX : Cross realm Kerberos TGT verification does not work<span style="font-weight: bold;"><span style="font-style: italic;">Introduction</span></span><br /><br />Currently, the AIX Kerberos load module KRB5 cannot handle cross realm Ticket Granting Ticket verification.<br /><br /><span style="font-weight: bold;"><span style="font-style: italic;">Details</span></span><br /><br />Assume the following setup:<br />- An Active Directory domain ABC.COM which contains all user principals.<br />- A Kerberos/NAS realm XYZ.COM which contains all service principals, including host service principals.<br />- Active Directory user foo which is mapped to AIX user foo using the following generic mapping rule in krb5.conf<br /> <span style="font-weight: bold;"><span style="font-style: italic;">auth_to_local = RULE:[1:$1@$0](.*@ABC.COM)s/@.*//</span></span><br />- The default Kerberos realm in krb5.conf is XYZ.COM<br />- The AIX user foo has the following properties:<br /> - <span style="font-weight: bold;"><span style="font-style: italic;">SYSTEM=KRB5</span></span>, which identifies that the Kerberos load module KRB5 needs to be used for authentication<br /> - <span style="font-weight: bold;"><span style="font-style: italic;">auth_domain=ABC.COM</span></span>, which identifies that an Active Directory KDC in domain ABC.COM should be contacted for authentication. This information is present in the [realms] section of krb5.conf.<br /><br />By default, the following Kerberos authentication flow is followed when Active Directory user foo wishes to connect to AIX host bar.xyz.com:<br /><br />- The KRB5 load module on bar.xyz.com sends an <span style="font-weight: bold;">AS-REQ</span> message to a KDC in Active Directory domain ABC.COM. This is regulated by the <span style="font-weight: bold;"><span style="font-style: italic;">auth_domain</span></span> user attribute of the local AIX user foo to which foo@ABC.COM is mapped using the <span style="font-weight:bold;"><span style="font-style:italic;">auth_to_local</span></span> rule. If <span style="font-weight:bold;"><span style="font-style:italic;">auth_domain</span></span> isn't set, the default realm in krb5.conf is used.<br />- The Active Directory KDC returns the <span style="font-weight: bold;">AS-REQ</span> message to bar.xyz.com, containing the TGT and TGS session key.<br />- Now bar.xyz.com needs to verify that the TGT is genuine by actually using it to see if it can access services locally with it. The idea behind this is that only the "real" KDC would be able to produce a valid service ticket, encrypted with the service's private key. In order to perform this TGT verification, a keytab must be installed on bar.xyz.com which contains the host/bar.xyz.com@XYZ.COM host service principal. This keytab contains the secret key of the host service principal, only known to the TGS and the service itself.<br /><br />Apparently, the KRB5 load module sends the TGT verification <span style="font-weight: bold;">TGS-REQ</span> message to the Active Directory KDC in ABC.COM instead of to the NAS KDC in XYZ.COM, which contains the service principal. Since the Active Directory KDC does not know the host/bar.xyz.com@ABC.COM host service principal, it will return <span style="font-weight: bold;">KRB5KDC_ERR_S_PRINCIPAL_UNKNOWN</span>, resulting in user authentication failure. The problem lies in the fact that the KRB5 Kerberos load module was never designed to handle cross realm authentication properly. Therefore, the TGT verification code does not use the <span style="font-weight: bold;"><span style="font-style: italic;">krb5_verify_init_creds()</span></span> call.<br /><br /><span style="font-weight: bold;"><span style="font-style: italic;">Solution</span></span><br /><br />- Disable TGT verification in /usr/lib/security/methods.cfg<br /><span style="font-weight: bold;font-size:85%;" > KRB5:<br /> program = /usr/lib/security/KRB5<br /> program_64 = /usr/lib/security/KRB5_64<br /> options = authonly,tgt_verify=no</span><br /><br />- A DCR was created and accepted for this issue, which should be fixed in AIX 6.1 TL6 and AIX 7.1.<br />DCR <a href="http://users.telenet.be/miguelsanders/MR082009165.pdf" target="_blank">MR082009165 </a>: TGT verification for user defined on a KDC does not work when host principals are defined on a different KDC in a cross realm setup.<br />APAR <a href="http://www-01.ibm.com/support/docview.wss?uid=isg1IZ78593" target="_blank">IZ78593 </a>: ENABLE CROSS REALM TGT VERIFICATION IN KRB5A<br /><br /><span style="font-weight: bold;">Note: Starting AIX 6.1 TL2, the KRB5A and KRB5 load modules have been merged. APAR <a href="http://www-01.ibm.com/support/docview.wss?uid=isg1IZ26224" target="_blank">IZ26224 </a>addresses this.</span>Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-668615703464201652010-07-22T13:31:00.011+02:002010-07-22T13:44:20.803+02:00VIO : Change default value of hcheck_intervalCurrently, the default value of hcheck_interval for VSCSI hdisks is set to 0, meaning that healthchecking is disabled. The hcheck_interval attribute of a hdisk can only be changed online if the volume group to which the hdisk belongs, is not active. If the volume group is active, the ODM value of the hcheck_interval can be altered in the CuAt class, as shown in the following example:<br /><br /><strong><span style="font-size:85%;">chdev -l hdisk0 -a hcheck_interval=60 -P</span></strong><br /><br />The change will then be applied once the system is rebooted. However, it is possible to change the default value of the hcheck_interval attribute in the PdAt ODM class. As a result, you won't have to worry about its value anymore and newly discovered hdisks will automatically get the new default value, as illustrated in the example below:<br /><br /><strong><span style="font-size:85%;">odmget -q 'attribute = hcheck_interval AND uniquetype = "PCM/friend/vscsi"' PdAt | sed 's/deflt = "0"/deflt = "60"/' | odmchange -o PdAt -q 'attribute = hcheck_interval AND uniquetype = "PCM/friend/vscsi"'</span></strong>Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-29511254626100573092010-07-21T19:26:00.017+02:002010-07-22T11:49:46.875+02:00AIX : Examining ulimits of running processCurrently, AIX does not allow to examine the ulimits of a running process through the /proc file system. F.e. in Linux this is possible, as shown in the following example:<br /><br /># cat /proc/4121/limits<br /><table><tbody><tr><td style="FONT-WEIGHT: bold"><span style="font-size:100%;">Limit</span></td><td style="FONT-WEIGHT: bold"><span style="font-size:100%;">Soft Limit</span></td><td style="FONT-WEIGHT: bold"><span style="font-size:100%;">Hard Limit</span></td><td style="FONT-WEIGHT: bold"><span style="font-size:100%;">Units</span></td></tr><tr><td><span style="font-size:100%;">Max cpu time</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">ms</span></td></tr><tr><td><span style="font-size:100%;">Max file size</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">bytes</span></td></tr><tr><td><span style="font-size:100%;">Max data size</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">bytes</span></td></tr><tr><td><span style="font-size:100%;">Max stack size</span></td><td><span style="font-size:100%;">8388608</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">bytes</span></td></tr><tr><td><span style="font-size:100%;">Max core file size</span></td><td><span style="font-size:100%;">ulimited</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">bytes</span></td></tr><tr><td><span style="font-size:100%;">Max resident set</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">unlimited</span></td><td><span style="font-size:100%;">bytes</span></td></tr><tr><td>. . .</td></tr></tbody></table><br />The only alternative currently available in AIX is to use dbx and attach to the running process, as shown in the following example:<br /><br /># dbx -a 700464<br />(dbx) <span style="FONT-WEIGHT: bold">proc rlimit</span><br /><table><tbody><tr><td><strong>rlimit name</strong></td><td><strong>rlimit_cur</strong></td><td><strong>rlimit_max</strong></td><td><strong>(units)</strong></td></tr><tr><td>RLIMIT_CPU:</td><td>(unlimited)</td><td>(unlimited)</td><td>sec</td></tr><tr><td>RLIMIT_FSIZE:</td><td>(unlimited)</td><td>(unlimited)</td><td>bytes</td></tr><tr><td>RLIMIT_DATA:</td><td>67108864</td><td>100663296</td><td>bytes</td></tr><tr><td>RLIMIT_STACK:</td><td>33554432</td><td>4294967296</td><td>bytes</td></tr><tr><td>RLIMIT_CORE:</td><td>(unlimited)</td><td>(unlimited)</td><td>bytes</td></tr><tr><td>RLIMIT_RSS:</td><td>(unlimited)</td><td>(unlimited)</td><td>bytes</td></tr><tr><td>. . .</td></tr></tbody></table>(dbx) quit<br /><br />The only major drawback from this alternative is that dbx interrupts the process by sending a SIGTRAP signal. After the dbx sessions finishes, the process terminates. A Design Change Request was made to address this.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-24681900834572276532009-12-31T12:57:00.005+01:002009-12-31T13:12:19.826+01:00GPFS : Tuning recommendationsA few words on important GPFS tunables.<br /><br /><span style="font-size:130%;"><b>Pagepool & SeqDiscardTheshold</b><br /></span>GPFS does not use the regular file buffer cache of the operating system (f.e. non-computational memory in AIX) but uses its own mechanism to implement caching. GPFS uses pinned computational memory to maintain its file buffer cache, called the <strong>pagepool</strong>, which is used to cache user file data and file system metadata. The default pagepool size is 64MB and is too small for many applications most of the times. Applications that re-use files a lot and perform sequential reads will benefit from the pagepool. Non-DIO writes will also be done to the pagepool. For a sequential write operation to the pagepool, write-behind will improve overall performance. For random I/O, GPFS will not be able to use read-ahead or write-behind techniques and will be able to rely on striping for improved performance.<br /><br />A parameter that affects how data is cached in the pagepool is <b>SeqDiscardTheshold</b>, which will instruct GPFS to try to keep as much data in pagepool as possible. The default for this value is 1MB which means that if a file greater than 1 MB is read sequentially, GPFS will not keep the data in the pagepool. There might be applications in which large files are often re-read by multiple processes, which can lead to improved performance if this tunable is set to a higer value.<br />It should also be noted that NSD servers don't not cache anything for their NSD clients. If both NSD client A and NSD client B request the same file from an NSD server, the NSD server will get the data twice from disk. As a result, increasing the pagepool on an NSD server would have no effect.<br /><br /><b><span style="font-size:130%;">Block size</span></b><br />This is one of the most important things to think about when designing a GPFS file system. After creating the file system with a specific block size, there's no way back other than recreating the file system with the new block size. Choosing the optimal GPFS block size is not a straight forward exercise since it relies on several other factors:<br /><ul><li>Physical disk block size</li><li>LUN segment size, which is the maximum amount of data that is written or read from a disk per operation before the next disk in the array is used</li><li>Application block size (f.e. DB block size for a RDBMS application)</li></ul>The following example will calculate the (theorical) optimal GPFS block size, without taking the application block size into account:<br /><br />RAID 5 4+1 , 128 KB LUN segment size = 512 KB LUN stripe size.<br />As a result, a GPFS block size of 512 KB (or a multiple of) would be good.<br />A GPFS block size of 256 KB will almost certainly lead to reduced performance because the disk subsystem would have to read the remaining 256 KB of the 512 KB stripe in order to calculate parity in a write operation. Summarized the operations for both 256 KB and 512 KB block sizes on the disk subsystem would look like:<br /><ol><li><strong>GPFS write (256 KB)</strong> = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Read LUN Segment #3 (128 KB) + Read LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)</li><li><strong>GPFS write (512 KB)</strong> = Write LUN Segment #1 (128 KB) + Write LUN Segment #2 (128 KB) + Write LUN Segment #3 (128 KB) + Write LUN Segment #4 (128 KB) + Calculate Parity + Write LUN segment #5 (128 KB)</li></ol>Considering the possible use of a write cache on the disk subsystem , 1 is certainly more costly than 2.<br /><br /><span style="font-size:130%;"><b>Split data / metadata</b><br /></span>Splitting data and metadata is one of the most underestimateded design questions.<br />The actual division can be changed online so unlike changing the GPFS file system block size, there is no downtime involved. If metadata (inodes + indirect data blocks) cannot be accessed fast enough, overall performance will degrade severely. Metadata access can be compared to a seek operation on a normal hard drive. Generally, it's a good idea to do the following:<br /><ul><li><strong>Metadata (RAID 1 + Enable read/write cache on the disk subsystem)<br /></strong>Try to keep as much metadata as possible in the cache of the disk subsystem so that every node that looks for it will find it in the cache.<br />Write operations on metadata are generally random and small. Moreover, these write operations should be as fast as possible.<br />As a result, using the write cache for metadata will be very beneficial.<br />Finally, since metadata is more read from than written to, RAID 1 is a better (though more costly) backend for metadata than RAID 5.</li><br /><li><strong>Data (RAID 5 + Disable read/write cache on the disk subsystem)<br /></strong>Try to protect metadata in the cache as much as possible by disabling caching on data.<br />As a result, nodes that are reading a lot of data don't thrash the cache, which can be used efficiently for nodes that need access to metadata instead. Sometimes it can even be beneficial to dedicate a disk controller to metadata LUNs (caching enabled on that controller).</li></ul><span style="font-size:130%;"><b>maxFilesToCache & maxStatCache</b><br /></span>As ready stated, the pagepool is GPFS's file buffer cache in pinned computational memory which caches user file data and file system metadata. On the other hand, GPFS uses regular computational memory to maintain its inode and stat cache (user file metadata). The inode cache (controlled by the <strong>maxFilesToCache</strong> tunable, default 1000) contains copies of inodes for open files and for some recently used files that are no longer open. Storing a file's inode in cache permits faster re-access to that file. The stat cache (controlled by the <strong>maxStatCache</strong> tunable, default 4 * maxFilesToCache) contains enough information to open the file and satisfy a stat() call. It is intended to help functions such as ls -l, du, and certain backup programs that scan entire directories looking for modification times and file sizes. However, the stat cache entry does not contain enough information to read from or write to the file since it does not contain the indirect block references (unlike a regular inode). A stat cache entry consumes significantly less memory than a full inode.<br /><br />It is possible that the number of currently opened files is larger than the size of the inode cache. In that case, the inode needs to retrieved from disk first if a node wishes to read a file, of which the inode in not in the inode cache. Therefore, it's very important that metadata can be accessed as fast as possible (see above).<br /><br /><span style="font-size:130%;"><b>prefetchThreads & worker1Threads & maxMBps</b><br /></span>The <strong>prefetchThreads</strong> tunable (default 72) controls the maximum possible number of threads dedicated to prefetching data for files that are read sequentially, or to handle sequential write-behind. On the other hand, the <strong>worker1Threads</strong> tunable (default 48) controls the maximum number of concurrent file operations at a time. If there are more requests than the number of worker1Threads, the excess will wait until a previous request has finished. The primary use is for random read or write requests that cannot be prefetched, random I/O requests, or small file activity. The maximum value of prefetchThreads plus worker1Threads is 550 (64-bit kernels) or 164 (32-bit kernels). These values need tuning sometimes, f.e. in an Oracle RAC environment. Oracle does not need many prefetchThreads, since Oracle does its own prefetching and does not the GPFS pagepool (Oracle uses DIO to access files on a GPFS filesystem). However, Oracle does need a high amount of worker1Threads to allow as many Oracle AIO threads as possible to work in parallel.<br /><br />The <strong>maxMBps</strong> tunable (default 150) is used for estimating the amount of I/O triggered by sequential read-ahead and write-behind. Setting this value higher than the default will get more parallelism if there are many LUNs. By lowering this value, the load on the disk subsystem can be limited artificially. Setting this value too high usually does not cause problems because of other limiting factors, such as the size of the pagepool and the number of prefetch threads.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com1tag:blogger.com,1999:blog-1361212388472874685.post-68602127044963963102009-12-19T12:00:00.013+01:002009-12-19T12:23:50.016+01:00AIX : Naming resolutionFollowing a recent conversation with IBM L2 support concerning general naming resolution, all affected parameter files and environment variables (and their relationship) will be explained.<br /><br />AIX supports several mechanisms for naming resolution of hosts, networks, protocols, services, netgroups and rpc:<br /><ul><li><b>dns</b> - Domain Name Service</li><li><b>nis</b> - Network Information Service</li><li><b>nis+</b> - Network Information Service Plus</li><li><b>local</b> - Local naming service. Searches the files in /etc directory for resolving</li><li><b>nis_ldap</b> - Provides naming resolution for host, networks, protocols, rpc, services, and netgroups. This mechanism works with any directory server which stores entity data using a schema defined in RFC 2307. Although the name of the mechanism is nis_ldap, this mechanism does not use or require any NIS services!</li></ul>AIX can be configured to use a combination of the above services for naming resolution. There is a sequential order that AIX follows to use these services. The default ordering can be overridden in several ways:<br /><ul><li><b>NSORDER</b> environment variable</li><li><b>/etc/netsvc.conf</b> configuration file</li><li><b>/etc/irs.conf</b> configuration file</li></ul><b>NSORDER</b><br />NSORDER is an environment variable that can be used to specify the order for resolving host names to addresses (gethostbyname) and vice versa (gethostbyaddr). NSORDER overrides the host settings in the netsvc.conf and irs.conf files. The supported mechanisms for NSORDER are <b>bind, nis, local</b>, which is also the default order.<br /><br /><b>/etc/netsvc.conf</b><br />The netsvc.conf file specifies the sequential order for resolving host names and aliases. It should be noted that sendmail ONLY uses netsvc.conf for resolution of host names and aliases. Other configuration files or environment variables are not consulted. The environment variable NSORDER overrides the host settings in the netsvc.conf file, which in turn overrides the host settings in the irs.conf file.<br /><br /><b>/etc/irs.conf</b><br />The irs.conf file is used to control the order of mechanisms that the resolver libraries use in searching for network-related data, including the resolving of host names, networks, services, protocols, and netgroups. The default order for resolving host names and networks is <b>dns, nis, local</b>. The default order for resolving services, protocols, and netgroups is <b>nis, local</b>. The order defined in irs.conf will override the default values. The settings in the netsvc.conf configuration file override the settings in the irs.conf file. The NSORDER environment variable overrides the settings in the irs.conf and netsvc.conf files.<br /><br /><b>Note:</b><br />AIX offers two LDAP naming services, <b>ldap</b> and <b>nis_ldap</b>. The <b>ldap</b> naming service uses the IBM specific schema and supports host name resolution only. The <b>nis_ldap</b> naming service implemented since AIX 5.2, uses the RFC 2307 schema and supports name resolution of hosts, services, networks, protocols, and netgroups.<br /><br /><b>Summary</b><br /><br /><table border="1"><tr><td><b>Service</b></td><td><b>Precedence</b></td></tr><tr><td><b>hosts</b></td><td>NSORDER, netsvc.conf, /etc/irs.conf</td></tr><tr><td><b>networks</b></td><td>irs.conf</td></tr><tr><td><b>protocols</b></td><td>irs.conf</td></tr><tr><td><b>services</b></td><td>irs.conf</td></tr><tr><td><b>netgroups</b></td><td>irs.conf</td></tr></table>Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-29564405318080720692009-12-08T17:03:00.012+01:002009-12-09T07:47:28.848+01:00NIM : Replication issue<strong><em>Introduction:</em></strong><br />Since AIX 5.3 TL5, Network Installation Manager supports replication of NIM objects from the NIM master to the alternate NIM master (<a href="http://www-01.ibm.com/support/docview.wss?uid=isg1IY81860" target="_blank">APAR IY81860</a>). Apparently, this feature does not function properly.<br /><br /><strong><em>Impacted:</em></strong><br />- All AIX versions up till now<br /><a href="http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53" target="_blank">http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53</a><br /><a href="http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61" target="_blank">http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61</a><br /><br /><strong><em>Details:</em></strong><br />The setup consists of the following two nodes: <br />master (NIM master) and alternate (Alternate NIM master) <br /><br />When issuing a regular sync operation on the NIM master, the operation is successful: <br /><strong><span style="font-size:85%;"># nim -Fo sync alternate<br />...<br />nim_master_recover Complete<br /></strong></span><br />When issuing a sync operation on the NIM master with the replicate option (this will copy all resources that are not present on the alternate NIM master), the following error is observed.<br /><strong><span style="font-size:85%;"># nim -Fo sync -a replicate=yes alternate<br />...<br />nim_master_recover Complete<br />error replicating resources: unable to /usr/lpp/bos.sysmgt/nim/methods/c_rsh master<br />Finished Replicating NIM resources <br />... <br />Finished checking SPOTs <br />nim_master_recover Complete<br /></strong></span><br />The replicate operation fails because of the broken c_rsh utility.<br />Further debugging of c_rsh on the NIM master learned that there are several ODM lookups prior to the error.<br /><strong><span style="font-size:85%;"># truss /usr/lpp/bos.sysmgt/nim/methods/c_rsh master date 2>&1 | grep objrepos<br />...<br />statx("/etc/objrepos/nim_object", 0x2FF1FD70, 76, 0) = 0<br />kopen("/etc/objrepos/nim_object", O_RDONLY) = 5<br />kopen("/etc/objrepos/nim_attr", O_RDONLY) = 5<br />kopen("/etc/objrepos/nim_attr.vc", O_RDONLY) = 6<br />...<br /></strong></span> <br /><strong><em>Resolution:</em></strong><br />Following PMR 25293.300.624, the IBM lab stated that a failing ODM lookup is the root cause of the issue. As a result, a particular data structure is not populated and the signal 11 occurs when trying to copy a string to this structure.<br /><br /><a href="http://www-01.ibm.com/support/docview.wss?uid=isg1IZ66255" target="_blank">APAR IZ66255</a> was created to address this issue.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-74617845288580957952009-11-24T19:27:00.025+01:002009-11-25T10:23:58.833+01:00VIO : Client path failoverFollowing a recent discussion with IBM L2 support, all parameters that affect VIO client path failover will be explained briefly.<br /><br /><span style="font-size:85%;"><strong># lsattr -El hdisk0<br />...<br />algorithm fail_over<br />hcheck_interval 60<br />hcheck_mode nonactive<br />...<br /></strong></span><br />Currently, MPIO on the VIO client only supports failover from one VSCSI client adapter to another (fail_over algorithm). Load balancing over multiple VSCSI client adapters is currently not supported. <br />The heartbeat check interval for each disk using MPIO should be configured so that the path status is updated automatically. Specifying hcheck_mode=nonactive means that healthcheck commands are sent down paths that have no active I/O, including paths with a state of "Failed". The hcheck_interval attribute defines how often the healthcheck is performed. In the client partition the hcheck_interval for virtual SCSI devices is set to 0 by default which means healthchecking is disabled.<br /><br /><span style="font-size:85%;"><strong># lsattr -El vscsi2<br />vscsi_err_recov fast_fail <br />vscsi_path_to 30 </strong></span><br /><br />vscsi_path_to, when enabled, allows the virtual client adapter driver to determine the health of the VIO Server to improve and expedite path failover processing.<br />A value of 0 (default) disables it, while any other value defines the number of seconds the VSCSI client adapter will wait for commands issued to the VSCSI server adapter that were not serviced meanwhile. If that time is exceeded, the VSCSI client adapter attempts the commands again and waits up to 60 seconds until it fails the outstanding requests. An error will be writen to the error log and, if MPIO is used, another path to the disk will be tried to service the requests. Therefore, this parameter should only be set for MPIO installations with dual VIO servers.<br /><br />Similar to the attribute fc_error_recov for real FC adapters, the attribute vscsi_err_recov is used by the VSCSI adapter driver. When this parameter is set to fast_fail, the VIO client adapter will send a FAST_FAIL datagram to the VIO server and it will subsequently fail the I/O immediately rather than delayed. This may help to improve MPIO failover.<br /><br />vscsi_err_recov has been added since AIX 5.3 TL9 (APAR <a href="http://www-01.ibm.com/support/docview.wss?uid=isg1IZ28537" target="_blank">IZ28537</a>) and AIX 6.1 TL2 (APAR <a href="http://www-01.ibm.com/support/docview.wss?uid=isg1IZ28554" target="_blank">IZ28554</a>).<br />It requires VIO server 2.1.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-73521305025411525592009-05-20T20:21:00.007+02:002009-05-20T20:37:15.972+02:00OpenSSH : Kerberos user principal name incorrect on AIX<em><strong>Introduction:</strong></em><br />Currently, it is observed that password based Kerberos authentication in OpenSSH does not function properly on AIX. Even though AIX can authenticate a user via Kerberos (using the KRB5/KRB5A load module), OpenSSH cannot.<br /><br /><em><strong>Impacted:</strong></em><br />- OpenSSH <= 5.2p1<br /><br /><em><strong>Details:</strong></em><br />This issue is caused by the fact that an AIX user has two attributes which OpenSSH doesn't take into account when forming the principal name of the user (attributes auth_name and auth_domain). If AIX user, myuser, has the attributes auth_name=someone and auth_domain=SOMEWHERE, then the Kerberos principal name would be someone@SOMEWHERE instead of myuser@DEFAULTREALM. By employing the auth_domain attribute, requests are sent to to the SOMEWHERE realm instead of the default realm DEFAULTREALM, which is listed in the libdefaults section of the krb5.conf configuration file. <br /><br />The following can be seen in the OpenSSH code (auth-krb5.c on line 88):<br /><br /><span style="font-size:85%;"><strong>problem = krb5_parse_name(authctxt->krb5_ctx,authctxt->pw->pw_name,&authctxt->krb5_user);<br /></strong></span><br />Since authctxt->pw->pw_name contains only the user name (without a realm), the default realm will be automatically appended according to the documentation of the krb5_parse_name call. Since this isn't the correct realm name (the overwritten auth_domain is the correct one), Kerberos authentication will fail. If the auth_domain attribute is not set, the default realm name will be used.<br /><br /><em><strong>Resolution:</strong></em><br />- <a href="https://bugzilla.mindrot.org/show_bug.cgi?id=1583" target="_blank">Bugzilla item # 1583</a> was created to address this issue. The item contains a patch to the source which solves the issue.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-79678800927440140352009-05-06T11:34:00.008+02:002009-05-06T12:56:09.922+02:00Samba : DFS does not work on AIX<em><strong>Introduction:</strong></em><br />Currently, there is a minor bug in Samba which makes DFS unusable on AIX. <br /><br /><em><strong>Impacted:</strong></em><br />- Samba <= 3.3.4<br /><br /><em><strong>Details:</strong></em><br />The issue is caused by the behaviour of the <a href="http://publib.boulder.ibm.com/infocenter/systems/index.jsp?topic=/com.ibm.aix.basetechref/doc/basetrf2/readlink.htm" target="_blank">readlink</a> system call on AIX. If the size of the buffer cannot contain the entire symbolic link, the ERANGE error is returned. Other UNIX and Linux distributions will never return an error if the size of the buffer is too small. Instead, only a part of the symbolic link will be written in the buffer.<br /><br />In msdfs.c, the character array 'link_target_buf' is defined with size 7 (size of "msdfs:" + 1). Since the DFS link is larger than that, the readlink system call on AIX returns ERANGE. In order to resolve this issue, the array should be of size PATH_MAX (defined in /usr/include/sys/limits.h). <br /><br />A proposed patch looks like:<br /><br /><span style="font-size:85%;"><strong>--- msdfs.c 2009-05-06 08:36:00.000000000 +0200<br />+++ msdfs.new.c 2009-05-06 08:36:44.000000000 +0200<br />@@ -400,11 +400,15 @@<br /> char **pp_link_target,<br /> SMB_STRUCT_STAT *sbufp)<br /> {<br /> SMB_STRUCT_STAT st;<br /> int referral_len = 0;<br />+#ifdef AIX<br />+ char link_target_buf[PATH_MAX];<br />+#else<br /> char link_target_buf[7];<br />+#endif<br /> size_t bufsize = 0;<br /> char *link_target = NULL;<br /><br /> if (pp_link_target) {<br /> bufsize = 1024;</strong></span><br /><br /><em><strong>Resolution:</strong></em><br />- <a href="https://bugzilla.samba.org/show_bug.cgi?id=6330" target="_blank">Bugzilla item # 6330</a> was created to address this issue.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-17936605783545363302009-05-03T18:25:00.008+02:002009-05-20T20:43:57.427+02:00OpenSSH : Server option PrintLastLog does not work on AIX<em><strong>Introduction:</strong></em><br />Currently, the OpenSSH server option "PrintLastLog" does not work on AIX. The last login time is always displayed, disregarding the option. <br /><br /><em><strong>Impacted:</strong></em><br />- OpenSSH <= 5.2p1<br /><br /><em><strong>Details:</strong></em><br />When browsing the source, several functions in loginrec.c were found which solely handle the processing of the last login info (login_get_lastlog, getlast_entry).<br />Since AIX does not provide such a function natively, the configure script sets the DISABLE_LASTLOG define. A small code snippet from getlast_entry in loginrec.c shows this:<br /><br /><span style="font-size:85%;"><strong>#if defined(DISABLE_LASTLOG)<br /> /* On some systems we shouldn't even try to obtain last login<br /> * time, e.g. AIX */<br /> return (0);</strong></span><br /><br />On the other hand, when issuing the AIX loginsuccess() call (which writes a new login record), the last login record can be retrieved by that very same call.<br />Looking at port-aix.c, the following can be seen:<br /><br /><span style="font-size:85%;"><strong>if (loginsuccess((char *)user, (char *)host, (char *)ttynm, &msg) == 0) {<br /> success = 1;<br /> if (msg != NULL && loginmsg != NULL && !msg_done) {<br /> debug("AIX/loginsuccess: msg %s", msg);<br /> buffer_append(loginmsg, msg, strlen(msg));<br /> xfree(msg);<br /> msg_done = 1;<br /> }<br />}</strong></span><br /><br />Pointer "msg" points to the new last login info for the user and it always appended to the loginmsg buffer. The buffer_append call should only be called if options.print_lastlog is set.<br /><br /><em><strong>Resolution:</strong></em><br />- <a href="https://bugzilla.mindrot.org/show_bug.cgi?id=1595" target="_blank">Bugzilla item # 1595</a> was created to address this issue. The item contains patches to the source which solve the issue.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-18187924856310973722009-04-20T10:00:00.017+02:002009-05-06T13:00:50.407+02:00EtherChannel : Issue with backup virtual adapter<strong><em>Introduction:</em></strong><br />Currently, there is something very odd going on when using EtherChannel (Network Interface Backup mode) if the backup adapter is a virtual adapter. PMR 68839.300.624 clarified that it is currently designed that the backup virtual adapter is receiving traffic, even though it is in backup mode. However, this introduces an additional problem: even though the backup virtual adapter is receiving traffic, it is not replying to it. It is the primary channel that responds, which creates an unbalanced situation on the physical network, resulting in flooding. <br /><br /><strong><em>Impacted:</em></strong><br />- All AIX versions up till now<br /><a href="http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53" target="_blank">http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53</a><br /><a href="http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61" target="_blank">http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61</a><br />- POWER5(+) firmware <= SF240_320<br /><br /><strong><em>Details:</em></strong><br /><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyFhwkFctaPa5PcimTWvL2sV5phM446BFnvyY_pc7FDV8pNDJ2YxLKv5mVsSmYJVC-t7H0JbZOVqFXMmwGBy9-IbOhL2TMGXjbT9oQVTCCrLOZAfn1_pd-H-bWpckBkJOjNVfmIp9LeGU/s1600-h/drawing.GIF"><img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 220px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjyFhwkFctaPa5PcimTWvL2sV5phM446BFnvyY_pc7FDV8pNDJ2YxLKv5mVsSmYJVC-t7H0JbZOVqFXMmwGBy9-IbOhL2TMGXjbT9oQVTCCrLOZAfn1_pd-H-bWpckBkJOjNVfmIp9LeGU/s320/drawing.GIF" border="0" alt=""id="BLOGGER_PHOTO_ID_5326692782138108226" /></a><br /><br />Consider that the ARP tables on client LPAR A and server LPAR B are empty aswell as the MAC table on the Ethernet Switch. Client LPAR A wishes to send data to LPAR B. <br /><br /><span style="font-size:85%;"><strong>LPAR A: <br />ent1 (Virtual Ethernet) - MAC address 22:f1:30:00:70:06 <br />en1 - IP address 10.226.32.145<br /><br />LPAR B: <br />ent3 (EtherChannel in NIB mode (Active/Passive)) - MAC address 00:14:5e:c6:46:80 <br />en3 - IP address 10.226.32.139<br />Primary Channel: ent2 (Physical Ethernet) <br />Backup Channel: ent1 (Virtual Ethernet)<br /><br />VIO: <br />ent3 (Shared Ethernet Adapter) - MAC address 00:14:5e:48:2c:7a<br />Physical Ethernet: ent2 - MAC address 00:14:5e:48:2c:7a<br />Virtual Ethernet: ent1 - MAC address 22:f1:30:00:30:06 <br /> <br />Source IP address: 10.226.32.145 <br />Destination IP address: 10.226.32.139 <br /> <br />Source MAC address: 22:f1:30:00:70:06 <br />Destination MAC address: unknown <br /></strong></span> <br />Since client LPAR A does not know the destination MAC address of server LPAR B, client LPAR A is broadcasting an ARP request (Who has 10.226.32.139, tell 10.226.32.145) on the internal Layer 2 PHYP switch. Even though the EtherChannel on server LPAR B is in Primary Channel Mode, the PHYP delivers this packet to the backup Virtual Ethernet adapter of the EtherChannel and also delivers the broadcast to the SEA for bridging. As a result, the MAC table on the physical switch is updated with MAC address of client LPAR A, located on physical port X. Server LPAR B will form a unicast reply but sends this unicast reply via the Primary Channel to the Ethernet Switch. The Ethernet Switch receives the unicast reply on port Y, links the source MAC address of server LPAR B to port Y in the MAC table. Since the frame contains a destination MAC address which has a valid MAC table entry on the physical switch, it is delivered to port X and it ultimately received by client LPAR A through the SEA. Client LPAR A updates the ARP table with the MAC address of server LPAR B.<br /><br />Now client LPAR A can start communicating with server LPAR B since it now knows the destination MAC address. The PHYP is delivering the packets via the backup Virtual Ethernet adapter of the EtherChannel. After the TTL of the MAC table entry for client LPAR A expires, flooding is observed on the physical switch, meaning that the switch will act as a simple repeater for all communication from server LPAR B to client LPAR A and hereby sending it to all trunk ports and access ports defined in the same VLAN. Ofcourse, the frames are also forwarded to port X (it's in the same VLAN) and are ultimately received by client LPAR A though the SEA. <br /><br />When client LPAR A is sending jumbo frames (data) to server LPAR B, approximately 2 Mbit/s of TCP ACK flooding was observed. It gets really bad when the process is reversed, in which server LPAR B is sending data to client LPAR A. As a result, all data will be flooded on the switch and only the TCP acks are delivered via the backup Virtual Ethernet Adapter.<br /><br />According to IBM, this is working as designed and a DCR was created to address this issue.<br /><br /><strong><em>Resolution:</em></strong><br />- Reduce ARP table TTL on the LPARs (arpt_killc network tunable) OR<br />- Increase MAC table TTL on the physical switch OR<br />- Replace Virtual Ethernet adapter by a Physical Ethernet adapter for the EtherChannel backup channel.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-47566547242359931532009-04-01T13:08:00.006+02:002009-05-03T18:50:08.101+02:00Quorum active or not?AIX 5.3 TL7 introduces concurrent quorum changes on a volume group. Prior to that version, the quorum change only becomes active after a varyoff/varyon operation on that specific volume group. This also means that, whever the ODM value is changed, there is no easy way to know whether quorum is currently active or not since lsvg displays the values of ODM attributes, not real-time values.<br />Fortunately, there is way to figure out whether quorum is active or not. This involves debugging the running kernel using kdb. The procedure to do this is as follows:<br /><br />- Determine the major number of the volume group in /dev, and convert to the hexadecimal value. F.e. rootvg will always have a major number of 10 (hexadecimal A) and all logical volumes will have a sequential minor number starting at 1.<br /><br /><span style="font-size:85%;"><strong># ls -al /dev/rootvg<br />crw-rw---- 1 root system 10, 0 Apr 24 2008 /dev/rootvg</strong><br /></span><br />- List the device switch table entry for the volume group, based on the hexadecimal major number, and track the effective address of the volgrp structure in memory (dsdptr)<br /><br /><strong><span style="font-size:85%;"># echo 'devsw 0xA' | kdb<br />The specified kernel file is a 64-bit kernel<br />Preserving 1402949 bytes of symbol table<br />First symbol __mulh<br />START END <name><br />0000000000001000 0000000003DDF050 start+000FD8<br />F00000002FF47600 F00000002FFDC920 __ublock+000000<br />000000002FF22FF4 000000002FF22FF8 environ+000000<br />000000002FF22FF8 000000002FF22FFC errno+000000<br />F100070F00000000 F100070F10000000 pvproc+000000<br />F100070F10000000 F100070F18000000 pvthread+000000<br />PFT:<br />PVT:<br />id....................0002<br />raddr.....000000000A000000 eaddr.....F200800040000000<br />size..............00080000 align.............00001000<br />valid..1 ros....0 fixlmb.1 seg....0 wimg...2<br />(0)> devsw 0xA<br />Slot address F1000100101AA500<br />MAJOR: 00A<br /> open: 04165624<br /> close: 04164EC8<br /> read: 04164738<br /> write: 04164638<br /> ioctl: 04162960<br /> strategy: 04180E9C<br /> ttys: 00000000<br /> select: .nodev (00196AE4)<br /> config: 041588F8<br /> print: .nodev (00196AE4)<br /> dump: 04181E68<br /> mpx: .nodev (00196AE4)<br /> revoke: .nodev (00196AE4)<br /> dsdptr: F100010032BA2000<br /> selptr: 00000000<br /> opts: 0000012A DEV_DEFINED DEV_MPSAFE DEV_EXTBUF</span></strong><br /><br />- Determine the flags attribute of the volgrp structure. The last bit is about quorum (1 -> quorum disabled)<br /><br /><strong><span style="font-size:85%;"># echo 'volgrp F100010032BA2000' | kdb | grep flags | awk '{print $4}'<br />00000001</span></strong>Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com2tag:blogger.com,1999:blog-1361212388472874685.post-578787064547595922009-03-30T08:51:00.047+02:002009-05-03T18:33:08.268+02:00TCP issue in IBM NAS 1.4.0.8<em><strong>Introduction:</strong></em><br />Ever since IBM NAS (Network Authentication Service) 1.4.0.8 with TCP support (RFC1510 compliant) was released, an issue was found in the TCP reception of fragmented payloads. As a result, TCP connections will never be closed properly (they remain in the TCP state CLOSE_WAIT) and pose an mbuf depletion threat.<br /><br /><strong><em>Impacted:</em></strong><br />- IBM Network Authentication Service 1.4.0.8<br /><br /><strong><em>Details:</em></strong><br />NAS server: MTU 1500 bytes, IP x.y.z.u<br />NAS client: MTU 576 bytes, IP a.b.c.d<br /><br />When the NAS client is in a LAN segment with MTU 576 bytes, the TCP issue on the server occurs. Once the client gets a cross realm ticket from an Active Directory domain controller, a service ticket is requested from the NAS server. The following tcpdump trace shows the TGS exchange:<br /><br /><span style="font-size:85%;">(1)09:40:54.892621 IP a.b.c.d.1250 > x.y.z.u.88: S 1586082305:1586082305(0) win 64512 <mss 536,nop,nop,sackOK><br />(2)09:40:54.892816 IP x.y.z.u.88 > a.b.c.d.1250: S 3658439259:3658439259(0) ack 1586082306 win 65535 <mss 1460><br />(3)09:40:54.893145 IP a.b.c.d.1250 > x.y.z.u.88: . ack 1 win 64856<br />(4)09:40:54.893338 IP a.b.c.d.1250 > x.y.z.u.88: . 1:537(536) ack 1 win 64856<br />(5)09:40:54.893471 IP a.b.c.d.1250 > x.y.z.u.88: . 537:1073(536) ack 1 win 64856<br />(6)09:40:54.893743 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 1073 win 65535<br />(7)09:40:54.894292 IP a.b.c.d.1250 > x.y.z.u.88: . 1073:1609(536) ack 1 win 64856<br />(8)09:40:54.894310 IP a.b.c.d.1250 > x.y.z.u.88: . 1609:2145(536) ack 1 win 64856<br />(9)09:40:54.894320 IP a.b.c.d.1250 > x.y.z.u.88: P 2145:2307(162) ack 1 win 64856<br />(10)09:40:55.070688 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 2307 win 65535<br />(11)09:40:59.878565 IP a.b.c.d.1250 > x.y.z.u.88: . 2307:2843(536) ack 1 win 64856<br />(12)09:40:59.878649 IP a.b.c.d.1250 > x.y.z.u.88: . 2843:3379(536) ack 1 win 64856<br />(13)09:40:59.878658 IP a.b.c.d.1250 > x.y.z.u.88: . 3379:3915(536) ack 1 win 64856<br />(14)09:40:59.878720 IP a.b.c.d.1250 > x.y.z.u.88: . 3915:4451(536) ack 1 win 64856<br />(15)09:40:59.884118 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4451 win 65535<br />(16)09:40:59.884567 IP a.b.c.d.1250 > x.y.z.u.88: P 4451:4613(162) ack 1 win 64856<br />(17)09:41:00.084446 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4613 win 65535<br />(18)09:41:04.878515 IP a.b.c.d.1250 > x.y.z.u.88: F 4613:4613(0) ack 1 win 64856<br />(19)09:41:04.878592 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4614 win 65535<br /><br />(1)First step in the TCP handshake in which the NAS client sends a SYN packet with TCP sequence number 1586082305, a TCP window size of 64512 bytes and a maximum TCP payload size (MSS) of 536 bytes.<br />(2)The NAS server replies by acknowledging the SYN packet from the NAS client and sending his own SYN packet with TCP sequence number 3658439259, a TCP window size of 65536 bytes and a maximum TCP payload size (MSS) of 1460 bytes.<br />(3)The NAS client acknowledges the SYN packet of the NAS server. The connection is now in the TCP state ESTABLISHED on both sides. The maximum TCP payload size (MSS) will be 536 bytes.<br />(4)The NAS client wants to send his TGS-REQ packet, but it has a total TCP payload of 2306 bytes. The large size of this payload can explained by the inclusion of the PAC in the user's TGT. Due to the large payload, TCP fragmentation needs to be done. Since the agreed MSS size is 536 bytes, 5 fragments need to be transmitted.<br />(5)The NAS client sends the second fragment<br />(6)The NAS server acknowledges the first two fragments.<br />(7-8-9) The NAS client send the remaining three fragments.<br />(10)The NAS server acknowledges the reception and reassembly of the remaining fragments. Normally, the NAS server should start sending the TGS-REP now but refuses to do so.<br />(11)After a 5 second timeout, the NAS client hasn't received the TGS-REP from the NAS server and starts retransmitting the first fragment of the TGS-REQ.<br />(12-13-14)The NAS client retransmits fragments #2,#3 and #4.<br />(15)The NAS server acknowledges the reception of the first 4 fragments.<br />(16)The NAS client sends his final fragment.<br />(17)The NAS server acknowledges the reception and reassembly of the remaining fragments. Once again, the NAS server doesn't start sending the TGS-REP.<br />(18)After an additional 5 second wait interval, the NAS client gives up and performs an active close on his end by sending a FIN packet to the NAS server. The NAS client is now in the TCP state FIN_WAIT_1.<br />(19)The NAS server acknowledges the FIN of the NAS client. The NAS server is now in the TCP state CLOSE_WAIT and the TCP client is now in the TCP state FIN_WAIT_2. Normally, the NAS server should now send a FIN packet to the NAS client for closing the TCP connection, but refuses to do so.</span><br /><br />As a result, netstat on the NAS server shows TCP connections stuck in the TCP state CLOSE_WAIT. As long as the NAS server is active, those TCP connections will never be freed and pose a potential mbuf depletion threat.<br /><br />After further investigation, the following truss output of the NAS server revealed the problem.<br /><span style="font-size:85%;">0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,<br />0x00000000) (sleeping...)<br />0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,<br />0x00000000) = 0<br />0.7132: yield() =<br />0.7136: thread_waitact(400) = 1<br />1.7665: naccept(75, 0x2FF21938, 0x2FF2198C) = 99<br />1.7669: ngetsockname(99, 0x2FF21998, 0x2FF21990) = 0<br />1.7673: kfcntl(99, F_GETFL, 0x00000000) = 6<br />1.7680: kfcntl(99, F_SETFL, 0x00000006) = 0<br />1.7684: kioctl(99, -2147195266, 0x10038260, 0x00000000) = 0<br />1.7688: setsockopt(99, 65535, 128, 0x10038268, 8) = 0<br />1.7691: __libc_sbrk(0x00000000) = 0x215E9520<br />1.7697: thread_setmystate(0x00000000, 0x2FF210B0) = 0<br />1.7700: mprotect(0x216C8000, 4096, 0) = 0<br />1.7704: thread_twakeup(3473645, 268435456) = 0<br />1.7707: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,<br />0x00000000) = 268435456<br />= 1<br />1.7715: thread_setmystate(0x216E13D0, 0x216E16D8) = 0<br />1.7720: yield() =<br />1.7724: thread_waitact(400) = 1<br />1.7727: yield() =<br />3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,<br />0x00000000) (sleeping...)<br />3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,<br />0x00000000) = 1<br />kread(99, "\0\0\b ?", 4) = 4<br />4.7437: _select(100, 0x216DFAC8, 0x216E0AC8, 0x00000000,<br />0x00000000) = 1<br /><strong>kread(99, " l82\b 082\b ? ?030201".., 2302) = 532 </strong><br />4.7464: kthread_ctl(2, 0x00000000) = 0<br />4.7467: thread_setmystate_fast(0x4000000C, 0x00000000,<br />0x00000000, 0x00000000, 0x40000000, 0x00000158, 0x00000000, 0x00000000)<br />= 0x00000000<br />4.7472: thread_setmystate_fast(0x4000000D, 0x00000000,<br />0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)<br />= 0x00000000<br />4.7477: thread_setmystate_fast(0x4000000C, 0x00000000,<br />0x00000000, 0x00000000, 0x40000000, 0x00000176, 0x00000000, 0x00000000)<br />= 0x00000000<br />4.7481: thread_setmystate_fast(0x4000000D, 0x00000000,<br />0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)<br />= 0x00000000<br />4.7486: sigprocmask(0, 0xF08C77A8, 0x20366CEC) = 0<br />4.7489: thread_setmystate(0x203665F8, 0x00000000) = 0<br />4.7492: thread_tsleep(0, 0x20009100, 0x00000000, 0x00000000) = 0<br />5.7808: mprotect(0x216C8000, 4096, 3) = 0<br />5.7813: yield() =<br />...<br /></span><br />The bold part of the truss output shows that of the requested 2302 additional bytes (remember the TGS-REQ is 2306 bytes) only 532 bytes were read because of fragmentation. After that, NAS doesn't even attempt to read the remaining fragments. It just freaks out and doesn't even proper close the socket, keeping the connections on the server in the TCP state CLOSE_WAIT.<br /><br /><strong><em>Resolution:</em></strong><br />NAS L3 support states that they will provide a fix for this issue, which will be incorporated in the next version of NAS (1.4.0.9).Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-54329643034949578672009-03-29T19:13:00.015+02:002009-03-30T10:46:09.466+02:00Identifying memory leaks in AIXDynamic memory allocation happens at run time rather than at the creation of the process. While giving more flexibility to the programmer, it also requires a lot more housekeeping. In large programs, memory leaks are a very common issue, with very unpleasant side effects. Side effects of poor dynamic memory allocation management can include:<br />- malloc returns with errno set to ENOMEM<br />- process working segment is growing over time (detected with either ps gv or svmon)<br />- core dump of the process which has malloc in the stack trace<br /><br />While an growing process working segment is an indication of a memory leak, it doesn't necessarily mean there is one. It might be perfectly normal for a process to allocate additional memory during its lifetime. However, at some point in time, dynamic memory has to be freed by the application. The following graphs show both a normal and abnormal memory evolution of a process (X-axis -> time , Y-axis -> allocated memory)<br /><br /><p><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiedzhmaDZNNxnYsGvxhqo8SkFnKi3VES1adozw3UStMNS9crj_DCzv_B6V5-HaqhQatmreCNXkf1PP9kaVE5OPJU7EnrgKOJYhLLmnahsuMm8Z28tPc2n9-v_NCjx7TlF0dqNKqk2vV2k/s1600-h/memok.jpg"><img id="BLOGGER_PHOTO_ID_5318659893944510658" style="FLOAT: left; MARGIN: 0px 10px 10px 0px; WIDTH: 200px; CURSOR: hand; HEIGHT: 103px" alt="" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiedzhmaDZNNxnYsGvxhqo8SkFnKi3VES1adozw3UStMNS9crj_DCzv_B6V5-HaqhQatmreCNXkf1PP9kaVE5OPJU7EnrgKOJYhLLmnahsuMm8Z28tPc2n9-v_NCjx7TlF0dqNKqk2vV2k/s200/memok.jpg" border="0" /></a><img id="BLOGGER_PHOTO_ID_5318659591143855330" style="DISPLAY: block; MARGIN: 0px auto 10px; WIDTH: 200px; CURSOR: hand; HEIGHT: 104px; TEXT-ALIGN: center" alt="" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7yGP_VUYQlk4ut3ltfhn1Uu0N2YjQpj25RxIv_1J1V6GjlDBVOHtDp7msQeUgAD3BlXE7Z1XgGTjoYncfDekNVVlbrUGUY4yq8qaTe4M4LDQDB3tBt6urgX00oilMhTheT2GeQVPsErY/s200/memnok.jpg" border="0" /> </p><p>Since dynamic memory gets allocated on the process heap, it is automatically cleaned up when the process terminates. This also means that a memory leak isn't that harmful for a process with a short lifetime. However for daemons it is potentially more harmful! </p><p>How can memory leaks be tracked down? There are some commercial products available that examine of the memory allocation of processes (f.e. IBM Rational PurifyPlus) but AIX provides a subsystem capable of determining this out of the box. Let's consider the following C code which has a few memory leaks.</p><p><span style="font-size:85%;"><strong>#include <stdio.h><br />void routineA(){<br /> char *test=malloc(4);<br /> fprintf(stdout,"routineA\n");<br /> fprintf(stdout,"pointer residing at address %p\n",&test);<br /> fprintf(stdout,"value of pointer %p\n",test);<br />}<br />void routineB(){<br /> char *test=malloc(4);<br /> fprintf(stdout,"routineB\n");<br /> fprintf(stdout,"pointer residing at address %p\n",&test);<br /> fprintf(stdout,"value of pointer %p\n",test);<br /> free(test);<br />}<br />void routineC(){<br /> char *test=malloc(8);<br /> fprintf(stdout,"routineC\n");<br /> fprintf(stdout,"pointer residing at address %p\n",&test);<br /> fprintf(stdout,"value of pointer %p\n",test);<br />}<br />int main(){<br /> char *test=malloc(4);<br /> fprintf(stdout,"main\n");<br /> fprintf(stdout,"pointer residing at address %p\n",&test);<br /> fprintf(stdout,"value of pointer %p\n",test);<br /> routineA();<br /> routineB();<br /> routineC();<br />}</strong></span><br /><br />Here we can clearly see that the memory allocations in main,routineA and routineC don't get freed. Using the malloc debug subsystem, we are also made aware of this. Moreover, even if we don't have the source, the malloc debug subsystem will give the stack trace.<br /><strong><span style="font-size:85%;"><br />#export MALLOCTYPE=debug<br />#export MALLOCDEBUG=report_allocations<br />#./memleak<br />main<br />pointer residing at address 2ff22b10<br />value of pointer 2000eff8<br />routineA<br />pointer residing at address 2ff22ac0<br />value of pointer 20010ff8<br />routineB<br />pointer residing at address 2ff22ac0<br />value of pointer 20012ff8<br />routineC<br />pointer residing at address 2ff22ac0<br />value of pointer 20012ff8<br />Current allocation report:<br /><br /> Allocation #0: 0x2000EFF8<br /> Allocation size: 0x4<br /> Allocation traceback:<br /> 0xD03EA170 malloc<br /> 0xD036C260 init_malloc<br /> 0xD036D434 malloc<br /> 0x10000540 main<br /><br /> Allocation #1: 0x20010FF8<br /> Allocation size: 0x4<br /> Allocation traceback:<br /> 0xD03EA170 malloc<br /> 0x10000360 routineA<br /> 0x1000058C main<br /> 0x100001C4 __start<br /><br /> Allocation #2: 0x20012FF8<br /> Allocation size: 0x8<br /> Allocation traceback:<br /> 0xD03EA170 malloc<br /> 0x100003FC routineC<br /> 0x10000594 main<br /> 0x100001C4 __start<br /><br /> Total allocations: 3.<br /></strong></span><br />The malloc debug subsystem states there are three memory leaks in the program. The first one is the main routine (at address 0x2000EFF8, size 4 bytes), the second one is in routineA (at address 0x20010FF8, size 4 bytes) and the last one is located in routineC (at address 0x20012FF8, size 8 bytes).<br /><br />Whenever you wish to open a PMR for a memory leak, be sure to add the malloc trace aswell. If it's not a known issue, you will be redirected to L3 support quite fast :)<br /></p>Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-67776137789942194942009-03-28T20:36:00.020+01:002009-12-08T17:02:59.432+01:00Memory leak KRB5A & libkrb5.a<strong><em>Introduction:</em></strong><br />Currently there is a memory leak in both the KRB5A load module and the libkrb5 library in AIX. The KRB5A load module is shipped with AIX whereas the libkrb5.a library is shipped with the krb5.client.rte fileset in NAS (Network Authentication Service), which is IBM's version of Kerberos<br /><br /><strong><em>Impacted:</em></strong><br />- All AIX versions up till now<br /><a href="http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53" target="_blank">http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53</a><br /><a href="http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61" target="_blank">http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61</a><br />- IBM Network Authentication Service <= 1.4.0.8<br /><br /><strong><em>Details:</em></strong><br /><strong><span style="font-size:85%;"># lsuser -a SYSTEM sidsmig<br />sidsmig SYSTEM=KRB5A<br /># cat /usr/lib/security/methods.cfg | grep -ip KRB5A<br />KRB5A:<br /> program = /usr/lib/security/KRB5A<br /> program_64 = /usr/lib/security/KRB5A_64<br /> options = authonly</strong></span><br /><br />The following C test program was used in PMR 69409.300.624<br /><br /><strong><span style="font-size:85%;">#include <stdio.h><br />#include <usersec.h><br /><br />int main(int argc,char** argv){<br /> while(1){<br /> int reenter;<br /> char* msg;<br /> authenticate(argv[1], argv[2], &reenter, &msg);<br /> if(msg) {<br /> free(msg)<br /> break;<br /> }<br /> }<br />}</strong></span><br /><br />An increasing process working segment could be noticed with either ps gv or svmon -P when user sidsmig authenticates to the system.<br /><br /><strong><em>Resolution:</em></strong><br />- <a href="http://www-01.ibm.com/support/docview.wss?uid=isg1IZ43820" target="_blank">APAR IZ43820</a> was created to address this issue.Miguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0tag:blogger.com,1999:blog-1361212388472874685.post-80328647736991529062009-03-27T14:53:00.001+01:002009-03-31T14:11:58.523+02:00First postThere always has to be a first post, wouldn't you think? Well, this is it!<br />I've never been using a blog before so I have no clue how things will work out in the end. I mainly chose to start this blog in order to memorize everything I am currently working on and to archive the things I worked on (both from a professional point of view) so that other people can benefit from it. You'll mainly see AIX (IBM's version of UNIX) related things here.<br /><br />See you around!<br /><br />MiguelMiguelhttp://www.blogger.com/profile/08194328849636582330noreply@blogger.com0