Miguel's Spot

dinsdag 24 november 2009

VIO : Client path failover

Following a recent discussion with IBM L2 support, all parameters that affect VIO client path failover will be explained briefly.

# lsattr -El hdisk0
...
algorithm                    fail_over
hcheck_interval         60
hcheck_mode             nonactive
...

Currently, MPIO on the VIO client only supports failover from one VSCSI client adapter to another (fail_over algorithm). Load balancing over multiple VSCSI client adapters is currently not supported.
The heartbeat check interval for each disk using MPIO should be configured so that the path status is updated automatically. Specifying hcheck_mode=nonactive means that healthcheck commands are sent down paths that have no active I/O, including paths with a state of "Failed". The hcheck_interval attribute defines how often the healthcheck is performed. In the client partition the hcheck_interval for virtual SCSI devices is set to 0 by default which means healthchecking is disabled.

# lsattr -El vscsi2
vscsi_err_recov        fast_fail
vscsi_path_to            30

vscsi_path_to, when enabled, allows the virtual client adapter driver to determine the health of the VIO Server to improve and expedite path failover processing.
A value of 0 (default) disables it, while any other value defines the number of seconds the VSCSI client adapter will wait for commands issued to the VSCSI server adapter that were not serviced meanwhile. If that time is exceeded, the VSCSI client adapter attempts the commands again and waits up to 60 seconds until it fails the outstanding requests. An error will be writen to the error log and, if MPIO is used, another path to the disk will be tried to service the requests. Therefore, this parameter should only be set for MPIO installations with dual VIO servers.

Similar to the attribute fc_error_recov for real FC adapters, the attribute vscsi_err_recov is used by the VSCSI adapter driver. When this parameter is set to fast_fail, the VIO client adapter will send a FAST_FAIL datagram to the VIO server and it will subsequently fail the I/O immediately rather than delayed. This may help to improve MPIO failover.

vscsi_err_recov has been added since AIX 5.3 TL9 (APAR IZ28537) and AIX 6.1 TL2 (APAR IZ28554).
It requires VIO server 2.1.

woensdag 20 mei 2009

OpenSSH : Kerberos user principal name incorrect on AIX

Introduction:
Currently, it is observed that password based Kerberos authentication in OpenSSH does not function properly on AIX. Even though AIX can authenticate a user via Kerberos (using the KRB5/KRB5A load module), OpenSSH cannot.

Impacted:
- OpenSSH <= 5.2p1

Details:
This issue is caused by the fact that an AIX user has two attributes which OpenSSH doesn't take into account when forming the principal name of the user (attributes auth_name and auth_domain). If AIX user, myuser, has the attributes auth_name=someone and auth_domain=SOMEWHERE, then the Kerberos principal name would be someone@SOMEWHERE instead of myuser@DEFAULTREALM. By employing the auth_domain attribute, requests are sent to to the SOMEWHERE realm instead of the default realm DEFAULTREALM, which is listed in the libdefaults section of the krb5.conf configuration file.

The following can be seen in the OpenSSH code (auth-krb5.c on line 88):

problem = krb5_parse_name(authctxt->krb5_ctx,authctxt->pw->pw_name,&authctxt->krb5_user);

Since authctxt->pw->pw_name contains only the user name (without a realm), the default realm will be automatically appended according to the documentation of the krb5_parse_name call. Since this isn't the correct realm name (the overwritten auth_domain is the correct one), Kerberos authentication will fail. If the auth_domain attribute is not set, the default realm name will be used.

Resolution:
- Bugzilla item # 1583 was created to address this issue. The item contains a patch to the source which solves the issue.

woensdag 6 mei 2009

Samba : DFS does not work on AIX

Introduction:
Currently, there is a minor bug in Samba which makes DFS unusable on AIX.

Impacted:
- Samba <= 3.3.4

Details:
The issue is caused by the behaviour of the readlink system call on AIX. If the size of the buffer cannot contain the entire symbolic link, the ERANGE error is returned. Other UNIX and Linux distributions will never return an error if the size of the buffer is too small. Instead, only a part of the symbolic link will be written in the buffer.

In msdfs.c, the character array 'link_target_buf' is defined with size 7 (size of "msdfs:" + 1). Since the DFS link is larger than that, the readlink system call on AIX returns ERANGE. In order to resolve this issue, the array should be of size PATH_MAX (defined in /usr/include/sys/limits.h).

A proposed patch looks like:

--- msdfs.c 2009-05-06 08:36:00.000000000 +0200
+++ msdfs.new.c 2009-05-06 08:36:44.000000000 +0200
@@ -400,11 +400,15 @@
      char **pp_link_target,
      SMB_STRUCT_STAT *sbufp)
{
   SMB_STRUCT_STAT st;
   int referral_len = 0;
+#ifdef AIX
+  char link_target_buf[PATH_MAX];
+#else
   char link_target_buf[7];
+#endif
   size_t bufsize = 0;
   char *link_target = NULL;

   if (pp_link_target) {
      bufsize = 1024;

Resolution:
- Bugzilla item # 6330 was created to address this issue.

zondag 3 mei 2009

OpenSSH : Server option PrintLastLog does not work on AIX

Introduction:
Currently, the OpenSSH server option "PrintLastLog" does not work on AIX. The last login time is always displayed, disregarding the option.

Impacted:
- OpenSSH <= 5.2p1

Details:
When browsing the source, several functions in loginrec.c were found which solely handle the processing of the last login info (login_get_lastlog, getlast_entry).
Since AIX does not provide such a function natively, the configure script sets the DISABLE_LASTLOG define. A small code snippet from getlast_entry in loginrec.c shows this:

#if defined(DISABLE_LASTLOG)
   /* On some systems we shouldn't even try to obtain last login
    * time, e.g. AIX */
   return (0);

On the other hand, when issuing the AIX loginsuccess() call (which writes a new login record), the last login record can be retrieved by that very same call.
Looking at port-aix.c, the following can be seen:

if (loginsuccess((char *)user, (char *)host, (char *)ttynm, &msg) == 0) {
   success = 1;
   if (msg != NULL && loginmsg != NULL && !msg_done) {
      debug("AIX/loginsuccess: msg %s", msg);
      buffer_append(loginmsg, msg, strlen(msg));
      xfree(msg);
      msg_done = 1;
   }
}

Pointer "msg" points to the new last login info for the user and it always appended to the loginmsg buffer. The buffer_append call should only be called if options.print_lastlog is set.

Resolution:
- Bugzilla item # 1595 was created to address this issue. The item contains patches to the source which solve the issue.

maandag 20 april 2009

EtherChannel : Issue with backup virtual adapter

Introduction:
Currently, there is something very odd going on when using EtherChannel (Network Interface Backup mode) if the backup adapter is a virtual adapter. PMR 68839.300.624 clarified that it is currently designed that the backup virtual adapter is receiving traffic, even though it is in backup mode. However, this introduces an additional problem: even though the backup virtual adapter is receiving traffic, it is not replying to it. It is the primary channel that responds, which creates an unbalanced situation on the physical network, resulting in flooding.

Impacted:
- All AIX versions up till now
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61
- POWER5(+) firmware <= SF240_320

Details:

Consider that the ARP tables on client LPAR A and server LPAR B are empty aswell as the MAC table on the Ethernet Switch. Client LPAR A wishes to send data to LPAR B.

LPAR A:
ent1 (Virtual Ethernet) - MAC address 22:f1:30:00:70:06
en1 - IP address 10.226.32.145

LPAR B:
ent3 (EtherChannel in NIB mode (Active/Passive)) - MAC address 00:14:5e:c6:46:80
en3 - IP address 10.226.32.139
Primary Channel: ent2 (Physical Ethernet)
Backup Channel: ent1 (Virtual Ethernet)

VIO:
ent3 (Shared Ethernet Adapter) - MAC address 00:14:5e:48:2c:7a
Physical Ethernet: ent2 - MAC address 00:14:5e:48:2c:7a
Virtual Ethernet: ent1 - MAC address 22:f1:30:00:30:06

Source IP address: 10.226.32.145
Destination IP address: 10.226.32.139

Source MAC address: 22:f1:30:00:70:06
Destination MAC address: unknown

Since client LPAR A does not know the destination MAC address of server LPAR B, client LPAR A is broadcasting an ARP request (Who has 10.226.32.139, tell 10.226.32.145) on the internal Layer 2 PHYP switch. Even though the EtherChannel on server LPAR B is in Primary Channel Mode, the PHYP delivers this packet to the backup Virtual Ethernet adapter of the EtherChannel and also delivers the broadcast to the SEA for bridging. As a result, the MAC table on the physical switch is updated with MAC address of client LPAR A, located on physical port X. Server LPAR B will form a unicast reply but sends this unicast reply via the Primary Channel to the Ethernet Switch. The Ethernet Switch receives the unicast reply on port Y, links the source MAC address of server LPAR B to port Y in the MAC table. Since the frame contains a destination MAC address which has a valid MAC table entry on the physical switch, it is delivered to port X and it ultimately received by client LPAR A through the SEA. Client LPAR A updates the ARP table with the MAC address of server LPAR B.

Now client LPAR A can start communicating with server LPAR B since it now knows the destination MAC address. The PHYP is delivering the packets via the backup Virtual Ethernet adapter of the EtherChannel. After the TTL of the MAC table entry for client LPAR A expires, flooding is observed on the physical switch, meaning that the switch will act as a simple repeater for all communication from server LPAR B to client LPAR A and hereby sending it to all trunk ports and access ports defined in the same VLAN. Ofcourse, the frames are also forwarded to port X (it's in the same VLAN) and are ultimately received by client LPAR A though the SEA.

When client LPAR A is sending jumbo frames (data) to server LPAR B, approximately 2 Mbit/s of TCP ACK flooding was observed. It gets really bad when the process is reversed, in which server LPAR B is sending data to client LPAR A. As a result, all data will be flooded on the switch and only the TCP acks are delivered via the backup Virtual Ethernet Adapter.

According to IBM, this is working as designed and a DCR was created to address this issue.

Resolution:
- Reduce ARP table TTL on the LPARs (arpt_killc network tunable) OR
- Increase MAC table TTL on the physical switch OR
- Replace Virtual Ethernet adapter by a Physical Ethernet adapter for the EtherChannel backup channel.

woensdag 1 april 2009

Quorum active or not?

AIX 5.3 TL7 introduces concurrent quorum changes on a volume group. Prior to that version, the quorum change only becomes active after a varyoff/varyon operation on that specific volume group. This also means that, whever the ODM value is changed, there is no easy way to know whether quorum is currently active or not since lsvg displays the values of ODM attributes, not real-time values.
Fortunately, there is way to figure out whether quorum is active or not. This involves debugging the running kernel using kdb. The procedure to do this is as follows:

- Determine the major number of the volume group in /dev, and convert to the hexadecimal value. F.e. rootvg will always have a major number of 10 (hexadecimal A) and all logical volumes will have a sequential minor number starting at 1.

# ls -al /dev/rootvg
crw-rw---- 1 root system 10, 0 Apr 24 2008 /dev/rootvg

- List the device switch table entry for the volume group, based on the hexadecimal major number, and track the effective address of the volgrp structure in memory (dsdptr)

# echo 'devsw 0xA' | kdb
The specified kernel file is a 64-bit kernel
Preserving 1402949 bytes of symbol table
First symbol __mulh
START END
0000000000001000 0000000003DDF050 start+000FD8
F00000002FF47600 F00000002FFDC920 __ublock+000000
000000002FF22FF4 000000002FF22FF8 environ+000000
000000002FF22FF8 000000002FF22FFC errno+000000
F100070F00000000 F100070F10000000 pvproc+000000
F100070F10000000 F100070F18000000 pvthread+000000
PFT:
PVT:
id....................0002
raddr.....000000000A000000 eaddr.....F200800040000000
size..............00080000 align.............00001000
valid..1 ros....0 fixlmb.1 seg....0 wimg...2
(0)> devsw 0xA
Slot address F1000100101AA500
MAJOR: 00A
   open: 04165624
   close: 04164EC8
   read: 04164738
   write: 04164638
   ioctl: 04162960
   strategy: 04180E9C
   ttys: 00000000
   select: .nodev (00196AE4)
   config: 041588F8
   print: .nodev (00196AE4)
   dump: 04181E68
   mpx: .nodev (00196AE4)
   revoke: .nodev (00196AE4)
   dsdptr: F100010032BA2000
   selptr: 00000000
   opts: 0000012A DEV_DEFINED DEV_MPSAFE DEV_EXTBUF

- Determine the flags attribute of the volgrp structure. The last bit is about quorum (1 -> quorum disabled)

# echo 'volgrp F100010032BA2000' | kdb | grep flags | awk '{print $4}'
00000001

maandag 30 maart 2009

TCP issue in IBM NAS 1.4.0.8

Introduction:
Ever since IBM NAS (Network Authentication Service) 1.4.0.8 with TCP support (RFC1510 compliant) was released, an issue was found in the TCP reception of fragmented payloads. As a result, TCP connections will never be closed properly (they remain in the TCP state CLOSE_WAIT) and pose an mbuf depletion threat.

Impacted:
- IBM Network Authentication Service 1.4.0.8

Details:
NAS server: MTU 1500 bytes, IP x.y.z.u
NAS client: MTU 576 bytes, IP a.b.c.d

When the NAS client is in a LAN segment with MTU 576 bytes, the TCP issue on the server occurs. Once the client gets a cross realm ticket from an Active Directory domain controller, a service ticket is requested from the NAS server. The following tcpdump trace shows the TGS exchange:

(1)09:40:54.892621 IP a.b.c.d.1250 > x.y.z.u.88: S 1586082305:1586082305(0) win 64512 <mss 536,nop,nop,sackOK>
(2)09:40:54.892816 IP x.y.z.u.88 > a.b.c.d.1250: S 3658439259:3658439259(0) ack 1586082306 win 65535 <mss 1460>
(3)09:40:54.893145 IP a.b.c.d.1250 > x.y.z.u.88: . ack 1 win 64856
(4)09:40:54.893338 IP a.b.c.d.1250 > x.y.z.u.88: . 1:537(536) ack 1 win 64856
(5)09:40:54.893471 IP a.b.c.d.1250 > x.y.z.u.88: . 537:1073(536) ack 1 win 64856
(6)09:40:54.893743 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 1073 win 65535
(7)09:40:54.894292 IP a.b.c.d.1250 > x.y.z.u.88: . 1073:1609(536) ack 1 win 64856
(8)09:40:54.894310 IP a.b.c.d.1250 > x.y.z.u.88: . 1609:2145(536) ack 1 win 64856
(9)09:40:54.894320 IP a.b.c.d.1250 > x.y.z.u.88: P 2145:2307(162) ack 1 win 64856
(10)09:40:55.070688 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 2307 win 65535
(11)09:40:59.878565 IP a.b.c.d.1250 > x.y.z.u.88: . 2307:2843(536) ack 1 win 64856
(12)09:40:59.878649 IP a.b.c.d.1250 > x.y.z.u.88: . 2843:3379(536) ack 1 win 64856
(13)09:40:59.878658 IP a.b.c.d.1250 > x.y.z.u.88: . 3379:3915(536) ack 1 win 64856
(14)09:40:59.878720 IP a.b.c.d.1250 > x.y.z.u.88: . 3915:4451(536) ack 1 win 64856
(15)09:40:59.884118 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4451 win 65535
(16)09:40:59.884567 IP a.b.c.d.1250 > x.y.z.u.88: P 4451:4613(162) ack 1 win 64856
(17)09:41:00.084446 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4613 win 65535
(18)09:41:04.878515 IP a.b.c.d.1250 > x.y.z.u.88: F 4613:4613(0) ack 1 win 64856
(19)09:41:04.878592 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4614 win 65535

(1)First step in the TCP handshake in which the NAS client sends a SYN packet with TCP sequence number 1586082305, a TCP window size of 64512 bytes and a maximum TCP payload size (MSS) of 536 bytes.
(2)The NAS server replies by acknowledging the SYN packet from the NAS client and sending his own SYN packet with TCP sequence number 3658439259, a TCP window size of 65536 bytes and a maximum TCP payload size (MSS) of 1460 bytes.
(3)The NAS client acknowledges the SYN packet of the NAS server. The connection is now in the TCP state ESTABLISHED on both sides. The maximum TCP payload size (MSS) will be 536 bytes.
(4)The NAS client wants to send his TGS-REQ packet, but it has a total TCP payload of 2306 bytes. The large size of this payload can explained by the inclusion of the PAC in the user's TGT. Due to the large payload, TCP fragmentation needs to be done. Since the agreed MSS size is 536 bytes, 5 fragments need to be transmitted.
(5)The NAS client sends the second fragment
(6)The NAS server acknowledges the first two fragments.
(7-8-9) The NAS client send the remaining three fragments.
(10)The NAS server acknowledges the reception and reassembly of the remaining fragments. Normally, the NAS server should start sending the TGS-REP now but refuses to do so.
(11)After a 5 second timeout, the NAS client hasn't received the TGS-REP from the NAS server and starts retransmitting the first fragment of the TGS-REQ.
(12-13-14)The NAS client retransmits fragments #2,#3 and #4.
(15)The NAS server acknowledges the reception of the first 4 fragments.
(16)The NAS client sends his final fragment.
(17)The NAS server acknowledges the reception and reassembly of the remaining fragments. Once again, the NAS server doesn't start sending the TGS-REP.
(18)After an additional 5 second wait interval, the NAS client gives up and performs an active close on his end by sending a FIN packet to the NAS server. The NAS client is now in the TCP state FIN_WAIT_1.
(19)The NAS server acknowledges the FIN of the NAS client. The NAS server is now in the TCP state CLOSE_WAIT and the TCP client is now in the TCP state FIN_WAIT_2. Normally, the NAS server should now send a FIN packet to the NAS client for closing the TCP connection, but refuses to do so.

As a result, netstat on the NAS server shows TCP connections stuck in the TCP state CLOSE_WAIT. As long as the NAS server is active, those TCP connections will never be freed and pose a potential mbuf depletion threat.

After further investigation, the following truss output of the NAS server revealed the problem.
0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) (sleeping...)
0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 0
0.7132: yield() =
0.7136: thread_waitact(400) = 1
1.7665: naccept(75, 0x2FF21938, 0x2FF2198C) = 99
1.7669: ngetsockname(99, 0x2FF21998, 0x2FF21990) = 0
1.7673: kfcntl(99, F_GETFL, 0x00000000) = 6
1.7680: kfcntl(99, F_SETFL, 0x00000006) = 0
1.7684: kioctl(99, -2147195266, 0x10038260, 0x00000000) = 0
1.7688: setsockopt(99, 65535, 128, 0x10038268, 8) = 0
1.7691: __libc_sbrk(0x00000000) = 0x215E9520
1.7697: thread_setmystate(0x00000000, 0x2FF210B0) = 0
1.7700: mprotect(0x216C8000, 4096, 0) = 0
1.7704: thread_twakeup(3473645, 268435456) = 0
1.7707: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 268435456
= 1
1.7715: thread_setmystate(0x216E13D0, 0x216E16D8) = 0
1.7720: yield() =
1.7724: thread_waitact(400) = 1
1.7727: yield() =
3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) (sleeping...)
3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 1
kread(99, "\0\0\b ?", 4) = 4
4.7437: _select(100, 0x216DFAC8, 0x216E0AC8, 0x00000000,
0x00000000) = 1
kread(99, " l82\b 082\b ? ?030201".., 2302) = 532
4.7464: kthread_ctl(2, 0x00000000) = 0
4.7467: thread_setmystate_fast(0x4000000C, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x00000158, 0x00000000, 0x00000000)
= 0x00000000
4.7472: thread_setmystate_fast(0x4000000D, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)
= 0x00000000
4.7477: thread_setmystate_fast(0x4000000C, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x00000176, 0x00000000, 0x00000000)
= 0x00000000
4.7481: thread_setmystate_fast(0x4000000D, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)
= 0x00000000
4.7486: sigprocmask(0, 0xF08C77A8, 0x20366CEC) = 0
4.7489: thread_setmystate(0x203665F8, 0x00000000) = 0
4.7492: thread_tsleep(0, 0x20009100, 0x00000000, 0x00000000) = 0
5.7808: mprotect(0x216C8000, 4096, 3) = 0
5.7813: yield() =
...

The bold part of the truss output shows that of the requested 2302 additional bytes (remember the TGS-REQ is 2306 bytes) only 532 bytes were read because of fragmentation. After that, NAS doesn't even attempt to read the remaining fragments. It just freaks out and doesn't even proper close the socket, keeping the connections on the server in the TCP state CLOSE_WAIT.

Resolution:
NAS L3 support states that they will provide a fix for this issue, which will be incorporated in the next version of NAS (1.4.0.9).