maandag 30 maart 2009

TCP issue in IBM NAS 1.4.0.8

Introduction:
Ever since IBM NAS (Network Authentication Service) 1.4.0.8 with TCP support (RFC1510 compliant) was released, an issue was found in the TCP reception of fragmented payloads. As a result, TCP connections will never be closed properly (they remain in the TCP state CLOSE_WAIT) and pose an mbuf depletion threat.

Impacted:
- IBM Network Authentication Service 1.4.0.8

Details:
NAS server: MTU 1500 bytes, IP x.y.z.u
NAS client: MTU 576 bytes, IP a.b.c.d

When the NAS client is in a LAN segment with MTU 576 bytes, the TCP issue on the server occurs. Once the client gets a cross realm ticket from an Active Directory domain controller, a service ticket is requested from the NAS server. The following tcpdump trace shows the TGS exchange:

(1)09:40:54.892621 IP a.b.c.d.1250 > x.y.z.u.88: S 1586082305:1586082305(0) win 64512 <mss 536,nop,nop,sackOK>
(2)09:40:54.892816 IP x.y.z.u.88 > a.b.c.d.1250: S 3658439259:3658439259(0) ack 1586082306 win 65535 <mss 1460>
(3)09:40:54.893145 IP a.b.c.d.1250 > x.y.z.u.88: . ack 1 win 64856
(4)09:40:54.893338 IP a.b.c.d.1250 > x.y.z.u.88: . 1:537(536) ack 1 win 64856
(5)09:40:54.893471 IP a.b.c.d.1250 > x.y.z.u.88: . 537:1073(536) ack 1 win 64856
(6)09:40:54.893743 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 1073 win 65535
(7)09:40:54.894292 IP a.b.c.d.1250 > x.y.z.u.88: . 1073:1609(536) ack 1 win 64856
(8)09:40:54.894310 IP a.b.c.d.1250 > x.y.z.u.88: . 1609:2145(536) ack 1 win 64856
(9)09:40:54.894320 IP a.b.c.d.1250 > x.y.z.u.88: P 2145:2307(162) ack 1 win 64856
(10)09:40:55.070688 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 2307 win 65535
(11)09:40:59.878565 IP a.b.c.d.1250 > x.y.z.u.88: . 2307:2843(536) ack 1 win 64856
(12)09:40:59.878649 IP a.b.c.d.1250 > x.y.z.u.88: . 2843:3379(536) ack 1 win 64856
(13)09:40:59.878658 IP a.b.c.d.1250 > x.y.z.u.88: . 3379:3915(536) ack 1 win 64856
(14)09:40:59.878720 IP a.b.c.d.1250 > x.y.z.u.88: . 3915:4451(536) ack 1 win 64856
(15)09:40:59.884118 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4451 win 65535
(16)09:40:59.884567 IP a.b.c.d.1250 > x.y.z.u.88: P 4451:4613(162) ack 1 win 64856
(17)09:41:00.084446 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4613 win 65535
(18)09:41:04.878515 IP a.b.c.d.1250 > x.y.z.u.88: F 4613:4613(0) ack 1 win 64856
(19)09:41:04.878592 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4614 win 65535

(1)First step in the TCP handshake in which the NAS client sends a SYN packet with TCP sequence number 1586082305, a TCP window size of 64512 bytes and a maximum TCP payload size (MSS) of 536 bytes.
(2)The NAS server replies by acknowledging the SYN packet from the NAS client and sending his own SYN packet with TCP sequence number 3658439259, a TCP window size of 65536 bytes and a maximum TCP payload size (MSS) of 1460 bytes.
(3)The NAS client acknowledges the SYN packet of the NAS server. The connection is now in the TCP state ESTABLISHED on both sides. The maximum TCP payload size (MSS) will be 536 bytes.
(4)The NAS client wants to send his TGS-REQ packet, but it has a total TCP payload of 2306 bytes. The large size of this payload can explained by the inclusion of the PAC in the user's TGT. Due to the large payload, TCP fragmentation needs to be done. Since the agreed MSS size is 536 bytes, 5 fragments need to be transmitted.
(5)The NAS client sends the second fragment
(6)The NAS server acknowledges the first two fragments.
(7-8-9) The NAS client send the remaining three fragments.
(10)The NAS server acknowledges the reception and reassembly of the remaining fragments. Normally, the NAS server should start sending the TGS-REP now but refuses to do so.
(11)After a 5 second timeout, the NAS client hasn't received the TGS-REP from the NAS server and starts retransmitting the first fragment of the TGS-REQ.
(12-13-14)The NAS client retransmits fragments #2,#3 and #4.
(15)The NAS server acknowledges the reception of the first 4 fragments.
(16)The NAS client sends his final fragment.
(17)The NAS server acknowledges the reception and reassembly of the remaining fragments. Once again, the NAS server doesn't start sending the TGS-REP.
(18)After an additional 5 second wait interval, the NAS client gives up and performs an active close on his end by sending a FIN packet to the NAS server. The NAS client is now in the TCP state FIN_WAIT_1.
(19)The NAS server acknowledges the FIN of the NAS client. The NAS server is now in the TCP state CLOSE_WAIT and the TCP client is now in the TCP state FIN_WAIT_2. Normally, the NAS server should now send a FIN packet to the NAS client for closing the TCP connection, but refuses to do so.


As a result, netstat on the NAS server shows TCP connections stuck in the TCP state CLOSE_WAIT. As long as the NAS server is active, those TCP connections will never be freed and pose a potential mbuf depletion threat.

After further investigation, the following truss output of the NAS server revealed the problem.
0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) (sleeping...)
0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 0
0.7132: yield() =
0.7136: thread_waitact(400) = 1
1.7665: naccept(75, 0x2FF21938, 0x2FF2198C) = 99
1.7669: ngetsockname(99, 0x2FF21998, 0x2FF21990) = 0
1.7673: kfcntl(99, F_GETFL, 0x00000000) = 6
1.7680: kfcntl(99, F_SETFL, 0x00000006) = 0
1.7684: kioctl(99, -2147195266, 0x10038260, 0x00000000) = 0
1.7688: setsockopt(99, 65535, 128, 0x10038268, 8) = 0
1.7691: __libc_sbrk(0x00000000) = 0x215E9520
1.7697: thread_setmystate(0x00000000, 0x2FF210B0) = 0
1.7700: mprotect(0x216C8000, 4096, 0) = 0
1.7704: thread_twakeup(3473645, 268435456) = 0
1.7707: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 268435456
= 1
1.7715: thread_setmystate(0x216E13D0, 0x216E16D8) = 0
1.7720: yield() =
1.7724: thread_waitact(400) = 1
1.7727: yield() =
3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) (sleeping...)
3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 1
kread(99, "\0\0\b ?", 4) = 4
4.7437: _select(100, 0x216DFAC8, 0x216E0AC8, 0x00000000,
0x00000000) = 1
kread(99, " l82\b 082\b ? ?030201".., 2302) = 532
4.7464: kthread_ctl(2, 0x00000000) = 0
4.7467: thread_setmystate_fast(0x4000000C, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x00000158, 0x00000000, 0x00000000)
= 0x00000000
4.7472: thread_setmystate_fast(0x4000000D, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)
= 0x00000000
4.7477: thread_setmystate_fast(0x4000000C, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x00000176, 0x00000000, 0x00000000)
= 0x00000000
4.7481: thread_setmystate_fast(0x4000000D, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)
= 0x00000000
4.7486: sigprocmask(0, 0xF08C77A8, 0x20366CEC) = 0
4.7489: thread_setmystate(0x203665F8, 0x00000000) = 0
4.7492: thread_tsleep(0, 0x20009100, 0x00000000, 0x00000000) = 0
5.7808: mprotect(0x216C8000, 4096, 3) = 0
5.7813: yield() =
...

The bold part of the truss output shows that of the requested 2302 additional bytes (remember the TGS-REQ is 2306 bytes) only 532 bytes were read because of fragmentation. After that, NAS doesn't even attempt to read the remaining fragments. It just freaks out and doesn't even proper close the socket, keeping the connections on the server in the TCP state CLOSE_WAIT.

Resolution:
NAS L3 support states that they will provide a fix for this issue, which will be incorporated in the next version of NAS (1.4.0.9).

Geen opmerkingen:

Een reactie posten