maandag 30 maart 2009
TCP issue in IBM NAS 1.4.0.8
Ever since IBM NAS (Network Authentication Service) 1.4.0.8 with TCP support (RFC1510 compliant) was released, an issue was found in the TCP reception of fragmented payloads. As a result, TCP connections will never be closed properly (they remain in the TCP state CLOSE_WAIT) and pose an mbuf depletion threat.
Impacted:
- IBM Network Authentication Service 1.4.0.8
Details:
NAS server: MTU 1500 bytes, IP x.y.z.u
NAS client: MTU 576 bytes, IP a.b.c.d
When the NAS client is in a LAN segment with MTU 576 bytes, the TCP issue on the server occurs. Once the client gets a cross realm ticket from an Active Directory domain controller, a service ticket is requested from the NAS server. The following tcpdump trace shows the TGS exchange:
(1)09:40:54.892621 IP a.b.c.d.1250 > x.y.z.u.88: S 1586082305:1586082305(0) win 64512 <mss 536,nop,nop,sackOK>
(2)09:40:54.892816 IP x.y.z.u.88 > a.b.c.d.1250: S 3658439259:3658439259(0) ack 1586082306 win 65535 <mss 1460>
(3)09:40:54.893145 IP a.b.c.d.1250 > x.y.z.u.88: . ack 1 win 64856
(4)09:40:54.893338 IP a.b.c.d.1250 > x.y.z.u.88: . 1:537(536) ack 1 win 64856
(5)09:40:54.893471 IP a.b.c.d.1250 > x.y.z.u.88: . 537:1073(536) ack 1 win 64856
(6)09:40:54.893743 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 1073 win 65535
(7)09:40:54.894292 IP a.b.c.d.1250 > x.y.z.u.88: . 1073:1609(536) ack 1 win 64856
(8)09:40:54.894310 IP a.b.c.d.1250 > x.y.z.u.88: . 1609:2145(536) ack 1 win 64856
(9)09:40:54.894320 IP a.b.c.d.1250 > x.y.z.u.88: P 2145:2307(162) ack 1 win 64856
(10)09:40:55.070688 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 2307 win 65535
(11)09:40:59.878565 IP a.b.c.d.1250 > x.y.z.u.88: . 2307:2843(536) ack 1 win 64856
(12)09:40:59.878649 IP a.b.c.d.1250 > x.y.z.u.88: . 2843:3379(536) ack 1 win 64856
(13)09:40:59.878658 IP a.b.c.d.1250 > x.y.z.u.88: . 3379:3915(536) ack 1 win 64856
(14)09:40:59.878720 IP a.b.c.d.1250 > x.y.z.u.88: . 3915:4451(536) ack 1 win 64856
(15)09:40:59.884118 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4451 win 65535
(16)09:40:59.884567 IP a.b.c.d.1250 > x.y.z.u.88: P 4451:4613(162) ack 1 win 64856
(17)09:41:00.084446 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4613 win 65535
(18)09:41:04.878515 IP a.b.c.d.1250 > x.y.z.u.88: F 4613:4613(0) ack 1 win 64856
(19)09:41:04.878592 IP x.y.z.u.88 > a.b.c.d.1250 : . ack 4614 win 65535
(1)First step in the TCP handshake in which the NAS client sends a SYN packet with TCP sequence number 1586082305, a TCP window size of 64512 bytes and a maximum TCP payload size (MSS) of 536 bytes.
(2)The NAS server replies by acknowledging the SYN packet from the NAS client and sending his own SYN packet with TCP sequence number 3658439259, a TCP window size of 65536 bytes and a maximum TCP payload size (MSS) of 1460 bytes.
(3)The NAS client acknowledges the SYN packet of the NAS server. The connection is now in the TCP state ESTABLISHED on both sides. The maximum TCP payload size (MSS) will be 536 bytes.
(4)The NAS client wants to send his TGS-REQ packet, but it has a total TCP payload of 2306 bytes. The large size of this payload can explained by the inclusion of the PAC in the user's TGT. Due to the large payload, TCP fragmentation needs to be done. Since the agreed MSS size is 536 bytes, 5 fragments need to be transmitted.
(5)The NAS client sends the second fragment
(6)The NAS server acknowledges the first two fragments.
(7-8-9) The NAS client send the remaining three fragments.
(10)The NAS server acknowledges the reception and reassembly of the remaining fragments. Normally, the NAS server should start sending the TGS-REP now but refuses to do so.
(11)After a 5 second timeout, the NAS client hasn't received the TGS-REP from the NAS server and starts retransmitting the first fragment of the TGS-REQ.
(12-13-14)The NAS client retransmits fragments #2,#3 and #4.
(15)The NAS server acknowledges the reception of the first 4 fragments.
(16)The NAS client sends his final fragment.
(17)The NAS server acknowledges the reception and reassembly of the remaining fragments. Once again, the NAS server doesn't start sending the TGS-REP.
(18)After an additional 5 second wait interval, the NAS client gives up and performs an active close on his end by sending a FIN packet to the NAS server. The NAS client is now in the TCP state FIN_WAIT_1.
(19)The NAS server acknowledges the FIN of the NAS client. The NAS server is now in the TCP state CLOSE_WAIT and the TCP client is now in the TCP state FIN_WAIT_2. Normally, the NAS server should now send a FIN packet to the NAS client for closing the TCP connection, but refuses to do so.
As a result, netstat on the NAS server shows TCP connections stuck in the TCP state CLOSE_WAIT. As long as the NAS server is active, those TCP connections will never be freed and pose a potential mbuf depletion threat.
After further investigation, the following truss output of the NAS server revealed the problem.
0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) (sleeping...)
0.0000: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 0
0.7132: yield() =
0.7136: thread_waitact(400) = 1
1.7665: naccept(75, 0x2FF21938, 0x2FF2198C) = 99
1.7669: ngetsockname(99, 0x2FF21998, 0x2FF21990) = 0
1.7673: kfcntl(99, F_GETFL, 0x00000000) = 6
1.7680: kfcntl(99, F_SETFL, 0x00000006) = 0
1.7684: kioctl(99, -2147195266, 0x10038260, 0x00000000) = 0
1.7688: setsockopt(99, 65535, 128, 0x10038268, 8) = 0
1.7691: __libc_sbrk(0x00000000) = 0x215E9520
1.7697: thread_setmystate(0x00000000, 0x2FF210B0) = 0
1.7700: mprotect(0x216C8000, 4096, 0) = 0
1.7704: thread_twakeup(3473645, 268435456) = 0
1.7707: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 268435456
= 1
1.7715: thread_setmystate(0x216E13D0, 0x216E16D8) = 0
1.7720: yield() =
1.7724: thread_waitact(400) = 1
1.7727: yield() =
3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) (sleeping...)
3.7745: _select(80, 0x2FF21A38, 0x00000000, 0x00000000,
0x00000000) = 1
kread(99, "\0\0\b ?", 4) = 4
4.7437: _select(100, 0x216DFAC8, 0x216E0AC8, 0x00000000,
0x00000000) = 1
kread(99, " l82\b 082\b ? ?030201".., 2302) = 532
4.7464: kthread_ctl(2, 0x00000000) = 0
4.7467: thread_setmystate_fast(0x4000000C, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x00000158, 0x00000000, 0x00000000)
= 0x00000000
4.7472: thread_setmystate_fast(0x4000000D, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)
= 0x00000000
4.7477: thread_setmystate_fast(0x4000000C, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x00000176, 0x00000000, 0x00000000)
= 0x00000000
4.7481: thread_setmystate_fast(0x4000000D, 0x00000000,
0x00000000, 0x00000000, 0x40000000, 0x103500ED, 0x103500ED, 0x00000000)
= 0x00000000
4.7486: sigprocmask(0, 0xF08C77A8, 0x20366CEC) = 0
4.7489: thread_setmystate(0x203665F8, 0x00000000) = 0
4.7492: thread_tsleep(0, 0x20009100, 0x00000000, 0x00000000) = 0
5.7808: mprotect(0x216C8000, 4096, 3) = 0
5.7813: yield() =
...
The bold part of the truss output shows that of the requested 2302 additional bytes (remember the TGS-REQ is 2306 bytes) only 532 bytes were read because of fragmentation. After that, NAS doesn't even attempt to read the remaining fragments. It just freaks out and doesn't even proper close the socket, keeping the connections on the server in the TCP state CLOSE_WAIT.
Resolution:
NAS L3 support states that they will provide a fix for this issue, which will be incorporated in the next version of NAS (1.4.0.9).
zondag 29 maart 2009
Identifying memory leaks in AIX
- malloc returns with errno set to ENOMEM
- process working segment is growing over time (detected with either ps gv or svmon)
- core dump of the process which has malloc in the stack trace
While an growing process working segment is an indication of a memory leak, it doesn't necessarily mean there is one. It might be perfectly normal for a process to allocate additional memory during its lifetime. However, at some point in time, dynamic memory has to be freed by the application. The following graphs show both a normal and abnormal memory evolution of a process (X-axis -> time , Y-axis -> allocated memory)
Since dynamic memory gets allocated on the process heap, it is automatically cleaned up when the process terminates. This also means that a memory leak isn't that harmful for a process with a short lifetime. However for daemons it is potentially more harmful!
How can memory leaks be tracked down? There are some commercial products available that examine of the memory allocation of processes (f.e. IBM Rational PurifyPlus) but AIX provides a subsystem capable of determining this out of the box. Let's consider the following C code which has a few memory leaks.
#include <stdio.h>
void routineA(){
char *test=malloc(4);
fprintf(stdout,"routineA\n");
fprintf(stdout,"pointer residing at address %p\n",&test);
fprintf(stdout,"value of pointer %p\n",test);
}
void routineB(){
char *test=malloc(4);
fprintf(stdout,"routineB\n");
fprintf(stdout,"pointer residing at address %p\n",&test);
fprintf(stdout,"value of pointer %p\n",test);
free(test);
}
void routineC(){
char *test=malloc(8);
fprintf(stdout,"routineC\n");
fprintf(stdout,"pointer residing at address %p\n",&test);
fprintf(stdout,"value of pointer %p\n",test);
}
int main(){
char *test=malloc(4);
fprintf(stdout,"main\n");
fprintf(stdout,"pointer residing at address %p\n",&test);
fprintf(stdout,"value of pointer %p\n",test);
routineA();
routineB();
routineC();
}
Here we can clearly see that the memory allocations in main,routineA and routineC don't get freed. Using the malloc debug subsystem, we are also made aware of this. Moreover, even if we don't have the source, the malloc debug subsystem will give the stack trace.
#export MALLOCTYPE=debug
#export MALLOCDEBUG=report_allocations
#./memleak
main
pointer residing at address 2ff22b10
value of pointer 2000eff8
routineA
pointer residing at address 2ff22ac0
value of pointer 20010ff8
routineB
pointer residing at address 2ff22ac0
value of pointer 20012ff8
routineC
pointer residing at address 2ff22ac0
value of pointer 20012ff8
Current allocation report:
Allocation #0: 0x2000EFF8
Allocation size: 0x4
Allocation traceback:
0xD03EA170 malloc
0xD036C260 init_malloc
0xD036D434 malloc
0x10000540 main
Allocation #1: 0x20010FF8
Allocation size: 0x4
Allocation traceback:
0xD03EA170 malloc
0x10000360 routineA
0x1000058C main
0x100001C4 __start
Allocation #2: 0x20012FF8
Allocation size: 0x8
Allocation traceback:
0xD03EA170 malloc
0x100003FC routineC
0x10000594 main
0x100001C4 __start
Total allocations: 3.
The malloc debug subsystem states there are three memory leaks in the program. The first one is the main routine (at address 0x2000EFF8, size 4 bytes), the second one is in routineA (at address 0x20010FF8, size 4 bytes) and the last one is located in routineC (at address 0x20012FF8, size 8 bytes).
Whenever you wish to open a PMR for a memory leak, be sure to add the malloc trace aswell. If it's not a known issue, you will be redirected to L3 support quite fast :)
zaterdag 28 maart 2009
Memory leak KRB5A & libkrb5.a
Currently there is a memory leak in both the KRB5A load module and the libkrb5 library in AIX. The KRB5A load module is shipped with AIX whereas the libkrb5.a library is shipped with the krb5.client.rte fileset in NAS (Network Authentication Service), which is IBM's version of Kerberos
Impacted:
- All AIX versions up till now
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/53
http://www-933.ibm.com/eserver/support/fixes/fixcentral/pfixpacks/61
- IBM Network Authentication Service <= 1.4.0.8
Details:
# lsuser -a SYSTEM sidsmig
sidsmig SYSTEM=KRB5A
# cat /usr/lib/security/methods.cfg | grep -ip KRB5A
KRB5A:
program = /usr/lib/security/KRB5A
program_64 = /usr/lib/security/KRB5A_64
options = authonly
The following C test program was used in PMR 69409.300.624
#include <stdio.h>
#include <usersec.h>
int main(int argc,char** argv){
while(1){
int reenter;
char* msg;
authenticate(argv[1], argv[2], &reenter, &msg);
if(msg) {
free(msg)
break;
}
}
}
An increasing process working segment could be noticed with either ps gv or svmon -P when user sidsmig authenticates to the system.
Resolution:
- APAR IZ43820 was created to address this issue.
vrijdag 27 maart 2009
First post
I've never been using a blog before so I have no clue how things will work out in the end. I mainly chose to start this blog in order to memorize everything I am currently working on and to archive the things I worked on (both from a professional point of view) so that other people can benefit from it. You'll mainly see AIX (IBM's version of UNIX) related things here.
See you around!
Miguel