getaddrinfo调用crash 的debug过程

前两天,产线报一例crash问题。拿到core文件后,栈展开信息:
(gdb) bt
#0  0x00007f64a3651aff in raise () from /lib64/libc.so.6
#1  0x00007f64a3624ea5 in abort () from /lib64/libc.so.6
#2  0x00007f64a3694097 in __libc_message () from /lib64/libc.so.6
#3  0x00007f64a369415a in __libc_fatal () from /lib64/libc.so.6
#4  0x00007f64a374fc44 in __netlink_assert_response () from /lib64/libc.so.6
#5  0x00007f64a374c762 in __netlink_request () from /lib64/libc.so.6
#6  0x00007f64a374c901 in getifaddrs_internal () from /lib64/libc.so.6
#7  0x00007f64a374d608 in getifaddrs () from /lib64/libc.so.6
#8  0x00007f64a47ecdd0 in bsd_localinfo (return_result=0x7f649d12a6b8, hints=0x7f649d12a6f0) at su_localinfo.c:1167
#9  su_getlocalinfo (hints=hints@entry=0x7f649d12a7d0, return_localinfo=return_localinfo@entry=0x7f649d12a7c8) at su_localinfo.c:242
#10 0x00007f64a47ca9ea in soa_init_sdp_connection_with_session (ss=ss@entry=0x7f64880603a0, c=0x7f649d12a940, buffer=buffer@entry=0x7f649d12a9a0 "10.10.50.52", sdp=sdp@entry=0x7f649d12a9e0) at soa.c:2326
......看来像是getifaddrs 调用出了什么问题。拿不到产线的系统日志。幸运的是栈里保留了一点信息,跳转到第四帧,查看下汇编:
(gdb) f 4
#4  0x00007f64a374fc44 in __netlink_assert_response () from /lib64/libc.so.6
(gdb) disassemble 

从这里看,触发crash前应该有打印出什么信息,把寄存器指向的内存解出来看看:
(gdb) x/s $r12
0x7f649d129380:	"Unexpected error 9 on netlink descriptor 19.\\n"找到了gilbc的打印内容:  "Unexpected error 9 on netlink descriptor 19.\\n",知道error number为9 (EBADF) ,操作的FD值为19。
外事不明问谷歌,找到这个:
https://stackoverflow.com/questions/58827641/getaddrinfo-calls-assert-in-the-program/59615786#59615786https://stackoverflow.com/questions/58827641/getaddrinfo-calls-assert-in-the-program/59615786#59615786似乎是对应上了这段说明:
This is a file descriptor race in the application. The typical scenario for error 9 (EBADF) looks like this:
- Thread A closes a file descriptor.
- Thread B calls getaddrinfoand opens a Netlink socket. It happens to receive the same descriptor value.
- Due to a bug, thread A closes the same file descriptor again. Normally, that would be benign, but due to the concurrent execution, the Netlink socket created by glibc is closed.
- Thread B attempts to use the Netlink socket descriptor and receives the EBADFerror.
The key to fixing such bugs is figuring out where exactly the double-close happens.
尝试重现,然后用strace跟踪系统调用:
 strace -o output.txt -T -tt -e trace=all -fp 1039上面命令中的output.txt是输出的文件名,1039是进程的PID。
重现后打开output.txt,果然找到这样的错误:

这下实锤了,有个FD19重复close。
接下来的事,就是检查代码,解决重复close的地方了。


