getaddrinfo调用crash 的debug过程

文章列表

前两天，产线报一例crash问题。拿到core文件后，栈展开信息：

(gdb) bt
#0  0x00007f64a3651aff in raise () from /lib64/libc.so.6
#1  0x00007f64a3624ea5 in abort () from /lib64/libc.so.6
#2  0x00007f64a3694097 in __libc_message () from /lib64/libc.so.6
#3  0x00007f64a369415a in __libc_fatal () from /lib64/libc.so.6
#4  0x00007f64a374fc44 in __netlink_assert_response () from /lib64/libc.so.6
#5  0x00007f64a374c762 in __netlink_request () from /lib64/libc.so.6
#6  0x00007f64a374c901 in getifaddrs_internal () from /lib64/libc.so.6
#7  0x00007f64a374d608 in getifaddrs () from /lib64/libc.so.6
#8  0x00007f64a47ecdd0 in bsd_localinfo (return_result=0x7f649d12a6b8, hints=0x7f649d12a6f0) at su_localinfo.c:1167
#9  su_getlocalinfo (hints=hints@entry=0x7f649d12a7d0, return_localinfo=return_localinfo@entry=0x7f649d12a7c8) at su_localinfo.c:242
#10 0x00007f64a47ca9ea in soa_init_sdp_connection_with_session (ss=ss@entry=0x7f64880603a0, c=0x7f649d12a940, buffer=buffer@entry=0x7f649d12a9a0 "10.10.50.52", sdp=sdp@entry=0x7f649d12a9e0) at soa.c:2326
......

看来像是getifaddrs 调用出了什么问题。拿不到产线的系统日志。幸运的是栈里保留了一点信息，跳转到第四帧，查看下汇编：

(gdb) f 4
#4  0x00007f64a374fc44 in __netlink_assert_response () from /lib64/libc.so.6
(gdb) disassemble

从这里看，触发crash前应该有打印出什么信息，把寄存器指向的内存解出来看看：

(gdb) x/s $r12
0x7f649d129380:	"Unexpected error 9 on netlink descriptor 19.\\n"

找到了gilbc的打印内容： "Unexpected error 9 on netlink descriptor 19.\\n"，知道error number为9 (EBADF) ,操作的FD值为19。

外事不明问谷歌，找到这个：

https://stackoverflow.com/questions/58827641/getaddrinfo-calls-assert-in-the-program/59615786#59615786https://stackoverflow.com/questions/58827641/getaddrinfo-calls-assert-in-the-program/59615786#59615786似乎是对应上了这段说明：

This is a file descriptor race in the application. The typical scenario for error 9 (EBADF) looks like this:

Thread A closes a file descriptor.
Thread B calls getaddrinfo and opens a Netlink socket. It happens to receive the same descriptor value.
Due to a bug, thread A closes the same file descriptor again. Normally, that would be benign, but due to the concurrent execution, the Netlink socket created by glibc is closed.
Thread B attempts to use the Netlink socket descriptor and receives the EBADF error.

The key to fixing such bugs is figuring out where exactly the double-close happens.

尝试重现，然后用strace跟踪系统调用：

 strace -o output.txt -T -tt -e trace=all -fp 1039

上面命令中的output.txt是输出的文件名，1039是进程的PID。

重现后打开output.txt，果然找到这样的错误：

这下实锤了，有个FD19重复close。

接下来的事，就是检查代码，解决重复close的地方了。

getaddrinfo调用crash 的debug过程

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

getaddrinfo调用crash 的debug过程

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签