Nikki Chumakov
2013-01-30 17:14:19 UTC
Hi,
I'm using boost-1.49 (also tried boost-1.53b1), RedHat5.8, kernel
2.6.32.26-17.el5, glibc 2.5-81.el5_8.7
Have a strange bug in async tcp client, when handle_connect is called
too early, before the tcp sockets is actually connected.
The trivial async http client example does not demonstrate this problem.
My code is too big to be posted here. It's a framework that utilize
custom invocation handlers, allocators, exception transporting between
threads, boost-log, et cetera, et cetera.
The scenario was trimmed to following conditions:
- just two threads was used: one for io_services::run and other for
creating client requests (objects).
- client opens the connection to http server, and closes after that
(technically it has the "return;" in handle_connect.
- the client requests runs one-by-one in a loop.
The handle_connect is simple:
void
session::handle_connect (const boost::system::error_code& err,
future::promise<void> promise)
{
if (! err)
{
promise.set (); // signal the caller about success
return;
}
else
{
std::cerr << "connect error: " << err.message () << '\n';
abort ();
}
}
Below is the listing of "strace -ff -e network,desc -e \!bind,stat,futex"
normal connect
~~~~~~~~~~~~~~
[pid 25441] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
[pid 25442] epoll_wait(5, <unfinished ...>
[pid 25441] epoll_ctl(5, EPOLL_CTL_ADD, 7,
{EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLET, {u32=1811954288,
u64=140515962010224}}) = 0
[pid 25442] <... epoll_wait resumed> {{EPOLLOUT|EPOLLHUP,
{u32=1811954288, u64=140515962010224}}}, 128, 300000) = 1
[pid 25441] ioctl(7, FIONBIO <unfinished ...>
[pid 25442] kill(1, SIG_0 <unfinished ...>
[pid 25441] <... ioctl resumed> , [1]) = 0
[pid 25442] <... kill resumed> ) = -1 EPERM (Operation not permitted)
[pid 25441] connect(7, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("xxx.xxx.193.11")}, 16 <unfinished ...>
[pid 25442] epoll_wait(5, <unfinished ...>
[pid 25441] <... connect resumed> ) = -1 EINPROGRESS (Operation now
in progress)
[pid 25441] epoll_ctl(5, EPOLL_CTL_MOD, 7,
{EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLET, {u32=1811954288,
u64=140515962010224}}) = 0
[pid 25442] <... epoll_wait resumed> {{EPOLLOUT, {u32=1811954288,
u64=140515962010224}}}, 128, 300000) = 1
[pid 25442] kill(1, SIG_0) = -1 EPERM (Operation not permitted)
*********** calling handle_connect
[pid 25442] getsockopt(7, SOL_SOCKET, SO_ERROR, [7782263930402897920],
[4]) = 0
[pid 25442] getpeername(7, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("xxx.xxx.193.11")}, [140514150055952]) = 0
[pid 25442] getsockname(7, {sa_family=AF_INET, sin_port=htons(47210),
sin_addr=inet_addr("xxx.xxx.145.88")}, [140514150055952]) = 0
[pid 25442] setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 25442] epoll_wait(5, {}, 128, 0) = 0
[pid 25442] kill(1, SIG_0) = -1 EPERM (Operation not permitted)
[pid 25442] close(7) = 0
early connect
~~~~~~~~~~~~~
[pid 25441] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 9
[pid 25441] epoll_ctl(5, EPOLL_CTL_ADD, 9,
{EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLET, {u32=1811961744,
u64=140515962017680}} <unfinished ...>
[pid 25442] epoll_wait(5, <unfinished ...>
[pid 25441] <... epoll_ctl resumed> ) = 0
[pid 25442] <... epoll_wait resumed> {{EPOLLOUT, {u32=1811958752,
u64=140515962014688}}, {EPOLLIN, {u32=1811943784, u64=140515961999720}},
{EPOLLOUT|EPOLLHUP, {u32=1811961744, u64=140515962017680}}}, 128, 0) = 3
[pid 25441] ioctl(9, FIONBIO <unfinished ...>
[pid 25442] kill(1, SIG_0 <unfinished ...>
[pid 25441] <... ioctl resumed> , [1]) = 0
[pid 25442] <... kill resumed> ) = -1 EPERM (Operation not permitted)
[pid 25441] connect(9, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("xxx.xxx.193.11")}, 16) = -1 EINPROGRESS
(Operation now in progress)
*********** no epoll_wait after connect **********
[pid 25441] epoll_ctl(5, EPOLL_CTL_MOD, 9,
{EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLET, {u32=1811961744,
u64=140515962017680}} <unfinished ...>
[pid 25442] kill(1, SIG_0 <unfinished ...>
[pid 25441] <... epoll_ctl resumed> ) = 0
[pid 25442] <... kill resumed> ) = -1 EPERM (Operation not permitted)
[pid 25442] kill(1, SIG_0) = -1 EPERM (Operation not permitted)
*********** calling handle_connect
[pid 25442] getsockopt(9, SOL_SOCKET, SO_ERROR, [7782250667543887872],
[4]) = 0
[pid 25442] getpeername(9, 0x40450360, [140514150055964]) = -1 ENOTCONN
(Transport endpoint is not connected)
[pid 25442] write(2, "connect error: ", 15connect error: ) = 15
[pid 25442] write(2, "Transport endpoint is not connec"..., 35Transport
endpoint is not connected) = 35
[pid 25442] write(2, "\n", 1
) = 1
[pid 25442] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
[pid 25442] tgkill(25441, 25442, SIGABRT) = 0
[pid 25442] --- SIGABRT (Aborted) @ 0 (0) ---
Process 25442 detached
+++ killed by SIGABRT +++
I inserted "kill (1, SIG_0)" in asio/detail/impl/epoll_reactor.ipp,
function epoll_reactor::descriptor_state::do_complete, just before
"op->complete(...)" line, to indicate in strace output when the handlers
are going to be called.
So, in abnormal case, asio calls "::connect" and then immediately calls
user handle_connect handler _without_calling_(!!!) (and waiting for)
epoll_wait between ::connect and handle_connect. Thus handle_connect is
called before the socket was actually connected.
Tcpdump packet sniffer shows the normal TCP handshake, just
handle_connected is called before it was completed.
Any ideas of what can be the reason of this bug, what should I check or
debug here?
Regards,
Nikki
I'm using boost-1.49 (also tried boost-1.53b1), RedHat5.8, kernel
2.6.32.26-17.el5, glibc 2.5-81.el5_8.7
Have a strange bug in async tcp client, when handle_connect is called
too early, before the tcp sockets is actually connected.
The trivial async http client example does not demonstrate this problem.
My code is too big to be posted here. It's a framework that utilize
custom invocation handlers, allocators, exception transporting between
threads, boost-log, et cetera, et cetera.
The scenario was trimmed to following conditions:
- just two threads was used: one for io_services::run and other for
creating client requests (objects).
- client opens the connection to http server, and closes after that
(technically it has the "return;" in handle_connect.
- the client requests runs one-by-one in a loop.
The handle_connect is simple:
void
session::handle_connect (const boost::system::error_code& err,
future::promise<void> promise)
{
if (! err)
{
promise.set (); // signal the caller about success
return;
}
else
{
std::cerr << "connect error: " << err.message () << '\n';
abort ();
}
}
Below is the listing of "strace -ff -e network,desc -e \!bind,stat,futex"
normal connect
~~~~~~~~~~~~~~
[pid 25441] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
[pid 25442] epoll_wait(5, <unfinished ...>
[pid 25441] epoll_ctl(5, EPOLL_CTL_ADD, 7,
{EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLET, {u32=1811954288,
u64=140515962010224}}) = 0
[pid 25442] <... epoll_wait resumed> {{EPOLLOUT|EPOLLHUP,
{u32=1811954288, u64=140515962010224}}}, 128, 300000) = 1
[pid 25441] ioctl(7, FIONBIO <unfinished ...>
[pid 25442] kill(1, SIG_0 <unfinished ...>
[pid 25441] <... ioctl resumed> , [1]) = 0
[pid 25442] <... kill resumed> ) = -1 EPERM (Operation not permitted)
[pid 25441] connect(7, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("xxx.xxx.193.11")}, 16 <unfinished ...>
[pid 25442] epoll_wait(5, <unfinished ...>
[pid 25441] <... connect resumed> ) = -1 EINPROGRESS (Operation now
in progress)
[pid 25441] epoll_ctl(5, EPOLL_CTL_MOD, 7,
{EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLET, {u32=1811954288,
u64=140515962010224}}) = 0
[pid 25442] <... epoll_wait resumed> {{EPOLLOUT, {u32=1811954288,
u64=140515962010224}}}, 128, 300000) = 1
[pid 25442] kill(1, SIG_0) = -1 EPERM (Operation not permitted)
*********** calling handle_connect
[pid 25442] getsockopt(7, SOL_SOCKET, SO_ERROR, [7782263930402897920],
[4]) = 0
[pid 25442] getpeername(7, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("xxx.xxx.193.11")}, [140514150055952]) = 0
[pid 25442] getsockname(7, {sa_family=AF_INET, sin_port=htons(47210),
sin_addr=inet_addr("xxx.xxx.145.88")}, [140514150055952]) = 0
[pid 25442] setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
[pid 25442] epoll_wait(5, {}, 128, 0) = 0
[pid 25442] kill(1, SIG_0) = -1 EPERM (Operation not permitted)
[pid 25442] close(7) = 0
early connect
~~~~~~~~~~~~~
[pid 25441] socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 9
[pid 25441] epoll_ctl(5, EPOLL_CTL_ADD, 9,
{EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLET, {u32=1811961744,
u64=140515962017680}} <unfinished ...>
[pid 25442] epoll_wait(5, <unfinished ...>
[pid 25441] <... epoll_ctl resumed> ) = 0
[pid 25442] <... epoll_wait resumed> {{EPOLLOUT, {u32=1811958752,
u64=140515962014688}}, {EPOLLIN, {u32=1811943784, u64=140515961999720}},
{EPOLLOUT|EPOLLHUP, {u32=1811961744, u64=140515962017680}}}, 128, 0) = 3
[pid 25441] ioctl(9, FIONBIO <unfinished ...>
[pid 25442] kill(1, SIG_0 <unfinished ...>
[pid 25441] <... ioctl resumed> , [1]) = 0
[pid 25442] <... kill resumed> ) = -1 EPERM (Operation not permitted)
[pid 25441] connect(9, {sa_family=AF_INET, sin_port=htons(80),
sin_addr=inet_addr("xxx.xxx.193.11")}, 16) = -1 EINPROGRESS
(Operation now in progress)
*********** no epoll_wait after connect **********
[pid 25441] epoll_ctl(5, EPOLL_CTL_MOD, 9,
{EPOLLIN|EPOLLPRI|EPOLLOUT|EPOLLERR|EPOLLHUP|EPOLLET, {u32=1811961744,
u64=140515962017680}} <unfinished ...>
[pid 25442] kill(1, SIG_0 <unfinished ...>
[pid 25441] <... epoll_ctl resumed> ) = 0
[pid 25442] <... kill resumed> ) = -1 EPERM (Operation not permitted)
[pid 25442] kill(1, SIG_0) = -1 EPERM (Operation not permitted)
*********** calling handle_connect
[pid 25442] getsockopt(9, SOL_SOCKET, SO_ERROR, [7782250667543887872],
[4]) = 0
[pid 25442] getpeername(9, 0x40450360, [140514150055964]) = -1 ENOTCONN
(Transport endpoint is not connected)
[pid 25442] write(2, "connect error: ", 15connect error: ) = 15
[pid 25442] write(2, "Transport endpoint is not connec"..., 35Transport
endpoint is not connected) = 35
[pid 25442] write(2, "\n", 1
) = 1
[pid 25442] rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
[pid 25442] tgkill(25441, 25442, SIGABRT) = 0
[pid 25442] --- SIGABRT (Aborted) @ 0 (0) ---
Process 25442 detached
+++ killed by SIGABRT +++
I inserted "kill (1, SIG_0)" in asio/detail/impl/epoll_reactor.ipp,
function epoll_reactor::descriptor_state::do_complete, just before
"op->complete(...)" line, to indicate in strace output when the handlers
are going to be called.
So, in abnormal case, asio calls "::connect" and then immediately calls
user handle_connect handler _without_calling_(!!!) (and waiting for)
epoll_wait between ::connect and handle_connect. Thus handle_connect is
called before the socket was actually connected.
Tcpdump packet sniffer shows the normal TCP handshake, just
handle_connected is called before it was completed.
Any ideas of what can be the reason of this bug, what should I check or
debug here?
Regards,
Nikki