This is the mail archive of the ecos-discuss@sourceware.org mailing list for the eCos project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: Possible sockets/fd race condition


On Wed, 2005-11-16 at 08:16 -0800, Jay Foster wrote:
> Do you have ASSERTions turned on.  This could help identify a problem
> earlier.
> 
> I have seen other similar kinds of problems (with assertions turned on), but
> have not spent any time trying to resolve them yet.  Assertions such as:
> 
> eCos panic: cyg_fp_free() will close fd
> eCos panic: sbdrop
> eCos panic: soaccept: !NOFDREF
> eCos panic: sofree: not queued
> eCos panic: soclose: NOFDREF
> eCos panic: Locking mutex I already own
> eCos panic: sbflush: locked
> 
> I did a search of the eCos archives for these panics, and came accross this
> (http://sourceware.org/ml/ecos-discuss/2000-04/msg00153.html)
> 
> I could find no evidence that this issue was ever addressed.
> 

While there may not be any explicit mention of addressing these issues,
let me assure you that they were - at great length.  After that message
was written, we imported a whole new stack based on FreeBSD and also
made considerable changes in how the whole thing fit into eCos.  The
end result is that [at least hopefully] the issues Grant was 
experiencing were handled.

The race mentioned below may indeed exist.  If protecting the basic call
with splnet() works, I don't see a problem.

Note: this particular problem may only show up by way of the way Ross'
program works.  It is unusual to create and destroy sockets for every
packet (which is what the pseudo-code implies) and this would then
exercise any failures which might exist in the socreate() path.

> Jay
> 
> -----Original Message-----
> From: Ross Wightman [mailto:rwightman@gmail.com]
> Sent: Wednesday, November 16, 2005 12:45 AM
> To: ecos-discuss@ecos.sourceware.org
> Subject: [ECOS] Possible sockets/fd race condition
> 
> 
> Myself and another engineer I am working with have recently
> encountered what appears to be some sort of sockets (BSD Stack) or fd
> related race condition.
> 
> We have two threads relevant to the problem. One of them is sending
> out small (~64byte) UDP multicast packets at a configurable rate of
> 1-100 per second. The other thread is running a basic single threaded
> http server that serves one connection at a time (but serves many
> small requests quite rapidly).
> 
> Running by themselves either thread can send out hundreds of thousands
> of UDP packets or serve millions of http requests with no problems.
> However, as soon as the two threads are both running they last for
> anywhere from a few hundred to a few thousand served requests before
> the UDP thread dies a horrible death trying to close its socket.
> 
> The exact crash occurs when tcp_usr_detach in tcp_usrreq.c tries to
> dereference a NULL pointer (which should be pointing to a tcp control
> block). This is very bizarre because we are supposed to be closing a
> UDP socket. If we look at the socket itself a few stack frames down it
> is a non active TCP socket with the SS_NOFDREF and SS_INCOMP flags set
> and no TCP control block. It is certainly not an active UDP socket as
> it should be.
> 
> Every time the crash has occurred, the http server is waiting in
> unlock_inner from an splx call via various tcp calls.
> 
> The Socket/File Descriptor # for the UDP socket and the TCP socket at
> the time of the crash is 4 and 5 respectively.
> 
> We have yet to pinpoint when exactly things start to go wrong. We have
> however looked for possible race conditions and we noticed that the
> socreate function does not have any splnet protection like most of the
> other functions at that level. So we added that and after half a day
> running the problem has not occurred again. However it's not clear if
> we fixed the problem in a correct manner or smashed it with a blunt
> hammer and masked the real cause.
> 
> Any thoughts would be much appreciated. More details follow below.
> 
> Regards,
> 
> Ross
> 
> 
> 
> This is the pseudo code sequence of the UDP thread.
> 
> {
> socket s;
> 
> loop
> {
> wait on flag()
> 
> s=socket(...);
> 
> setsocketopt(s, ...);
> 
> ioctl(s,.... ) //for address change check
> 
> sendto(s, ....)
> 
> close(s)
> 
> }
> }
> 
> The http server thread is running a typical TCP serving loop and
> shares nothing with the UDP thread. The only interaction between the
> two threads is via any locking or data sharing that occurs in the
> tcp/ip stack.
> 
> The relevant portion of the UDP thread stack is as follows:
> 
> #0  0x001f01b0 in tcp_disconnect (tp=0x0)
>     at /cygdrive/d/ecos/ecos-cvs
> /packages/net/bsd_tcpip/current/src/sys/netinet/tcp_usrreq.c:1070
>  #1  0x001eedb4 in tcp_usr_detach (so=0x2cc010)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/netinet/tcp
> _usrreq.c:181
> #2  0x001c33ec in cyg_soclose (so=0x2cc010)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_s
> ocket.c:321
> #3  0x0016f244 in bsd_close (fp=0x2b99ec)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
> .c:730
> #4  0x0019c17c in fp_ucount_dec (fp=0x2b99ec)
>     at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:160
> #5  0x0019c284 in fd_close (fd=5)
>     at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:189
> #6  0x0019c55c in cyg_fd_free (fd=5)
>     at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/fd.cxx:260
> #7  0x0019f8ac in close (fd=5)
>     at /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/io.cxx:216
> #8  0x000524b0 in Packet::PacketSend() (this=0x32d5a0)
>     at src/App/Packet.cpp:141
> 
> The print out of socket (so) is: This should be a UDP socket but
> instead it's an inactive TCP socket.
> 
> (gdb) print/x *(struct socket*)(0x2cc010)
> $2 = {so_zone = 0x2cc180, so_type = 0x1, so_options = 0x20, so_linger = 0x0,
>   so_state = 0x801, so_pcb = 0x2cb478, so_proto = 0x2a51a0,
>   so_head = 0x2cc0c0, so_incomp = {tqh_first = 0x0, tqh_last = 0x0},
>   so_comp = {tqh_first = 0x0, tqh_last = 0x0}, so_list = {tqe_next = 0x0,
>     tqe_prev = 0x2cc0d8}, so_qlen = 0x0, so_incqlen = 0x0, so_qlimit = 0x0,
>   so_timeo = 0x0, so_error = 0x0, so_sigio = 0x0, so_oobmark = 0x0,
>   so_aiojobq = {tqh_first = 0x0, tqh_last = 0x2cc054}, so_rcv = {sb_cc =
> 0x0,
>     sb_hiwat = 0x4470, sb_mbcnt = 0x0, sb_mbmax = 0x22380, sb_lowat = 0x1,
>     sb_mb = 0x0, sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags =
> 0x0,
>     sb_timeo = 0x0}, so_snd = {sb_cc = 0x40, sb_hiwat = 0x4470,
>     sb_mbcnt = 0x80, sb_mbmax = 0x22380, sb_lowat = 0x800, sb_mb = 0x2ccf80,
>     sb_sel = {si_info = 0x0, si_waitFlag = 0x0}, sb_flags = 0x0,
>     sb_timeo = 0x0}, so_upcall = 0x0, so_upcallarg = 0x0, so_cred = 0x0,
>   cyg_so_gencnt = 0x8646, so_emuldata = 0x0, so_accf = 0x0}
> 
> The print out of inpcb (so_pcb) is:
> 
> (gdb) print/x *(struct inpcb*)(0x2cb478)
> $3 = {inp_hash = {le_next = 0x0, le_prev = 0x0}, inp_fport = 0x0,
>   inp_lport = 0x0, inp_list = {le_next = 0x0, le_prev = 0x2a6e18},
>   inp_flow = 0x0, inp_dependfaddr = {inp46_foreign = {ia46_pad32 = {0x0,
> 0x0,
>         0x0}, ia46_addr4 = {s_addr = 0x0}}, inp6_foreign = {__u6_addr = {
>         __u6_addr8 = {0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0,
>           0x0, 0x0, 0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
>   inp_dependladdr = {inp46_local = {ia46_pad32 = {0x0, 0x0, 0x0},
>       ia46_addr4 = {s_addr = 0x0}}, inp6_local = {__u6_addr = {__u6_addr8 =
> {
>           0x0 <repeats 16 times>}, __u6_addr16 = {0x0, 0x0, 0x0, 0x0, 0x0,
>           0x0, 0x0, 0x0}, __u6_addr32 = {0x0, 0x0, 0x0, 0x0}}}},
>   inp_ppcb = 0x0, inp_pcbinfo = 0x308310, inp_socket = 0x2cc010,
>   inp_flags = 0x0, inp_dependroute = {inp4_route = {ro_rt = 0x0, ro_u = {
>         rou_dst_ss = {ss_len = 0x0, ss_family = 0x0, __ss_pad1 = {0x0, 0x0,
>             0x0, 0x0, 0x0, 0x0}, __ss_align = 0x0, __ss_pad2 = {
>             0x0 <repeats 112 times>}}, ro_dst_sa = {sa_len = 0x0,
>           sa_family = 0x0, sa_data = {0x0 <repeats 30 times>}}}},
>      inp6_route = {ro_rt = 0x0, ro_u = {rou_dst_ss = {ss_len = 0x0,
>           ss_family = 0x0, __ss_pad1 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
>           __ss_align = 0x0, __ss_pad2 = {0x0 <repeats 112 times>}},
>          ro_dst_sa = {sa_len = 0x0, sa_family = 0x0, sa_data = {
>             0x0 <repeats 30 times>}}}}}, inp_sp = 0x0, inp_vflag = 0x1,
>   inp_ip_ttl = 0x40, inp_ip_p = 0x0, inp_depend4 = {inp4_ip_tos = 0x0,
>      inp4_options = 0x0, inp4_moptions = 0x0}, inp_depend6 = {
>     inp6_options = 0x0, inp6_inputopts = 0x0, inp6_outputopts = 0x0,
>     inp6_moptions = 0x0, inp6_icmp6filt = 0x0, inp6_cksum = 0x0,
>     inp6_ifindex = 0x0, inp6_hops = 0x0}, inp_portlist = {le_next = 0x0,
>     le_prev = 0x0}, inp_phd = 0x0, inp_gencnt = 0x8d9d}
> 
> 
> The relevant portion of the HTTP serving thread is:
> 
> #0  Cyg_Scheduler::unlock_inner(unsigned) (new_lock=0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sched/sched.cxx:220
> #1  0x00266858 in Cyg_Scheduler::unlock() ()
>     at
> /cygdrive/d/work/firmware/micro/ecos/conf/avppc1/ecos/install/include/cyg/ke
> rnel/sched.inl:100
> #2  0x001868cc in Cyg_Mutex::unlock() (this=0x2ff580)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/sync/mutex.cxx:445
> #3  0x0017e470 in cyg_mutex_unlock (mutex=0x2ff580)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/kernel/current/src/common/kapi.cxx:1081
> #4  0x001a6048 in cyg_splx (old_state=0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/ecos/synch.c:20
> 9
> #5  0x001c43d8 in cyg_sosend (so=0x2cbeb0, addr=0x0, uio=0x2b8310,
>     top=0x2cce80, control=0x0, flags=0, p=0x0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/uipc_s
> ocket.c:622
>  #6  0x0016f9dc in bsd_sendit (fp=0x2b99cc, mp=0x2b83b0, flags=0,
>     retsize=0x2b83a0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
> .c:1039
> #7  0x0016eb64 in bsd_sendmsg (fp=0x2b99cc, m=0x2b83b0, flags=0,
>     retsize=0x2b83a0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/net/bsd_tcpip/current/src/sys/kern/sockio
> .c:572
> #8  0x001a2450 in sendto (s=4, buf=0x32d7fc, len=543, flags=0, to=0x0,
> tolen=0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:612
> #9  0x001a251c in send (s=4, buf=0x32d7fc, len=543, flags=0)
>     at
> /cygdrive/d/ecos/ecos-cvs/packages/io/fileio/current/src/socket.cxx:629
> #10 0x00143694 in fsend (pData=0x32d5e8, len=543) at
> src/Interface/serv.cpp:518
> 
> -- 
> Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
> and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------


-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]