[ECOS] TCP/IP preemption fix

Grant Edwards grante@visi.com
Thu Apr 13 20:03:00 GMT 2000


On Thu, Apr 13, 2000 at 04:50:46PM -0600, Gary Thomas wrote:

> >> Can you see if these patches fix [at least] the sockbuf corruption
> >> problem you were seeing?
> > 
> > After some additional testing, it seems the problem is still
> > there.  It looks like there are routines that access sb structs
> > without calling sblock/sbunlock.
> > 
> > The ones we've found are in code called by the network task:
> > tcp_input, tcp_output, etc.  There are calls to sbappend and
> > similar functions/macros that result in unprotected accesses to
> > sb struct fields.
> > 
> > My original e-mail pointing out the unreliability of sblock and
> > sbunlock didn't identify the entire problem.
> > 
> >> The basic idea I've incorporated is to use the eCos scheduler
> >> lock to emulate the user/kernel behaviour from the BSD world
> >> (i.e. kernel code cannot be preempted)
> > 
> > I think that sblock and sbunlock should work now, but I don't
> > think they're called in enough places.
> 
> So this seems to be a start.  I'll try and investigate more.
> BTW did you see any improvement at all (just to make sure we are
> hunting the right fox)?

Well, some of the throughput tests showed an improvement of about 5%, but I
don't know why the changes should have done that. When we spent some more
time analyzing things it looks like the sb corruptions have to be happening
due to conflicts between our user tasks and functions called from the
network task.

AFAICT, the sblock/unlock calls are in routines called from user tasks, but
not in the functions called by the network task.  So the patch should
prevent conflicts between users tasks.  We have two user tasks that do
TCP/IP via a single socket, but one handles input and the other handles
output, so it doesn't appear they can be conflicting with each other, since
there are separate input and output sb structs.

The conflict we seem to be running into is between our user tasks and the
code run from the network task such as tcp_input and tcp_output.

> DO you have a good way to duplicate the failures?

If we set our user tasks to a higher priority than the network task, we will
almost always see a panic within a minute or two.  Sometimes it will run for
as long as several minutes.

> Is it something I can set up here? 

Not really.  The application that fails most predictably depends on custom
target hardware and some specific software on a host for it to talk to. I've
been trying to come up with a simple test configuration based on one of the
existing eCos tests programs that demonstrates the problem, and will
continue to try.  But, I haven't been able to come up with the right
combination of cpu loading and TCP/IP traffic patterns to make it fail
predictably, so I have to build a library, throw it over the wall and let
the application development guy try it out.

> Any/all information on this will be useful.

I'm trying to come up with a simple test case -- I've got one more idea to
try out tomorrow.  I might also add some sblock/sbunlock pairs to some of
the functions like sbappend to see if that has an effect.  (I've convinced
myself it's got to.)

-- 
Grant Edwards
grante@visi.com


More information about the Ecos-discuss mailing list