[ECOS] TCP/IP preemption fix

Gary Thomas gthomas@redhat.com
Fri Apr 14 03:41:00 GMT 2000


On 14-Apr-00 Grant Edwards wrote:
> On Thu, Apr 13, 2000 at 04:50:46PM -0600, Gary Thomas wrote:
> 
>> >> Can you see if these patches fix [at least] the sockbuf corruption
>> >> problem you were seeing?
>> > 
>> > After some additional testing, it seems the problem is still
>> > there.  It looks like there are routines that access sb structs
>> > without calling sblock/sbunlock.
>> > 
>> > The ones we've found are in code called by the network task:
>> > tcp_input, tcp_output, etc.  There are calls to sbappend and
>> > similar functions/macros that result in unprotected accesses to
>> > sb struct fields.
>> > 
>> > My original e-mail pointing out the unreliability of sblock and
>> > sbunlock didn't identify the entire problem.
>> > 
>> >> The basic idea I've incorporated is to use the eCos scheduler
>> >> lock to emulate the user/kernel behaviour from the BSD world
>> >> (i.e. kernel code cannot be preempted)
>> > 
>> > I think that sblock and sbunlock should work now, but I don't
>> > think they're called in enough places.
>> 
>> So this seems to be a start.  I'll try and investigate more.
>> BTW did you see any improvement at all (just to make sure we are
>> hunting the right fox)?
> 
> Well, some of the throughput tests showed an improvement of about 5%, but I
> don't know why the changes should have done that. When we spent some more
> time analyzing things it looks like the sb corruptions have to be happening
> due to conflicts between our user tasks and functions called from the
> network task.
> 
> AFAICT, the sblock/unlock calls are in routines called from user tasks, but
> not in the functions called by the network task.  So the patch should
> prevent conflicts between users tasks.  We have two user tasks that do
> TCP/IP via a single socket, but one handles input and the other handles
> output, so it doesn't appear they can be conflicting with each other, since
> there are separate input and output sb structs.
> 
> The conflict we seem to be running into is between our user tasks and the
> code run from the network task such as tcp_input and tcp_output.
> 
>> DO you have a good way to duplicate the failures?
> 
> If we set our user tasks to a higher priority than the network task, we will
> almost always see a panic within a minute or two.  Sometimes it will run for
> as long as several minutes.
> 
>> Is it something I can set up here? 
> 
> Not really.  The application that fails most predictably depends on custom
> target hardware and some specific software on a host for it to talk to. I've
> been trying to come up with a simple test configuration based on one of the
> existing eCos tests programs that demonstrates the problem, and will
> continue to try.  But, I haven't been able to come up with the right
> combination of cpu loading and TCP/IP traffic patterns to make it fail
> predictably, so I have to build a library, throw it over the wall and let
> the application development guy try it out.
> 
>> Any/all information on this will be useful.
> 
> I'm trying to come up with a simple test case -- I've got one more idea to
> try out tomorrow.  I might also add some sblock/sbunlock pairs to some of
> the functions like sbappend to see if that has an effect.  (I've convinced
> myself it's got to.)
> 

If it is caused by interaction between the network task and your threads, try
this patch:

cvs diff: Diffing net/tcpip/current/src/ecos
Index: net/tcpip/current/src/ecos/support.c
===================================================================
RCS file: /local/cvsfiles/ecc/ecc/net/tcpip/current/src/ecos/support.c,v
retrieving revision 1.7
diff -u -5 -p -r1.7 support.c
--- net/tcpip/current/src/ecos/support.c        2000/03/08 19:11:20     1.7
+++ net/tcpip/current/src/ecos/support.c        2000/04/14 10:38:37
@@ -494,10 +494,11 @@ cyg_netint(cyg_addrword_t param)
 {
     cyg_flag_value_t curisr;
     while (true) {
         curisr = cyg_flag_wait(&netint_flags, NETISR_ANY, 
                                CYG_FLAG_WAITMODE_OR|CYG_FLAG_WAITMODE_CLR);
+        cyg_scheduler_lock();  // This code should not be preempted
 #ifdef INET
         if (curisr & (1 << NETISR_ARP)) {
             // Pending ARP requests
             arpintr();
         }
@@ -510,10 +511,11 @@ cyg_netint(cyg_addrword_t param)
         if (curisr & (1 << NETISR_IPV6)) {
             // Pending IPv6 input
             ip6intr();
         }
 #endif
+        cyg_scheduler_unlock();
     }
 }
 
 //
 // Network initialization


Basically, just keeping other threads from running while the network "task"
performs its housekeeping.


More information about the Ecos-discuss mailing list