[ECOS] Network code unstable (Solved for real this time).

Thu Mar 7 00:58:00 GMT 2002

On Wed, 2002-03-06 at 18:38, Gary Thomas wrote:
> On Wed, 2002-03-06 at 10:11, Pieter Truter wrote:
> > 
> > After a lot of testing and debugging I found out that the CS8900 is losing
> > interrupts under heavy network load. This is more prominent when running
> > from flash which is slower.
> > 
> > Looking at if_cs8900a.c I think I found the cause of my problem. The time
> > between the interrupt and acknowledge() is too long. I then moved the
> > acknowledge() in cs8900a_deliver() to cs8900a_isr() just after the mask()
> > and now everything works great.
> 
> So, this was a case of new interrupts from the device not causing the ISR
> to run, possibly because of edge triggering.  I don't understand while the
> 'while()' loop in the interrupt handling routine doesn't cause this to be
> retriggered, but maybe it's just a chip problem.
> 
> > 
> > I am still concerned about masking the interrupt for so long but I
> > understand that this is probably done to be able to use the BSD stack with a
> > realtime OS.
> 
> It's only the device interrupt which is masked.  I don't see how you can
> avoid that - you've got to keep the device from [re]interrupting the driver 
> while it handles the current one.  Also note that the "deliver" function gets 
> called from a network processing thread, not directly by the DSR code.  This 
> probably accounts for most of the delay.
> 
> > 
> > The big problem with losing an interrupt from the CS8900a chip is that you
> > have to cleanup all the info in the chip otherwise it would not generate any
> > other interrupts. And if you do not know that you missed an interrupt you
> > don't know when to cleanup. ;-(
> 
> Every ethernet device seems to have these quirks and, sadly, we have to deal
> with them all, each in their own way :-(
> 

Please correct me if I am wrong but I don't think Ethernet devices play
a special role in this kind of problem, it rather seems like a common
pattern in interrupt driven device drivers. Given the pseudo code below:

ISR/DSR:
	mask_dev_interrupt()
	wake_up_thread()

thread:
	status = clear_device_status()
	while (work_to_do(status)) {
		...
		status = clear_device_status()
	}
	ack_dev_interrupt()		/* <== bad guy */
	unmask_dev_interrupt()

the thread code will loose interrupts when new events happen on the
device while the thread has left the 'while' loop but not yet executed
the ack_dev_interrupt().

But now if we move 'ack_dev_interrupt()' either at the beginning of the
thread code or in the 'while' loop, before reading the device's status,
then it solves the problem.

Robin

-- 
Before posting, please read the FAQ: http://sources.redhat.com/fom/ecos
and search the list archive: http://sources.redhat.com/ml/ecos-discuss