[ECOS] do not use the ARM FIQ: there's a bug in the code

Jürgen Lambrecht jurgen.lambrecht2@telenet.be
Fri Sep 14 22:27:00 GMT 2007


I was using the FIQ pin of my AT91 ARM microcontroller (uC) as Ethernet 
interrupt. When connected to a network, our platforms (2 different ones) 
crash after a while. The debugger always gives "scheduler lock not zero" 
(first reported 
recently reported here: 
http://sourceware.org/ml/ecos-discuss/2007-07/msg00169.html; also some 
info here: http://sourceware.org/ml/ecos-discuss/2006-11/msg00094.html)
I use now another IRQ pin, and the problems are gone.

We (me and Wim Dumon working for Televic) were able to track down the 
bug, but not (yet) able to solve it.
Here a first report. Wim will mail a more detailed report when he has time.
The bug appears randomly, sometimes it takes an hour sometimes 10 
seconds before it appears.
With UDP traffic as a test, the bug shows up as "scheduler lock not 
zero" in the idle thread. This is probably because UDP is rather simple, 
and the processor has much time free to spend in the idle thread (as no 
other application threads are running during the tests).
With TCP traffic as a test, the bug shows up as various other weird 
errors (I have a test report of it).

Because the bug always shows up with UDP in "scheduler lock not zero", 
we could track it down by setting a breakpoint there, and just before we 
toggle a pin of the uC. That pin was connected to a logic analyzer as 
trigger input. The 16 lowest address pins (going to the uC's SRAM where 
the code runs) were monitored by the logic analyzer, and everything 
before the trigger was stored in the analyzer's memory. This way Wim 
traced back the SW.
Wim found that at bug time, always the first 5 registers of the ARM were 
wrong, and always with the same values. Those values come from some 
stack - the bug makes that those 5 registers are not restored correctly 
at context switch. Register 2 (r2) contains the scheduler lock, and is 
indeed not zero at bug time (as reported by the assertion), but when 
reading the scheduler lock from its address in SRAM, it was correctly 
zero! r2 was always 0xFFDF_FFDF.
Wim is convinced that the /hal/arm/arch/.../src/vectors.S code contains 
the bug(s).
Our eCos tree is from 2006-02-15, so after the bugfix of 2006-02-06 from 
Sergei Organov. But we think a similar bug is still present.

Mark: the comment in /hal/arm/at91/var/.../cdl/hal_arm_at91.cdl about 
"CYGHWR_HAL_ARM_AT91_FIQ" is wrong. This is (more or less) the correct 
"           Enable this option if you want to use the FIQ. Interrupts in 
            may not be interrupted. Therefore, it is needed to handle FIQ
            interrupts in the normal way, i.e. a FIQ interrupt must be 
            as a normal IRQ using the highest priority"

During debugging with the JTAG monitor BDI2000, we often saw spurious 
interrupts. But we checked the hardware with an oscilloscope, and the 
interrupts are clean. I see 3 possible reasons for the spurious interrupts:
 - caused by the monitoring,
 - or caused by that bug,
 - or caused by using level sensitive interrupt.
When I have some time, I will check the last one out by using edge 
sensitive irq instead, and I will check out the second one now we are 
not using the FIQ anymore.

It is up to my boss to decide if he wants to spend any more money trying 
to solve this bug. It will also depend if I will be able to use the 
workaround for all version of our platforms...

I could also try to solve it for fun in my free time of course ;-),
kind regards,

Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss

More information about the Ecos-discuss mailing list