[ECOS] do not use the ARM FIQ: there's a bug in the code
Fri Sep 14 22:27:00 GMT 2007
I was using the FIQ pin of my AT91 ARM microcontroller (uC) as Ethernet
interrupt. When connected to a network, our platforms (2 different ones)
crash after a while. The debugger always gives "scheduler lock not zero"
recently reported here:
http://sourceware.org/ml/ecos-discuss/2007-07/msg00169.html; also some
info here: http://sourceware.org/ml/ecos-discuss/2006-11/msg00094.html)
I use now another IRQ pin, and the problems are gone.
We (me and Wim Dumon working for Televic) were able to track down the
bug, but not (yet) able to solve it.
Here a first report. Wim will mail a more detailed report when he has time.
The bug appears randomly, sometimes it takes an hour sometimes 10
seconds before it appears.
With UDP traffic as a test, the bug shows up as "scheduler lock not
zero" in the idle thread. This is probably because UDP is rather simple,
and the processor has much time free to spend in the idle thread (as no
other application threads are running during the tests).
With TCP traffic as a test, the bug shows up as various other weird
errors (I have a test report of it).
Because the bug always shows up with UDP in "scheduler lock not zero",
we could track it down by setting a breakpoint there, and just before we
toggle a pin of the uC. That pin was connected to a logic analyzer as
trigger input. The 16 lowest address pins (going to the uC's SRAM where
the code runs) were monitored by the logic analyzer, and everything
before the trigger was stored in the analyzer's memory. This way Wim
traced back the SW.
Wim found that at bug time, always the first 5 registers of the ARM were
wrong, and always with the same values. Those values come from some
stack - the bug makes that those 5 registers are not restored correctly
at context switch. Register 2 (r2) contains the scheduler lock, and is
indeed not zero at bug time (as reported by the assertion), but when
reading the scheduler lock from its address in SRAM, it was correctly
zero! r2 was always 0xFFDF_FFDF.
Wim is convinced that the /hal/arm/arch/.../src/vectors.S code contains
Our eCos tree is from 2006-02-15, so after the bugfix of 2006-02-06 from
Sergei Organov. But we think a similar bug is still present.
Mark: the comment in /hal/arm/at91/var/.../cdl/hal_arm_at91.cdl about
"CYGHWR_HAL_ARM_AT91_FIQ" is wrong. This is (more or less) the correct
" Enable this option if you want to use the FIQ. Interrupts in
may not be interrupted. Therefore, it is needed to handle FIQ
interrupts in the normal way, i.e. a FIQ interrupt must be
as a normal IRQ using the highest priority"
During debugging with the JTAG monitor BDI2000, we often saw spurious
interrupts. But we checked the hardware with an oscilloscope, and the
interrupts are clean. I see 3 possible reasons for the spurious interrupts:
- caused by the monitoring,
- or caused by that bug,
- or caused by using level sensitive interrupt.
When I have some time, I will check the last one out by using edge
sensitive irq instead, and I will check out the second one now we are
not using the FIQ anymore.
It is up to my boss to decide if he wants to spend any more money trying
to solve this bug. It will also depend if I will be able to use the
workaround for all version of our platforms...
I could also try to solve it for fun in my free time of course ;-),
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss
More information about the Ecos-discuss