[ECOS] Re: RedBoot DHCP failure due to race condition.

Grant Edwards grant.b.edwards@gmail.com
Fri Mar 18 21:19:00 GMT 2011


On 2011-03-18, Gary Thomas <gary@mlbassoc.com> wrote:
> On 03/18/2011 02:39 AM, Tarmo Kuuse wrote:
>> On 18.03.2011 01:18, Grant Edwards wrote:
>>> What I'm having problems with is the DHCP implementation in the
>>> Redboot bootloader. Redboot has it's own stripped-down polled-mode
>>> network "stack" and implementations of things like BOOTP/DHCP client,
>>> Telnet server, and TFTP client.
>>>
>>> Redboot's "DHCP" code started out as purely a BOOTP client, and it
>>> looks like when DHCP was added a few obscure failure modes were
>>> overlooked.
>>
>> Ok, I was not paying enough attention. Sorry.
>>
>> Yes, I remember seeing similar problems in RedBoot DHCP client over a year ago. It's not exactly production quality code (to put it mildly).
>>
>> Looking at some old stuff I see following major problems:
>>
>> 1. The race. Any received BOOTREPLY packages are copied into internal
>>    buffer before they are verified to be valid and destined to us. A
>>    BOOTREPLY package sent to somebody else will overwrite our current
>>    package. Lines 92-97 need to be moved 10 lines lower, after
>>    verification code. This may not fix all problems, but it should fix
>>    the race that you described.

There's a different race condition that won't fix.  It involves two
DHCP servers one of which ACKs the REQUEST and the other of which NAKs
the REQUEST 1ms later.  To really fix the problem, the copy code needs
to be state-aware and packet-type-aware. [I don't know why there are
two DHCP servers or why they act like that -- but they do -- and
Redboot seems to be the only client that has problems with that
setup.]

Is there any reason why the packet handler can't do things like

 1) send the REQUEST packet upon receipt of the OFFER?

 2) unregister the packet handler upon receipt of the ACK?

If the state machine was handled in one place instead of two, it would
make things a lot simpler.

>> 2. The retry counter counts remaining retry attempts. I vaguely
>>    recall that it does not work. Mostly because it doesn't really
>>    care - line 232 always tops up the "remaining retries" counter
>>    even if the last attempt failed, causing a neverending loop.

Yup.  Seen that one happen too.

>>    There used to be a similar assignment before 234 but I don't see
>>    it in current cvs.
>>
>> I have forgotten the details by now so don't take this as reliable
>> information.

That pretty much agrees with the behavior I've seen an my analysis of
the code.

>> Other, minor issues are:
>>
>> 3. DHCP transaction ID (XID) is not unique. If you start up multiple devices simultaneously, it will create a mess.
>>
>> 4. XID is not verified. It accepts any response it gets, as long as MAC address matches.
>
> Feel free to contribute "production quality code" to the project, rather
> than just moaning about it!

I'm working on it.

I was just looking for some confirmation that the problems I'm
perceiving aren't just due to my lack of understanding of how the code
is supposed to work.

-- 
Grant Edwards               grant.b.edwards        Yow! My uncle Murray
                                  at               conquered Egypt in 53 B.C.
                              gmail.com            And I can prove it too!!


-- 
Before posting, please read the FAQ: http://ecos.sourceware.org/fom/ecos
and search the list archive: http://ecos.sourceware.org/ml/ecos-discuss



More information about the Ecos-discuss mailing list