This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: Ignoring failures and altering behavior
On 17/10/2018 11:03, Florian Weimer wrote:
> * Adhemerval Zanella:
>
>>> But I think there is a larger question here: Should we keep running at
>>> all cost, possibly giving quite different results, or is it better to
>>> actually report the errors we encounter and stop?
>
>> My wild guess is these kind of errors are usually contingency ones that
>> are usually taking for granted or handled as intermittent. Do have any
>> bug reports with such issues?
>
> I extracted this from a public Google bug report:
>
> <https://sourceware.org/bugzilla/show_bug.cgi?id=22041>
>
> Based on the information they provided, it matches the failure more
> closely than the bug they fixed with the patches they posted.
>
> I need to think about it for a bit, but I don't immediately recall that
> I have encountered the issue myself.
>
>> In any case, not reporting issues to user is not a good policy imho.
>> It sets the API contract can not fail, where in fact it just changing
>> to different semantic in case of failure.
>
> The downside could be that if you have an unreachable (NFS) directory on
> your ld.so search path, you can't launch any programs anymore. Or if
> /etc/nsswitch.conf is corrupted (leading to EIO errors etc. when reading
> it), you can no longer log in over SSH. So in some cases, the right
> choice could be tough.
>
> But in general, my feeling is that we paper over far too many errors.
I think *silently* changing the API semantic should be avoided. Either
we document it is default behaviour and log it (even if it is only on
debug mode) or change it to be a default option with a possibility to
assert or return an error in failure case. Both examples you cited
show how difficult it can be to a system administrator to debug such
failures without direct errors indications.
>
>> GNU guidelines usually do not set hard limits on APIs, so I think is
>> fair expectation that depending of function usage resource acquisition
>> may fail. Usually I see giving the user an option to actually handle this
>> issues it better than silently ignoring it (we might still use the current
>> policy as ignoring certain issues as default semantic).
>
> Sure, but if we are too picky, then the user might not see anything
> because the system does not boot. 8-)
Yes, but usually I think this in a indication of fragile setup of system
defaults and/or organization. Using the setup of unreachable NFS directory,
if a system administrator is relying on such configuration it is expected it
might fail due a myriad of issues. Best course of action, IMHO, is to at
least give an easier way to *debug* it.
>
>> I also think each error case might require a different answer depending.
>> On NSS services load, for instance, one option might to syslog failues
>> (as for NIS) and add a config option to always return failure. The
>> gconv/iconv might be more tricky, since some uses on top of the cache
>> conf load define their semantics as 'no failure is expected'.
>
> For gconv/iconv, these appear to be bugs. We need to fix fwide to be
> able to return errors. At least POSIX clearly describes how to
> communicate errors to the caller. For everything else, we already have
> a clear way to report errors, I think. Downwards from gconv/iconv, the
> issue is dlopen, of course, where we cannot tell a resource allocation
> failure from a missing DSO (as in the NSS bug mentioned above).
>
> Thanks,
> Florian
>