This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: NSS error reporting (bug 20532)

From: Carlos O'Donell <carlos at redhat dot com>
To: Florian Weimer <fweimer at redhat dot com>, GNU C Library <libc-alpha at sourceware dot org>
Date: Fri, 4 Aug 2017 09:52:02 -0400
Subject: Re: NSS error reporting (bug 20532)
Authentication-results: sourceware.org; auth=none
References: <c128f474-992e-c7de-5fae-a2437778c1eb@redhat.com> <87e3d20e-5a51-93ba-6276-0f5305836b2c@redhat.com> <02cabe4c-f2e8-03f7-8307-71cf2aedace5@redhat.com> <a2d5b334-e02f-b999-6605-c1d4ff114250@redhat.com> <7f631668-2ac2-ebfe-1552-b94479027cd4@redhat.com>

On 08/04/2017 05:54 AM, Florian Weimer wrote:
> On 08/03/2017 09:28 PM, Carlos O'Donell wrote:
>> On 08/03/2017 01:18 PM, Florian Weimer wrote:
>>> On 08/03/2017 06:49 PM, Carlos O'Donell wrote:
>>>> On 08/03/2017 12:28 PM, Florian Weimer wrote:
>>>>> (L) Carlos added NSS_STATUS_NOTFOUND with *errnop equals 0 as a
>>>>> documented special case to the manual, in commit
>>>>> d4e301c5c65393837e438b6d81feabfbfde7b9c7.  This contradicts (A).
>>>>> NSS_STATUS_NOTFOUND is handled implicitly by __nss_next2, which does not
>>>>> have access to the errno value, so I do not understand how this could work.
>>>
>>>> I must assume that (A) is not quite correct. I had two reproducers where
>>>> errno was propagated to the caller, and did result in observable differences?
>>>
>>> Can you dig up the details?  And what was fixed in response to this
>>> clarification?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=988068
>>
>> SSSD, installed, client setup, but no server running was returning
>> NSS_STATUS_UNAVAIL / ENOENT, and that results in the ENOENT being
>> propagated to the caller. This was not the intent. The intent was
>> for the SSSD NSS service module to be transparent to the user if
>> sssd was not running. So I had to clarify a class of NSS service module
>> was was intalled, answering replies, lacked a connected data source,
>> but wanted to return a no-error result that would be retried later.
>>
>> The example reproducer was as simple as:
>>
>> #include <sys/types.h>
>> #include <pwd.h>
>> #include <stdio.h>
>>
>> int main(int argc, char* argv[]) {
>>   struct passwd pwd;
>>   char buf[4096];
>>   int err;
>>   struct passwd *res;
>>
>>   err = getpwnam_r(argv[1], &pwd, buf, 4096, &res);
>>
>>   printf("<%s> err: <%d>\n", argv[1], err);
>>   return 0;
>> }
> 
> This reproducer does not match the output below.

I took liberties to cut-and-paste from private notes, and I had a different
reproducer at the time, but it illustrates the issue.

> I think there are several issues in play here.
> 
> (i) NSS_STATUS_UNAVAIL must come with an errno error code on the NSS
> module interface.  If there is no error code, NSS_STATUS_UNAVAIL is
> likely the wrong result code.

Correct, UNAVAIL is a system configuration problem, and needs an error
code, and this is not a configuration problem, it is simply that the
data source is transiently not available.

> (ii) getpwnam_r (and all of get*_r) uses &errno as the errnop pointer,
> and calls __set_errno explicitly (even to 0).  A lot of code relies on
> the errno update, which is not required by POSIX.

I think this is OK, as long as the NSS framework saves and restores errno
if there was no error, making the usage of errno transparent to the user,
and semantically equivalent to the POSIX requirements.

> (iii) As a result, getpwnam (without _r) does not leave errno untouched
> on success, which is required to tell the cause of the getpwnam result:
> error lack of user.

This is also a bug, and we should save/restore errno, and modify it only 
in the event of an error, which would leave us with a POSIX correct 
implementation, because if errno was zero, and we had no error, it would
remain zero.

> (getpwnam follows the readdir interface convention here: Set errno to
> zero before the call, and if it is still zero, no error happened.)

Agreed.

>> With SSSD using status==NSS_STATUS_TRYAGAIN errno==EAGAIN:
>>
>> getgrent() OK group(0) = root 
>> getgrent() OK group(1) = bin 
>> getgrent() OK group(2) = daemon 
>> ...
>> getgrent: Resource temporarily unavailable
>> getgrent error 11 
> 
> Note: Now we are talking about getgrent.
> 
> get*ent (in the form of __nss_getent_r) has both bugs (ii) and (iii).

OK.

> For getgrent, the case for readdir-like behavior is even clearer.

OK.

>> It should have been clarified to say "errno is left untouched."
> 
> Untouched by whom?  The NSS module or the framework?  The framework
> currently clobbers errno unconditionally for NSS_STATUS_SUCCESS and
> NSS_STATUS_NOTFOUND in a few places.

Assertions:

* The NSS module refers to an implementation of the NSS services in
  a dynamically loaded library. When we talk about module we talk
  only about the code that implements the low-level functions.

* The NSS framework refers to the implementation of the API wrapper
  which knows how to lookup the required NSS services and call their
  modules.

Let me clarify:

* The NSS framework should save and restore errno in the event there
  is no error, and should only set outbound errno to an error when
  the internal code or NSS service returns a non-zero errno as part
  of an external API contract.

* Internally the NSS framework should start operation with a zero
  errno (ignoring whatever value the user set). This zero errno is
  important such that NSS services can decide internally if errors
  occurred or not. For example we could set errno to zero, call the
  NSS service module function, and observe if it failed or not

This is overly simplistic I think, but I'm trying to explain and
expand my expectations of the interface as I've learned them over
the years.

We should continue to flesh this out until we understand all the cases.

>> Should we have saved and restored errno?
> 
> We save and restore errno in the NSS framework for NSS_STATUS_NOTFOUND,
> so that we get POSIX semantics for errno.

I think that should be expanded to always save and restore errno, to
isolate the framework and service from user-set non-zero errno values.

> We could also pass a local int variable instead of &errno for errnop,
> and leave it to the NSS module not to clobber errno, but that will
> result in a lot of code duplication, and the NSS_STATUS_NOTFOUND code is
> already unambiguous in this regard.  I now think we should not try to
> separate errno and *errnop because so much code relies on errno updates
> even for the *_r functions (see glob and nscd for examples), and keeping
> the aliasing (even extending it to getaddrinfo) will increase
> compatibility with existing NSS Modules.

I agree, but I also think this will not solve all of our problems.

Keep in mind that on the horizon we have static/dynamic namespace issues
coming, which mean that at the boundary between a static namespace and
the dynamic namespace we will need some new API to help us carry the global
data, like errno, across the boundary. If such an API served dual duty
to isolate errno, then it might be a win-win. In NSS we are in a position
where the NSS module boundary happens to be the same static/dynamic boundary
case we want to fix later.

> The errno saving cannot happen at the level of the *_r functions because
> of the ERANGE case, which requires an *errnop update.  In particular,
> what can happen is that we get NSS_STATUS_TRYAGAIN/ERANGE first,
> followed by NSS_STATUS_NOTFOUND once we retry with a larger buffer.
> This can happen with nss_files and long comment lines, for instance.

The level of the *_r functions is nss/getXXbyYY_r.c, and I don't see why
we could not intelligently save and restore errno at that level?

Given your example it would look like this, and some of this is defensive
errno saving because we are going to have flawed service modules that
set errno explicitly.

nss/getXXbyYY_r.c
* save errno.
* set errno to zero
* resolve service to nss_files
* call nss_files low-level functions: e.g. _nss_files_getpwnam_r (nss_files/files-pwd.c)
  ... and get back NSS_STATUS_TRYAGAIN/ERANGE.
* Since ERANGE we flag that error has occured
* if (errno == 0) then restore errno.

nss/getXXbyYY.c
* call reentrant version

In the non-reentrant version, either errno is seen as unmodified in the event
of success, or errno is set to the error returned by the underlying NSS module.

In this kind of defensive design we can tolerate incorrectly written NSS modules
that write 0 to errno without ruining the POSIX requirements that errno be
unmodified until the next function modifies it according to its API contract.

We could also detect modules setting errno to zero by setting a non-zero errno
and seeing what fails.

In summary
==========

- My opinion is that we need to focus on some high-level design goals and see if
  we can apply those to the framework and modules with success.

- How do we come up with the high-level design goals? Probably a combination of
  looking at the flaws you drafted, and the API contracts we have (Zack's suggestion),
  and harmonizing those in some way.

-- 
Cheers,
Carlos.

Follow-Ups:
- Re: NSS error reporting (bug 20532)
  - From: Florian Weimer

References:
- NSS error reporting (bug 20532)
  - From: Florian Weimer
- Re: NSS error reporting (bug 20532)
  - From: Carlos O'Donell
- Re: NSS error reporting (bug 20532)
  - From: Florian Weimer
- Re: NSS error reporting (bug 20532)
  - From: Carlos O'Donell
- Re: NSS error reporting (bug 20532)
  - From: Florian Weimer

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]