Missing <aio_misc.h> exported header ?

Sat Sep 7 18:56:00 GMT 2019

On 06/09/2019 08:25, Xavier Roche wrote:
> Hi!,
> 
> On Wed, Sep 4, 2019 at 9:28 PM Adhemerval Zanella
> <adhemerval.zanella@linaro.org> wrote:
>> Because it an internal-only implementation files.  The Linux one, for
>> instance, calls syscall directly using internal macros (INTERNAL_SYSCALL)
>> which are not meant to be exported.
> 
> My understanding is that on Linux the kernel version was bypassed, and
> the pure glibc version was used ?

Internally, POSIX AIO code is organized so an environment that supports
it natively can reimplement the required internal API and not use the
generic pthread based one. But currently only Hurd and Linux are actually 
supported, so it means only the pthread version is used. Linux version
also assumes NPTL, so for some primitives it calls futex operations directly 
as an optimization.

Now, for the code organization the Linux aio_misc.h implementation used
the include_next which will in turn include sysdeps/nptl/aio_misc.h which
then includes sysdeps/pthread/aio_misc.h (this is set by the sysdeps 
mechanism defines the Implies files).

> 
> Could the glibc version expose this feature ? The fsync call is
> surprisingly missing in the lio_listio() flavor.

It would be possible to include them as GNU extension, but since our
current implementation is just default synchronous IO implemented with
thread operation you won't get much without some lacking kernel support
(man(7) aio describe briefly the issue with an userland AIO implementation
as done by glibc).

Even kernel support has its drawbacks [1] and I am not sure how it has matured
over these 3 years. You can use it through libaio [2] and if I recall correctly
some RH based distro applied some out-of-tree patches to add some support on
glibc POSIX aio implementation.

> 
>> The LIO_DSYNC/LIO_SYNC are used to call the internal __aio_enqueue_request,
>> external usage should use aio_fsync. By "providing much better performances when
>>> syncing a lot of small files" which exactly usage pattern are you referring?
> 
> The program attached is a small benchmark program that, despite its
> rather basic nature, allows to spot differences between different
> strategies:
> - sync every files with fsync or fdatasync
> - use aio
> - sync with global syncfs call
> 
> For example, with 10 files of 10KB,
> g++ -Wall -Wextra -std=c++17 -O3 -g3 fsynctest.cpp -o fsynctest -lrt
> ./fsynctest 10000 10
> 
> I get the following consistent results with a standard consumer SSD:
> AVERAGE fsync: 22ms
> AVERAGE fdatasync: 21ms
> AVERAGE parallel(fsync): 3ms
> AVERAGE parallel(fdatasync): 3ms
> AVERAGE syncfs: 5ms
> AVERAGE sync: 5ms
> 
> The idea being that a single lio_listio() is better than several aio_fsync().

Which is not true for glibc implementation, both calls are essentially the same
For instance using your benchmark as base, with the following extra strategy:

---
                case SyncMode::ParallelAioFsync:
                case SyncMode::ParallelAioFdatasync: {
                    const bool isFsync = mode == SyncMode::ParallelFsync;
                    smode = isFsync ? "aio_parallel(fsync)" : "aio_parallel(fdatasync)";

                    std::vector<struct aiocb> syncs;
                    syncs.resize(nbfiles);

                    std::vector<struct aiocb*> psyncs;
                    psyncs.resize(nbfiles);

                    for (size_t i = 0; i < nbfiles; i++) {
                        syncs[i].aio_fildes = files[i].fd;
                        syncs[i].aio_sigevent.sigev_notify = SIGEV_NONE;
                        if (aio_fsync (isFsync ? O_SYNC : O_DSYNC, &syncs[i]) < 0) {
                            assert (!"aio_fsync failed");
                        }
                        psyncs[i] = &syncs[i];
                    }

                    bool go_on;
                    do {
                        aio_suspend(psyncs.data(), psyncs.size(), nullptr);
                        go_on = false;
                        for (const auto& sync : psyncs) {
                            if (aio_error(sync) == EINPROGRESS) {
                                go_on = true;
                            } else {
                                if (aio_return (sync) == -1 && aio_error (sync) != 0) {
                                   assert(!"aio_error");
                               }
                            }
                        }
                    } while (go_on);
                } break;
---

I see on my machine (also with a consumer grade SSD):

AVERAGE fsync: 75ms
AVERAGE fdatasync: 81ms
AVERAGE parallel(fsync): 8ms
AVERAGE parallel(fdatasync): 8ms
AVERAGE aio_parallel(fdatasync): 8ms
AVERAGE aio_parallel(fdatasync): 8ms
AVERAGE syncfs: 9ms
AVERAGE sync: 8ms

Basically you have on glibc implementation:

---
lio_listio (...) {
  return lio_listio_internal (...);
}

static int
lio_listio_internal (int mode, struct aiocb *const list[], int nent,
                     struct sigevent *sig)
{
  [...]

  /* Request the mutex.  */
  pthread_mutex_lock (&__aio_requests_mutex);

  for (cnt = 0; cnt < nent; ++cnt)
    if (list[cnt] != NULL && list[cnt]->aio_lio_opcode != LIO_NOP)
      { 
	[...]
	requests[cnt] = __aio_enqueue_request (...)
	[...]
      }

  [...]

  struct waitlist waitlist[nent];

  [...]

  else if (LIO_MODE (mode) == LIO_WAIT)
    {
      for (cnt = 0; cnt < nent; ++cnt)
	/* Add each element on the waitlist

      AIO_MISC_WAIT (result, total, NULL, 0);
    }

  /* Release the mutex.  */
  pthread_mutex_unlock (&__aio_requests_mutex);

  return result;
}
---

While aio_fsync and aio_suspend are:

---
int
aio_fsync (int op, struct aiocb *aiocbp)
{
  [...]
  return (__aio_enqueue_request (...)
          ? -1 : 0);
}

int
aio_suspend (const struct aiocb *const list[], int nent,
             const struct timespec *timeout)
{
  struct waitlist waitlist[nent]
  [...]

  pthread_mutex_lock (&__aio_requests_mutex);

  [...]

  for (cnt = 0; cnt < nent; ++cnt)
    if (list[cnt] != NULL)
      {
	/* Update waitlist.  */

  result = do_aio_misc_wait (&cntr, timeout);
  \_ AIO_MISC_WAIT (result, *cntr, timeout, 1);

    /* Release the mutex.  */
  pthread_mutex_unlock (&__aio_requests_mutex);
---

You might see a small contention on the lio_listio case because the
__aio_enqueue_request might require a small synchronization (since although
it is recursive, it might issue less atomic operation) plus some less CPU
time due less function call, but on IO bounded work cases I think it won't 
matter.

I really think it does not worth the development time to add sync operation
extension as lio_listio at current state.  You may try to bring it to
Austin Group, but I think POSIX AIO will be only a viable usage API once
we get proper kernel support.

[1] https://lwn.net/Articles/671649/
[2] https://pagure.io/libaio