This is the mail archive of the mailing list for the glibc project.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] Documentation: update about adding syscalls

Add additional information on how to ensure that syscalls with structure
arguments are extensible and add a section about naming conventions to
follow when adding revised versions of already existing syscalls.

Co-Developed-by: Aleksa Sarai <>
Signed-off-by: Aleksa Sarai <>
Signed-off-by: Christian Brauner <>
 Documentation/process/adding-syscalls.rst | 82 +++++++++++++++++++----
 1 file changed, 70 insertions(+), 12 deletions(-)

diff --git a/Documentation/process/adding-syscalls.rst b/Documentation/process/adding-syscalls.rst
index 1c3a840d06b9..93e0221fbb9a 100644
--- a/Documentation/process/adding-syscalls.rst
+++ b/Documentation/process/adding-syscalls.rst
@@ -79,7 +79,7 @@ flags, and reject the system call (with ``EINVAL``) if it does::
 For more sophisticated system calls that involve a larger number of arguments,
 it's preferred to encapsulate the majority of the arguments into a structure
 that is passed in by pointer.  Such a structure can cope with future extension
-by including a size argument in the structure::
+by either including a size argument in the structure::
     struct xyzzy_params {
         u32 size; /* userspace sets p->size = sizeof(struct xyzzy_params) */
@@ -87,20 +87,56 @@ by including a size argument in the structure::
         u64 param_2;
         u64 param_3;
+    int sys_xyzzy(struct xyzzy_params __user *uarg);
+    /* in case of -E2BIG, p->size is set to the in-kernel size and thus all
+       extensions after that offset are unsupported. */
-As long as any subsequently added field, say ``param_4``, is designed so that a
-zero value gives the previous behaviour, then this allows both directions of
-version mismatch:
+or by including a separate argument that specifies the size::
- - To cope with a later userspace program calling an older kernel, the kernel
-   code should check that any memory beyond the size of the structure that it
-   expects is zero (effectively checking that ``param_4 == 0``).
- - To cope with an older userspace program calling a newer kernel, the kernel
-   code can zero-extend a smaller instance of the structure (effectively
-   setting ``param_4 = 0``).
+    struct xyzzy_params {
+        u32 param_1;
+        u64 param_2;
+        u64 param_3;
+    };
+    /* userspace sets @usize = sizeof(struct xyzzy_params) */
+    int sys_xyzzy(struct xyzzy_params __user *uarg, size_t usize);
+    /* in case of -E2BIG, userspace has to attempt smaller @usize values
+       to figure out which extensions are unsupported. */
+Which model you choose to use is down to personal taste. However, please only
+pick one (for a counter-example, see :manpage:`sched_getattr(2)`).
+Then, any extensions can be implemented by appending fields to the structure.
+However, all new fields must be designed such that their zero value results in
+identical behaviour to the pre-extension syscall. This allows for compatibility
+between different-vintage kernels and userspace, no matter which is newer:
+ - If the userspace is newer, then the older kernel can check whether the
+   fields it doesn't understand are zero-valued. If they are, then it can
+   safely ignore them (since any future extensions will be backwards-compatible
+   as described above). If they aren't, then the kernel doesn't support the
+   feature and can give an error (``-E2BIG`` is traditional).
+ - If the userspace is older, then the kernel can treat all extensions that
+   userspace is unaware of as having their zero-value (and effectively
+   zero-extend the userspace structure when copying it for in-kernel usage).
-See :manpage:`perf_event_open(2)` and the ``perf_copy_attr()`` function (in
-``kernel/events/core.c``) for an example of this approach.
+ - If they're the same version, just use the structure as-is.
+As with the simpler flag-only syscalls, you must always check that any unknown
+values for flag-like parameters in the passed structure are zeroed.
+It is also critical to ensure your syscall handles larger-sized arguments from
+the outset, otherwise userspace will have to do additional (fairly pointless)
+fallbacks for some old kernels. An example where this mistake was made is
+:manpage:`rt_sigprocmask(2)` (where any unknown-sized arguments are
+unconditionally rejected).
+To help implement this correctly, there is a helper function
+``copy_struct_from_user()`` which handles the compatibility requirements for
+you. For examples using this helper, see :manpage:`perf_event_open(2)` (which
+uses the embedded-size model) and :manpage:`clone3(2)` (which uses the
+separate-argument model).
 Designing the API: Other Considerations
@@ -173,6 +209,28 @@ registers.  (This concern does not apply if the arguments are part of a
 structure that's passed in by pointer.)
+Designing the API: Revisions of syscalls
+Syscalls that were not designed to be extensible or syscalls that use a flag
+argument for extensions running out of bits (e.g. :manpage:`clone(2)`)
+sometimes need to be replaced.
+If the revised syscall provides a superset (or a reasonably large subset, such
+as when a feature that turned out to be a design mistake is dropped) of the
+features of the old syscall it is common practice to give it the same name with
+a number appended. Examples for this include ``dup2``/``dup3``,
+``epoll_create``/``epoll_create1`` and others.
+For some syscalls the appended number indicates the number of arguments
+(``accept``/``accept4``) for others the number of the revision
+(``clone``/``clone3``, ``epoll_create``/``epoll_create1``). Recent discussions
+around new syscalls (such as ``clone3``) have pointed to a consensus around
+naming syscalls revisions by appending the number of the revision to the
+syscall name. That means, if you were to post a revised version of the
+``openat`` syscall it should be named ``openat2``.
 Proposing the API

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]