This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH RFC] explicit_bzero, again


On 08/26/2015 05:34 PM, Adhemerval Zanella wrote:
> On 26-08-2015 18:13, Zack Weinberg wrote:
>> The *optimization* (replacing `explicit_bzero` with `memset` +
>> vacuous use) is limited to C programs and GCC.  The *API* works
>> just fine regardless of compiler.  I believe this is sufficient as
>> a starting point.  As and when appropriate ways to express a
>> vacuous use become available in other compilers, we can add them.
>
> Right, but do you know what kind of optimization the 'memory'
> cobbler avoids and your suggestion allows?  I do understand that the
> 'memory' cobbler is indeed a more restrictive memory barrier, but
> for mostly targets avoids a function calls were is possible is a
> much more gain that some memory operations begin handle with
> restrictive scheduling.

I found some time to look into this in detail, by doing experiments on
LibreSSL, which makes extensive use of explicit_bzero.  There are 77
object files in the LibreSSL 2.2.3-portable release which contain at
least one call to explicit_bzero, either directly or via the wrapper
OPENSSL_cleanse, not counting the definition of OPENSSL_cleanse
itself.  I compiled the entire library three times against glibc 2.19
(therefore, using LibreSSL's internal definition of explicit_bzero),
using gcc 5.2, with --disable-shared, as follows:

1) "funcall" -- baseline: all internal uses of the wrapper were
   replaced with direct calls to explicit_bzero (using a macro), but
   there were no other changes.

2) "memread" -- add to libressl's compat/string.h my original proposal
   for bits/string2.h optimization of explicit_bzero, which uses the
   asm ("" :: "m" (*(struct { char x[size]; } *)ptr)) construct.  This
   is expected to produce better local optimization than either (1) or
   (3) when it works, but degenerates to (1) when it doesn't work.

3) "memclob" -- instead of the above optimization, use
   asm ("" :: "r" (ptr) : "memory").  This can be applied
   unconditionally, but is expected to produce worse code than (2) in
   the cases where (2) works.

In the baseline compilation, there are 37 objects that make a call to
memset.  In the "memread" compilation, the number of objects that make
a call to explicit_bzero drops to 38, but the number of objects that
make a call to memset only goes up by one; this is partially because
some of the changed objects already called memset, and partially
because the "memread" optimization works in exactly the cases where
GCC is most likely to replace a call to memset with inline code
(i.e. the size parameter is constant).  In the "memclob" compilation,
there are no surviving calls to explicit_bzero (as expected) and 63
object files call memset.

First off, I may have missed something, but it appears to me that the
"memclob" approach does not replace any *more* explicit_bzero calls with
inline code than "memread" does.  Again, this is expected, since the
"memread" optimization applies to the cases where GCC is most likely
to inline a call to memset.

Now let's look at a typical case where "memread" enables the compiler
to replace memset with inline code: this is _rs_stir in arc4random.c:

        movq    $0, 96(%rax)
        movq    rs(%rip), %rax
        movq    $984, (%rax)
 .L23:
-       movq    %rsp, %rdi
-       movl    $40, %esi
-       call    explicit_bzero
+       movq    $0, (%rsp)
+       movq    $0, 8(%rsp)
+       movq    $0, 16(%rsp)
+       movq    $0, 24(%rsp)
+       movq    $0, 32(%rsp)
        movq    rs(%rip), %rax
        movq    $0, (%rax)

With "memclob" instead, the same code is generated for the actual
memory clear, but we get slightly worse code for the rest of the
function, because it's now considered to require a frame pointer.
(- memread, + memclob; debug information stripped, slightly
descheduled to emphasize the semantic change)

 _rs_stir:
+       pushq   %r12
        pushq   %rbp
        pushq   %rbx
        movl    $40, %esi
-       subq    $56, %rsp
+       subq    $48, %rsp
        movq    %rsp, %rdi
+       movq    %rsp, %rbp
        movq    %fs:40, %rax
        movq    %rax, 40(%rsp)
        xorl    %eax, %eax
        call    getentropy
        cmpl    $-1, %eax
@@
        movq    rsx(%rip), %rdi
        leaq    64(%rdi), %rsi
        movq    %rsi, %rdx
        call    chacha_encrypt_bytes.constprop.2
        movq    rsx(%rip), %rbx
        xorl    %eax, %eax
 .L24:
-       movzbl  (%rsp,%rax), %edx
+       movzbl  0(%rbp,%rax), %edx
        xorb    %dl, 64(%rbx,%rax)
        addq    $1, %rax
        cmpq    $40, %rax
        jne     .L24
        cmpq    $0, rs(%rip)
-       leaq    64(%rbx), %rbp
+       leaq    64(%rbx), %r12
        movq    %rbx, %rdi
        je      .L46
 .L25:
-       movq    %rbp, %rsi
+       movq    %r12, %rsi
        call    chacha_keysetup.isra.0.constprop.3

and so on.

Here's another example where memclob causes slightly worse register
allocation than memread.  In this case, I think it's mostly because
memset takes one more argument than explicit_bzero.  This is
tls_config_set_key_mem in tls/tls_config.c.

 tls_config_set_key_mem:
+       pushq   %r13
        pushq   %r12
        pushq   %rbp
        pushq   %rbx
+       movq    %rdx, %r13
-       movq    %rdx, %r12
+       movq    %rsi, %r12
        movq    %rdi, %rbx
+       subq    $8, %rsp
-       movq    80(%rdi), %rdi
+       movq    80(%rdi), %rbp
-       movq    %rsi, %rbp
-       testq   %rdi, %rdi
+       testq   %rbp, %rbp
        je      .L105
-       movq    88(%rbx), %rsi
+       movq    88(%rdi), %rdx
+       xorl    %esi, %esi
+       movq    %rbp, %rdi
-       call    explicit_bzero
+       call    memset
 .L105:
+       addq    $8, %rsp
        leaq    88(%rbx), %rsi
        leaq    80(%rbx), %rdi
-       movq    %r12, %rcx
+       movq    %r13, %rcx
+       movq    %r12, %rdx
        popq    %rbx
-       movq    %rbp, %rdx
        popq    %rbp
        popq    %r12
+       popq    %r13
        jmp     set_mem

Finally, the large, complicated apps/ca.c reads in part

    pkey = load_key(bio_err, keyfile, keyform, 0, key, e, "CA private key");
    if (key)
            OPENSSL_cleanse(key, strlen(key));
    if (pkey == NULL) {
            /* load_key() has already printed an appropriate message */
            goto err;
    }

With -memread, GCC decides that the OPENSSL_cleanse operation belongs
out-of-line in .text.unlikely.  With either -memclob or -funcall, this
does not happen.  This is a mis-optimization; I think control can't
actually reach this point with a NULL key.  However, it illustrates
how -memread exposes more information to the compiler and lets it make
more sophisticated decisions.

I've tarred up all three build trees (with intermediate .s files
included) and put them on my website at
https://hacks.owlfolio.org/scratchpad/libressl-ebzero-experiments.tar.xz.
I invite other people to dig into them and discuss the effects of the
various choices.  (Please note that it is 53MB and I can't leave it
there forever due to hosting costs.)

It remains my opinion that the patch as I posted it (with an update to
the manual -- I will get to that in the next few days) is Good Enough
and we should not hold off on making the API available until the ideal
bits/string2.h optimization for it is possible.  I will be sending a
message to the gcc and clang mailing lists in the next few days to
open a discussion of that ideal, but I don't have any plans to
implement it myself.

zw


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]