[PATCH RFC] explicit_bzero, again
Zack Weinberg
zackw@panix.com
Mon Sep 7 17:19:00 GMT 2015
On 08/26/2015 05:34 PM, Adhemerval Zanella wrote:
> On 26-08-2015 18:13, Zack Weinberg wrote:
>> The *optimization* (replacing `explicit_bzero` with `memset` +
>> vacuous use) is limited to C programs and GCC. The *API* works
>> just fine regardless of compiler. I believe this is sufficient as
>> a starting point. As and when appropriate ways to express a
>> vacuous use become available in other compilers, we can add them.
>
> Right, but do you know what kind of optimization the 'memory'
> cobbler avoids and your suggestion allows? I do understand that the
> 'memory' cobbler is indeed a more restrictive memory barrier, but
> for mostly targets avoids a function calls were is possible is a
> much more gain that some memory operations begin handle with
> restrictive scheduling.
I found some time to look into this in detail, by doing experiments on
LibreSSL, which makes extensive use of explicit_bzero. There are 77
object files in the LibreSSL 2.2.3-portable release which contain at
least one call to explicit_bzero, either directly or via the wrapper
OPENSSL_cleanse, not counting the definition of OPENSSL_cleanse
itself. I compiled the entire library three times against glibc 2.19
(therefore, using LibreSSL's internal definition of explicit_bzero),
using gcc 5.2, with --disable-shared, as follows:
1) "funcall" -- baseline: all internal uses of the wrapper were
replaced with direct calls to explicit_bzero (using a macro), but
there were no other changes.
2) "memread" -- add to libressl's compat/string.h my original proposal
for bits/string2.h optimization of explicit_bzero, which uses the
asm ("" :: "m" (*(struct { char x[size]; } *)ptr)) construct. This
is expected to produce better local optimization than either (1) or
(3) when it works, but degenerates to (1) when it doesn't work.
3) "memclob" -- instead of the above optimization, use
asm ("" :: "r" (ptr) : "memory"). This can be applied
unconditionally, but is expected to produce worse code than (2) in
the cases where (2) works.
In the baseline compilation, there are 37 objects that make a call to
memset. In the "memread" compilation, the number of objects that make
a call to explicit_bzero drops to 38, but the number of objects that
make a call to memset only goes up by one; this is partially because
some of the changed objects already called memset, and partially
because the "memread" optimization works in exactly the cases where
GCC is most likely to replace a call to memset with inline code
(i.e. the size parameter is constant). In the "memclob" compilation,
there are no surviving calls to explicit_bzero (as expected) and 63
object files call memset.
First off, I may have missed something, but it appears to me that the
"memclob" approach does not replace any *more* explicit_bzero calls with
inline code than "memread" does. Again, this is expected, since the
"memread" optimization applies to the cases where GCC is most likely
to inline a call to memset.
Now let's look at a typical case where "memread" enables the compiler
to replace memset with inline code: this is _rs_stir in arc4random.c:
movq $0, 96(%rax)
movq rs(%rip), %rax
movq $984, (%rax)
.L23:
- movq %rsp, %rdi
- movl $40, %esi
- call explicit_bzero
+ movq $0, (%rsp)
+ movq $0, 8(%rsp)
+ movq $0, 16(%rsp)
+ movq $0, 24(%rsp)
+ movq $0, 32(%rsp)
movq rs(%rip), %rax
movq $0, (%rax)
With "memclob" instead, the same code is generated for the actual
memory clear, but we get slightly worse code for the rest of the
function, because it's now considered to require a frame pointer.
(- memread, + memclob; debug information stripped, slightly
descheduled to emphasize the semantic change)
_rs_stir:
+ pushq %r12
pushq %rbp
pushq %rbx
movl $40, %esi
- subq $56, %rsp
+ subq $48, %rsp
movq %rsp, %rdi
+ movq %rsp, %rbp
movq %fs:40, %rax
movq %rax, 40(%rsp)
xorl %eax, %eax
call getentropy
cmpl $-1, %eax
@@
movq rsx(%rip), %rdi
leaq 64(%rdi), %rsi
movq %rsi, %rdx
call chacha_encrypt_bytes.constprop.2
movq rsx(%rip), %rbx
xorl %eax, %eax
.L24:
- movzbl (%rsp,%rax), %edx
+ movzbl 0(%rbp,%rax), %edx
xorb %dl, 64(%rbx,%rax)
addq $1, %rax
cmpq $40, %rax
jne .L24
cmpq $0, rs(%rip)
- leaq 64(%rbx), %rbp
+ leaq 64(%rbx), %r12
movq %rbx, %rdi
je .L46
.L25:
- movq %rbp, %rsi
+ movq %r12, %rsi
call chacha_keysetup.isra.0.constprop.3
and so on.
Here's another example where memclob causes slightly worse register
allocation than memread. In this case, I think it's mostly because
memset takes one more argument than explicit_bzero. This is
tls_config_set_key_mem in tls/tls_config.c.
tls_config_set_key_mem:
+ pushq %r13
pushq %r12
pushq %rbp
pushq %rbx
+ movq %rdx, %r13
- movq %rdx, %r12
+ movq %rsi, %r12
movq %rdi, %rbx
+ subq $8, %rsp
- movq 80(%rdi), %rdi
+ movq 80(%rdi), %rbp
- movq %rsi, %rbp
- testq %rdi, %rdi
+ testq %rbp, %rbp
je .L105
- movq 88(%rbx), %rsi
+ movq 88(%rdi), %rdx
+ xorl %esi, %esi
+ movq %rbp, %rdi
- call explicit_bzero
+ call memset
.L105:
+ addq $8, %rsp
leaq 88(%rbx), %rsi
leaq 80(%rbx), %rdi
- movq %r12, %rcx
+ movq %r13, %rcx
+ movq %r12, %rdx
popq %rbx
- movq %rbp, %rdx
popq %rbp
popq %r12
+ popq %r13
jmp set_mem
Finally, the large, complicated apps/ca.c reads in part
pkey = load_key(bio_err, keyfile, keyform, 0, key, e, "CA private key");
if (key)
OPENSSL_cleanse(key, strlen(key));
if (pkey == NULL) {
/* load_key() has already printed an appropriate message */
goto err;
}
With -memread, GCC decides that the OPENSSL_cleanse operation belongs
out-of-line in .text.unlikely. With either -memclob or -funcall, this
does not happen. This is a mis-optimization; I think control can't
actually reach this point with a NULL key. However, it illustrates
how -memread exposes more information to the compiler and lets it make
more sophisticated decisions.
I've tarred up all three build trees (with intermediate .s files
included) and put them on my website at
https://hacks.owlfolio.org/scratchpad/libressl-ebzero-experiments.tar.xz.
I invite other people to dig into them and discuss the effects of the
various choices. (Please note that it is 53MB and I can't leave it
there forever due to hosting costs.)
It remains my opinion that the patch as I posted it (with an update to
the manual -- I will get to that in the next few days) is Good Enough
and we should not hold off on making the API available until the ideal
bits/string2.h optimization for it is possible. I will be sending a
message to the gcc and clang mailing lists in the next few days to
open a discussion of that ideal, but I don't have any plans to
implement it myself.
zw
More information about the Libc-alpha
mailing list