[PATCH 5/7] x86: Add AVX2 optimized chacha20

Thu Apr 14 17:16:55 GMT 2022

On 13/04/2022 20:04, Noah Goldstein wrote:
> On Wed, Apr 13, 2022 at 1:27 PM Adhemerval Zanella via Libc-alpha
> <libc-alpha@sourceware.org> wrote:
>>
>> +       .text
> 
> section avx2
> 

Ack, I changed to '.section .text.avx2, "ax", @progbits'.

>> +       .align 32
>> +chacha20_data:
>> +L(shuf_rol16):
>> +       .byte 2,3,0,1,6,7,4,5,10,11,8,9,14,15,12,13
>> +L(shuf_rol8):
>> +       .byte 3,0,1,2,7,4,5,6,11,8,9,10,15,12,13,14
>> +L(inc_counter):
>> +       .byte 0,1,2,3,4,5,6,7
>> +L(unsigned_cmp):
>> +       .long 0x80000000
>> +
>> +ENTRY (__chacha20_avx2_blocks8)
>> +       /* input:
>> +        *      %rdi: input
>> +        *      %rsi: dst
>> +        *      %rdx: src
>> +        *      %rcx: nblks (multiple of 8)
>> +        */
>> +       vzeroupper;
> 
> vzeroupper needs to be replaced with VZEROUPPER_RETURN
> and we need a transaction safe version unless this can never
> be called during a transaction.

I think you meant VZEROUPPER here (VZEROUPPER_RETURN seems to trigger
test case failures). What do you mean by a 'transaction safe version'?
Ax extra __chacha20_avx2_blocks8 implementation to handle it? Or disable
it if RTM is enabled?

>> +
>> +       /* clear the used vector registers and stack */
>> +       vpxor X0, X0, X0;
>> +       vmovdqa X0, (STACK_VEC_X12)(%rsp);
>> +       vmovdqa X0, (STACK_VEC_X13)(%rsp);
>> +       vmovdqa X0, (STACK_TMP)(%rsp);
>> +       vmovdqa X0, (STACK_TMP1)(%rsp);
>> +       vzeroall;
> 
> Do you need vzeroall?
> Why not vzeroupper? Is it a security concern to leave info in the xmm pieces?

I would assume, since it is on the original libgrcypt optimization.  As
for the ssse3 version, I am not sure if we really need that level of
hardening, but it would be good to have the initial revision as close
as possible from libgcrypt.

> 
> 
>> +
>> +       /* eax zeroed by round loop. */
>> +       leave;
>> +       cfi_adjust_cfa_offset(-8)
>> +       cfi_def_cfa_register(%rsp);
>> +       ret;
>> +       int3;
> 
> Why do we need int3 here?

I think the ssse3 applies here as well.

>> +END(__chacha20_avx2_blocks8)
>> diff --git a/sysdeps/x86_64/chacha20_arch.h b/sysdeps/x86_64/chacha20_arch.h
>> index 37a4fdfb1f..7e9e7755f3 100644
>> --- a/sysdeps/x86_64/chacha20_arch.h
>> +++ b/sysdeps/x86_64/chacha20_arch.h
>> @@ -22,11 +22,25 @@
>>
>>  unsigned int __chacha20_ssse3_blocks8 (uint32_t *state, uint8_t *dst,
>>                                        const uint8_t *src, size_t nblks);
>> +unsigned int __chacha20_avx2_blocks8 (uint32_t *state, uint8_t *dst,
>> +                                     const uint8_t *src, size_t nblks);
>>
>>  static inline void
>>  chacha20_crypt (struct chacha20_state *state, uint8_t *dst, const uint8_t *src,
>>                 size_t bytes)
>>  {
>> +  const struct cpu_features* cpu_features = __get_cpu_features ();
> 
> Can we do this with an ifunc and take the cpufeature check off the critical
> path?

Ditto.

>> +
>> +  if (CPU_FEATURE_USABLE_P (cpu_features, AVX2) && bytes >= CHACHA20_BLOCK_SIZE * 8)
>> +    {
>> +      size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
>> +      nblocks -= nblocks % 8;
>> +      __chacha20_avx2_blocks8 (state->ctx, dst, src, nblocks);
>> +      bytes -= nblocks * CHACHA20_BLOCK_SIZE;
>> +      dst += nblocks * CHACHA20_BLOCK_SIZE;
>> +      src += nblocks * CHACHA20_BLOCK_SIZE;
>> +    }
>> +
>>    if (CPU_FEATURE_USABLE_P (cpu_features, SSSE3) && bytes >= CHACHA20_BLOCK_SIZE * 4)
>>      {
>>        size_t nblocks = bytes / CHACHA20_BLOCK_SIZE;
>> --
>> 2.32.0
>>
> 
> Do you want optimization comments or do that later?

Ideally I would like to check if the proposed arc4random implementation 
is what we want (with current approach of using atfork handlers and the
key reschedule).  The cipher itself it not the utmost important in the 
sense it is transparent to user and we can eventually replace it if there
any issue or attack to ChaCha20.  Initially I won't add any arch-specific
optimization, but since libgcrypt provides some that fits on the current
approach I though it would be a nice thing to have.

For optimization comments it would be good to sync with libgcrypt as well,
I think the project will be interested in any performance improvement
you might have for the chacha implementations.