[PATCH v6 5/5] riscv: Add and use alignment-ignorant memcpy
enh
enh@google.com
Thu Aug 17 16:37:23 GMT 2023
On Thu, Aug 17, 2023 at 9:27 AM Evan Green <evan@rivosinc.com> wrote:
>
> On Wed, Aug 16, 2023 at 4:18 PM enh <enh@google.com> wrote:
> >
> > On Tue, Aug 15, 2023 at 4:02 PM Evan Green <evan@rivosinc.com> wrote:
> > >
> > > On Tue, Aug 15, 2023 at 2:54 PM enh <enh@google.com> wrote:
> > > >
> > > > On Tue, Aug 15, 2023 at 9:41 AM Evan Green <evan@rivosinc.com> wrote:
> > > > >
> > > > > On Fri, Aug 11, 2023 at 5:01 PM enh <enh@google.com> wrote:
> > > > > >
> > > > > > On Mon, Aug 7, 2023 at 5:01 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > >
> > > > > > > On Mon, Aug 7, 2023 at 3:48 PM enh <enh@google.com> wrote:
> > > > > > > >
> > > > > > > > On Mon, Aug 7, 2023 at 3:11 PM Evan Green <evan@rivosinc.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Aug 3, 2023 at 3:30 PM Richard Henderson
> > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > >
> > > > > > > > > > On 8/3/23 11:42, Evan Green wrote:
> > > > > > > > > > > On Thu, Aug 3, 2023 at 10:50 AM Richard Henderson
> > > > > > > > > > > <richard.henderson@linaro.org> wrote:
> > > > > > > > > > >> Outside libc something is required.
> > > > > > > > > > >>
> > > > > > > > > > >> An extra parameter to ifunc is surprising though, and clearly not ideal per the extra
> > > > > > > > > > >> hoops above. I would hope for something with hidden visibility in libc_nonshared.a that
> > > > > > > > > > >> could always be called directly.
> > > > > > > > > > >
> > > > > > > > > > > My previous spin took that approach, defining a
> > > > > > > > > > > __riscv_hwprobe_early() in libc_nonshared that could route to the real
> > > > > > > > > > > function if available, or make the syscall directly if not. But that
> > > > > > > > > > > approach had the drawback that ifunc users couldn't take advantage of
> > > > > > > > > > > the vDSO, and then all users had to comprehend the difference between
> > > > > > > > > > > __riscv_hwprobe() and __riscv_hwprobe_early().
> > > > > > > > > >
> > > > > > > > > > I would define __riscv_hwprobe such that it could take advantage of the vDSO once
> > > > > > > > > > initialization reaches a certain point, but cope with being run earlier than that point by
> > > > > > > > > > falling back to the syscall.
> > > > > > > > > >
> > > > > > > > > > That constrains the implementation, I guess, in that it can't set errno, but just
> > > > > > > > > > returning the negative errno from the syscall seems fine.
> > > > > > > > > >
> > > > > > > > > > It might be tricky to get a reference to GLRO(dl_vdso_riscv_hwprobe) very early, but I
> > > > > > > > > > would hope that some application of __attribute__((weak)) might correctly get you a NULL
> > > > > > > > > > prior to full relocations being complete.
> > > > > > > > >
> > > > > > > > > Right, this is what we had in the previous iteration of this series,
> > > > > > > > > and it did work ok. But it wasn't as good since it meant ifunc
> > > > > > > > > selectors always got stuck in the null/fallback case and were forced
> > > > > > > > > to make the syscall. With this mechanism they get to take advantage of
> > > > > > > > > the vDSO.
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > In contrast, IMO this approach is much nicer. Ifunc writers are
> > > > > > > > > > > already used to getting hwcap info via a parameter. Adding this second
> > > > > > > > > > > parameter, which also provides hwcap-like things, seems like a natural
> > > > > > > > > > > extension. I didn't quite follow what you meant by the "extra hoops
> > > > > > > > > > > above".
> > > > > > > > > >
> > > > > > > > > > The check for null function pointer, for sure. But also consider how __riscv_hwprobe is
> > > > > > > > > > going to be used.
> > > > > > > > > >
> > > > > > > > > > It might be worth defining some helper functions for probing a single key or a single
> > > > > > > > > > field. E.g.
> > > > > > > > > >
> > > > > > > > > > uint64_t __riscv_hwprobe_one_key(int64_t key, unsigned int flags)
> > > > > > > > > > {
> > > > > > > > > > struct riscv_hwprobe pair = { .key = key };
> > > > > > > > > > int err = __riscv_hwprobe(&pair, 1, 0, NULL, flags);
> > > > > > > > > > if (err)
> > > > > > > > > > return err;
> > > > > > > > > > if (pair.key == -1)
> > > > > > > > > > return -ENOENT;
> > > > > > > > > > return pair.value;
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > This implementation requires that no future hwprobe key define a value which as a valid
> > > > > > > > > > value in the errno range (or better, bit 63 unused). Alternately, or additionally:
> > > > > > > > > >
> > > > > > > > > > bool __riscv_hwprobe_one_mask(int64_t key, uint64_t mask, uint64_t val, int flags)
> > > > > > > > > > {
> > > > > > > > > > struct riscv_hwprobe pair = { .key = key };
> > > > > > > > > > return (__riscv_hwprobe(&pair, 1, 0, NULL, flags) == 0
> > > > > > > > > > && pair.key != -1
> > > > > > > > > > && (pair.value & mask) == val);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > These yield either
> > > > > > > > > >
> > > > > > > > > > int64_t v = __riscv_hwprobe_one_key(CPUPERF_0, 0);
> > > > > > > > > > if (v >= 0 && (v & MISALIGNED_MASK) == MISALIGNED_FAST)
> > > > > > > > > > return __memcpy_noalignment;
> > > > > > > > > > return __memcpy_generic;
> > > > > > > > > >
> > > > > > > > > > or
> > > > > > > > > >
> > > > > > > > > > if (__riscv_hwprobe_one_mask(CPUPERF_0, MISALIGNED_MASK, MISALIGNED_FAST, 0))
> > > > > > > > > > return __memcpy_noalignment;
> > > > > > > > > > return __memcpy_generic;
> > > > > > > > > >
> > > > > > > > > > which to my mind looks much better for a pattern you'll be replicating so very many times
> > > > > > > > > > across all of the ifunc implementations in the system.
> > > > > > > > >
> > > > > > > > > Ah, I see. I could make a static inline function in the header that
> > > > > > > > > looks something like this (mangled by gmail, sorry):
> > > > > > > > >
> > > > > > > > > /* Helper function usable from ifunc selectors that probes a single key. */
> > > > > > > > > static inline int __riscv_hwprobe_one(__riscv_hwprobe_t hwprobe_func,
> > > > > > > > > signed long long int key,
> > > > > > > > > unsigned long long int *value)
> > > > > > > > > {
> > > > > > > > > struct riscv_hwprobe pair;
> > > > > > > > > int rc;
> > > > > > > > >
> > > > > > > > > if (!hwprobe_func)
> > > > > > > > > return -ENOSYS;
> > > > > > > > >
> > > > > > > > > pair.key = key;
> > > > > > > > > rc = hwprobe_func(&pair, 1, 0, NULL, 0);
> > > > > > > > > if (rc) {
> > > > > > > > > return rc;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > if (pair.key < 0) {
> > > > > > > > > return -ENOENT;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > *value = pair.value;
> > > > > > > > > return 0;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > The ifunc selector would then be significantly cleaned up, looking
> > > > > > > > > something like:
> > > > > > > > >
> > > > > > > > > if (__riscv_hwprobe_one(hwprobe_func, RISCV_HWPROBE_KEY_CPUPERF_0, &value))
> > > > > > > > > return __memcpy_generic;
> > > > > > > > >
> > > > > > > > > if (value & RISCV_HWPROBE_MISALIGNED_MASK) == RISCV_HWPROBE_MISALIGNED_FAST)
> > > > > > > > > return __memcpy_noalignment;
> > > > > > > >
> > > > > > > > (Android's libc maintainer here, having joined the list just to talk
> > > > > > > > about risc-v ifuncs :-) )
> > > > > > > >
> > > > > > > > has anyone thought about calling ifunc resolvers more like this...
> > > > > > > >
> > > > > > > > --same part of the dynamic loader that caches the two getauxval()s for arm64--
> > > > > > > > static struct riscv_hwprobe probes[] = {
> > > > > > > > {.value = RISCV_HWPROBE_KEY_MVENDORID},
> > > > > > > > {.value = RISCV_HWPROBE_KEY_MARCHID},
> > > > > > > > {.value = RISCV_HWPROBE_KEY_MIMPID},
> > > > > > > > {.value = RISCV_HWPROBE_KEY_BASE_BEHAVIOR},
> > > > > > > > {.value = RISCV_HWPROBE_KEY_IMA_EXT},
> > > > > > > > {.value = RISCV_HWPROBE_KEY_CPUPERF_0},
> > > > > > > > ... // every time a new key is added to the kernel, we add it here
> > > > > > > > };
> > > > > > > > __riscv_hwprobe(...); // called once
> > > > > > > >
> > > > > > > > --part of the dynamic loader that calls ifunc resolvers--
> > > > > > > > (*ifunc_resolver)(sizeof(probes)/sizeof(probes[0]), probes);
> > > > > > > >
> > > > > > > > this is similar to what we already have for arm64 (where there's a
> > > > > > > > getauxval(AT_HWCAP) and a pointer to a struct for AT_HWCAP2 and
> > > > > > > > potentially others), but more uniform, and avoiding the source
> > > > > > > > (in)compatibility issues of adding new fields to a struct [even if it
> > > > > > > > does have a size_t to "version" it like the arm64 ifunc struct].
> > > > > > > >
> > > > > > > > yes, it means everyone pays to get all the hwprobes, but that gets
> > > > > > > > amortized. and lookup in the ifunc resolver is simple and quick. if we
> > > > > > > > know that the keys will be kept dense, we can even have code in ifunc
> > > > > > > > resolvers like
> > > > > > > >
> > > > > > > > if (probes[RISCV_HWPROBE_BASE_BEHAVIOR_IMA].value & RISCV_HWPROBE_IMA_V) ...
> > > > > > > >
> > > > > > > > though personally for the "big ticket items" that get a letter to
> > > > > > > > themselves like V, i'd be tempted to pass `(getauxval(AT_HWCAP),
> > > > > > > > probe_count, probes_ptr)` to the resolver, but i hear that's
> > > > > > > > controversial :-)
> > > > > > >
> > > > > > > Hello, welcome to the fun! :)
> > > > > >
> > > > > > (sorry for the delay. i've been thinking :-) )
> > > > > >
> > > > > > > What you're describing here is almost exactly what we did inside the
> > > > > > > vDSO function. The vDSO function acts as a front for a handful of
> > > > > > > probe values that we've already completed and cached in userspace. We
> > > > > > > opted to make it a function, rather than exposing the data itself via
> > > > > > > vDSO, so that we had future flexibility in what elements we cached in
> > > > > > > userspace and their storage format. We can update the kernel as needed
> > > > > > > to cache the hottest things in userspace, even if that means
> > > > > > > rearranging the data format, passing through some extra information,
> > > > > > > or adding an extra snip of code. My hope is callers can directly
> > > > > > > interact with the vDSO function (though maybe as Richard suggested
> > > > > > > maybe with the help of a tidy inline helper), rather than trying to
> > > > > > > add a second layer of userspace caching.
> > > > > >
> > > > > > on reflection i think i might be too focused on the FMV use case, in
> > > > > > part because we're looking at those compiler-generated ifuncs for
> > > > > > arm64 on Android atm. i think i'm imagining a world where there's a
> > > > > > lot of that, and worrying about having to pay for the setup, call, and
> > > > > > loop for each ifunc, and wondering why we don't just pay once instead.
> > > > > > (as a bit of background context, Android "app" start is actually a
> > > > > > dlopen() in a clone of an existing zygote process, and in general app
> > > > > > launch time is one of the key metrics anyone who's serious is
> > > > > > optimizing for. you'd be surprised how much of my life i spend
> > > > > > explaining to people that if they want dlopen() to be faster, maybe
> > > > > > they shouldn't ask us to run thousands of ELF constructors.)
> > > > > >
> > > > > > but... the more time i spend looking at what we actually need in
> > > > > > third-party open source libraries right now i realize that libc and
> > > > > > FMV (which is still a future thing for us anyway) are really the only
> > > > > > _actual_ ifunc users. perhaps in part because macOS/iOS don't have
> > > > > > ifuncs, all the libraries that are part of the OS itself, for example,
> > > > > > are just doing their own thing with function pointers and
> > > > > > pthread_once() or whatever.
> > > > > >
> > > > > > (i have yet to try to get any data on actual apps. i have no reason to
> > > > > > think they'll be very different, but that could easily be skewed by
> > > > > > popular middleware or a popular game engine using ifuncs, so i do plan
> > > > > > on following up on that.)
> > > > > >
> > > > > > "how do they decide what to set that function pointer to?". well, it
> > > > > > looks like in most cases cpuid on x86 and calls to getauxval()
> > > > > > everywhere else. in some cases that's actually via some other library:
> > > > > > https://github.com/pytorch/cpuinfo or
> > > > > > https://github.com/google/cpu_features for example. so they have a
> > > > > > layer of caching there, even in cases where they don't have a single
> > > > > > function that sets all the function pointers.
> > > > >
> > > > > Right, function multi-versioning is just the sort of spot where we'd
> > > > > imagine hwprobe gets used, since it's providing similar/equivalent
> > > > > information to what cpuid does on x86. It may not be quite as fast as
> > > > > cpuid (I don't know how fast cpuid actually is). But with the vDSO
> > > > > function+data in userspace it should be able to match getauxval() in
> > > > > performance, as they're both a function pointer plus a loop. We're
> > > > > sort of planning for a world in which RISC-V has a wider set of these
> > > > > values to fetch, such that a ifunc selector may need a more complex
> > > > > set of information. Hwprobe and the vDSO gives us the ability both to
> > > > > answer multiple queries fast, and freely allocate more keys that may
> > > > > represent versioned features or even compound features.
> > > >
> > > > yeah, my incorrect mental model was that -- primarily because of
> > > > x86-64 and cpuid -- every function would get its own ifunc resolver
> > > > that would have to make a query. but the [in progress] arm64
> > > > implementation shows that that's not really the case anyway, and we
> > > > can just cache __riscv_hwprobe() in the same [one] place that
> > > > getauxval() is already being cached for arm64.
> > >
> > > Sounds good.
> > >
> > > >
> > > > > > so assuming i don't find that apps look very different from the OS
> > > > > > (that is: that apps use lots of ifuncs), i probably don't care at all
> > > > > > until we get to FMV. and i probably don't care for FMV, because
> > > > > > compiler-rt (or gcc's equivalent) will be the "caching layer" there.
> > > > > > (and on Android it'll be a while before i have to worry about libc's
> > > > > > ifuncs because we'll require V and not use ifuncs there for the
> > > > > > foreseeable future.)
> > > > > >
> > > > > > so, yeah, given that i've adopted the "pass a null pointer rather than
> > > > > > no arguments" convention you have, we have room for expansion if/when
> > > > > > FMV is a big thing, and until then -- unless i'm shocked by what i
> > > > > > find looking at actual apps -- i don't think i have any reason to
> > > > > > believe that ifuncs matter that much, and if compiler-rt makes one
> > > > > > __riscv_hwprobe() call per .so, that's probably fine. (i already spend
> > > > > > a big chunk of my life advising people to just have one .so file,
> > > > > > exporting nothing but a JNI_OnLoad symbol, so this will just make that
> > > > > > advice even better advice :-) )
> > > > >
> > > > > Just to confirm, by "pass a null pointer", you're saying that the
> > > > > Android libc also passes NULL as the second ifunc selector argument
> > > > > (or first)?
> > > >
> > > > #elif defined(__riscv)
> > > > // This argument and its value is just a placeholder for now,
> > > > // but it means that if we do pass something in future (such as
> > > > // getauxval() and/or hwprobe key/value pairs), callees will be able to
> > > > // recognize what they're being given.
> > > > typedef ElfW(Addr) (*ifunc_resolver_t)(void*);
> > > > return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(nullptr);
> > > >
> > > > it's arm64 that has the initial getauxval() argument:
> > > >
> > > > #if defined(__aarch64__)
> > > > typedef ElfW(Addr) (*ifunc_resolver_t)(uint64_t, __ifunc_arg_t*);
> > > > static __ifunc_arg_t arg;
> > > > static bool initialized = false;
> > > > if (!initialized) {
> > > > initialized = true;
> > > > arg._size = sizeof(__ifunc_arg_t);
> > > > arg._hwcap = getauxval(AT_HWCAP);
> > > > arg._hwcap2 = getauxval(AT_HWCAP2);
> > > > }
> > > > return reinterpret_cast<ifunc_resolver_t>(resolver_addr)(arg._hwcap
> > > > | _IFUNC_ARG_HWCAP, &arg);
> > > >
> > > > https://android.googlesource.com/platform/bionic/+/main/libc/bionic/bionic_call_ifunc_resolver.cpp
> > > >
> > > > > That's good. It sounds like you're planning to just
> > > > > continue passing NULL for now, and wait for people to start clamoring
> > > > > for this in android libc?
> > > >
> > > > yeah, and i'm assuming there will never be any clamor ... yesterday
> > > > and today i actually checked a bunch of popular apks, and didn't find
> > > > any that were currently using ifuncs.
> > > >
> > > > the only change i'm thinking of making right now is that "there's a
> > > > single argument, and it's null" should probably be the default.
> > > > obviously since Android doesn't add new architectures very often, this
> > > > is only likely to affect x86/x86-64 for the foreseeable future, but
> > > > being able to recognize at a glance "am i running under a libc new
> > > > enough to pass me arguments?" would certainly have helped for arm64.
> > > > even if x86/x86-64 never benefit, it seems like the right default for
> > > > the #else clause...
> > >
> > > Sounds good, thanks for the pointers. The paranoid person in me would
> > > also add a comment in the risc-v section that if a pointer to hwprobe
> > > is added, it should be added as the second argument, behind hwcap as
> > > the first (assuming this change lands).
> > >
> > > Come to think of it, the static inline helper I'm proposing in my
> > > discussion with Richard needs to take both arguments, since callers
> > > need to check both ((arg1 != 0) && (arg2 != NULL)) to safely know that
> > > arg2 is a pointer to __riscv_hwprobe().
> >
> > presumably not `(arg1 != 0)` but `(arg1 & _IFUNC_ARG_HWCAP)` to match arm64?
>
> It looks like we didn't do that _IFUNC_ARG_HWCAP bit on riscv.
> Actually, looking at the history of sysdeps/riscv/dl-irel.h, hwcap has
> always been passed as the first argument. So I think I don't need to
> check it in the (glibc-specific) inline helper function, I can safely
> assume it's there and go straight to checking the second argument.
oh i misunderstood what you were saying earlier.
> If you were coding this directly in a library or application, you
> would need to check the first arg to be compatible with other libcs
> like Android's.
we haven't shipped yet, so if you're telling me glibc is passing
`(getauxval(AT_HWCAP), nullptr)`, i'll change bionic to do the same
today ... the whole reason i'm here on this thread is to ensure source
compatibility for anyone writing ifuncs :-)
> I think checking against zero should be ok since bits
> like I, M, A should always be set. (I didn't dig through the kernel
> history, so maybe not on really old kernels? But you're not going to
> get hwprobe on those anyway either so the false bailout is correct).
>
> -Evan
More information about the Libc-alpha
mailing list