Vector registers perform better than scalar register pairs for copying
data so prefer them instead. This results in a time reduction of over
50% (i.e. 2x speed improvemnet) for some smaller sizes for memcpy-walk.
Larger sizes show improvements of around 1% to 2%. memcpy-random shows
a very small improvement, in the range of 1-2%.
* sysdeps/aarch64/multiarch/memcpy_falkor.S (__memcpy_falkor):
Use vector registers.