On Mon, 2018-10-01 at 19:22 +0300, Anton Youdkevitch wrote:
+L(dst_unaligned):
+ /* For the unaligned store case the code loads two
+ aligned chunks and then merges them using ext
+ instrunction. This can be up to 30% faster than
+ the the simple unaligned store access.
+
+ Current state: tmp1 = dst % 16; C_q, D_q, E_q
+ contains data yet to be stored. src and dst points
+ to next-to-be-processed data. A_q, B_q contains
+ data already stored before, count = bytes left to
+ be load decremented by 64.
+
+ The control is passed here if at least 64 bytes left
+ to be loaded. The code does two aligned loads and then
+ extracts (16-tmp1) bytes from the first register and
+ tmp1 bytes from the next register forming the value
+ for the aligned store.
+
+ As ext instruction can only have it's index encoded
+ as immediate. 15 code chunks process each possible
+ index value. Computed goto is used to reach the
+ required code. */
+
+ /* Store the 16 bytes to dst and align dst for further
+ operations, several bytes will be stored at this
+ address once more */
+ str C_q, [dst], #16
+ ldp F_q, G_q, [src], #32
+ bic dst, dst, 15
+ adr tmp2, L(load_and_merge)
+ add tmp2, tmp2, tmp1, LSL 7
+ sub tmp2, tmp2, 128
+ br tmp2
Anton,
As far as the actual code, I think my only concern is this use of a
'computed goto' to jump to one of the extract sections. It seems very
brittle since a change in the alignment of the various sections or a
change in the size of those sections could mess up this jump. Would
the code be any slower if you used a jump table instead of a computed
goto?