Optimizing a Program Through SVE2 Auto-Vectorization
gus
Posted on April 21, 2022
Today I'm going to be taking another look at the volume scaling algorithms we benchmarked in my last post with the goal of adding SVE2 optimization and further improving the runtime. Because we're using SVE2 we need to make these changes on either vol4.c or vol5.c, as those are the AArch64-specific algorithms that take advantage of inline assembly and intrinsics, respectively.
To make things simple I'll use the first candidate, vol4.c, which uses inline assembly. The full code is as follows:
int main() {
#ifndef __aarch64__
printf("Wrong architecture - written for aarch64 only.\n");
#else
// these variables will also be accessed by our assembler code
int16_t* in_cursor; // input cursor
int16_t* out_cursor; // output cursor
int16_t vol_int; // volume as int16_t
int16_t* limit; // end of input array
int x; // array interator
int ttl=0 ; // array total
// ---- Create in[] and out[] arrays
int16_t* in;
int16_t* out;
in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
// ---- Create dummy samples in in[]
vol_createsample(in, SAMPLES);
// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
// set vol_int to fixed-point representation of the volume factor
// Q: should we use 32767 or 32768 in next line? why?
vol_int = (int16_t)(VOLUME/100.0 * 32767.0);
// Q: what is the purpose of these next two lines?
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES;
// Q: what does it mean to "duplicate" values in the next line?
__asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h
while ( in_cursor < limit ) {
__asm__ (
"ldr q0, [%[in_cursor]], #16 \n\t"
// load eight samples into q0 (same as v0.8h)
// from [in_cursor]
// post-increment in_cursor by 16 bytes
// and store back into the pointer register
"sqrdmulh v0.8h, v0.8h, v1.8h \n\t"
// with 32 signed integer output,
// multiply each lane in v0 * v1 * 2
// saturate results
// store upper 16 bits of results into
// the corresponding lane in v0
"str q0, [%[out_cursor]],#16 \n\t"
// store eight samples to [out_cursor]
// post-increment out_cursor by 16 bytes
// and store back into the pointer register
// Q: What do these next three lines do?
: [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
: "r"(in_cursor),"r"(out_cursor)
: "memory"
);
}
// --------------------------------------------------------------------
for (x = 0; x < SAMPLES; x++) {
ttl=(ttl+out[x])%1000;
}
// Q: are the results usable? are they correct?
printf("Result: %d\n", ttl);
return 0;
#endif
}
To start, we need to include the relevant library by adding an include.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
#include <time.h>
#include <arm_sve.h>
#ifndef __aarch64__
printf("Wrong architecture- written for aarch64 only.\n");
Next, I changed the duplicate instruction's destination to the z register as per the SVE2 standard.
__asm__ ("dup z1.h,%w0"::"r"(vol_int)); //duplicate vol_int into z1.h
...
"sqrdmulh z0.h, z0.h, z1.h \n\t"
Next the makefile that we use to build the program needs to be changed to trigger the use of SVE2 by the compiler.
vol4: vol4.c vol_createsample.o vol.h
gcc ${CCOPTS} vol4.c -march=armv8-a+sve2 vol_createsample.o -o vol4
And finally, when running it we need to make sure to add the qemu-aarch64
argument to specify that we'll be emulating the appropriate hardware to run SVE2, as the real thing isn't available to us yet. I ran it with the following command and confirmed it worked as intended.
qemu-aarch64 ./vol4
This has been a quick exploration of making use of autovectorization to implement SVE2 in a program. Enjoy!
Posted on April 21, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.