Optimizing a Program Through SVE2 Auto-Vectorization

Today I'm going to be taking another look at the volume scaling algorithms we benchmarked in my last post with the goal of adding SVE2 optimization and further improving the runtime. Because we're using SVE2 we need to make these changes on either vol4.c or vol5.c, as those are the AArch64-specific algorithms that take advantage of inline assembly and intrinsics, respectively.

To make things simple I'll use the first candidate, vol4.c, which uses inline assembly. The full code is as follows:

int main() {

#ifndef __aarch64__
        printf("Wrong architecture - written for aarch64 only.\n");
#else


        // these variables will also be accessed by our assembler code
        int16_t*        in_cursor;              // input cursor
        int16_t*        out_cursor;             // output cursor
        int16_t         vol_int;                // volume as int16_t

        int16_t*        limit;                  // end of input array

        int             x;                      // array interator
        int             ttl=0 ;                 // array total

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]


        // set vol_int to fixed-point representation of the volume factor
        // Q: should we use 32767 or 32768 in next line? why?
        vol_int = (int16_t)(VOLUME/100.0 * 32767.0);

        // Q: what is the purpose of these next two lines?
        in_cursor = in;
        out_cursor = out;
        limit = in + SAMPLES;

        // Q: what does it mean to "duplicate" values in the next line?
        __asm__ ("dup v1.8h,%w0"::"r"(vol_int)); // duplicate vol_int into v1.8h

        while ( in_cursor < limit ) {
                __asm__ (
                        "ldr q0, [%[in_cursor]], #16    \n\t"
                        // load eight samples into q0 (same as v0.8h)
                        // from [in_cursor]
                        // post-increment in_cursor by 16 bytes
                        // and store back into the pointer register


                        "sqrdmulh v0.8h, v0.8h, v1.8h   \n\t"
                        // with 32 signed integer output,
                        // multiply each lane in v0 * v1 * 2
                        // saturate results
                        // store upper 16 bits of results into
                        // the corresponding lane in v0

                        "str q0, [%[out_cursor]],#16            \n\t"
                        // store eight samples to [out_cursor]
                        // post-increment out_cursor by 16 bytes
                        // and store back into the pointer register

                        // Q: What do these next three lines do?
                        : [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor)
                        : "r"(in_cursor),"r"(out_cursor)
                        : "memory"
                        );
        }

// --------------------------------------------------------------------

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

        // Q: are the results usable? are they correct?
        printf("Result: %d\n", ttl);

        return 0;

#endif
}

To start, we need to include the relevant library by adding an include.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include "vol.h"
#include <time.h>
#include <arm_sve.h>

#ifndef __aarch64__
        printf("Wrong architecture- written for aarch64 only.\n");

Next, I changed the duplicate instruction's destination to the z register as per the SVE2 standard.

__asm__ ("dup z1.h,%w0"::"r"(vol_int)); //duplicate vol_int into z1.h
...
"sqrdmulh z0.h, z0.h, z1.h      \n\t"

Next the makefile that we use to build the program needs to be changed to trigger the use of SVE2 by the compiler.

vol4:    vol4.c vol_createsample.o vol.h
         gcc ${CCOPTS} vol4.c -march=armv8-a+sve2 vol_createsample.o -o vol4

And finally, when running it we need to make sure to add the qemu-aarch64 argument to specify that we'll be emulating the appropriate hardware to run SVE2, as the real thing isn't available to us yet. I ran it with the following command and confirmed it worked as intended.

qemu-aarch64 ./vol4

This has been a quick exploration of making use of autovectorization to implement SVE2 in a program. Enjoy!

Blog

Optimizing a Program Through SVE2 Auto-Vectorization

gus

Join Our Newsletter. No Spam, Only the good stuff.

Related