Introduction

Welcome back to part 2 of SVE2(Scalable Vector Extension version 2). If you are not sure about what this post is about, you can see the part 1 to have a better idea.

Source code (vol1.c) for conversion to adapt SVE2

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <stdbool.h>
#include "vol.h"

int16_t scale_sample(int16_t sample, int volume) {

        return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);
}

int main() {
        int             x;
        int             ttl=0;

// ---- Create in[] and out[] arrays
        int16_t*        in;
        int16_t*        out;
        in=(int16_t*) calloc(SAMPLES, sizeof(int16_t));
        out=(int16_t*) calloc(SAMPLES, sizeof(int16_t));

// ---- Create dummy samples in in[]
        vol_createsample(in, SAMPLES);

// ---- This is the part we're interested in!
// ---- Scale the samples from in[], placing results in out[]
        for (x = 0; x < SAMPLES; x++) {
                out[x]=scale_sample(in[x], VOLUME);
        }

// ---- This part sums the samples.

        for (x = 0; x < SAMPLES; x++) {
                ttl=(ttl+out[x])%1000;
        }

// ---- Print the sum of the samples.

        printf("Result: %d\n", ttl);

        return 0;

}

As you can tell, this is vol1 from previous post about algorithm selection.
Note that vol1 utilizes a fixed-point calculation. This avoids the cost of repetitively casting between integer and floating point.

Converting

C Compiler Options

most compilers do not have a specific target for Armv9 systems. Therefore, to build code that includes SVE2 instructions, we will need to instruct the complier to emit code for an Armv8-a processor that also understands the SVE2 instructions; on the GCC compiler, this is performed using the -march= option

we have to instruct the compiler to emit code for an Armv8a processor to make it understand SVE2 to do that we need to invoke the autovectorizer in GCC version 11, we must use -O3 or the appropriate feature options

gcc -O3 -march=armv8-a+sve2

In our case, we will be working with vol1

gcc -o3 -march=armv8-a+sve2 vol1.c vol_createsample.c -o vol1

Then, we can execute the program by emulating with the QEMU usermode system. This will trap SVE2 instructions and emulate them in software, while executing Armv8a instructions directly on the hardware:

qemu-aarch64 ./vol1

Result:

Converted code

.arch armv8-a+sve2
        .file   "vol1.c"
        .text
        .align  2
        .p2align 4,,11
        .global scale_sample
        .type   scale_sample, %function
scale_sample:
.LFB24:
        .cfi_startproc
        lsl     w2, w1, 15
        mov     w3, 34079
        sub     w1, w2, w1
        movk    w3, 0x51eb, lsl 16
        sxth    w0, w0
        smull   x3, w1, w3
        asr     x3, x3, 37
        sub     w1, w3, w1, asr 31
        lsl     w1, w1, 1
        mul     w0, w1, w0
        lsr     w0, w0, 16
        ret
        .cfi_endproc
.LFE24:
        .size   scale_sample, .-scale_sample
        .section        .rodata.str1.8,"aMS",@progbits,1
        .align  3
.LC0:
        .string "Total Time: %2.9f\n"

Understanding converted code

SVE2 instructions

 .cfi_startproc
        lsl     w2, w1, 15
        mov     w3, 34079
        sub     w1, w2, w1
        movk    w3, 0x51eb, lsl 16
        sxth    w0, w0
        smull   x3, w1, w3
        asr     x3, x3, 37
        sub     w1, w3, w1, asr 31
        lsl     w1, w1, 1
        mul     w0, w1, w0
        lsr     w0, w0, 16
        ret
        .cfi_endproc

corresponding C code

return ((((int32_t) sample) * ((int32_t) (32767 * volume / 100) <<1) ) >> 16);

‘movk w3, 0x51eb, lsl 16’ contains an ‘lsl 16’ instruction, indicating that the bits are to be shifted left by 16 bits.
‘sxth ’ tells register w0 to sign the least-significant element of itself.
‘smull x3, w1, w3’ refers to the multiplication of the value of ‘volume’ by 32767.
‘lsl w1, w1, 1’ refers to the shifting left one bit at the end.
‘mul w0, w1, w0’ turns the result of multiplying the sample into a signed 32-bit integer.
‘lsr w0, w0, 16’ shifts the final resulting integer’s bits to the right 16 times.

Conclusion

We've done experimenting with SVE2 instructions to the volume adjusting algorithm(vol1). Since SVE2 is very new at the moment and has practically no systems developed for it. And we must use an emulator to run the program. I wasn't able to find a way to test the SVE2 performance of the assembly code.
The most challenging part of the lab I found was when after converting the C code into SVE2 instructions, trying to relate the instructions from SVE2 with the code in the original C file.

Source: SVE2

Blog

SVE2 (Scalable Vector Extension version 2) - LAB 6 part 2

Tecca Yu