Adding SVE2 Support to an Open Source Library - Part II

Part 1
Part 2
Part 3

In the last entry in this series I found a library called opus which currently uses SIMD by way of compiler intrinsics. Today I'm implementing SVE2 optimization in this library.

My first step will be swapping out the SIMD intrinsics in a file for their SVE2 counterparts. Then I can modify the makefile to detect when it's appropriate to use those enhancements and compile them accordingly. If a machine can't support SVE2, there's no use compiling that code.

By performing a search for "neon" in the package we get the following results:

find | grep neon

./celt/arm/pitch_neon_intr.lo
./celt/arm/celt_neon_intr.lo
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.o
./celt/arm/pitch_neon_intr.c
./celt/arm/celt_neon_intr.o
./celt/arm/.libs/pitch_neon_intr.o
./celt/arm/.libs/celt_neon_intr.o
./celt/arm/.deps/pitch_neon_intr.Plo
./celt/arm/.deps/celt_neon_intr.Plo
./silk/fixed/arm/.deps/warped_autocorrelation_FIX_neon_intr.Plo
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/biquad_alt_neon_intr.lo
./silk/arm/NSQ_neon.c
./silk/arm/NSQ_del_dec_neon_intr.o
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.lo
./silk/arm/NSQ_neon.h
./silk/arm/LPC_inv_pred_gain_neon_intr.o
./silk/arm/.libs/NSQ_del_dec_neon_intr.o
./silk/arm/.libs/LPC_inv_pred_gain_neon_intr.o
./silk/arm/.libs/biquad_alt_neon_intr.o
./silk/arm/.libs/NSQ_neon.o
./silk/arm/LPC_inv_pred_gain_neon_intr.lo
./silk/arm/.deps/NSQ_neon.Plo
./silk/arm/.deps/NSQ_del_dec_neon_intr.Plo
./silk/arm/.deps/LPC_inv_pred_gain_neon_intr.Plo
./silk/arm/.deps/biquad_alt_neon_intr.Plo
./silk/arm/biquad_alt_neon_intr.o
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.lo
./silk/arm/NSQ_neon.o

It looks like there's a lot to work with here - unfortunately we don't have time to add SVE2 intrinsics to all these files so we'll have to narrow in on one file or even a section of a file to start with, which the maintainers can use as a jumping off point for future optimization. In the last post I'd mentioned one file in particular, opus/celt/arm/pitch_neon_intr.c. I'll start there and see what I can do.

First we'll include the appropriate header:

#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif /* __ARM_FEATURE_SVE */

Starting with the first loop we encounter, the code is as follows:

opus_val32 celt_inner_prod_neon(const opus_val16 *x, const opus_val16 *y, int N)
{

int i;
    opus_val32 xy;
    int16x8_t x_s16x8, y_s16x8;
    int32x4_t xy_s32x4 = vdupq_n_s32(0);
    int64x2_t xy_s64x2;
    int64x1_t xy_s64x1;

    for (i = 0; i < N - 7; i += 8) {
        x_s16x8  = vld1q_s16(&x[i]);
        y_s16x8  = vld1q_s16(&y[i]);
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_low_s16 (x_s16x8), vget_low_s16 (y_s16x8));
        xy_s32x4 = vmlal_s16(xy_s32x4, vget_high_s16(x_s16x8), vget_high_s16(y_s16x8));
    }

for (; i < N; i++) {
        xy = MAC16_16(xy, x[i], y[i]);
    }

By looking up the intrinsics in the instruction set arm provides, we can quickly find out what the Neon intrinsics represent and determine their SVE2 counterparts.

We start with initializations, including one initialization to the result of vdupq_n_s32 - which sets all lanes of the register to the same value. The SVE2 version of this is svdup_lane.

The first intrinsic in the loop, vld1q_s16, can load multiple elements to multiple registers. In this case, it loads x_s16x8 with the value from &x[i]. It's followed by another of the same type which loads y_s16x8 with the value from &y[i]. The SVE2 version of this is svldnf1sh_32. Next there are two multiplications between the low portions of x and y and then the high portions using the vmlal_s16 instruction. The SVE versions of these are svpmullb and svpmullt respectively, for the bottom and top halves. We also need to call vget_low_s16 and vget_high_s16, or rather their SVE2 counterparts: svunpklo and svunpkhi.

After making all the aforementioned adjustments, here's what we get:

#ifdef __ARM_FEATURE_SVE2
pus_val32 celt_inner_prod_neon(const opus_val16 *x, const opus_val16 *y, int N)
{
    int i;
    opus_val32 xy;
    svint16_t x_s16x8, y_s16x8;
    svint32_t xy_s32x4 = svdup_lane(0);
    svint64_t xy_s64x2;
    svint64_t xy_s64x1;

    for (i = 0; i < N - 7; i += 8) {
        x_s16x8  = svldnf1sh_s32(&x[i]);
        y_s16x8  = svldnf1sh_s32(&y[i]);
        xy_s32x4 = svpmullb(xy_s32x4, svunpklo (x_s16x8), svunpklo (y_s16x8));
        xy_s32x4 = svpmullb(xy_s32x4, svunpkhi (x_s16x8), svunpkhi (y_s16x8));
    }

    if (N - i >= 4) {
        const int16x4_t x_s16x4 = vld1_s16(&x[i]);
        const int16x4_t y_s16x4 = vld1_s16(&y[i]);
        xy_s32x4 = vmlal_s16(xy_s32x4, x_s16x4, y_s16x4);
        i += 4;
    }

    xy_s64x2 = vpaddlq_s32(xy_s32x4);
    xy_s64x1 = vadd_s64(vget_low_s64(xy_s64x2), vget_high_s64(xy_s64x2));
    xy      = vget_lane_s32(vreinterpret_s32_s64(xy_s64x1), 0);

    for (; i < N; i++) {
        xy = MAC16_16(xy, x[i], y[i]);
    }
#endif

Now all we have to do is see if we can compile and run it.

CCASFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes
CCDEPMODE = depmode=gcc3
CFLAGS = -g -O3 -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes -fvisibility=hidden -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes

I added the relevant compile flags to turn on SVE2 optimization and gave it a go - unfortunately there were some build errors that would have to be dealt with so in my next post I'll go over next steps to solve those and continue building SVE2 optimizations into this package. More on that soon!

Blog

Adding SVE2 Support to an Open Source Library - Part II

gus

Join Our Newsletter. No Spam, Only the good stuff.

Related