Implementing SVE2 for Open Source Project

Introduction

In the last post, we explored and implemented Scalable Vector Extension 2 (SVE2) code for the volume adjusting algorithm. Now, we will do the same process but in a much bigger scale - by actually trying to contribute SVE2 code for the ongoing open source project.

Searching for a package

As we learned before, SVE2 is best suitable for processing large amount of data such as:

Computer vision
Multimedia
Long-Term Evolution (LTE) baseband processing
Genomics
In-memory database
Web serving
Cryptography
And so on...

And we know the vectorization can be implemented in 3 different ways:

Auto-vectorization
Compiler Intrinsics
Inline Assembler

Since we already have the experience of intrinsics, we will try our best to search packages that already use them.

We also have to consider if a package supports for our machine (Fedora 35 running on Aarch64 Architecture) as we have to install the program. For this, we will use the Fedora's package manager DNF and run the following commands:

$dnf search search_keyword
$dnf info package_name

By using $dnf search, the keyword is searched in both name and description of every package. Once we find a name of package, we can display the detailed description of that package with $dnf info. We also have to be careful to only choose open-source project.

List of Possible Candidates

With the aforementioned strategy, we can find some possible candidates as follows:

Let's see each package together!

libjpeg-turbo

libjpeg-turbo is a JPEG image codec that utilizes SIMD instructions to perform JPEG compression and decompression. When we inspect the package, we can find a list of promising files as follows:

$ find . -name "*neon*"
./jidctfst-neon.c
./jcsample-neon.c
./aarch32/jchuff-neon.c
./aarch32/jsimd_neon.S
./aarch32/jccolext-neon.c
./jfdctfst-neon.c
./neon-compat.h.in
./aarch64/jchuff-neon.c
./aarch64/jsimd_neon.S
./aarch64/jccolext-neon.c
./jidctred-neon.c
./jfdctint-neon.c
./jdmerge-neon.c
./jidctint-neon.c
./jccolor-neon.c
./jdsample-neon.c
./jdcolor-neon.c
./jdmrgext-neon.c
./jcgryext-neon.c
./jcphuff-neon.c
./jcgray-neon.c
./jdcolext-neon.c
./jquanti-neon.c

// jquanti-neon.c
// ...

#if defined(__clang__) && (defined(__aarch64__) || defined(_M_ARM64))
#pragma unroll
#endif
  for (i = 0; i < DCTSIZE; i += DCTSIZE / 2) {
    /* Load DCT coefficients. */
    int16x8_t row0 = vld1q_s16(workspace + (i + 0) * DCTSIZE);
    int16x8_t row1 = vld1q_s16(workspace + (i + 1) * DCTSIZE);
    int16x8_t row2 = vld1q_s16(workspace + (i + 2) * DCTSIZE);
    int16x8_t row3 = vld1q_s16(workspace + (i + 3) * DCTSIZE);
    /* Load reciprocals of quantization values. */
    uint16x8_t recip0 = vld1q_u16(recip_ptr + (i + 0) * DCTSIZE);
    uint16x8_t recip1 = vld1q_u16(recip_ptr + (i + 1) * DCTSIZE);
    uint16x8_t recip2 = vld1q_u16(recip_ptr + (i + 2) * DCTSIZE);
    uint16x8_t recip3 = vld1q_u16(recip_ptr + (i + 3) * DCTSIZE);
    uint16x8_t corr0 = vld1q_u16(corr_ptr + (i + 0) * DCTSIZE);
    uint16x8_t corr1 = vld1q_u16(corr_ptr + (i + 1) * DCTSIZE);
    uint16x8_t corr2 = vld1q_u16(corr_ptr + (i + 2) * DCTSIZE);
    uint16x8_t corr3 = vld1q_u16(corr_ptr + (i + 3) * DCTSIZE);
    int16x8_t shift0 = vld1q_s16(shift_ptr + (i + 0) * DCTSIZE);
    int16x8_t shift1 = vld1q_s16(shift_ptr + (i + 1) * DCTSIZE);
    int16x8_t shift2 = vld1q_s16(shift_ptr + (i + 2) * DCTSIZE);
    int16x8_t shift3 = vld1q_s16(shift_ptr + (i + 3) * DCTSIZE);

// ...

As we can see, vld1q_s16 intrinsic is used to load a vector from memory. Furthermore, the package does not yet use SVE or SVE2 implementation. This indicates this project is a good candidate where we can contribute our knowledge of SVE2 for this project.

SoundTouch

Soundtouch is an audio-processing library that allows changing the sound tempo, pitch and playback rate parameters. This sounds familiar to us as we dealt with a simple audio algorithm before and maybe another good candidate for us.

$grep -ir neon .
./configure.ac:AC_CHECK_HEADERS([arm_neon.h])
./configure.ac:AC_ARG_ENABLE([neon-optimizations],
./configure.ac:              [AS_HELP_STRING([--enable-neon-optimizations],
./configure.ac:                              [use ARM NEON optimization [default=yes]])],[enable_neon_optimizations="${enableval}"],

# configure.ac 
if test "x$enable_neon_optimizations" = "xyes" -a "x$ac_cv_header_arm_neon_h" = "xyes"; then

        # Check for ARM NEON support
        original_saved_CXXFLAGS=$CXXFLAGS
        have_neon=no
        CXXFLAGS="-mfpu=neon -march=native $CXXFLAGS"

        # Check if can compile neon code using intrinsics, require GCC >= 4.3 for autovectorization.
        AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
        #if defined(__GNUC__) && (__GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 3))
        #error "Need GCC >= 4.3 for neon autovectorization"
        #endif
        #include <arm_neon.h>
        int main () {
                int32x4_t t = {1};
                return vaddq_s32(t,t)[0] == 2;
        }]])],[have_neon=yes])
        CXXFLAGS=$original_saved_CXXFLAGS
        if test "x$have_neon" = "xyes" ; then
                echo "****** NEON support enabled ******"
                CPPFLAGS="-mfpu=neon -march=native -mtune=native $CPPFLAGS"
                AC_DEFINE(SOUNDTOUCH_USE_NEON,1,[Use ARM NEON extension])
        fi
fi

The package does not contain any files that has simd or neon in their names. However, it does have a file that contains neon in its content. When we open that file, we can see this package utilizes the auto-vectorization feature by the compiler. As we can see, the package prompts a message saying that it cannot perform the auto-vectorization when it is compiled by GCC with a version less than 4.3.

Opus

Opus is a audio codec for interactive speech and audio transmission across the Internet with compression algorithms. It can support a wide rage of interactive audio applications such as Voice Over IP (VoIP), remote live music performance, and video conferencing. As similar to the last one, this may be a good candidate for us.

$ find | grep -i neon
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.c
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/NSQ_neon.c
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.h
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c

// celt_neon_intr.c
#include <arm_neon.h>
#include "../pitch.h"

#if defined(FIXED_POINT)
void xcorr_kernel_neon_fixed(const opus_val16 * x, const opus_val16 * y, opus_val32 sum[4], int len)
{
   int j;
   int32x4_t a = vld1q_s32(sum);
   /* Load y[0...3] */
   /* This requires len>0 to always be valid (which we assert in the C code). */
   int16x4_t y0 = vld1_s16(y);
   y += 4;

   for (j = 0; j + 8 <= len; j += 8)
   {
      /* Load x[0...7] */
      int16x8_t xx = vld1q_s16(x);
      int16x4_t x0 = vget_low_s16(xx);
      int16x4_t x4 = vget_high_s16(xx);
      /* Load y[4...11] */
      int16x8_t yy = vld1q_s16(y);
      int16x4_t y4 = vget_low_s16(yy);
      int16x4_t y8 = vget_high_s16(yy);
      int32x4_t a0 = vmlal_lane_s16(a, y0, x0, 0);
      int32x4_t a1 = vmlal_lane_s16(a0, y4, x4, 0);

      int16x4_t y1 = vext_s16(y0, y4, 1);
      int16x4_t y5 = vext_s16(y4, y8, 1);
      int32x4_t a2 = vmlal_lane_s16(a1, y1, x0, 1);
      int32x4_t a3 = vmlal_lane_s16(a2, y5, x4, 1);

      int16x4_t y2 = vext_s16(y0, y4, 2);
      int16x4_t y6 = vext_s16(y4, y8, 2);
      int32x4_t a4 = vmlal_lane_s16(a3, y2, x0, 2);
      int32x4_t a5 = vmlal_lane_s16(a4, y6, x4, 2);

      int16x4_t y3 = vext_s16(y0, y4, 3);
      int16x4_t y7 = vext_s16(y4, y8, 3);
      int32x4_t a6 = vmlal_lane_s16(a5, y3, x0, 3);
      int32x4_t a7 = vmlal_lane_s16(a6, y7, x4, 3);

      y0 = y8;
      a = a7;
      x += 8;
      y += 8;
   }
// ...

When searched with neon, we can see a list of promising files that potentially deal with simd instructions. In celt_neon_intr.c file, we can see xcorr_kernel_neon_fixed function executes a loop with SIMD instructions.

Result

We have a pretty good open-source projects to implement SVE2. Amongst them, we will choose Opus project for several reasons. First of all, this project is still well and actively maintained by developers. As a matter of fact, it is standardized by the Internet Engineering Task Force IETF and unmatched for interactive audio transmission over the Internet. Besides, the package is well-documented to understand the code thoroughly. Lastly, and most importantly, the code is written to be more readable by new developers as compared to the first two projects. As we can see, the author kindly commented the purpose of variables and functions. Thus, we will choose Opus project to contribute our SVE2 knowledge.

Contributions

The way to contribute for Opus project is well-explained in its wiki page. Thankfully, the wiki page states that one of ways to contribute to Opus development is by doing optimizations (assembly/intrinsics). To do this, we can easily approach to the developers on the mailing list or through the IRC channel.

Conclusion

In this post, we explored some of the open-source projects where we could contribute our SVE2 knowledge. As it turned out, Opus project is most suitable for us. In the following post, we will start implementing SVE2 codes in the project.

Blog