Adding SVE2 Support to an Open Source Library - Part III

gusmccallum

gus

Posted on April 22, 2022

Adding SVE2 Support to an Open Source Library - Part III

Part 1
Part 2
Part 3


In my last post I ran into some snags at the end when building opus, apparently some of the intrinsics I wrote for the file I modified errored out and as such I wasn't able to build and test the library. In this post, I'm going to change tactics and try autovectorization to see if I can successfully build and test the library, after which I'll give some analysis on the results.

First off I'll start by clearing my work so far and downloading a fresh copy of the library. At this point I need to configure and build, but in order to prevent the NEON intrinsics from conflicting with the autovectorization I'm going to implement I'll need to turn off NEON support in the configure.ac file. I searched for mentions of intrinsics and turned them off, and then ran autogen.sh and configure to get the build configured. We can confirm intrinsics are now turned off by the output:

------------------------------------------------------------------------
  opus 1.3.1-107-gccaaffa9-dirty:  Automatic configuration OK.

    Compiler support:

    C99 var arrays: ................ yes
    C99 lrintf: .................... yes
    Use alloca: .................... no (using var arrays)

    General configuration:

    Floating point support: ........ yes
    Fast float approximations: ..... no
    Fixed point debugging: ......... no
    Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
    External Assembly Optimizations:  
    Intrinsics Optimizations: ...... no
    Run-time CPU detection: ........ no
    Custom modes: .................. no
    Assertion checking: ............ no
    Hardening: ..................... yes
    Fuzzing: ....................... no
    Check ASM: ..................... no

    API documentation: ............. yes
    Extra programs: ................ yes
------------------------------------------------------------------------

Enter fullscreen mode Exit fullscreen mode

Now by subbing the CFLAGS mentioned in the last post (-O3 -march=armv8-a+sve2) into the makefile and taking care to run the build with the qemu-aarch64 argument, we can see that the build and most of the tests execute successfully.

FAIL: celt/tests/test_unit_cwrs32
./test-driver: line 107: 448983 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_dft
PASS: celt/tests/test_unit_entropy
PASS: celt/tests/test_unit_laplace
PASS: celt/tests/test_unit_mathops
./test-driver: line 107: 449031 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_mdct
./test-driver: line 107: 449046 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_rotation
PASS: celt/tests/test_unit_types
./test-driver: line 107: 449072 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: silk/tests/test_unit_LPC_inv_pred_gain
PASS: tests/test_opus_api
PASS: tests/test_opus_decode
PASS: tests/test_opus_encode
PASS: tests/test_opus_padding
./test-driver: line 107: 449716 Illegal instruction     (core dumped) "$@" > $log_file 2>&1
FAIL: tests/test_opus_projection
======================================================
   opus 1.3.1-107-gccaaffa9-dirty: ./test-suite.log
======================================================

# TOTAL: 14
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  6
# XPASS: 0
# ERROR: 0

.. contents:: :depth: 2

FAIL: celt/tests/test_unit_cwrs32
=================================

FAIL celt/tests/test_unit_cwrs32 (exit status: 132)

FAIL: celt/tests/test_unit_dft
==============================

FAIL celt/tests/test_unit_dft (exit status: 132)

FAIL: celt/tests/test_unit_mdct
===============================

FAIL celt/tests/test_unit_mdct (exit status: 132)

FAIL: celt/tests/test_unit_rotation
===================================

FAIL celt/tests/test_unit_rotation (exit status: 132)

FAIL: silk/tests/test_unit_LPC_inv_pred_gain
============================================

FAIL silk/tests/test_unit_LPC_inv_pred_gain (exit status: 132)

FAIL: tests/test_opus_projection
================================

FAIL tests/test_opus_projection (exit status: 132)

============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9-dirty
============================================================================
# TOTAL: 14
# PASS:  8
# SKIP:  0
# XFAIL: 0
# FAIL:  6
# XPASS: 0
# ERROR: 0
============================================================================

Enter fullscreen mode Exit fullscreen mode

Let's take a closer look at one of the tests that successfully made use of the SVE2 inclusion:

Running Opus Encode Test

./test_opus_encode
Testing libopus 1.3.1-107-gccaaffa9-dirty encoder. Random seed: 3135156945 (95E3)
Running simple tests for bugs that have been fixed previously
  Encode+Decode tests.
    Mode    LP FB encode  VBR,  11318 bps OK.
    Mode    LP FB encode  VBR,  14930 bps OK.
    Mode    LP FB encode  VBR,  67659 bps OK.
    Mode Hybrid FB encode  VBR,  17712 bps OK.
    Mode Hybrid FB encode  VBR,  51200 bps OK.
    Mode Hybrid FB encode  VBR,  80954 bps OK.
    Mode Hybrid FB encode  VBR, 127480 bps OK.
    Mode   MDCT FB encode  VBR, 752629 bps OK.
    Mode   MDCT FB encode  VBR,  25609 bps OK.
    Mode   MDCT FB encode  VBR,  33107 bps OK.
    Mode   MDCT FB encode  VBR,  78592 bps OK.
    Mode   MDCT FB encode  VBR,  73157 bps OK.
    Mode   MDCT FB encode  VBR, 137477 bps OK.
    Mode    LP FB encode CVBR,  11480 bps OK.
    Mode    LP FB encode CVBR,  21257 bps OK.
    Mode    LP FB encode CVBR,  63201 bps OK.
    Mode Hybrid FB encode CVBR,  25583 bps OK.
    Mode Hybrid FB encode CVBR,  36126 bps OK.
    Mode Hybrid FB encode CVBR,  54107 bps OK.
    Mode Hybrid FB encode CVBR, 108482 bps OK.
    Mode   MDCT FB encode CVBR, 934758 bps OK.
    Mode   MDCT FB encode CVBR,  25111 bps OK.
    Mode   MDCT FB encode CVBR,  33929 bps OK.
    Mode   MDCT FB encode CVBR,  52270 bps OK.
    Mode   MDCT FB encode CVBR,  79059 bps OK.
    Mode   MDCT FB encode CVBR, 117366 bps OK.
    Mode    LP FB encode  CBR,   7432 bps OK.
    Mode    LP FB encode  CBR,  16781 bps OK.
    Mode    LP FB encode  CBR,  90950 bps OK.
    Mode Hybrid FB encode  CBR,  18257 bps OK.
    Mode Hybrid FB encode  CBR,  37925 bps OK.
    Mode Hybrid FB encode  CBR,  56473 bps OK.
    Mode Hybrid FB encode  CBR,  78233 bps OK.
    Mode   MDCT FB encode  CBR, 780220 bps OK.
    Mode   MDCT FB encode  CBR,  20668 bps OK.
    Mode   MDCT FB encode  CBR,  38398 bps OK.
    Mode   MDCT FB encode  CBR,  74376 bps OK.
    Mode   MDCT FB encode  CBR,  68468 bps OK.
    Mode   MDCT FB encode  CBR, 141108 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,   4884 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  18110 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  44628 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  15245 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  26620 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  61885 bps OK.
    Mode    LP NB dual-mono MS encode  VBR,  86977 bps OK.
    Mode    LP NB dual-mono MS encode  VBR, 119885 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,   7123 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  19106 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  41453 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  10135 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  19040 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  57693 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR,  77731 bps OK.
    Mode   MDCT NB dual-mono MS encode  VBR, 165272 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,   7245 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  16460 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  56065 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  13411 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  28783 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  61638 bps OK.
    Mode    LP NB dual-mono MS encode CVBR,  92219 bps OK.
    Mode    LP NB dual-mono MS encode CVBR, 110936 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,   4047 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  21622 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  43253 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  12557 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  28091 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  57473 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR,  77203 bps OK.
    Mode   MDCT NB dual-mono MS encode CVBR, 154714 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,   4000 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  12396 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  56699 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  10327 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  19576 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  36651 bps OK.
    Mode    LP NB dual-mono MS encode  CBR,  50625 bps OK.
    Mode    LP NB dual-mono MS encode  CBR, 122376 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,   4916 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  14647 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  55741 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  12307 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  23408 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  62311 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR,  54876 bps OK.
    Mode   MDCT NB dual-mono MS encode  CBR, 104358 bps OK.
    All framesize pairs switching encode, 9810 frames OK.
Running fuzz_encoder_settings with 5 encoder(s) and 40 setting change(s) each.
Tests completed successfully.
Enter fullscreen mode Exit fullscreen mode

Now we can inspect the encoding program and see how it makes use of SVE2 instructions.

find . -type f -executable -print | while read X ; do echo ======== $X ; objdump -d $X | grep whilelo ;
Enter fullscreen mode Exit fullscreen mode

The lines in question are too numerous to put here but the files affected are:

======== ./tests/test_opus_projection
======== ./tests/.libs/test_opus_encode
======== ./tests/.libs/test_opus_api
======== ./tests/.libs/test_opus_decode
======== ./celt/tests/test_unit_entropy
======== ./celt/tests/test_unit_cwrs32
======== ./celt/tests/test_unit_mathops
======== ./celt/tests/test_unit_rotation
======== ./celt/tests/test_unit_dft
======== ./celt/tests/test_unit_mdct
======== ./.libs/opus_demo
======== ./.libs/libopus.so.0.8.0
======== ./.libs/trivial_example
======== ./opus_compare
======== ./silk/tests/test_unit_LPC_inv_pred_gain
Enter fullscreen mode Exit fullscreen mode

And a line count with find . -type f -executable -print | while read X ; do echo ======== $X ; objdump -d $X 2> /dev/null | grep whilelo ; done | wc -l returns 2903 instances of whilelo. I'll zero in on one of these files to see how it makes use of its SVE2 instructions.

Analyzing Opus Encode Test

I'll go back to the encode test I ran before and take a look at how it's using its SVE2 instructions now.

objdump -d test_opus_encode > ~/opus_encode_objdump
Enter fullscreen mode Exit fullscreen mode

In searching around the output I can find 6 instances of whilelo at play here, the first 2 being in this <generate_music> section.

00000000004016b0 <generate_music>:
  4016b0:       d2800002        mov     x2, #0x0                        // #0
  4016b4:       d282d003        mov     x3, #0x1680                     // #5760
  4016b8:       2538c000        mov     z0.b, #0
  4016bc:       25631fe0        whilelo p0.h, xzr, x3
  4016c0:       e4a24000        st1h    {z0.h}, p0, [x0, x2, lsl #1]
  4016c4:       0470e3e2        inch    x2
  4016c8:       25631c40        whilelo p0.h, x2, x3
  4016cc:       54ffffa1        b.ne    4016c0 <generate_music+0x10>  // b.any
  4016d0:       712d003f        cmp     w1, #0xb40
  4016d4:       54000e4d        b.le    40189c <generate_music+0x1ec>
  4016d8:       a9bb7bfd        stp     x29, x30, [sp, #-80]!
  4016dc:       f000017e        adrp    x30, 430000 <memcpy@GLIBC_2.17>
  4016e0:       910593de        add     x30, x30, #0x164
  4016e4:       910003fd        mov     x29, sp
  4016e8:       a90153f3        stp     x19, x20, [sp, #16]
  4016ec:       d285a002        mov     x2, #0x2d00                     // #11520
  4016f0:       52955571        mov     w17, #0xaaab                    // #43691
  4016f4:       294093d4        ldp     w20, w4, [x30, #4]
  4016f8:       52955550        mov     w16, #0xaaaa                    // #43690
  4016fc:       8b020002        add     x2, x0, x2
  401700:       52800006        mov     w6, #0x0                        // #0
Enter fullscreen mode Exit fullscreen mode

So let's break down what it's doing here. Whilelo is a loop that's taking scalable predicate register p0.h as its first argument (the destination register), and increments until the second argument - the value in register xzr is lower than the value in register x3.

  4016bc:       25631fe0        whilelo p0.h, xzr, x3
Enter fullscreen mode Exit fullscreen mode

While that condition is true, the program performs a st1h, or a contiguous store halfwords from vector, with a scalar index as its argument.

 4016c0:    e4a24000        st1h    {z0.h}, p0, [x0, x2, lsl #1]
Enter fullscreen mode Exit fullscreen mode

It then increments x2.

  4016c4:       0470e3e2        inch    x2
Enter fullscreen mode Exit fullscreen mode

While this helps us understand the mechanics of what's being called and why, what function does this serve in the program? The source code can give us some clues in a language that's easier to parse:

   /* Generate input data */
   inbuf = (opus_int16*)malloc(sizeof(*inbuf)*SSAMPLES);
   generate_music(inbuf, SSAMPLES/2);
Enter fullscreen mode Exit fullscreen mode

We can see here that generate_music is a function that, much like the vol_createsample function in lab 5 creates dummy data to operate on and test the encoding utility. Looking at the function definition in full:

void generate_music(short *buf, opus_int32 len)
{
   opus_int32 a1,b1,a2,b2;
   opus_int32 c1,c2,d1,d2;
   opus_int32 i,j;
   a1=b1=a2=b2=0;
   c1=c2=d1=d2=0;
   j=0;
   /*60ms silence*/
   for(i=0;i<2880;i++)buf[i*2]=buf[i*2+1]=0;
   for(i=2880;i<len;i++)
   {
    opus_uint32 r;
    opus_int32 v1,v2;
    v1=v2=(((j*((j>>12)^((j>>10|j>>12)&26&j>>7)))&128)+128)<<15;
    r=fast_rand();v1+=r&65535;v1-=r>>16;
    r=fast_rand();v2+=r&65535;v2-=r>>16;
    b1=v1-a1+((b1*61+32)>>6);a1=v1;
    b2=v2-a2+((b2*61+32)>>6);a2=v2;
    c1=(30*(c1+b1+d1)+32)>>6;d1=b1;
    c2=(30*(c2+b2+d2)+32)>>6;d2=b2;
    v1=(c1+128)>>8;
    v2=(c2+128)>>8;
    buf[i*2]=v1>32767?32767:(v1<-32768?-32768:v1);
    buf[i*2+1]=v2>32767?32767:(v2<-32768?-32768:v2);
    if(i%6==0)j++;
   }
}
Enter fullscreen mode Exit fullscreen mode

We can see that the entire function is essentially two loops, so it makes sense that we would be able to take advantage of whilelo to squeeze some more performance out of it. Using SIMD in this way allows multiple iterations of the generate_music function to run simultaneously, which should speed up the performance greatly.

With that in mind, it would be interesting to see if there are loops in the source code that didn't get converted to SVE2 instructions and ascertain why. One such example is in main, which I'll show the first part of for context:

int main(int _argc, char **_argv)
{
   int args=1;
   char * strtol_str=NULL;
   const char * oversion;
   const char * env_seed;
   int env_used;
   int num_encoders_to_fuzz=5;
   int num_setting_changes=40;

   env_used=0;
   env_seed=getenv("SEED");
   if(_argc>1)
    iseed=strtol(_argv[1], &strtol_str, 10);  /* the first input argument might be the seed */
   if(strtol_str!=NULL && strtol_str[0]=='\0')   /* iseed is a valid number */
    args++;
   else if(env_seed) {
    iseed=atoi(env_seed);
    env_used=1;
   }
   else iseed=(opus_uint32)time(NULL)^(((opus_uint32)getpid()&65535)<<16);
   Rw=Rz=iseed;

while(args<_argc)
   {
    if(strcmp(_argv[args], "-fuzz")==0 && _argc==(args+3)) {
        num_encoders_to_fuzz=strtol(_argv[args+1], &strtol_str, 10);
        if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
        num_setting_changes=strtol(_argv[args+2], &strtol_str, 10);
        if(strtol_str[0]!='\0' || num_setting_changes<=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
        args+=3;
    }
    else {
        print_usage(_argv);
        return EXIT_FAILURE;
    }
   }
Enter fullscreen mode Exit fullscreen mode

The while loop here iterates through the command line arguments argc, and the logic within checks for the validity of the arguments. The correct way to call the encoding test is in the format /test_opus_encode [<seed>] [-fuzz <num_encoders> <num_settings_per_encoder>]. Disassembled, the first loop section looks like this:

  4012f4:       97ffff7f        bl      4010f0 <strcmp@plt>
  4012f8:       350001e0        cbnz    w0, 401334 <main+0x134>
  4012fc:       11000e73        add     w19, w19, #0x3
  401300:       6b14027f        cmp     w19, w20
  401304:       54000181        b.ne    401334 <main+0x134>  // b.any
Enter fullscreen mode Exit fullscreen mode

We can tell from the reference to <strcmp@plt> that this is where the loop's first condition is evaluated, with the string comparison between the current command line argument and "-fuzz" taking place. So why isn't this loop vectorized? Let's break it down.

while(args<_argc)
   {
Enter fullscreen mode Exit fullscreen mode

args is initialized to 1. The while loop executes as long as args is less than argc (argc is the number of command line argument provided when invoking the program).

    if(strcmp(_argv[args], "-fuzz")==0 && _argc==(args+3)) {
Enter fullscreen mode Exit fullscreen mode

The first condition evaluated is if the argument is the string "-fuzz".

        num_encoders_to_fuzz=strtol(_argv[args+1], &strtol_str, 10);
Enter fullscreen mode Exit fullscreen mode

If it is and the number of arguments is 4, the number of encoders to fuzz is set with the next argument and execution moves to evaluation of the next condition.

        if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {
Enter fullscreen mode Exit fullscreen mode

If strtol_str[0] (the character following a number from the _argv[args+1] string that was just parsed) is not a null terminating character or the num_encoders_to_fuzz is less than or equal to zero - that is to say there are characters in the arguments when there should only be numbers at this point, or the number of encoders to fuzz was improperly set - then print the proper usage of the invocation arguments and exit.

if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
Enter fullscreen mode Exit fullscreen mode

Otherwise, continue evaluating the command line arguments and check if the num_setting_changes is set properly by the third argument using the same logic of the previous condition.

num_setting_changes=strtol(_argv[args+2], &strtol_str, 10);
        if(strtol_str[0]!='\0' || num_setting_changes<=0) {
            print_usage(_argv);
            return EXIT_FAILURE;
        }
Enter fullscreen mode Exit fullscreen mode

If this is true, increment args by 3. Otherwise, exit.

        args+=3;
    }
    else {
        print_usage(_argv);
        return EXIT_FAILURE;
    }
Enter fullscreen mode Exit fullscreen mode

The args increment at the end will make the while condition evaluate false, so all this to say - the loop only evaluates once so it makes sense that SVE2 instructions wouldn't apply here. There would be no benefit to simultaneously running a loop that can only execute once.

Conclusion

In conclusion, it's been interesting looking at how SVE2 optimization can benefit an open source library. This is a cool technology that will no doubt become pervasive very quickly and have widespread benefits, especially for large data processing libraries such as this. I explored some different ways to make use of it through compiler intrinsics as well as autovectorization, some attempts were challenging and less fruitful while others seemed to find purchase and successfully optimize opus' encoding functionality. I broke down some code that was optimized and some that wasn't and the reasons why, and gave a closer look at the disassembled code compared to its source to see how the compiler implements SVE2 for us and why.

I hope my work can be useful to those interested in implementing SVE2 in their own projects, or to the maintainers of the opus project. The latter might find those tests that I couldn't get to pass with autovectorization to be a good place to start, as the "core dump" error message means that the qemu-aarch64 argument wasn't applied to those tests at runtime as I couldn't determine how to apply it in those cases. Doing so would likely cause all tests to pass and allow the entire library to take advantage of SVE2.

This project and this course at large have been very useful in changing my perspective on programming and allowed me to get much closer to the metal than I have before. It's cleared up many misconceptions about how computers treat data - to paraphrase my professor, "Your other teachers probably told you variables are stored in memory - they lied." This project and course have been full of little epiphanies like that that I think have been influential in refining my concept of programming and I'm glad I was able to have this experience before graduating. Thanks for reading.

💖 💪 🙅 🚩
gusmccallum
gus

Posted on April 22, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related