Adding SVE2 Support to an Open Source Library - Part III
gus
Posted on April 22, 2022
In my last post I ran into some snags at the end when building opus, apparently some of the intrinsics I wrote for the file I modified errored out and as such I wasn't able to build and test the library. In this post, I'm going to change tactics and try autovectorization to see if I can successfully build and test the library, after which I'll give some analysis on the results.
First off I'll start by clearing my work so far and downloading a fresh copy of the library. At this point I need to configure and build, but in order to prevent the NEON intrinsics from conflicting with the autovectorization I'm going to implement I'll need to turn off NEON support in the configure.ac
file. I searched for mentions of intrinsics and turned them off, and then ran autogen.sh
and configure
to get the build configured. We can confirm intrinsics are now turned off by the output:
------------------------------------------------------------------------
opus 1.3.1-107-gccaaffa9-dirty: Automatic configuration OK.
Compiler support:
C99 var arrays: ................ yes
C99 lrintf: .................... yes
Use alloca: .................... no (using var arrays)
General configuration:
Floating point support: ........ yes
Fast float approximations: ..... no
Fixed point debugging: ......... no
Inline Assembly Optimizations: . No inline ASM for your platform, please send patches
External Assembly Optimizations:
Intrinsics Optimizations: ...... no
Run-time CPU detection: ........ no
Custom modes: .................. no
Assertion checking: ............ no
Hardening: ..................... yes
Fuzzing: ....................... no
Check ASM: ..................... no
API documentation: ............. yes
Extra programs: ................ yes
------------------------------------------------------------------------
Now by subbing the CFLAGS mentioned in the last post (-O3 -march=armv8-a+sve2
) into the makefile and taking care to run the build with the qemu-aarch64
argument, we can see that the build and most of the tests execute successfully.
FAIL: celt/tests/test_unit_cwrs32
./test-driver: line 107: 448983 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_dft
PASS: celt/tests/test_unit_entropy
PASS: celt/tests/test_unit_laplace
PASS: celt/tests/test_unit_mathops
./test-driver: line 107: 449031 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_mdct
./test-driver: line 107: 449046 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: celt/tests/test_unit_rotation
PASS: celt/tests/test_unit_types
./test-driver: line 107: 449072 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: silk/tests/test_unit_LPC_inv_pred_gain
PASS: tests/test_opus_api
PASS: tests/test_opus_decode
PASS: tests/test_opus_encode
PASS: tests/test_opus_padding
./test-driver: line 107: 449716 Illegal instruction (core dumped) "$@" > $log_file 2>&1
FAIL: tests/test_opus_projection
======================================================
opus 1.3.1-107-gccaaffa9-dirty: ./test-suite.log
======================================================
# TOTAL: 14
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 6
# XPASS: 0
# ERROR: 0
.. contents:: :depth: 2
FAIL: celt/tests/test_unit_cwrs32
=================================
FAIL celt/tests/test_unit_cwrs32 (exit status: 132)
FAIL: celt/tests/test_unit_dft
==============================
FAIL celt/tests/test_unit_dft (exit status: 132)
FAIL: celt/tests/test_unit_mdct
===============================
FAIL celt/tests/test_unit_mdct (exit status: 132)
FAIL: celt/tests/test_unit_rotation
===================================
FAIL celt/tests/test_unit_rotation (exit status: 132)
FAIL: silk/tests/test_unit_LPC_inv_pred_gain
============================================
FAIL silk/tests/test_unit_LPC_inv_pred_gain (exit status: 132)
FAIL: tests/test_opus_projection
================================
FAIL tests/test_opus_projection (exit status: 132)
============================================================================
Testsuite summary for opus 1.3.1-107-gccaaffa9-dirty
============================================================================
# TOTAL: 14
# PASS: 8
# SKIP: 0
# XFAIL: 0
# FAIL: 6
# XPASS: 0
# ERROR: 0
============================================================================
Let's take a closer look at one of the tests that successfully made use of the SVE2 inclusion:
Running Opus Encode Test
./test_opus_encode
Testing libopus 1.3.1-107-gccaaffa9-dirty encoder. Random seed: 3135156945 (95E3)
Running simple tests for bugs that have been fixed previously
Encode+Decode tests.
Mode LP FB encode VBR, 11318 bps OK.
Mode LP FB encode VBR, 14930 bps OK.
Mode LP FB encode VBR, 67659 bps OK.
Mode Hybrid FB encode VBR, 17712 bps OK.
Mode Hybrid FB encode VBR, 51200 bps OK.
Mode Hybrid FB encode VBR, 80954 bps OK.
Mode Hybrid FB encode VBR, 127480 bps OK.
Mode MDCT FB encode VBR, 752629 bps OK.
Mode MDCT FB encode VBR, 25609 bps OK.
Mode MDCT FB encode VBR, 33107 bps OK.
Mode MDCT FB encode VBR, 78592 bps OK.
Mode MDCT FB encode VBR, 73157 bps OK.
Mode MDCT FB encode VBR, 137477 bps OK.
Mode LP FB encode CVBR, 11480 bps OK.
Mode LP FB encode CVBR, 21257 bps OK.
Mode LP FB encode CVBR, 63201 bps OK.
Mode Hybrid FB encode CVBR, 25583 bps OK.
Mode Hybrid FB encode CVBR, 36126 bps OK.
Mode Hybrid FB encode CVBR, 54107 bps OK.
Mode Hybrid FB encode CVBR, 108482 bps OK.
Mode MDCT FB encode CVBR, 934758 bps OK.
Mode MDCT FB encode CVBR, 25111 bps OK.
Mode MDCT FB encode CVBR, 33929 bps OK.
Mode MDCT FB encode CVBR, 52270 bps OK.
Mode MDCT FB encode CVBR, 79059 bps OK.
Mode MDCT FB encode CVBR, 117366 bps OK.
Mode LP FB encode CBR, 7432 bps OK.
Mode LP FB encode CBR, 16781 bps OK.
Mode LP FB encode CBR, 90950 bps OK.
Mode Hybrid FB encode CBR, 18257 bps OK.
Mode Hybrid FB encode CBR, 37925 bps OK.
Mode Hybrid FB encode CBR, 56473 bps OK.
Mode Hybrid FB encode CBR, 78233 bps OK.
Mode MDCT FB encode CBR, 780220 bps OK.
Mode MDCT FB encode CBR, 20668 bps OK.
Mode MDCT FB encode CBR, 38398 bps OK.
Mode MDCT FB encode CBR, 74376 bps OK.
Mode MDCT FB encode CBR, 68468 bps OK.
Mode MDCT FB encode CBR, 141108 bps OK.
Mode LP NB dual-mono MS encode VBR, 4884 bps OK.
Mode LP NB dual-mono MS encode VBR, 18110 bps OK.
Mode LP NB dual-mono MS encode VBR, 44628 bps OK.
Mode LP NB dual-mono MS encode VBR, 15245 bps OK.
Mode LP NB dual-mono MS encode VBR, 26620 bps OK.
Mode LP NB dual-mono MS encode VBR, 61885 bps OK.
Mode LP NB dual-mono MS encode VBR, 86977 bps OK.
Mode LP NB dual-mono MS encode VBR, 119885 bps OK.
Mode MDCT NB dual-mono MS encode VBR, 7123 bps OK.
Mode MDCT NB dual-mono MS encode VBR, 19106 bps OK.
Mode MDCT NB dual-mono MS encode VBR, 41453 bps OK.
Mode MDCT NB dual-mono MS encode VBR, 10135 bps OK.
Mode MDCT NB dual-mono MS encode VBR, 19040 bps OK.
Mode MDCT NB dual-mono MS encode VBR, 57693 bps OK.
Mode MDCT NB dual-mono MS encode VBR, 77731 bps OK.
Mode MDCT NB dual-mono MS encode VBR, 165272 bps OK.
Mode LP NB dual-mono MS encode CVBR, 7245 bps OK.
Mode LP NB dual-mono MS encode CVBR, 16460 bps OK.
Mode LP NB dual-mono MS encode CVBR, 56065 bps OK.
Mode LP NB dual-mono MS encode CVBR, 13411 bps OK.
Mode LP NB dual-mono MS encode CVBR, 28783 bps OK.
Mode LP NB dual-mono MS encode CVBR, 61638 bps OK.
Mode LP NB dual-mono MS encode CVBR, 92219 bps OK.
Mode LP NB dual-mono MS encode CVBR, 110936 bps OK.
Mode MDCT NB dual-mono MS encode CVBR, 4047 bps OK.
Mode MDCT NB dual-mono MS encode CVBR, 21622 bps OK.
Mode MDCT NB dual-mono MS encode CVBR, 43253 bps OK.
Mode MDCT NB dual-mono MS encode CVBR, 12557 bps OK.
Mode MDCT NB dual-mono MS encode CVBR, 28091 bps OK.
Mode MDCT NB dual-mono MS encode CVBR, 57473 bps OK.
Mode MDCT NB dual-mono MS encode CVBR, 77203 bps OK.
Mode MDCT NB dual-mono MS encode CVBR, 154714 bps OK.
Mode LP NB dual-mono MS encode CBR, 4000 bps OK.
Mode LP NB dual-mono MS encode CBR, 12396 bps OK.
Mode LP NB dual-mono MS encode CBR, 56699 bps OK.
Mode LP NB dual-mono MS encode CBR, 10327 bps OK.
Mode LP NB dual-mono MS encode CBR, 19576 bps OK.
Mode LP NB dual-mono MS encode CBR, 36651 bps OK.
Mode LP NB dual-mono MS encode CBR, 50625 bps OK.
Mode LP NB dual-mono MS encode CBR, 122376 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 4916 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 14647 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 55741 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 12307 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 23408 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 62311 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 54876 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 104358 bps OK.
All framesize pairs switching encode, 9810 frames OK.
Running fuzz_encoder_settings with 5 encoder(s) and 40 setting change(s) each.
Tests completed successfully.
Now we can inspect the encoding program and see how it makes use of SVE2 instructions.
find . -type f -executable -print | while read X ; do echo ======== $X ; objdump -d $X | grep whilelo ;
The lines in question are too numerous to put here but the files affected are:
======== ./tests/test_opus_projection
======== ./tests/.libs/test_opus_encode
======== ./tests/.libs/test_opus_api
======== ./tests/.libs/test_opus_decode
======== ./celt/tests/test_unit_entropy
======== ./celt/tests/test_unit_cwrs32
======== ./celt/tests/test_unit_mathops
======== ./celt/tests/test_unit_rotation
======== ./celt/tests/test_unit_dft
======== ./celt/tests/test_unit_mdct
======== ./.libs/opus_demo
======== ./.libs/libopus.so.0.8.0
======== ./.libs/trivial_example
======== ./opus_compare
======== ./silk/tests/test_unit_LPC_inv_pred_gain
And a line count with find . -type f -executable -print | while read X ; do echo ======== $X ; objdump -d $X 2> /dev/null | grep whilelo ; done | wc -l
returns 2903 instances of whilelo. I'll zero in on one of these files to see how it makes use of its SVE2 instructions.
Analyzing Opus Encode Test
I'll go back to the encode test I ran before and take a look at how it's using its SVE2 instructions now.
objdump -d test_opus_encode > ~/opus_encode_objdump
In searching around the output I can find 6 instances of whilelo at play here, the first 2 being in this <generate_music>
section.
00000000004016b0 <generate_music>:
4016b0: d2800002 mov x2, #0x0 // #0
4016b4: d282d003 mov x3, #0x1680 // #5760
4016b8: 2538c000 mov z0.b, #0
4016bc: 25631fe0 whilelo p0.h, xzr, x3
4016c0: e4a24000 st1h {z0.h}, p0, [x0, x2, lsl #1]
4016c4: 0470e3e2 inch x2
4016c8: 25631c40 whilelo p0.h, x2, x3
4016cc: 54ffffa1 b.ne 4016c0 <generate_music+0x10> // b.any
4016d0: 712d003f cmp w1, #0xb40
4016d4: 54000e4d b.le 40189c <generate_music+0x1ec>
4016d8: a9bb7bfd stp x29, x30, [sp, #-80]!
4016dc: f000017e adrp x30, 430000 <memcpy@GLIBC_2.17>
4016e0: 910593de add x30, x30, #0x164
4016e4: 910003fd mov x29, sp
4016e8: a90153f3 stp x19, x20, [sp, #16]
4016ec: d285a002 mov x2, #0x2d00 // #11520
4016f0: 52955571 mov w17, #0xaaab // #43691
4016f4: 294093d4 ldp w20, w4, [x30, #4]
4016f8: 52955550 mov w16, #0xaaaa // #43690
4016fc: 8b020002 add x2, x0, x2
401700: 52800006 mov w6, #0x0 // #0
So let's break down what it's doing here. Whilelo is a loop that's taking scalable predicate register p0.h
as its first argument (the destination register), and increments until the second argument - the value in register xzr
is lower than the value in register x3
.
4016bc: 25631fe0 whilelo p0.h, xzr, x3
While that condition is true, the program performs a st1h, or a contiguous store halfwords from vector, with a scalar index as its argument.
4016c0: e4a24000 st1h {z0.h}, p0, [x0, x2, lsl #1]
It then increments x2
.
4016c4: 0470e3e2 inch x2
While this helps us understand the mechanics of what's being called and why, what function does this serve in the program? The source code can give us some clues in a language that's easier to parse:
/* Generate input data */
inbuf = (opus_int16*)malloc(sizeof(*inbuf)*SSAMPLES);
generate_music(inbuf, SSAMPLES/2);
We can see here that generate_music
is a function that, much like the vol_createsample
function in lab 5 creates dummy data to operate on and test the encoding utility. Looking at the function definition in full:
void generate_music(short *buf, opus_int32 len)
{
opus_int32 a1,b1,a2,b2;
opus_int32 c1,c2,d1,d2;
opus_int32 i,j;
a1=b1=a2=b2=0;
c1=c2=d1=d2=0;
j=0;
/*60ms silence*/
for(i=0;i<2880;i++)buf[i*2]=buf[i*2+1]=0;
for(i=2880;i<len;i++)
{
opus_uint32 r;
opus_int32 v1,v2;
v1=v2=(((j*((j>>12)^((j>>10|j>>12)&26&j>>7)))&128)+128)<<15;
r=fast_rand();v1+=r&65535;v1-=r>>16;
r=fast_rand();v2+=r&65535;v2-=r>>16;
b1=v1-a1+((b1*61+32)>>6);a1=v1;
b2=v2-a2+((b2*61+32)>>6);a2=v2;
c1=(30*(c1+b1+d1)+32)>>6;d1=b1;
c2=(30*(c2+b2+d2)+32)>>6;d2=b2;
v1=(c1+128)>>8;
v2=(c2+128)>>8;
buf[i*2]=v1>32767?32767:(v1<-32768?-32768:v1);
buf[i*2+1]=v2>32767?32767:(v2<-32768?-32768:v2);
if(i%6==0)j++;
}
}
We can see that the entire function is essentially two loops, so it makes sense that we would be able to take advantage of whilelo to squeeze some more performance out of it. Using SIMD in this way allows multiple iterations of the generate_music
function to run simultaneously, which should speed up the performance greatly.
With that in mind, it would be interesting to see if there are loops in the source code that didn't get converted to SVE2 instructions and ascertain why. One such example is in main, which I'll show the first part of for context:
int main(int _argc, char **_argv)
{
int args=1;
char * strtol_str=NULL;
const char * oversion;
const char * env_seed;
int env_used;
int num_encoders_to_fuzz=5;
int num_setting_changes=40;
env_used=0;
env_seed=getenv("SEED");
if(_argc>1)
iseed=strtol(_argv[1], &strtol_str, 10); /* the first input argument might be the seed */
if(strtol_str!=NULL && strtol_str[0]=='\0') /* iseed is a valid number */
args++;
else if(env_seed) {
iseed=atoi(env_seed);
env_used=1;
}
else iseed=(opus_uint32)time(NULL)^(((opus_uint32)getpid()&65535)<<16);
Rw=Rz=iseed;
while(args<_argc)
{
if(strcmp(_argv[args], "-fuzz")==0 && _argc==(args+3)) {
num_encoders_to_fuzz=strtol(_argv[args+1], &strtol_str, 10);
if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {
print_usage(_argv);
return EXIT_FAILURE;
}
num_setting_changes=strtol(_argv[args+2], &strtol_str, 10);
if(strtol_str[0]!='\0' || num_setting_changes<=0) {
print_usage(_argv);
return EXIT_FAILURE;
}
args+=3;
}
else {
print_usage(_argv);
return EXIT_FAILURE;
}
}
The while loop here iterates through the command line arguments argc
, and the logic within checks for the validity of the arguments. The correct way to call the encoding test is in the format /test_opus_encode [<seed>] [-fuzz <num_encoders> <num_settings_per_encoder>]
. Disassembled, the first loop section looks like this:
4012f4: 97ffff7f bl 4010f0 <strcmp@plt>
4012f8: 350001e0 cbnz w0, 401334 <main+0x134>
4012fc: 11000e73 add w19, w19, #0x3
401300: 6b14027f cmp w19, w20
401304: 54000181 b.ne 401334 <main+0x134> // b.any
We can tell from the reference to <strcmp@plt>
that this is where the loop's first condition is evaluated, with the string comparison between the current command line argument and "-fuzz" taking place. So why isn't this loop vectorized? Let's break it down.
while(args<_argc)
{
args
is initialized to 1. The while loop executes as long as args
is less than argc
(argc
is the number of command line argument provided when invoking the program).
if(strcmp(_argv[args], "-fuzz")==0 && _argc==(args+3)) {
The first condition evaluated is if the argument is the string "-fuzz".
num_encoders_to_fuzz=strtol(_argv[args+1], &strtol_str, 10);
If it is and the number of arguments is 4, the number of encoders to fuzz is set with the next argument and execution moves to evaluation of the next condition.
if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {
If strtol_str[0]
(the character following a number from the _argv[args+1]
string that was just parsed) is not a null terminating character or the num_encoders_to_fuzz
is less than or equal to zero - that is to say there are characters in the arguments when there should only be numbers at this point, or the number of encoders to fuzz was improperly set - then print the proper usage of the invocation arguments and exit.
if(strtol_str[0]!='\0' || num_encoders_to_fuzz<=0) {
print_usage(_argv);
return EXIT_FAILURE;
}
Otherwise, continue evaluating the command line arguments and check if the num_setting_changes
is set properly by the third argument using the same logic of the previous condition.
num_setting_changes=strtol(_argv[args+2], &strtol_str, 10);
if(strtol_str[0]!='\0' || num_setting_changes<=0) {
print_usage(_argv);
return EXIT_FAILURE;
}
If this is true, increment args
by 3. Otherwise, exit.
args+=3;
}
else {
print_usage(_argv);
return EXIT_FAILURE;
}
The args
increment at the end will make the while condition evaluate false, so all this to say - the loop only evaluates once so it makes sense that SVE2 instructions wouldn't apply here. There would be no benefit to simultaneously running a loop that can only execute once.
Conclusion
In conclusion, it's been interesting looking at how SVE2 optimization can benefit an open source library. This is a cool technology that will no doubt become pervasive very quickly and have widespread benefits, especially for large data processing libraries such as this. I explored some different ways to make use of it through compiler intrinsics as well as autovectorization, some attempts were challenging and less fruitful while others seemed to find purchase and successfully optimize opus' encoding functionality. I broke down some code that was optimized and some that wasn't and the reasons why, and gave a closer look at the disassembled code compared to its source to see how the compiler implements SVE2 for us and why.
I hope my work can be useful to those interested in implementing SVE2 in their own projects, or to the maintainers of the opus project. The latter might find those tests that I couldn't get to pass with autovectorization to be a good place to start, as the "core dump" error message means that the qemu-aarch64
argument wasn't applied to those tests at runtime as I couldn't determine how to apply it in those cases. Doing so would likely cause all tests to pass and allow the entire library to take advantage of SVE2.
This project and this course at large have been very useful in changing my perspective on programming and allowed me to get much closer to the metal than I have before. It's cleared up many misconceptions about how computers treat data - to paraphrase my professor, "Your other teachers probably told you variables are stored in memory - they lied." This project and course have been full of little epiphanies like that that I think have been influential in refining my concept of programming and I'm glad I was able to have this experience before graduating. Thanks for reading.
Posted on April 22, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.