SVE2 Implementation for Opus Codec Library Analysis
Seung Woo (Paul) Ji
Posted on April 22, 2022
Introduction
Previously, we successfully implemented SVE2 into Opus codec library by utilizing auto-vectorization method. In this post, we will analyze the result to further test if the SVE2 code is implemented correctly and determine its possible impact on the software's performance.
SVE2 Code Analysis
As we explored in the previous post, the compiler auto-vectorized many parts of the package. Let's take a look at one of them to see where SVE2 code is used.
Opus Codec utilizes Celt as one of ways to encode and decode audio source. In opus/celt
, we can see the following list of files.
$ ls
arch.h celt.o entenc.o mdct.c quant_bands.lo
arm cpu_support.h fixed_c5x.h mdct.h quant_bands.o
bands.c cwrs.c fixed_c6x.h mdct.lo rate.c
bands.h cwrs.h fixed_debug.h mdct.o rate.h
bands.lo cwrs.lo fixed_generic.h meson.build rate.lo
bands.o cwrs.o float_cast.h mfrngcod.h rate.o
celt.c dump_modes kiss_fft.c mips stack_alloc.h
celt_decoder.c ecintrin.h _kiss_fft_guts.h modes.c static_modes_fixed_arm_ne10.h
celt_decoder.lo entcode.c kiss_fft.h modes.h static_modes_fixed.h
celt_decoder.o entcode.h kiss_fft.lo modes.lo static_modes_float_arm_ne10.h
celt_encoder.c entcode.lo kiss_fft.o modes.o static_modes_float.h
celt_encoder.lo entcode.o laplace.c opus_custom_demo.c tests
celt_encoder.o entdec.c laplace.h os_support.h vq.c
celt.h entdec.h laplace.lo pitch.c vq.h
celt.lo entdec.lo laplace.o pitch.h vq.lo
celt_lpc.c entdec.o mathops.c pitch.lo vq.o
celt_lpc.h entenc.c mathops.h pitch.o x86
celt_lpc.lo entenc.h mathops.lo quant_bands.c
celt_lpc.o entenc.lo mathops.o quant_bands.h
In celt_encoder.c
file, we can see that it contains many for loops
that may benefit from SVE2
implementation. The following code example is one of them:
// celt_encode.c
// ...
1100 /* For non-transient CBR/CVBR frames, halve the dynalloc contribution */
1101 if ((!vbr || constrained_vbr)&&!isTransient)
1102 {
1103 for (i=start;i<end;i++)
1104 follower[i] = HALF16(follower[i]);
1105 }
1106 for (i=start;i<end;i++)
1107 {
1108 if (i<8)
1109 follower[i] *= 2;
1110 if (i>=12)
1111 follower[i] = HALF16(follower[i]);
// ...
In the code, we can see a loop that iterates from start
to end
. Depending on the value of i
, the i
th element of follower
array is either halved or multiplied by two. As we can see, this does not involve complex logic and process a large amount of data in the uniform manner and, therefore, this could be a good candidate to utilize the auto-vectorization by the compiler.
And as we expected, the celt_encoder.o
contains multiple SVE-specific whilelo
instructions when we disassemble it.
$ objdump -d celt_encoder.o | grep whilelo
174: 25a30fe0 whilelo p0.s, wzr, w3
198: 25a30c00 whilelo p0.s, w0, w3
1e8: 25b40fe0 whilelo p0.s, wzr, w20
200: 25b40c00 whilelo p0.s, w0, w20
418: 25bc0fe0 whilelo p0.s, wzr, w28
430: 25bc0c00 whilelo p0.s, w0, w28
498: 25bc0fe0 whilelo p0.s, wzr, w28
4b0: 25bc0c20 whilelo p0.s, w1, w28
# ...
57ac: 25a10c00 whilelo p0.s, w0, w1
5844: 25a10fe0 whilelo p0.s, wzr, w1
585c: 25a10c00 whilelo p0.s, w0, w1
5ae0: 25a10fe0 whilelo p0.s, wzr, w1
5b00: 25a10c00 whilelo p0.s, w0, w1
5ea8: 25a10fe0 whilelo p0.s, wzr, w1
5ebc: 25a10c00 whilelo p0.s, w0, w1
But, this only shows that celt_encode
have implemented SVE2 instruction. How can we know if the code that we are interested in utilizes SVE2?
Let's look at this in a different angle - how the compiler can determine if the codes are suitable for auto-vectorization? For this, we can specify an additional option to enable feature when you generate configure
binary.
$ ./configure CFLAGS="-g -O3 -fopt-info-vec-all -march=armv8-a+sve2 -fvisibility=hidden -D_FORTIFY_SOURCE=2 -W -Wall -Wextra -Wcast-align -Wnested-externs -Wshadow -Wstrict-prototypes"
$ make -j24 |& tee make.log
fopt-info
generates additional log in the compiler output. We specifically asks for all information regarding to vectorization by using vec-all
. When we compile the package again using make
, this feature will tell us why (or why not) the compiler add SVE2 implementation.
Once we run make
command as above, we have the following make.log
file that contains every information we want to know.
$ ll make.log
-rw-r--r--. 1 swji1 swji1 2831714 Apr 22 14:21 make.log
Let's refine the result by only searching the logs that happened in the celt
directory as follows:
$ grep "celt/celt_encoder"
celt/celt_encoder.c:1810:22: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1780:16: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1780:16: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1778:40: missed: couldn't vectorize loop
celt/celt_encoder.c:1778:40: missed: not vectorized: number of iterations cannot be computed.
celt/celt_encoder.c:1756:17: missed: couldn't vectorize loop
celt/celt_encoder.c:1761:20: missed: not vectorized: complicated access pattern.
We can see which lines of the code are vectorized or not as above. Let's find if the code located at line 1106
that we have examined is vectorized as well.
$ grep "celt/celt_encoder.c:1106"
celt/celt_encoder.c:1106:21: celt/pitch.h:143:14: optimized: loop vectorized using variable length vectors
celt/celt_encoder.c:1106:21: optimized: loop vectorized using variable length vectors
As we expected, the loop is vectorized by the compiler.
Now, we may wonder what are the codes that the compiler cannot perform auto-vectorization and why? Let's take a look at one of them.
celt/celt_encoder.c:1922:39: missed: not vectorized: complicated access pattern.
// celt_encoder.c
// ...
do {
1915 for (i=start;i<end;i++)
1916 {
1917 /* When the energy is stable, slightly bias energy quantization towards
1918 the previous error to make the gain more stable (a constant offset is
1919 better than fluctuations). */
1920 if (ABS32(SUB32(bandLogE[i+c*nbEBands], oldBandE[i+c*nbEBands])) < QCONST16(2.f, DB_SHIFT))
1921 {
1922 bandLogE[i+c*nbEBands] -= MULT16_16_Q15(energyError[i+c*nbEBands], QCONST16(0.25f, 15));
1923 }
1924 }
1925 } while (++c < C);
// ...
In the if
statement inside of the loop, we can see that each element of the arrays requires extensive calculations beforehand. For this reason, the compiler cannot vectorize the loop as it requires complex access pattern.
Performance Prediction
Unfortunately, we cannot benchmark the performance of the package at the moment due to the lack of hardware that supports SVE2. However, we do know the SVE2 implementation would potentially improve the performance as it optimizes loops when processing large datasets like audio and video resources. For this reason, we can assume there is a positive correlation between the number of SVE2 instructions and the performance.
Before we begin, we need to also consider that opus
package contains multiple unit tests that can potentially increase the total number. Thus, we have to be extra careful to exclude them.
Let's count the total number of optimizations that are done by the compiler.
$ grep -v "test" make.log | grep "optimized" -c
632
The compiler managed to auto-vectorize a significant amount (632) of codes. Let's take a look at how many of SVE-specific whilelo
instruction and registers (i.e. predicate register and scalable vector register) are implemented in the executable opus
codec library, libopus
.
$ objdump -d libopus.so.0.8.0 | grep whilelo -c
671
$ objdump -d libopus.so.0.8.0 | grep whilelo
2ef0: 25a40fe1 whilelo p1.s, wzr, w4
2f28: 25a40c60 whilelo p0.s, w3, w4
2f7c: 25a40fe1 whilelo p1.s, wzr, w4
2fa0: 25a40c60 whilelo p0.s, w3, w4
3314: 25b80fe0 whilelo p0.s, wzr, w24
3344: 25b80c20 whilelo p0.s, w1, w24
# ...
47b38: 25a50fe0 whilelo p0.s, wzr, w5
47b3c: 25a80c23 whilelo p3.s, w1, w8
47b4c: 25aa0c24 whilelo p4.s, w1, w10
47b54: 25250c26 whilelo p6.b, w1, w5
47b5c: 25a60c22 whilelo p2.s, w1, w6
47b68: 25a50c25 whilelo p5.s, w1, w5
47b98: 25a50c20 whilelo p0.s, w1, w5
$objdump -d libopus.so.0.8.0 | egrep "[^[:alpha:]]z[[:digit:]]|[^[:alpha:]]p[[:digit:]]" -c
5274
$objdump -d libopus.so.0.8.0 | egrep "[^[:alpha:]]z[[:digit:]]|[^[:alpha:]]p[[:digit:]]"
2ef0: 25a40fe1 whilelo p1.s, wzr, w4
2ef8: 04a34801 index z1.s, #0, w3
2ef0: 25a40fe1 whilelo p1.s, wzr, w4
2ef8: 04a34801 index z1.s, #0, w3
2f0c: 25814420 mov p0.b, p1.b
2f18: 856140a0 ld1w {z0.s}, p0/z, [x5, z1.s, sxtw #2]
2f1c: e54340c0 st1w {z0.s}, p0, [x6, x3, lsl #2]
2f28: 25a40c60 whilelo p0.s, w3, w4
# ...
47f48: 6594a000 scvtf z0.s, p0/m, z0.s
47f4c: 25886100 mov p0.b, p8.b
47f50: e544e4a2 st1w {z2.s}, p1, [x5, #4, mul vl]
47f54: e546e0a1 st1w {z1.s}, p0, [x5, #6, mul vl]
47f58: 25896520 mov p0.b, p9.b
47f5c: e547e0a0 st1w {z0.s}, p0, [x5, #7, mul vl]
As we can see, there are substantial amount of SVE2 specific codes that are implemented by the auto-vectorization. Therefore, we can suspect that the opus
library may benefit from it to increase the overall performance.
Things that Can Further Improve the Performance
We already know the compiler auto-vectorize a large portion of the codes. But, we have to admit there is a limit to this method. As we already found before, the compiler cannot auto-vectorize some codes. However, this does not mean they cannot be vectorized. In some cases, we may find places where SVE2 implementation could take place if the loop is written differently. For example, as this article suggested, we may use restrict
qualifiers to inform the compiler that there is no array overlaps.
Original and SVE2 Implementation Comparison
Now, we know SVE2 implementation is successfully performed by the auto-vectorization. However, this is meaningless if the SVE2-improved library does not generate the same result as the original library. For this, let's examine if the improved version of the program works as well as the original version.
# original file
$ ll libopus.so.0.8.0
-rwxr-xr-x. 1 swji1 swji1 1498808 Apr 13 20:16 libopus.so.0.8.0
# SVE2 implemented file
$ ll libopus.so.0.8.0
-rwxr-xr-x. 1 swji1 swji1 1684704 Apr 22 14:21 libopus.so.0.8.0
The SVE2 implemented version has a little bit larger in size (~0.2 MiB) but does not show a significant change.
Let's run the unit tests that are provided by the package authors. As we know from the previous post, we have to execute them using qemu-aarch64
command to run the emulation. But, unlike previous post, we will run several unit tests to see if the SVE2 code works correctly.
$ ./test_opus_api
Testing the libopus 1.3.1-107-gccaaffa9-dirty API deterministically
Decoder basic API tests
---------------------------------------------------
opus_decoder_get_size(0)=0 ................... OK.
opus_decoder_get_size(1)=18228 ............... OK.
opus_decoder_get_size(2)=26996 ............... OK.
opus_decoder_get_size(3)=0 ................... OK.
opus_decoder_create() ........................ OK.
opus_decoder_init() .......................... OK.
OPUS_GET_FINAL_RANGE ......................... OK.
OPUS_UNIMPLEMENTED ........................... OK.
OPUS_GET_BANDWIDTH ........................... OK.
OPUS_GET_SAMPLE_RATE ......................... OK.
OPUS_GET_PITCH ............................... OK.
OPUS_GET_LAST_PACKET_DURATION ................ OK.
OPUS_SET_GAIN ................................ OK.
OPUS_GET_GAIN ................................ OK.
OPUS_RESET_STATE ............................. OK.
opus_{packet,decoder}_get_nb_samples() ....... OK.
opus_packet_get_nb_frames() .................. OK.
opus_packet_get_bandwidth() .................. OK.
opus_packet_get_samples_per_frame() .......... OK.
opus_decode() ................................ OK.
opus_decode_float() .......................... OK.
All decoder interface tests passed
(1219433 API invocations)
# ...
Repacketizer tests
---------------------------------------------------
opus_repacketizer_get_size()=496 ............. OK.
opus_repacketizer_init ....................... OK.
opus_repacketizer_create ..................... OK.
opus_repacketizer_get_nb_frames .............. OK.
opus_repacketizer_cat ........................ OK.
opus_repacketizer_out ........................ OK.
opus_repacketizer_out_range .................. OK.
opus_packet_pad .............................. OK.
opus_packet_unpad ............................ OK.
opus_multistream_packet_pad .................. OK.
opus_multistream_packet_unpad ................ OK.
All repacketizer tests passed
(6713561 API invocations)
malloc() failure tests
---------------------------------------------------
opus_decoder_create() ................... SKIPPED.
opus_encoder_create() ................... SKIPPED.
opus_repacketizer_create() .............. SKIPPED.
opus_multistream_decoder_create() ....... SKIPPED.
opus_multistream_encoder_create() ....... SKIPPED.
(Test only supported with GLIBC and without valgrind)
All API tests passed.
The libopus API was invoked 115421979 times.
$ ./test_opus_decode
Testing libopus 1.3.1-107-gccaaffa9-dirty decoder. Random seed: 2918850151 (76BD)
Starting 10 decoders...
opus_decoder_create(48000,1) OK. Copy OK.
opus_decoder_create(48000,2) OK. Copy OK.
opus_decoder_create(24000,1) OK. Copy OK.
opus_decoder_create(24000,2) OK. Copy OK.
opus_decoder_create(16000,1) OK. Copy OK.
opus_decoder_create(16000,2) OK. Copy OK.
opus_decoder_create(12000,1) OK. Copy OK.
opus_decoder_create(12000,2) OK. Copy OK.
opus_decoder_create( 8000,1) OK. Copy OK.
opus_decoder_create( 8000,2) OK. Copy OK.
dec[all] initial frame PLC OK.
dec[all] all 2-byte prefix for length 3 and PLC, all modes (64) OK.
dec[ 5] all 3-byte prefix for length 4, mode 28 OK.
dec[ 0] all 3-byte prefix for length 4, mode 4 OK.
dec[all] random packets, all modes (64), every 8th size from from 7 bytes to maximum OK.
dec[all] random packets, all mode pairs (4096), 145 bytes/frame OK.
dec[ 3] random packets, all mode pairs (4096)*10, 81 bytes/frame OK.
dec[ 0] pre-selected random packets OK.
Decoders stopped.
Testing opus_pcm_soft_clip... OK.
$ ./test_opus_encode
Testing libopus 1.3.1-107-gccaaffa9-dirty encoder. Random seed: 2953257216 (421F)
Running simple tests for bugs that have been fixed previously
Encode+Decode tests.
Mode LP FB encode VBR, 9119 bps OK.
Mode LP FB encode VBR, 13234 bps OK.
Mode LP FB encode VBR, 64668 bps OK.
Mode Hybrid FB encode VBR, 28306 bps OK.
Mode Hybrid FB encode VBR, 54852 bps OK.
Mode Hybrid FB encode VBR, 55130 bps OK.
Mode Hybrid FB encode VBR, 96362 bps OK.
Mode MDCT FB encode VBR, 893620 bps OK.
Mode MDCT FB encode VBR, 25608 bps OK.
Mode MDCT FB encode VBR, 29011 bps OK.
Mode MDCT FB encode VBR, 93628 bps OK.
Mode MDCT FB encode VBR, 93328 bps OK.
Mode MDCT FB encode VBR, 160982 bps OK.
# ...
Mode LP NB dual-mono MS encode CBR, 21883 bps OK.
Mode LP NB dual-mono MS encode CBR, 60566 bps OK.
Mode LP NB dual-mono MS encode CBR, 76774 bps OK.
Mode LP NB dual-mono MS encode CBR, 167879 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 6953 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 12756 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 60193 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 14915 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 16946 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 34028 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 86938 bps OK.
Mode MDCT NB dual-mono MS encode CBR, 172977 bps OK.
All framesize pairs switching encode, 9683 frames OK.
Running fuzz_encoder_settings with 5 encoder(s) and 40 setting change(s) each.
Tests completed successfully.
As we can see, the SVE2 program passes all the unit tests to confirm that it works as well as the original program.
Conclusion
In this post, we found that the compiler successfully vectorized the codes and there would be a significant improvement in the performance considering the substantial amount of SVE2-specific instructions and registers. We also checked that SVE2 does not break the program and run as well as the original program. These findings suggest that the authors of opus
package may greatly benefit from the vectorization of the codes when SVE2 become publicly available in the near future.
Posted on April 22, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024