Implementing SVE2 for Open Source Project
Seung Woo (Paul) Ji
Posted on March 29, 2022
Introduction
In the last post, we explored and implemented Scalable Vector Extension 2 (SVE2) code for the volume adjusting algorithm. Now, we will do the same process but in a much bigger scale - by actually trying to contribute SVE2 code for the ongoing open source project.
Searching for a package
As we learned before, SVE2 is best suitable for processing large amount of data such as:
- Computer vision
- Multimedia
- Long-Term Evolution (LTE) baseband processing
- Genomics
- In-memory database
- Web serving
- Cryptography
- And so on...
And we know the vectorization can be implemented in 3 different ways:
- Auto-vectorization
- Compiler Intrinsics
- Inline Assembler
Since we already have the experience of intrinsics, we will try our best to search packages that already use them.
We also have to consider if a package supports for our machine (Fedora 35 running on Aarch64 Architecture) as we have to install the program. For this, we will use the Fedora's package manager DNF and run the following commands:
$dnf search search_keyword
$dnf info package_name
By using $dnf search
, the keyword is searched in both name and description of every package. Once we find a name of package, we can display the detailed description of that package with $dnf info
. We also have to be careful to only choose open-source project.
List of Possible Candidates
With the aforementioned strategy, we can find some possible candidates as follows:
Let's see each package together!
libjpeg-turbo
libjpeg-turbo is a JPEG image codec that utilizes SIMD instructions to perform JPEG compression and decompression. When we inspect the package, we can find a list of promising files as follows:
$ find . -name "*neon*"
./jidctfst-neon.c
./jcsample-neon.c
./aarch32/jchuff-neon.c
./aarch32/jsimd_neon.S
./aarch32/jccolext-neon.c
./jfdctfst-neon.c
./neon-compat.h.in
./aarch64/jchuff-neon.c
./aarch64/jsimd_neon.S
./aarch64/jccolext-neon.c
./jidctred-neon.c
./jfdctint-neon.c
./jdmerge-neon.c
./jidctint-neon.c
./jccolor-neon.c
./jdsample-neon.c
./jdcolor-neon.c
./jdmrgext-neon.c
./jcgryext-neon.c
./jcphuff-neon.c
./jcgray-neon.c
./jdcolext-neon.c
./jquanti-neon.c
// jquanti-neon.c
// ...
#if defined(__clang__) && (defined(__aarch64__) || defined(_M_ARM64))
#pragma unroll
#endif
for (i = 0; i < DCTSIZE; i += DCTSIZE / 2) {
/* Load DCT coefficients. */
int16x8_t row0 = vld1q_s16(workspace + (i + 0) * DCTSIZE);
int16x8_t row1 = vld1q_s16(workspace + (i + 1) * DCTSIZE);
int16x8_t row2 = vld1q_s16(workspace + (i + 2) * DCTSIZE);
int16x8_t row3 = vld1q_s16(workspace + (i + 3) * DCTSIZE);
/* Load reciprocals of quantization values. */
uint16x8_t recip0 = vld1q_u16(recip_ptr + (i + 0) * DCTSIZE);
uint16x8_t recip1 = vld1q_u16(recip_ptr + (i + 1) * DCTSIZE);
uint16x8_t recip2 = vld1q_u16(recip_ptr + (i + 2) * DCTSIZE);
uint16x8_t recip3 = vld1q_u16(recip_ptr + (i + 3) * DCTSIZE);
uint16x8_t corr0 = vld1q_u16(corr_ptr + (i + 0) * DCTSIZE);
uint16x8_t corr1 = vld1q_u16(corr_ptr + (i + 1) * DCTSIZE);
uint16x8_t corr2 = vld1q_u16(corr_ptr + (i + 2) * DCTSIZE);
uint16x8_t corr3 = vld1q_u16(corr_ptr + (i + 3) * DCTSIZE);
int16x8_t shift0 = vld1q_s16(shift_ptr + (i + 0) * DCTSIZE);
int16x8_t shift1 = vld1q_s16(shift_ptr + (i + 1) * DCTSIZE);
int16x8_t shift2 = vld1q_s16(shift_ptr + (i + 2) * DCTSIZE);
int16x8_t shift3 = vld1q_s16(shift_ptr + (i + 3) * DCTSIZE);
// ...
As we can see, vld1q_s16
intrinsic is used to load a vector from memory. Furthermore, the package does not yet use SVE or SVE2 implementation. This indicates this project is a good candidate where we can contribute our knowledge of SVE2 for this project.
SoundTouch
Soundtouch is an audio-processing library that allows changing the sound tempo, pitch and playback rate parameters. This sounds familiar to us as we dealt with a simple audio algorithm before and maybe another good candidate for us.
$grep -ir neon .
./configure.ac:AC_CHECK_HEADERS([arm_neon.h])
./configure.ac:AC_ARG_ENABLE([neon-optimizations],
./configure.ac: [AS_HELP_STRING([--enable-neon-optimizations],
./configure.ac: [use ARM NEON optimization [default=yes]])],[enable_neon_optimizations="${enableval}"],
# configure.ac
if test "x$enable_neon_optimizations" = "xyes" -a "x$ac_cv_header_arm_neon_h" = "xyes"; then
# Check for ARM NEON support
original_saved_CXXFLAGS=$CXXFLAGS
have_neon=no
CXXFLAGS="-mfpu=neon -march=native $CXXFLAGS"
# Check if can compile neon code using intrinsics, require GCC >= 4.3 for autovectorization.
AC_COMPILE_IFELSE([AC_LANG_SOURCE([[
#if defined(__GNUC__) && (__GNUC__ < 4 || (__GNUC__ == 4 && __GNUC_MINOR__ < 3))
#error "Need GCC >= 4.3 for neon autovectorization"
#endif
#include <arm_neon.h>
int main () {
int32x4_t t = {1};
return vaddq_s32(t,t)[0] == 2;
}]])],[have_neon=yes])
CXXFLAGS=$original_saved_CXXFLAGS
if test "x$have_neon" = "xyes" ; then
echo "****** NEON support enabled ******"
CPPFLAGS="-mfpu=neon -march=native -mtune=native $CPPFLAGS"
AC_DEFINE(SOUNDTOUCH_USE_NEON,1,[Use ARM NEON extension])
fi
fi
The package does not contain any files that has simd
or neon
in their names. However, it does have a file that contains neon
in its content. When we open that file, we can see this package utilizes the auto-vectorization feature by the compiler. As we can see, the package prompts a message saying that it cannot perform the auto-vectorization when it is compiled by GCC
with a version less than 4.3.
Opus
Opus is a audio codec for interactive speech and audio transmission across the Internet with compression algorithms. It can support a wide rage of interactive audio applications such as Voice Over IP (VoIP), remote live music performance, and video conferencing. As similar to the last one, this may be a good candidate for us.
$ find | grep -i neon
./celt/arm/celt_neon_intr.c
./celt/arm/pitch_neon_intr.c
./silk/fixed/arm/warped_autocorrelation_FIX_neon_intr.c
./silk/arm/NSQ_neon.c
./silk/arm/LPC_inv_pred_gain_neon_intr.c
./silk/arm/NSQ_neon.h
./silk/arm/biquad_alt_neon_intr.c
./silk/arm/NSQ_del_dec_neon_intr.c
// celt_neon_intr.c
#include <arm_neon.h>
#include "../pitch.h"
#if defined(FIXED_POINT)
void xcorr_kernel_neon_fixed(const opus_val16 * x, const opus_val16 * y, opus_val32 sum[4], int len)
{
int j;
int32x4_t a = vld1q_s32(sum);
/* Load y[0...3] */
/* This requires len>0 to always be valid (which we assert in the C code). */
int16x4_t y0 = vld1_s16(y);
y += 4;
for (j = 0; j + 8 <= len; j += 8)
{
/* Load x[0...7] */
int16x8_t xx = vld1q_s16(x);
int16x4_t x0 = vget_low_s16(xx);
int16x4_t x4 = vget_high_s16(xx);
/* Load y[4...11] */
int16x8_t yy = vld1q_s16(y);
int16x4_t y4 = vget_low_s16(yy);
int16x4_t y8 = vget_high_s16(yy);
int32x4_t a0 = vmlal_lane_s16(a, y0, x0, 0);
int32x4_t a1 = vmlal_lane_s16(a0, y4, x4, 0);
int16x4_t y1 = vext_s16(y0, y4, 1);
int16x4_t y5 = vext_s16(y4, y8, 1);
int32x4_t a2 = vmlal_lane_s16(a1, y1, x0, 1);
int32x4_t a3 = vmlal_lane_s16(a2, y5, x4, 1);
int16x4_t y2 = vext_s16(y0, y4, 2);
int16x4_t y6 = vext_s16(y4, y8, 2);
int32x4_t a4 = vmlal_lane_s16(a3, y2, x0, 2);
int32x4_t a5 = vmlal_lane_s16(a4, y6, x4, 2);
int16x4_t y3 = vext_s16(y0, y4, 3);
int16x4_t y7 = vext_s16(y4, y8, 3);
int32x4_t a6 = vmlal_lane_s16(a5, y3, x0, 3);
int32x4_t a7 = vmlal_lane_s16(a6, y7, x4, 3);
y0 = y8;
a = a7;
x += 8;
y += 8;
}
// ...
When searched with neon
, we can see a list of promising files that potentially deal with simd
instructions. In celt_neon_intr.c
file, we can see xcorr_kernel_neon_fixed
function executes a loop with SIMD instructions.
Result
We have a pretty good open-source projects to implement SVE2. Amongst them, we will choose Opus
project for several reasons. First of all, this project is still well and actively maintained by developers. As a matter of fact, it is standardized by the Internet Engineering Task Force IETF and unmatched for interactive audio transmission over the Internet. Besides, the package is well-documented to understand the code thoroughly. Lastly, and most importantly, the code is written to be more readable by new developers as compared to the first two projects. As we can see, the author kindly commented the purpose of variables and functions. Thus, we will choose Opus
project to contribute our SVE2 knowledge.
Contributions
The way to contribute for Opus
project is well-explained in its wiki page. Thankfully, the wiki page states that one of ways to contribute to Opus
development is by doing optimizations (assembly/intrinsics). To do this, we can easily approach to the developers on the mailing list or through the IRC channel.
Conclusion
In this post, we explored some of the open-source projects where we could contribute our SVE2 knowledge. As it turned out, Opus
project is most suitable for us. In the following post, we will start implementing SVE2 codes in the project.
Posted on March 29, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024