SPO 600 project part 3 - Analysis

Introduction

This is the final part of our project to implement SVE2 instructions into an open-source project. Thank you for reading.

If you didn't yet read the second, please see this link.

Also, you can see the link to the repo here.

https://github.com/aserputov/std-simd

Pull Request: https://github.com/VcDevel/std-simd/pull/35

For this project, I did two different ways to work with SVE2.

Autovectorization
Intrinsics

Also, I have two different types of machines that help me test the library.

We can't apply our instructions to all the files, but at least I tried something.

We don't have hardware for ARM9 yet, so we won't see many differences.

Added different variables to the project for SVE2.

Update: after receiving feedback on project part 2.

My initial idea was to work with autovectorization.

As I see now there is no way to add autovectorization to my header library.

That's why in the second part of my project, I decided to use intrinsics with SVE2.

You can track my progress here:
https://github.com/VcDevel/std-simd/pull/35

But now more on what was my idea with intrinsics:(Of course I found that it was super hard to do and I still in progress, but I think I did a small progress)

First step in my analysis: https://developer.arm.com/documentation/100748/0616/SVE-Coding-Considerations-with-Arm-Compiler/Using-SVE-and-SVE2-intrinsics-directly-in-your-C-code

Intrinsics are C or C++ pseudo-function calls that the compiler replaces with the appropriate SIMD instructions. These intrinsics let you use the data types and operations available in the SIMD implementation, while allowing the compiler to handle instruction scheduling and register allocation.

My code steps:

Header file inclusion

#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif /* __ARM_FEATURE_SVE */ 

// All functions and types that are defined in the header file have the prefix sv, to reduce the chance of collisions with other extensions.

In my code it's on line 51 https://github.com/VcDevel/std-simd/blob/5a1ef0e3c5ccc36fbc4af8fb1ad8c89e6e6d0dd4/experimental/bits/simd.h#L51

SVE vector types.
arm_sve.h defines the following C types to represent values in SVE vector registers. Each type describes the type of the elements within the vector:

I updated a few variables types for now with Sve2 types.
I'm still in progress, because it seems work is too big.
(Note: I would like to mention that I created constexpr for Sve2 to maybe future logic in std-simd library. I thought and still think it will help not a lot but at some points for SVE2.)

!Docs
Common situations where SVE types might be used include:

As the type of an object with automatic storage duration.
As a function parameter or return type.
As the type in a (type) {value} compound literal.
As the target of a pointer or reference type.
As a template type argument.

The
SVE2 intrinsics

!Docs reference
To enable only the base SVE2 instructions, use the +sve2 option with the armclang -march or -mcpu options. To enable additional optional SVE2 instructions, use the following armclang options:

+sve2-aes to enable scalable vector forms of AESD, AESE, AESIMC, AESMC, PMULLB, and PMULLT instructions.
+sve2-bitperm to enable the BDEP, BEXT, and BGRP instructions.
+sve2-sha3 to enable scalable vector forms of the RAX1 instruction.
+sve2-sm4 to enable scalable vector forms of SM4E and SM4EKEY instructions.
You can use one or more of these options. Each option also implies +sve2. For example, +sve2-aes+sve2-bitperm+sve2-sha3+sve2-sm4 enables all base and optional instructions. For clarity, you can include +sve2 if necessary.

I got some kind of the point on how to the intrinsics, but I'm still looking there I can add them in this header library:
Ex.(I use)

void daxpy_1_1(int64_t n, double da, double *dx, double *dy)
{
    for (int64_t i = 0; i < n; ++i) {
        dy[i] = dx[i] * da + dy[i];
    }
}

void daxpy_1_1(int64_t n, double da, double *dx, double *dy)
{
    int64_t i = 0;
    svbool_t pg = svwhilelt_b64(i, n);                                       // [1]
    do {
        svfloat64_t dx_vec = svld1(pg, &dx[i]);                     // [2]
        svfloat64_t dy_vec = svld1(pg, &dy[i]);                     // [2]
        svst1(pg, &dy[i], svmla_x(pg, dy_vec, dx_vec, da));         // [3]
        i += svcntd();                                              // [4]
        pg = svwhilelt_b64(i, n);                                   // [1]
    }
    while (svptest_any(svptrue_b64(), pg));                                   // [5]
}

https://developer.arm.com/documentation/100987/0000/
I will progress with this work and update post.

Also, because I didn't show auto-vectorization on practice in std-simd library and not actually build it, I decided to show that I have knowledges on how it works, and used one extra library, to apply them():

My choice fall on this project:
https://github.com/cisco/openh264

It was easy to build and test on different machines, because except Israel and Portugal servers I also had my own arm64 and x86 local machines.

Here : make OS=ios ARCH=**ARCH**

!DOCS

Valid values for **ARCH** are the normal iOS architecture names such as armv7, armv7s, arm64, and i386 and x86_64 for the simulator. Another settable iOS specific parameter is SDK_MIN, specifying the minimum deployment target for the built library. For other details on building using make on the command line, see 'For All Platforms' below.

After with find command we found this executable files:


-rw-r--r--   1 anatoliyserputov  staff   443656 libcommon.a
-rw-r--r--   1 anatoliyserputov  staff     8160 libconsole_common.a
-rw-r--r--   1 anatoliyserputov  staff  1604152 libdecoder.a
-rw-r--r--   1 anatoliyserputov  staff  2019888 libencoder.a
-rw-r--r--   1 anatoliyserputov  staff  4444584 libopenh264.a
-rw-r--r--   1 anatoliyserputov  staff   377176 libprocessing.a

Also, I built it using make ARCH=arm64