Project stage 3 - Analysis

pykedot

Tecca Yu

Posted on April 19, 2022

Project stage 3 - Analysis

Introduction

Hi, this is Tecca, and this post is a summary of the project, for more details check previous posts about stage1 and stage2 of the project.

In stage 2, I implemented auto-vectorization to the project and in this post I will go over some details and see if there are places that auto-vectorization was not applied and why.

Re-cap

Before applying auto-vectorization running djpeg

Image description

After applying auto-vectorization and running djpeg on qemu-aarch64

Image description

number of whilelo after applying auto-vectorization

Image description

I assumed the above screen shots evidently show that auto-vectorization was successfully applied to the project and the original function djpeg works fine without crashing. But I wonder if all the necessary locations were auto-vectorized.

I believe my implementation will actually run slower than the original code if I test it on qemu-aarch64 due to the nature of qEmu-aarch64 that it will allow regular code to run at full speed on processors, and run SVE2 instructions at a slower speed.

Anyways, to get a log of all the vectorized file and not vectorized file, I need to rebuild the project the way I did in stage 2.

make -j$((`nproc`+1)) |& tee make.log
Enter fullscreen mode Exit fullscreen mode

Through this we are storing the make process detail in the make.log file.

Image description

We can tell from the log that there are definitely places that were "missed" from auto-vectorization.

Detail of not vectorized files

Image description

Image description

We can tell from the above screen shot, the amount of files vectorized are way less than the files that are not vectorized.

My guess is that only the files with loops that will process large amount of data will get optimized by auto-vectorization, because optimizing the loops that does not process large amount of data will very unlikely benefit from auto-vectorization. And it make sense that the important loops are way less than the less important ones.

Different vectorization

two types of vectors were used variable length and specified byte vector
Image description

Trying to look into not vectorized code and apply modifications

I tried to modify the codes that are not vectorized and try to see if I could manually auto-vectorize them. But none of the methods I try work, and there could be various reasons to that and it was actually explained in to make.log.

I did a bit of research and some say in most cases, a C/C++ compiler cannot vectorize the for-loop because it cannot match its structure to a predefined vectorization template.

Conclusion

Throughout the 3 stages, I selected candidate open-source package for optimization, in stage 3 I tried adding SVE2 support manually but I couldn't add more vectorization through modifying the source code, but I successfully added auto-vectorization in stage 2 to presumably all the necessary locations in the library by modifying compiler options.

πŸ’– πŸ’ͺ πŸ™… 🚩
pykedot
Tecca Yu

Posted on April 19, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

64-bit Assembly Programming: AArch64
assembly 64-bit Assembly Programming: AArch64

October 26, 2024

Modifying the 6502 Assembly Program
assembly Modifying the 6502 Assembly Program

October 6, 2024

6502 Assembly - Intro
assembly 6502 Assembly - Intro

October 1, 2024