Perl, Inline::CPP and the need for speed (sometimes).

TLDR;
V1 Perl original regex character strip loop: 11 Seconds
V2 Perl manual charatcer strip loop: 40 Seconds
V3 Perl + Inline::CPP character strip loop: 0.92 Seconds
/TLDR

Summary: A dabble with Perl and the great Inline::CPP module to speed up a bit of performance with a bit of old code.

Inline:CPP is a Perl module which makes it really easy to incorporate bits of C++ with Perl. (Why aren't more people using this! You can even create standalone modules without the need for Inline::CPP)

You can find it here

So, I've been trying to optimise some bits of Perl code recently. We all know premature optimisation in code is evil, but there are times when its worth benchmarking your code and sometimes it's surprising where the slow downs occur. Whilst Perl is fantastic for speed of programming, like any language, sometimes we grab the wrong solution or things just grow without checking in on our code.

So, we do a lot of text processing, this IS what Perl IS good for isn't it ? But sometimes we get lazy keeping an eye on our code, as long as it works.

So the original idea of this code (to be fair it was written about 15 years ago by someone else!), was to sanitise some text, strip out excess whitespace (but leave one) and make lower case, for comparing duplicates once all the clutter (punctuation, html etc) was removed.

This is naturally a simplified example.

# get rid of non-alphas and excess spaces and make lower case
sub perl_strip {
        my ( $string ) = @_;
        $string =~ s/[^0-9a-zA-Z]/ /g;
        $string =~ s/ +/ /g;
        $string =~ s/^ +//;
        $string =~ s/ +$//;
        lc $string;
}

Ok, immediately, I suspect there's some better way within regex itself, but as we add more clauses it gets messy (simplified example again).

Anyway, after benchmarking, this looked pretty slow, and it kinda felt dumb looping over the same text even with regex repeatedly. Regex deep down was written in C afaik, so it's fast, but this approach doesn't feel optimal.

So lets try and come up with a general solution that doesn't do that.

At this point, out of interest, I actually asked ChatGPT (I find it good for suggested approaches), and with some coaxing to what was on my mind (it tried the regex only route, but failed with exactness), "we" came up with a simple loop with a check on the previous character/state...

my ($str) = @_;

my $result = '';
my $last_char_was_space = 1;

for my $c (split //, $str) {
    if ($c =~ /[a-zA-Z0-9]/) {
        $result .= $c;
        $last_char_was_space = 0;
    } elsif (!$last_char_was_space) {
        $result .= ' ';
        $last_char_was_space = 1;
    }
}
return lc $result;

Ok, maybe that makes sense logically, but is it faster ? Actually, no. I'm guessing (but don't know), maybe the optimised regex code easily makes up for any improvement in approach.

However, What about if we spend a little time optimising that loop with some C++, it's pretty straightforward and some Perl genius at https://github.com/daoswald/Inline-CPP has made it easy to merge C++ with Perl (note RPerl I think also uses Inline::CPP). Install as usual via

cpan Inline::CPP

So, I've fluffed about with Inline::CPP before a little (not a lot! So let me know flaws here), and thought this was a prime candidate. After all, we do this processing A LOT on a lot of text, ALL the time.

So next step, get my CPP hat (erm ok, well I don't have one, but I can dig out the basics)...and have a play with Perls Inline::CPP

First of all the clutter in CPP, it's a one off, so I don't care too much, setting up the basics is not too complex, but you may need a little digging on some setups and compilers (you will need a C++ compiler on your system).

# set up basic Perl CPP compile/lib options
use Inline CPP => config => typemaps => './typemap'; ## cppxs will look for typemap if you use that
use Inline CPP => config => ccflags => '-Wall -c -std=c++11 -I/usr/local/include';
use Inline CPP => config => inc => '-I/usr/local/include';
use Inline CPP => config => cc => '/usr/bin/g++';
use Inline CPP => <<'END';

#define extract_string_from_scalar_value SvPV_nolen
#define set_string_value_of_scalar_value sv_setpv

Now some C++..it's doing exactly the same logic as the last Perl example, just with C++, then we call it from Perl as a normal Perl subroutine.

// We will call cpp_strip from Perl!! All the subs are exactly the same logic as the Perl code

#include <string>
#include <cctype>
#include <iostream> 
#include <algorithm>
#include <cstring> // for access to std::strlen

// so we know how to convert std::string to Perl strings, see lower down
typedef std::string cppstring;

cppstring cpp_strip(cppstring str) {
        bool last_char_was_space = true;

        std::string result;
        result.reserve(str.size());

        for (char c : str) {
                if (isalnum(c)) {
                        result += char(tolower(c));
                        last_char_was_space = false;
                } else if (!last_char_was_space) {
                        result += ' ';
                        last_char_was_space = true;
                }
        }

        return result;
}

and I call it within Perl later as

$s = cpp_strip( $string );

I also have a typemap file which is referenced earlier as

use Inline CPP => config => typemaps => './typemap';

Note the two statements I had earlier. I did this just to make it easier whenever I read the code, you could just use SvPV_nolen & sv_setpv but I find that messy as it's not clear to me their intention and not the intent to go into that here. We need a typemap just to know how to convert from Perl types to and back from C++ (or other language) types.

#define extract_string_from_scalar_value SvPV_nolen
#define set_string_value_of_scalar_value sv_setpv


TYPEMAP
  cppstring T_CPPSTRING

INPUT
T_CPPSTRING
        $var = ($type)extract_string_from_scalar_value($arg)

OUTPUT
T_CPPSTRING
        set_string_value_of_scalar_value($arg, $var.c_str());

I call this later in Perl as

 $s = cpp_char_strip( $string );

Benchmarks!

So, I looped over the code repeatedly using a largeish lorem ipsum string 100,000 times...benchmarks as follows.

V1 Perl original regex loop: 11 Seconds

V2 Perl manual strip loop: 42 Seconds

V3 Perl + Inline::C++ string loop: 0.92 Seconds

Final thoughts

I do feel like one of Perls strengths is the ability to combine different languages for performance and flexibility and should be used more often!

I'm also interested in using Rust and Golang, probably using FFI to integrate with Perl. I feel like there are a lot of possibilities with mixing languages we don't often explore (we can also use this for any APIs Perl doesn't have solutions for, but other languages do), and I'm hoping to have a dig into that at some point!

If anyone can do it in pure regex, I'm very interested how it will perform.

Full code, including benchmarks, and an even slightly faster (but more complex) solution can be found here

Note: We can also use InlineX::CPP2XS to convert the C++ code to a Perl XS module to include, if you then want to remove the dependency on Inline::CPP.

Blog

Perl, Inline::CPP and the need for speed (sometimes).

Ian

Benchmarks!

Final thoughts

Join Our Newsletter. No Spam, Only the good stuff.

Related