Playing with JavaScript performances and DNA

linediconsine

Marco

Posted on December 5, 2021

Playing with JavaScript performances and DNA

Biology is an interesting world and bioinformatic is where computer science meets biology,
Today I will describe a simple yet interesting bioinformatic problem from an algorithmic prospective.: Calculate the reverse complement of DNA bases using Javascript.

In this article, I am using a bioinformatic problem because is fun and interesting, but I will be mostly talking about JavaScript performance.
We will

  • Start describing how DNA works (with some big simplifications… Eih! I am not a biologist!),
  • Propose some implementations, and then
  • try to archive the best time performance, comparing the time for completing the task.

heads-up: A basic knowledge of JavaScript language is required to follow along.

What’s the reverse complement?

Before explaining it, bear with me for a small tour of how DNA looks like.

Let start with some concepts,
DNA Macrostructure

The DNA helix is composed of two strands like in the image above.
A strand is a long sequence of this for letters ATGC ( each letter is a specific nucleotide Adenine, Thymidine, Guanidine, Cytidine ) in some order.
There is a specific relation between what is the first strand and what there is on the second strand: for each A in the first sequence there is a T on the other strand and vice versa, and for each G a C will be on the other strand.

The conversion from map DNA strand to complementary strang would be something like:

'A': 'T',
'G': 'C',
'T': 'A',
'C': 'G'

Here is an example:

Example of DNA

I often hear these two sequences named 5’ to 3’’ ( 3’ end) and the second string is named 3’ to 5’ (5’’ end). The direction of reading is in both from 5’ to 3’’ and this means that a sequence is read from left to right but the other one (the complementary) is read from right to left.

In most of the file formats and web APIs that I worked with since the complementary DNA strand can be calculated from the first strand sequence, only one DNA strand is provided ( 3’ end) and is up to us to calculate the complementary.

Now, we have enough for our small challenge:

How can I generate a complementary strand?

Given an input:
TTATACGACTCACTATAGGGAGACTTAAGAAG
The expected output should look like this:
CTTCTTAAGTCTCCCTATAGTGAGTCGTATAA

Remember: we are reading the complementary in reverse order so the DNA sequence starts TT the complementary will end with AA.

Input:

TT ATACGACTCACTATAGGGAGACTTAAGAAG
Output:

CTTCTTAAGTCTCCCTATAGTGAGTCGTAT AA

Ok, let make the code talks for us:

Let start with a modern approach, a map

const map = (sequence) => {
    const map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    return sequence.split('').reverse().map(bp => map[bp]).join("")
}
Enter fullscreen mode Exit fullscreen mode

This is “easy” to read and the steps are :

We take the input
“TACGA”
We separate each char and create an array
[ ‘T’ , ’A’ , ’C’ , ’G’ , ’A’]
Then map each char into his complementary
[ ‘A’ , ’T’ , ’G’ , ’C’ , ’T’]
We reverse
[ ‘T’ , ’C’ , ’G’ , ’T’ , ’A’]
And the join in a string
“TCGTA”

That’s it… right?

In most cases, yes, but today we are a bit more stubborn and we will try to find the best performance time for this job.
Why? Well even a small bacterial DNA can range in size anywhere from 130 kbp to over 14 Mbp ( a bp is a single letter/Nucleotide) so being fast could be important.

Ok, we have the rules, now let me present our players :

Player1:
We just saw the map implementation, let call map,

const map = (sequence) => {
    const map = {'A': 'T', 'T': 'A', 'G': 'C', 'C': 'G'}
    return sequence.split('')
                   .reverse()
                   .map(bp => map[bp])
                   .join("")
}
Enter fullscreen mode Exit fullscreen mode

Player2: for loop and with if:

const forLoop = (sequence) => {
    let complement = ''
    for (let idx = 0; idx < sequence.length; idx++) {
        if (sequence[idx] === 'A') {
            complement = 'T' + complement
        } else if (sequence[idx] === 'T') {
            complement = 'A' + complement
        } else if (sequence[idx] === 'G') {
            complement = 'C' + complement
        } else if (sequence[idx] === 'C') {
            complement = 'G' + complement
        }
    }
    return complement
}
Enter fullscreen mode Exit fullscreen mode

Player3: A for with a switch case:

const forSwitch = (sequence) => {
    let complement = '';

    for (let idx = 0, sL = sequence.length; idx < sL; idx++) {
        switch (sequence[idx]) {
            case 'A':
                complement = 'T' + complement
                break;
            case 'T':
                complement = 'A' + complement
                break;
            case 'G':
                complement = 'C' + complement
                break;
            case 'C':
                complement = 'G' + complement
                break;
        }
    }
    return complement
}
Enter fullscreen mode Exit fullscreen mode

We will run these implementations ( and some small variations, github for more details ), 10000 times on a 35752 long DNA sequence and record the best time, the worst time, and the overall average time.

Ready go!

Performance Graph

This graph is not that easy to read, let me provide a table ordered by

Code Average Best Worst
For (optimized) with switch case 0.9446 0.4836 99258,00
For with multiple if 21564,00 0.5540 867263,00
For (optimized) with each if 11737,00 0.6480 98886,00
For with dictionary 15038,00 11097,00 83742,00
ForEach with dictionary 23381,00 17202,00 70510,00
Big Map with regular expression 29884,00 23477,00 103878,00
Map with dictionary 34595,00 26937,00 137978,00
Replace with dictionary 237074,00 51751,00 3951461,00

It looks like “replace with dictionary” is the worst in timing, and “optimized switch case” is the best.

Wrapping up,
In this implementation I can see that:

  • The regular expression and the dictionary are slower than if and switch case
  • For is the faster loop
  • switch case wins on if else if
  • The optimized of ‘for loop’ give some small improvements

Bonus, (what optimized for means):

Maybe you already noted the ‘switch case’ implementation. During my review of this topic, I fell on this website ( https://browserdiet.com/ ) and learned something interesting about the for loop that I didn’t know.

for ([initialization]; [condition]; [final-expression]){
        Statement
}
Enter fullscreen mode Exit fullscreen mode

Every time a ‘statement’ gets executed the condition block runs again.
This sounds clear, but also sequence.length will re-calculate each interaction, consuming more time, and this is bad!

And there is a simple solution,

We can instance a variable with the value of sequence.length
in the initialization block:

for (let idx = 0; idx < sequence.length; idx++) {
    // sequence.length is calculated every interaction
}

for (let idx = 0, sL = sequence.length; idx < sL; idx++) {
    // sequence.length is calculated only 1 time
}
Enter fullscreen mode Exit fullscreen mode

Thanks for taking the time to read this article and let me know any feedback, have a great day!

References:
https://en.wikipedia.org/wiki/Bacterial_genome
https://en.wikipedia.org/wiki/DNA
https://leanylabs.com/blog/js-forEach-map-reduce-vs-for-for_of/
DNA image from https://commons.wikimedia.org/wiki/File:0321_DNA_Macrostructure.jpg

💖 💪 🙅 🚩
linediconsine
Marco

Posted on December 5, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related