Quick and easy way of counting UTF-8 characters in Javascript

coolgoose

Alexandru Bucur

Posted on June 8, 2018

Quick and easy way of counting UTF-8 characters in Javascript

Reading the following tutorial regarding a VueJS component that displays the character count for a textarea got me thinking.

You see, the problem is that when Javascript was first created it didn't had proper UTF-8 support. Javascript's internal encoding is UCS-2 or UTF-16 depending the articles you find on the internet. (actually there's an awesome article from 2012 that explains this in detail ) .

What does that mean you say ? Well it's rather straightforward, if you're trying to get the length property of a string that contains UTF-8 3/4 byte (that translate into UTF-16 surrogate pair characters) your length will return 2 for each of the characters.

This might not be an issue usually, but it's a big issue if you're having a password policy of 8 characters that can be filled by just 4 "😹🐢😹🐢" (ok, not the best example, but everybody likes cats and dogs)

let lengthTest = "😹🐢😹🐢";
console.log(lengthTest.length);
// will display 8
Enter fullscreen mode Exit fullscreen mode

Now the fix with modern Javascript is rather easy, because it supports surrogates properly in arrays, and using array destructuring makes it a quick and easy one liner.

let lengthTest = "😹🐢😹🐢";
console.log([...lengthTest].length);
// will display 4
Enter fullscreen mode Exit fullscreen mode

I'm interested in knowing if you got any weird/interesting experiences with UTF-8

PS: Use this link for a nice simple-ish explanation of Unicode encodings

πŸ’– πŸ’ͺ πŸ™… 🚩
coolgoose
Alexandru Bucur

Posted on June 8, 2018

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related