Dealing with Unicode strings, done right and better.
Hyeseong Kim
Posted on June 16, 2024
TL;DR
I created a library called "unicode-segmenter" to handle Unicode grapheme clusters with good performance and reasonable bundle size. Check it out!
This article is not just about promoting my library but also about the process of creating a high-quality library, including optimization and testing.
Updated: unicode-segmenter is now e18e recommendation!
Intro
Have you ever heard of the term "grapheme cluster"? If you're using non-English languages or supporting emoji in your service, you probably have.
One key lesson from Unicode is that strings often require more space than they appear to.
If we used UCS-4 (or UTF-32), it would be sufficient to represent every character ever invented, making it easy to count each character. However, this would waste a lot of space.
Therefore, most programming languages use more efficient encodings like UTF-8 or UTF-16. JavaScript uses UTF-16 to represent characters like "👋" as surrogate pairs (two characters). So in JavaScript, its .length
property is 2.
Some characters can be even longer by using special characters called "joiners".
- "🇰🇷" displays as 1, but has a
.length
of 4 - "👩👩👦👦" displays as 1, but has a
.length
of 11 - "अनुच्छेद" displays as 4, but has a
.length
of 8 - Some complex characters displays as 6, but has a
.length
of 75! 🤯
A unit internally expressed as multiple characters but logically treated as one character is called a "Grapheme Cluster."
Unicode® standardizes how to handle these grapheme clusters in UAX#29, the "Unicode Segmentation" rules.
The problem
Handling graphemes is crucial when creating custom inputs, text editors, or code analyzers that count characters.
There is the Web's Intl.Segmenter
API that supports the Unicode Segmentation standard. It's available not only in Web browsers like Chrome and Safari, but also in JS runtimes like Node.js, Deno, and Bun.
However, it's so new that I couldn't use it for my company's app, Karrot. For more context, Karrot is a massive app with over 18M monthly active users in South Korea, so supporting users on older versions is essential. Given the update frequency of mobile users, versions like Chrome 87 and Safari 14.5 are still not sufficient.
One day, my colleague was analyzing the bundle size of their production and reported an issue with our design system library as it seemed unexpectedly large.
It was the "graphemer" library inside, more than 95% of the size of the problematic input component, and is a large enough portion of the overall app that it's even visually noticeable.
Today, graphemer is the most popular way to handle grapheme clusters in JavaScript. (20M weekly downloads on NPM !!)
But graphemer is an old library and is no longer actively maintained. This may result in differences with the latest Unicode version (mostly in Hindi), and no ESM support.
Custom implementations for Unicode are not trivial in bundle size because they include Unicode data. Also, graphemer generates tons of if-else statements for performance reasons. (spoiler: it's not performant)
Well, isn't there a better alternative?
Can WebAssembly (and Rust) save us?
Probably YES!
I already knew that there was a good quality library unicode-segmentation in Rust, and Rust has a great WebAssembly toolchain called wasm-bindgen.
So I created a simple binding using wasm-bindgen.
use unicode_segmentation::UnicodeSegmentation;
use wasm_bindgen::prelude::*;
#[wasm_bindgen]
pub fn count(val: &str) -> usize {
val.graphemes(true).count()
}
#[wasm_bindgen]
pub fn collect(val: &str) -> Vec<String> {
val.graphemes(true).map(|s| s.to_string()).collect()
}
The above code generates a WASM binary with JavaScript binding. It was surprisingly easy.
After a few manual optimizations, I got a WASM binary in 52KB size. When combined with the bindings, it totals around 55KB. Also, WASM binaries compress very well, coming in at a total of 23KB when compressed by gzip!
This is already smaller than the parsed size of the graphemer library, and since WASM supports streaming compilation, it appeared to be a better option to me.
However, feedback from my colleagues wasn’t positive. The reason was that the size was still too large, and using WebAssembly was not yet feasible for us.
RIIJS (Rewrite It In JavaScript)
Rust people have a cultural idiom called RIIR (Rewrite It In Rust).
Although WebAssembly is known for its performance, its size can sometimes be unacceptable in mobile apps, or the performance might not meet expectations due to binding overhead.
A high-quality JavaScript library is still valuable, so I decided to do the opposite: Rewrite It in JavaScript.
First, I reviewed the Rust library’s license. It's dual-licensed under MIT and Apache 2.0, I chose MIT.
Then I began by modifying its code generator, which was written in Python, to produce JavaScript instead of Rust. Since I’m not familiar with Python code, this was the most time-consuming task.
By utilizing my knowledge of Rust and JavaScript, I ported the implementation of its Iterator
trait into JavaScript’s iteration protocol.
Since JavaScript doesn't have a pattern-matching like Rust, it could be hard to replicate the same logic. I used the ReScript compiler to maintain the original logic as much as possible. It made me able to port it confidently. Specifically check_pair
function can be converted into this.
After some test-fix loops, I got the first working version and the result was way better than expected. It was 3x smaller in size and 2.5x faster than the graphemer 😲
(It was the initial result and even better now!)
Reducing bundle size
The major win of the library comes from choosing the simplest structure to store the Unicode data table. This wins in both size and performance. Not only that but there are also various techniques and efforts to achieve an even smaller size.
Using JavaScript generator
JavaScript has a syntax for stateful logic known as "Generator". Generator is highly expressive, making them 40% smaller (1682 bytes into 1011 bytes) than an equivalent class with the same functionality!
You can see both class version and generator version in old commits.
Considering compression
JavaScript code is mostly transmitted with compression like gzip or brotli (zstd soon).
The minified graphemer library is 89.08 KB, but after gzip compression, it reduces to 13.15 KB. This is because the library size is significantly inflated by a bunch of if-else statements, and compression algorithms excel at such repetitive stuff.
unicode-segmenter aims to be small even before compression, but it also carefully considers the size after compression.
For instance, when dealing with character data like “\u{1F680}”
, a minifier like Terser tries to unescape it to "🚀"
. Unescaping can reduce code size since escape sequences require additional characters. However, in the case of data tables with a significant number of escaped characters, gzip compression is highly effective, making it better not to unescape them. The results are summarized in this pull request.
Afterward, It was switched to use int32 codepoint directly for performance reasons. And it compresses better.
Even further, I compressed the number strings to Base36 (0~9, a~z), which is easily reversible with toString
/parseInt
, and still benefits from additional compression from gzip.
initUnicodeRangeBuffer(
new Uint32Array(bufferSize),
',9,a,,b,1,d,,e,h,3j,w,4p,,4t,, ...', // encoded Unicode ranges
);
function initUnicodeRangeBuffer(buffer, value) {
let nums = value.split(',').map(s => s ? parseInt(s, 36) : 0);
for (let i = 0, n = 0; i < nums.length; i++)
buffer[i] = i % 2 ? n + nums[i] : (n = nums[i]);
return buffer;
};
This technique is inspired by Marijin's code, the author of CodeMirror
Enabling tree-shaking
JavaScript production apps use bundlers. These tools can exclude unused modules from the final bundle or completely remove unused code through static analysis.
The one goal of unicode-segmenter is to provide a complete polyfill for Intl.Segmenter
. However, if only grapheme segmentation is needed, it adopts a well-considered modular structure to ensure that unnecessary parts are not included.
The exposed APIs have independent entries based on their respective topics.
-
unicode-segmenter/grapheme
(5.3 KB): Provides text segmentation by extended grapheme cluster -
unicode-segmenter/intl-polyfill
(5.7 KB): ProvideIntl.Segmenter
(currently grapheme only) polyfill -
unicode-segmenter/emoji
(825 B): Provides emoji matchers -
unicode-segmenter/general
(4.3 KB): Provides alphanumeric matchers -
unicode-segmenter/utils
(353 B): Provide utilities for UTF-16 and Unicode code points.
Optimizing performance
Thanks to the unicode-rs' excellent implementation I referenced, it was already 2.5x better than graphemer in its first version. It's mostly a hand-written binary search in a compressed Unicode data table. It's simple but very efficient.
However, for even better results, it is important to fully consider the characteristics of the language.
Leverage the compiler
As mentioned above, the ReScript compiler was very helpful in the initial porting process.
I ended up rewriting it by hand for the best performance, but until then it allowed me to focus on other areas first, as it always ensures accuracy and sufficient efficiency.
Using the optimal data type
JavaScript engines also have a compilation process called JIT(Just-In-Time compilation). One well-known thing is that they are very efficient when dealing with 32-bit integers (aka SMI; small integers).
A Unicode data is a list of ranges of Unicode code points, which are 32-bit integers. Therefore, I was able to make all operations internally based on SMI.
Unlike Rust, where the compiler checks data types, in JavaScript, this responsibility falls entirely on the developer. I managed to do it with careful attention.
As a result, performance more than doubled in the second release.
Explicit bindings
When using generators or regular functions instead of classes, it is natural to implicitly capture external variables through "closures". However, implicit bindings can negatively impact optimization and garbage collection.
To avoid this, simply hoisting functions resulted in an additional 10% improvement. (See the change)
Avoiding copying and constructing structs
When dealing with more than one piece of data, there’s a desire to handle it as a structure.
However, JavaScript lacks efficient records/tuples, and both Object
and Array
always come with a cost. While this isn’t an issue for most applications, it can be critical in the micro-benchmarks of libraries.
It is also very important to avoid implicit copying and unnecessary GCs. Pass references directly and perform the copy explicitly where it really needed.
// retained `cache` reference
function cat(cp, cache) {
...
if (cp < cache[0] || cp > cache[1]) {
let result = searchGraphemeCategory(cp);
// update its values by explicit copying
cache[0] = result[0];
cache[1] = result[1];
cache[2] = result[2];
}
return cache[2];
}
};
Continuous benchmarking and efforts!
What helped getting faster mostly was the benchmark tool that was set up from the very beginning and the continuous measurements.
I use mitata (pretiously tinybench). If you care about performance, I highly recommend you do it.
However, poorly designed benchmarks can be as dangerous as having none. They might be completely invalidated by the compiler, too small to reflect real-world scenarios, or fail to identify scalability issues.
To avoid these pitfalls, it is crucial to design benchmarks to be as close to the real world as possible. unicode-segmenter gets help from ChatGPT to construct tests that resemble real code!
Benchmarks can yield very different results depending on the environment. Therefore, I run it not only in my local environment but also across various versions of Node.js and Bun, on various OS and CPU architectures, and I also configure them to run in browsers to verify performance in both Chrome and Safari. Through this process, I confirmed that the performance of Intl.Segmenter
has improved significantly in recent versions of Node.js, Chrome, and Safari. This is very exciting news!
But efforts are still worth it since not everyone is using the latest version or ideal environment :)
Testing strategy
Testing is crucial. Since Unicode involves characters that might be unfamiliar, extensive testing was necessary.
I set up initial test suites with Node.js test runner (It's great!) and manually wrote a few known cases, but it was never enough.
Property-based testing
I tried a more precise method called "Property-based testing".
fast-check is the most popular PBT tool in the JavaScript ecosystem. With fast-check, I can easily write test code that defines properties and verifies them based on automated inputs.
Then, how is the verification done? The idea is to check if the results match those of Intl.Segmenter
, which is already available at the Node.js runtime!
test('graphemeSegments', async t => {
await t.test('unicode string', () => {
// Intl.Segmenter is available on Node.js!
const intlSegmenter = new Intl.Segmenter();
fc.assert(
fc.property(fc.fullUnicodeString(), input => {
// assert through automated input
assertObjectContaining(
[...graphemeSegments(input)],
[...intlSegmenter.segment(input)],
);
}),
);
});
});
Through PBT, I found and fixed numerous bugs in the initial implementation. The first bug PBT identified was an empty string (""
). 😅 This is a surprisingly common mistake.
I completely switched my primary testing method to PBT and set up additional tests to debug the counterexamples found.
Generating test suites
Unicode provides official test data. I used the data to generate test suites and check for spec compliance.
During this process, I discovered that the Rust library I referenced had not yet implemented the “GB9c” rule, so I implemented it myself. (It seems they implemented it recently)
Production testing
I understand that despite all these efforts, it might still be insufficient. Reality can sometimes be more extreme than theory.
I tried to create migration PRs directly to major dependents of graphemer. I found a few issues related to implementation, and many others related to TypeScript configuration, sourcemap setup, etc.
By supporting both large and small-scale projects, I was able to update the library and gain confidence in its final quality.
Conclusion
As a result, my app achieved a smaller bundle size and better performance.
Updated: It's now 17.3 KB (stats), 12.8 KB (parsed), 5.3 KB (gzipped) in v0.11
I also learned that what it does is not that simple. This should be a temporary alternative to Intl.Segmenter
.
I recommend to use Intl.Segmenter
where possible. The unicode-segmenter library might become another graphemer after 10 years. Who knows!
But if you are in a special environment where Intl.Segmenter
is not available, try unicode-segmenter. It provides good performance in a reasonable size.
Especially if you build something on the React Native and the Hermes engine. unicode-segmenter supports it pretty well!
Also, there are more coming shortly
- It will soon provide a full
Intl.Segmenter
polyfill, including word/sentence support. (issue #25) - It will support more advanced patterns that
Intl.Segmenter
doesn't, like backward iteration. (issue #26)
Are you struggling with the quality of the library in another area? Consider taking some time to investigate. We still have many opportunities for further improvement in the JavaScript ecosystem.
Posted on June 16, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.