Lua strings are not strings

samyeyo

Sam

Posted on February 21, 2021

Lua strings are not strings

Yes, you read that right. It could have been worthy of a shakespear play but when it comes to strings in Lua, it can turn into a nightmare if you don't pay attention with the true definition of what a string is...

"Wait, what do you mean by strings are not strings?"

In Lua, strings could have been called buffers, arrays, or even containers. Indeed strings are only containers. But it does not contain characters. In fact, Lua has no idea what a character is.

"So what a string contains ?"

You may then wonder what the strings contain: they simply contain bytes. Strings can therefore, in Lua, contain lots of things: an image, a digitized sound, a database ... and characters too.

This is where things get complicated: Lua considers that in strings, a single byte corresponds to a single character. This is fine as long as you are using single-byte encoded characters (as with ASCII encoding), with only 255 character possibilities.

But in the age of the Internet, when the whole world communicates in all languages, that seems rather restrictive ! Fortunately other encodings than ASCII exist, to extend the number of usable characters: UTF8, UCS 2 LE, UCS 2 BE,...
They allow to encode a character over several bytes.

Multibytes characters

Multibytes characters can be stored in Lua strings after all, as strings in Lua contains bytes.
Yes, that's right. But it does not mean you can use them !

One rule to rule them all

All Lua strings functionnalities (concatenation, length calculation, string.find, string.gmatch, string.sub...) consider that strings contain only single byte characters : the same rule again !

Here is an example that illustrates the problem (the script must have been saved with UTF8 encoding +/- BOM)

local summer_infrench = "été"

-- outputs 5 !?
print(string.len(summer_infrench))

-- pos = 3 !?
pos = string.find(summer_infrench, "t"))
Enter fullscreen mode Exit fullscreen mode

What's going on ?

Remember the rule : strings are considered as bytes containers. The UTF8 string "été" (means 'summer' in French) is 3 characters long, but occupies 5 bytes in memory :

é t é = 3 characters
0xC3 0xA9 0x75 0xC3 0xA9 = 5 bytes

That's why the function string.len returns 5 and not 3.
The same for string.find : The byte position of the "t" character is 3.

Is there any workaround ?

Hopefully, yes there is one. Since Lua 5.3 a new module "utf8" is available to help developers with UTF8 encoded strings. But this greatly complicates the use of UTF8 strings, as it uses specific functions. A kind of overlay over strings. Not very friendly : in other modern programming languages, strings are containers for characters and support natively multibytes encodings.

Here is the previous example using the "utf8" module :

local utf8 = require "utf8"

local summer_infrench = "été"

-- yes ! outputs 3 !
print(utf8.len(summer_infrench))

-- Still pos = 3, no solution for string.find with UTF8 strings
pos = string.find(summer_infrench, "t"))
Enter fullscreen mode Exit fullscreen mode

What if I want to use UTF8 strings with Lua ?

It's in Lua philosophy : if Lua lacks something, implement it using binary modules or Lua modules. Search on the net and you will find some of them. But again, for such a simple functionality, this represents a certain degree of complication especially for beginners.

Conclusion

I hope this article has helped to better understand the use of strings in Lua. This is also the main reason why I decided to natively implement UTF8 strings in my LuaRT project.

💖 💪 🙅 🚩
samyeyo
Sam

Posted on February 21, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Hello World in Lua
lua Hello World in Lua

March 17, 2024

A little bit about Lua
programming A little bit about Lua

December 3, 2022

Lua strings are not strings
programming Lua strings are not strings

February 21, 2021