Michael Chaney
Posted on June 25, 2024
I've been writing about Unicode on Twitter over the last week, and specifically handling Unicode in Ruby. Ruby has robust Unicode support, along with robust support for the older code pages.
In my work in the music publishing industry I have to write code to process all manner of spreadsheets, typically in the form of CSV files. CSV files can be a crap shoot in terms of encoding. Thankfully, everything I've had to deal with up to now has been either Unicode or Latin-1 (ISO-8859-1) or the Windows-1252 variant.
I created a piece of code some years back to handle the issue of determining the encoding of a file and coercing the bytes into a standard Unicode format, specifically UTF-8.
module FixEncoding
def FixEncoding.fix_encoding(str)
# The "b" method returns a copied string with encoding ASCII-8BIT
str = str.b
# Strip UTF-8 BOM if it's at start of file
if str =~ /\A\xEF\xBB\xBF/n
str = str.gsub(/\A\xEF\xBB\xBF/n, '')
end
if str =~ /([\xc0-\xff][\x80-\xbf]{1,3})+/n
# String has actual UTF-8 characters
str.force_encoding('UTF-8')
elsif str =~ /[\x80-\xff]/n
# Get rid of Microsoft stupid quotes
if str =~ /[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n
str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
end
# There was no UTF-8, but there are high characters. Assume to
# be Latin-1, and then convert to UTF-8
str.force_encoding('ISO-8859-1').encode('UTF-8')
else
# No high characters, just mark as UTF-8
str.force_encoding('UTF-8')
end
end
end
There it is in all its glory. I realized after looking at it that it's in not great shape. I'm going to refactor it and talk about my decisions.
There are a few things that stick out:
- I'm making extensive use of
=~
instead of using theString#match
. In Ruby,=~
causes a performance hit due to the fact that it sets various globals (a la Perl) after the match. - I'm using regular expressions where I don't need to - specifically when checking for the BOM (Unicode byte order mark) at the start of the string. Some of these strings are many megabytes, so there can be performance gains.
- I realized that I'm using a regular expression to check for high (128 and above) characters. Ruby has
String#ascii_only?
to do that. - The logic can be changed around to handle the faster cases first.
So, let's first talk about what this does.
# The "b" method returns a copied string with encoding ASCII-8BIT
str = str.b
I'm telling you right there - this gets a copy of the string with the encoding set to ASCII-8BIT. That's basically "no encoding", which is what we want. The string is mostly a string of boring bytes, where the collation order is the character code and there are 26 upper and lowercase letters. This is what Ruby essentially used in version 1.8.
With the string in this encoding, we can look at individual bytes regardless of whether they're part of a UTF-8 set.
# Strip UTF-8 BOM if it's at start of file
if str =~ /\A\xEF\xBB\xBF/n
str = str.gsub(/\A\xEF\xBB\xBF/n, '')
end
(note that the "n" flag on the regular expression is a Rubyism that makes the regular expression have the ASCII-8BIT encoding)
The Unicode Byte Order Mark can optionally occur at the start of a file. Microsoft software adds these, and some other software has no idea what they are. If you understand UTF-8 encoding you can see that this BOM is really character FEFF, which oddly is a zero-width non-breaking space.
The BOM isn't needed, and you'll notice that I do nothing but remove it. That's because I've received files that have a BOM at the start, and Latin-1 characters later on in the same file. There's no reason to "believe" the BOM.
if str =~ /([\xc0-\xff][\x80-\xbf]{1,3})+/n
# String has actual UTF-8 characters
str.force_encoding('UTF-8')
Now we're getting to the meat of it. That regexp will find real UTF-8 characters in the binary byte stream. It'll really find the first one, but that's all I care about. I could make that regexp more precise, although it's of limited value to do so.
In this tweet I cover the format of a UTF-8 character in-depth. Here are the basics, though. a UTF-8 character will be of one of these forms:
0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The first form is just ASCII. Any ordinal value above 127 will always occupy two, three, or four bytes in UTF-8 and will be one of the forms shown above. The first character will always be in the ranges (in hex) C0-DF, E0-EF, or F0-F7. The subsequent characters will always be in the range 80-BF. Having a character in the first range that's not followed by a character in the 80-BF range would be invalid, and having a character in the 80-BF range that's not preceded by one of the first characters or another 80-BF is also not valid.
In the referenced tweet I include a large regexp that will determine if a given string is fully valid as UTF-8:
str =~ /\A(?:\xef\xbb\xbf)?
(?:
(?:[\x00-\x7f]) |
(?:[\xc0-\xdf][\x80-\xbf]) |
(?:[\xe0-\xef][\x80-\xbf]{2}) |
(?:[\xf0-\xf7][\x80-\xbf]{3})
)*
\z/nx
That's a beauty, but it's overkill for what I'm doing here. I'll assume that if there's a single valid UTF-8 character then the string is UTF-8. I'm willing to take that risk.
The only change that I see is that the first character should match [\xc0-\xf7]
instead of [\xc0-\xff]
(note the final "7" in the former). I can also use "match" here to speed it up.
elsif str =~ /[\x80-\xff]/n
# Get rid of Microsoft stupid quotes
if str =~ /[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n
str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
end
# There was no UTF-8, but there are high characters. Assume to
# be Latin-1, and then convert to UTF-8
str.force_encoding('ISO-8859-1').encode('UTF-8')
Okay, lots going on here. We first check if the string has any "high characters", defined as a character code greater than 127. Put another way - the high bit is set. Standard ASCII goes from 0 to 127. When I was a kid the high bit was often used as a parity bit, which isn't needed now. For most applications. I'm sure someone's still using a parity bit.
We've already ruled out this being a UTF-8 string, so if there are high characters we're in either Latin-1 or its inbred cousin Windows-1252.
In Latin-1 the character codes 80-9F were reserved as extended "control codes", kind of mirroring the control code concept in the first 32 ASCII characters. I'm not sure they were ever used as such, and interestingly the character table in Wikipedia simply shows them as "undefined".
Microsoft and Apple both had an idea of what to put in that range, and this caused calamity 20+ years ago as a text file that looked great on Windows or Mac would be full of weird question marks when viewed elsewhere.
Microsoft referred to this "feature" as "smart quotes", so we usually referred to them as "stupid quotes" (as a side note, if you think I'm the only one who refers to them as such Github Copilot knew what to do when I created the "has_stupid_quotes?" method). There are a few other characters in there as well which don't have Latin-1 equivalents, including the Euro sign "€".
Anyway, the next chunk replaces the fancy quote characters with the standard ASCII equivalents.
One possible change to this piece of code would be to force the encoding to Windows-1252 and then transcode to UTF-8, which would preserve the fancy quotation marks and apostrophes. I don't do that simply because I prefer to standardize quote marks to the ASCII versions. In some other contexts that might be a less preferred choice.
Here, I use a regular expression to find them and, if found, use the "tr" method to replace them.
str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
Finally, I force the string to Latin-1 encoding, then transcode to UTF-8:
str.force_encoding('ISO-8859-1').encode('UTF-8')
A better way to do this would be to check for the presence of characters in the 80-9F range and use Windows-1252 instead of ISO-8859-1 (aka "Latin-1"). That would pick up Euro signs and such.
In the last part, there were no high characters at all, so we force the encoding to be UTF-8. A regular ASCII string is also a standard UTF-8 or Latin-1 string as well.
else
# No high characters, just mark as UTF-8
str.force_encoding('UTF-8')
So, let's turn this on its head. First, we need to start out with the check for high characters, and there's good reason for that. Ruby has a built-in method "String#ascii_only?". I'm going to give some high praise here - this method is written how I would write it. Go ahead, have a look:
https://github.com/ruby/ruby/blob/bed34b3a52afde6d98fcef19e199d0af293577be/string.c#L618
That's actually the opposite - "search_nonascii", but that's ultimately what "ascii_only?" uses. Why do I like it? It is as fast as the CPU can perform this check. It looks at each word and sees if any of the high bits are set. Instead of looking byte by byte, it looks at entire 32 or 64-bit words.
So, that's way preferable to using a regular expression. Better yet, if the string has no high characters there's no reason to even continue with the rest of this.
module FixEncoding
def FixEncoding.fix_encoding(str)
if str.ascii_only?
return str.force_encoding('UTF-8')
else
str = str.b
# Rest of code
end
end
end
Putting that check first will short-circuit the rest of our checks if there are no high characters. And since that's the fastest check we have, it should speed this up dramatically in that case. Note that this also precludes the string copy, so there's an even bigger win.
Next, we need to strip the BOM. This can be done without a regular expression to speed it up. Here's the old way again:
if str =~ /\A\xEF\xBB\xBF/n
str = str.gsub(/\A\xEF\xBB\xBF/n, '')
end
We can use String#byteslice
in both places to make this faster:
def FixEncoding.remove_bom(str)
if str.byteslice(0..2) == "\xEF\xBB\xBF".b
return str.byteslice(3..-1)
else
return str
end
end
So, this is very different. First, we're slicing the first 3 bytes off and comparing to the BOM (also as a binary string). If they match, we replace str
with all but the first three bytes of str
. Both parts of this are much faster than the original, and String#byteslice
is the fastest way to handle it.
Next, we check for UTF-8 characters:
def FixEncoding.has_utf8?(str)
str.match(/[\xc0-\xf7][\x80-\xbf]/n)
end
This is mostly the same, but I've simplified the regexp by removing the extraneous capture and repetition.
Next, we can check for stupid quotes:
def FixEncoding.has_stupid_quotes?(str)
str.match(/[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n)
end
and replace them if we find them:
def FixEncoding.replace_stupid_quotes(str)
str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
end
This remains unchanged, save for moving it to its own function.
The final piece of the puzzle is to determine what the likely encoding is, force it to that encoding, then transcode to UTF-8.
def FixEncoding.has_win1252?(str)
str.match(/[\x80-\x9f]/n)
end
def FixEncoding.likely_8bit_encoding(str)
if str.ascii_only?
"ASCII-8BIT"
elsif has_win1252?(str)
"WINDOWS-1252"
else
"ISO-8859-1"
end
end
Note that we again do the "ascii_only?" check. Why? I've replaced the high quote marks with standard ASCII equivalents, so we may well have an ASCII string again. That's faster than the regular expression check, so we're again looking for a short-circuit.
With that, we can write our final helper:
def FixEncoding.transcode_to_utf8(str)
str.encode("UTF-8", likely_8bit_encoding(str))
end
Note that using the encode
method like that is the equivalent of using force_encoding
with the second argument followed by encode
with the first argument:
str.force_encoding(likely_8bit_encoding(str)).encode("UTF-8")
At this point, our fix_encoding
function is simpler, fully testable, and pretty much all acting at the same semantic level:
def FixEncoding.fix_encoding(str)
if str.ascii_only?
return str.force_encoding('UTF-8')
else
str = str.b
str = remove_bom(str)
if has_utf8?(str)
return str.force_encoding('UTF-8')
else
if has_stupid_quotes?(str)
str = replace_stupid_quotes(str)
end
return transcode_to_utf8(str)
end
end
end
The entire thing is longer now, but in reality there's no more code that before and what code there is will run faster. While I don't normally worry too much about the speed of Ruby code this is often used in processing multi-megabyte files where any speed improvement is appreciated.
The complete code is available here:
https://gist.github.com/mdchaney/e2b05eafab81cbdc4dfed6dd2f8e69a6
That's not tested, though. Next time, I'll create some tests and find out how I did.
Posted on June 25, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.