Refactoring fix_encoding

mdchaney

Michael Chaney

Posted on June 25, 2024

Refactoring fix_encoding

I've been writing about Unicode on Twitter over the last week, and specifically handling Unicode in Ruby. Ruby has robust Unicode support, along with robust support for the older code pages.

In my work in the music publishing industry I have to write code to process all manner of spreadsheets, typically in the form of CSV files. CSV files can be a crap shoot in terms of encoding. Thankfully, everything I've had to deal with up to now has been either Unicode or Latin-1 (ISO-8859-1) or the Windows-1252 variant.

I created a piece of code some years back to handle the issue of determining the encoding of a file and coercing the bytes into a standard Unicode format, specifically UTF-8.

module FixEncoding
  def FixEncoding.fix_encoding(str)
    # The "b" method returns a copied string with encoding ASCII-8BIT
    str = str.b
    # Strip UTF-8 BOM if it's at start of file
    if str =~ /\A\xEF\xBB\xBF/n
      str = str.gsub(/\A\xEF\xBB\xBF/n, '')
    end
    if str =~ /([\xc0-\xff][\x80-\xbf]{1,3})+/n
      # String has actual UTF-8 characters
      str.force_encoding('UTF-8')
    elsif str =~ /[\x80-\xff]/n
      # Get rid of Microsoft stupid quotes
      if str =~ /[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n
        str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
      end
      # There was no UTF-8, but there are high characters.  Assume to
      # be Latin-1, and then convert to UTF-8
      str.force_encoding('ISO-8859-1').encode('UTF-8')
    else
      # No high characters, just mark as UTF-8
      str.force_encoding('UTF-8')
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

There it is in all its glory. I realized after looking at it that it's in not great shape. I'm going to refactor it and talk about my decisions.

There are a few things that stick out:

  1. I'm making extensive use of =~ instead of using the String#match. In Ruby, =~ causes a performance hit due to the fact that it sets various globals (a la Perl) after the match.
  2. I'm using regular expressions where I don't need to - specifically when checking for the BOM (Unicode byte order mark) at the start of the string. Some of these strings are many megabytes, so there can be performance gains.
  3. I realized that I'm using a regular expression to check for high (128 and above) characters. Ruby has String#ascii_only? to do that.
  4. The logic can be changed around to handle the faster cases first.

So, let's first talk about what this does.

    # The "b" method returns a copied string with encoding ASCII-8BIT
    str = str.b
Enter fullscreen mode Exit fullscreen mode

I'm telling you right there - this gets a copy of the string with the encoding set to ASCII-8BIT. That's basically "no encoding", which is what we want. The string is mostly a string of boring bytes, where the collation order is the character code and there are 26 upper and lowercase letters. This is what Ruby essentially used in version 1.8.

With the string in this encoding, we can look at individual bytes regardless of whether they're part of a UTF-8 set.

    # Strip UTF-8 BOM if it's at start of file
    if str =~ /\A\xEF\xBB\xBF/n
      str = str.gsub(/\A\xEF\xBB\xBF/n, '')
    end
Enter fullscreen mode Exit fullscreen mode

(note that the "n" flag on the regular expression is a Rubyism that makes the regular expression have the ASCII-8BIT encoding)

The Unicode Byte Order Mark can optionally occur at the start of a file. Microsoft software adds these, and some other software has no idea what they are. If you understand UTF-8 encoding you can see that this BOM is really character FEFF, which oddly is a zero-width non-breaking space.

The BOM isn't needed, and you'll notice that I do nothing but remove it. That's because I've received files that have a BOM at the start, and Latin-1 characters later on in the same file. There's no reason to "believe" the BOM.

    if str =~ /([\xc0-\xff][\x80-\xbf]{1,3})+/n
      # String has actual UTF-8 characters
      str.force_encoding('UTF-8')
Enter fullscreen mode Exit fullscreen mode

Now we're getting to the meat of it. That regexp will find real UTF-8 characters in the binary byte stream. It'll really find the first one, but that's all I care about. I could make that regexp more precise, although it's of limited value to do so.

In this tweet I cover the format of a UTF-8 character in-depth. Here are the basics, though. a UTF-8 character will be of one of these forms:

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The first form is just ASCII. Any ordinal value above 127 will always occupy two, three, or four bytes in UTF-8 and will be one of the forms shown above. The first character will always be in the ranges (in hex) C0-DF, E0-EF, or F0-F7. The subsequent characters will always be in the range 80-BF. Having a character in the first range that's not followed by a character in the 80-BF range would be invalid, and having a character in the 80-BF range that's not preceded by one of the first characters or another 80-BF is also not valid.

In the referenced tweet I include a large regexp that will determine if a given string is fully valid as UTF-8:

str =~ /\A(?:\xef\xbb\xbf)?
  (?:
  (?:[\x00-\x7f]) |
  (?:[\xc0-\xdf][\x80-\xbf]) |
  (?:[\xe0-\xef][\x80-\xbf]{2}) |
  (?:[\xf0-\xf7][\x80-\xbf]{3})
  )*
\z/nx
Enter fullscreen mode Exit fullscreen mode

That's a beauty, but it's overkill for what I'm doing here. I'll assume that if there's a single valid UTF-8 character then the string is UTF-8. I'm willing to take that risk.

The only change that I see is that the first character should match [\xc0-\xf7] instead of [\xc0-\xff] (note the final "7" in the former). I can also use "match" here to speed it up.

    elsif str =~ /[\x80-\xff]/n
      # Get rid of Microsoft stupid quotes
      if str =~ /[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n
        str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
      end
      # There was no UTF-8, but there are high characters.  Assume to
      # be Latin-1, and then convert to UTF-8
      str.force_encoding('ISO-8859-1').encode('UTF-8')
Enter fullscreen mode Exit fullscreen mode

Okay, lots going on here. We first check if the string has any "high characters", defined as a character code greater than 127. Put another way - the high bit is set. Standard ASCII goes from 0 to 127. When I was a kid the high bit was often used as a parity bit, which isn't needed now. For most applications. I'm sure someone's still using a parity bit.

We've already ruled out this being a UTF-8 string, so if there are high characters we're in either Latin-1 or its inbred cousin Windows-1252.

In Latin-1 the character codes 80-9F were reserved as extended "control codes", kind of mirroring the control code concept in the first 32 ASCII characters. I'm not sure they were ever used as such, and interestingly the character table in Wikipedia simply shows them as "undefined".

Microsoft and Apple both had an idea of what to put in that range, and this caused calamity 20+ years ago as a text file that looked great on Windows or Mac would be full of weird question marks when viewed elsewhere.

Microsoft referred to this "feature" as "smart quotes", so we usually referred to them as "stupid quotes" (as a side note, if you think I'm the only one who refers to them as such Github Copilot knew what to do when I created the "has_stupid_quotes?" method). There are a few other characters in there as well which don't have Latin-1 equivalents, including the Euro sign "€".

Anyway, the next chunk replaces the fancy quote characters with the standard ASCII equivalents.

One possible change to this piece of code would be to force the encoding to Windows-1252 and then transcode to UTF-8, which would preserve the fancy quotation marks and apostrophes. I don't do that simply because I prefer to standardize quote marks to the ASCII versions. In some other contexts that might be a less preferred choice.

Here, I use a regular expression to find them and, if found, use the "tr" method to replace them.

        str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
Enter fullscreen mode Exit fullscreen mode

Finally, I force the string to Latin-1 encoding, then transcode to UTF-8:

      str.force_encoding('ISO-8859-1').encode('UTF-8')
Enter fullscreen mode Exit fullscreen mode

A better way to do this would be to check for the presence of characters in the 80-9F range and use Windows-1252 instead of ISO-8859-1 (aka "Latin-1"). That would pick up Euro signs and such.

In the last part, there were no high characters at all, so we force the encoding to be UTF-8. A regular ASCII string is also a standard UTF-8 or Latin-1 string as well.

    else
      # No high characters, just mark as UTF-8
      str.force_encoding('UTF-8')
Enter fullscreen mode Exit fullscreen mode

So, let's turn this on its head. First, we need to start out with the check for high characters, and there's good reason for that. Ruby has a built-in method "String#ascii_only?". I'm going to give some high praise here - this method is written how I would write it. Go ahead, have a look:

https://github.com/ruby/ruby/blob/bed34b3a52afde6d98fcef19e199d0af293577be/string.c#L618

That's actually the opposite - "search_nonascii", but that's ultimately what "ascii_only?" uses. Why do I like it? It is as fast as the CPU can perform this check. It looks at each word and sees if any of the high bits are set. Instead of looking byte by byte, it looks at entire 32 or 64-bit words.

So, that's way preferable to using a regular expression. Better yet, if the string has no high characters there's no reason to even continue with the rest of this.

module FixEncoding
  def FixEncoding.fix_encoding(str)
    if str.ascii_only?
      return str.force_encoding('UTF-8')
    else
      str = str.b
      # Rest of code
    end
  end
end
Enter fullscreen mode Exit fullscreen mode

Putting that check first will short-circuit the rest of our checks if there are no high characters. And since that's the fastest check we have, it should speed this up dramatically in that case. Note that this also precludes the string copy, so there's an even bigger win.

Next, we need to strip the BOM. This can be done without a regular expression to speed it up. Here's the old way again:

    if str =~ /\A\xEF\xBB\xBF/n
      str = str.gsub(/\A\xEF\xBB\xBF/n, '')
    end
Enter fullscreen mode Exit fullscreen mode

We can use String#byteslice in both places to make this faster:

  def FixEncoding.remove_bom(str)
    if str.byteslice(0..2) == "\xEF\xBB\xBF".b
      return str.byteslice(3..-1)
    else
      return str
    end
  end
Enter fullscreen mode Exit fullscreen mode

So, this is very different. First, we're slicing the first 3 bytes off and comparing to the BOM (also as a binary string). If they match, we replace str with all but the first three bytes of str. Both parts of this are much faster than the original, and String#byteslice is the fastest way to handle it.

Next, we check for UTF-8 characters:

  def FixEncoding.has_utf8?(str)
    str.match(/[\xc0-\xf7][\x80-\xbf]/n)
  end
Enter fullscreen mode Exit fullscreen mode

This is mostly the same, but I've simplified the regexp by removing the extraneous capture and repetition.

Next, we can check for stupid quotes:

  def FixEncoding.has_stupid_quotes?(str)
    str.match(/[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n)
  end
Enter fullscreen mode Exit fullscreen mode

and replace them if we find them:

  def FixEncoding.replace_stupid_quotes(str)
    str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
  end
Enter fullscreen mode Exit fullscreen mode

This remains unchanged, save for moving it to its own function.

The final piece of the puzzle is to determine what the likely encoding is, force it to that encoding, then transcode to UTF-8.

  def FixEncoding.has_win1252?(str)
    str.match(/[\x80-\x9f]/n)
  end

  def FixEncoding.likely_8bit_encoding(str)
    if str.ascii_only?
      "ASCII-8BIT"
    elsif has_win1252?(str)
      "WINDOWS-1252"
    else
      "ISO-8859-1"
    end
  end
Enter fullscreen mode Exit fullscreen mode

Note that we again do the "ascii_only?" check. Why? I've replaced the high quote marks with standard ASCII equivalents, so we may well have an ASCII string again. That's faster than the regular expression check, so we're again looking for a short-circuit.

With that, we can write our final helper:

  def FixEncoding.transcode_to_utf8(str)
    str.encode("UTF-8", likely_8bit_encoding(str))
  end
Enter fullscreen mode Exit fullscreen mode

Note that using the encode method like that is the equivalent of using force_encoding with the second argument followed by encode with the first argument:

   str.force_encoding(likely_8bit_encoding(str)).encode("UTF-8")
Enter fullscreen mode Exit fullscreen mode

At this point, our fix_encoding function is simpler, fully testable, and pretty much all acting at the same semantic level:

  def FixEncoding.fix_encoding(str)
    if str.ascii_only?
      return str.force_encoding('UTF-8')
    else
      str = str.b

      str = remove_bom(str)

      if has_utf8?(str)
        return str.force_encoding('UTF-8')
      else
        if has_stupid_quotes?(str)
          str = replace_stupid_quotes(str)
        end

        return transcode_to_utf8(str)
      end
    end
  end
Enter fullscreen mode Exit fullscreen mode

The entire thing is longer now, but in reality there's no more code that before and what code there is will run faster. While I don't normally worry too much about the speed of Ruby code this is often used in processing multi-megabyte files where any speed improvement is appreciated.

The complete code is available here:

https://gist.github.com/mdchaney/e2b05eafab81cbdc4dfed6dd2f8e69a6

That's not tested, though. Next time, I'll create some tests and find out how I did.

💖 💪 🙅 🚩
mdchaney
Michael Chaney

Posted on June 25, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Refactoring fix_encoding
ruby Refactoring fix_encoding

June 25, 2024

The Tale of the Whitespace
ruby The Tale of the Whitespace

September 26, 2021

Regex Cheat Sheet
regex Regex Cheat Sheet

June 11, 2021