Troubleshooting encoding errors in Ruby
Honeybadger Staff
Posted on July 17, 2020
This article was originally written by Jose Manuél on the Honeybadger Developer Blog.
You have thought about all the edge cases of your code, writing unit and integration tests for them, and yet, when you least expect it, you see a notification that an exception occurred. So, you investigate the problem, discover that it's an encoding error, and see some odd characters, such as "�" in the error message. How do you resolve this error?
For many of us, string encoding is like car maintenance; we only think about them when they break. If you're reading this, it's very likely that you spend a lot of time writing software, so we can try to get ready for encoding errors and leave the cars for later.
The good news is that encoding errors are fixable. However, first, we need to understand what they mean and, more importantly, how we can detect them before they happen.
An additional benefit of understanding encoding errors is that you'll have deeper knowledge of how encoding works and what can go wrong.
A quick review of encoding
If you have a string in your Ruby program, it's translated internally as a sequence of bytes. We can see this with the bytes method:
"H".bytes
=> [72]
And, if we transform it to binary, it will look like this:
"H".bytes.map {|e| e.to_s 2}
=> ["1001000"]
Encoding is the process of transforming the characters that we see into bytes so that the computer can process or store them internally.
Let's see a complete word now:
"House".bytes
=> [72, 111, 117, 115, 101]
And in binary:
"House".bytes.map {|e| e.to_s 2}
=> ["1001000", "1101111", "1110101", "1110011", "1100101"]
Here, we can see how Ruby transforms each character in the string into 7 bits.
ASCII was one of the first ways of encoding these characters. It's still valid since UTF-8, the current de-facto standard, grants the same encoding for the first 128 characters. Here's the original ASCII table:
(From http://www.plcdev.com/ascii_chart
So, when we see a string in our program, we need to keep in mind that, apart from its value, encoding transforms it into a visible representation, mostly in the form of readable characters.
What's an encoding error?
An encoding error happens when our program can't correctly transform the string into its representation with a given encoding. I know this doesn't give us the full picture, so let's explore, one-by-one, the four kinds of exceptions we can find in Ruby.
All these exceptions share the same parent EncodingError, whose parent is StandardError:
Converter not found error
The simplest way to encode a string is using the encode method in the string class. There are other ways to call this method, but with one parameter we just pass our target encoding:
"abc".encode("US-ASCII")
=> "abc"
Note that I'll use "US-ASCII" encoding in this article as an example of limited encoding (compared to UTF-8), but everything can be equally applied to any encoding you work with.
When does ConverterNotFound happen?
We've used "US-ASCII" encoding, but this exception will be raised if we use one that doesn't exist in Ruby:
"abc".encode("test")
...
# Encoding::ConverterNotFoundError (code converter not found (UTF-8 to test))
How can we fix the ConverterNotFound error?
If you're using a new encoding method that you're not familiar with, you can begin by checking all the encodings available in Ruby using the following:
Encoding.name_list
=> ["ASCII-8BIT", "UTF-8", "US-ASCII", "UTF-16BE", "UTF-16LE", "UTF-32BE", "UTF-32LE", "UTF-16", "UTF-32", "UTF8-MAC", "EUC-JP", "Windows-31J", "Big5", "Big5-HKSCS", "Big5-UAO", "CP949", "Emacs-Mule", "EUC-KR", ...]
Also, make sure yours is listed there. I'd also recommend getting some context on the encoding from Wikipedia and other sources since it's going to be useful later to see what we can and can't do with it.
For example, if you're using ASCII, which only has 127 characters, it's impossible to show a character like "Å" or "ê".
Compatibility error
This is a slightly more complex error since it involves valid encodings and might not be easy to spot.
When does it happen?
When you try to compare two strings, you need to also be aware of their encoding. The program will try to transform what you see on the screen to bytes, and as we've seen, the way to do this is with its encoding. If we have different encodings, we can run into this issue:
string_in_utf8 = "Löve"
string_in_ascii = "Löve".force_encoding('US-ASCII')
string_in_utf8.start_with?(string_in_ascii)
...
# Encoding::CompatibilityError (incompatible character encodings: UTF-8 and US-ASCII)
It's important to note that if we try to compare them with '==', it doesn't raise this exception, but it returns false:
string_in_utf8 = "Löve"
string_in_ascii = "Löve".force_encoding('US-ASCII')
string_in_utf8 == string_in_ascii
=> false
How can we fix the compatibility error?
If you're dealing with strings in different encodings, try to convert them into the same ones first and then compare them. Otherwise, you might be confused by the results. We'll see in the next exceptions how to convert them safely.
Undefined conversion error
Sometimes, "translating" the character into bytes is impossible with the current encoding.
Here's a string with a character outside the ASCII table: "hellÔ!". If we use it in Ruby, then we don't usually have any issues since it's in UTF-8 by default. However, if we're trying to use another encoding, it might be impossible to "fit" it with that encoding.
"hellÔ!".encode("US-ASCII");
...
# Encoding::UndefinedConversionError (U+00D4 from UTF-8 to US-ASCII)
We can force it, but the result won't be what we want:
"hellÔ!".force_encoding("US-ASCII");
=> "hell\xC3\x94!"
This happens because the ASCII encoding doesn't know how to handle those bytes; they're not represented in its range of values, so if we force Ruby to encode the string, it'll just show the byte values.
When does it happen?
Strings don't always come from controlled sources. If you're importing a CSV or spreadsheet file, for example, then it might have a different encoding. If we try to convert it to the encoding we use in our application or database, we can run into this error.
How can the UndefinedConversion error be fixed?
Sadly, it's impossible to say if a string uses a particular encoding with a method. We can call encoding as follows:
"dog".encoding
=> #<Encoding:UTF-8>
However, it doesn't guarantee that it's a valid and "representable" string in UTF-8. If we have a string like:
"abc\xCF\x88\xCF\x88"
We don't really know its encoding unless we have a reliable source, such as the metadata in a file or a header in a website. If we get this wrong, we'd have a situation like this, where we create the string as a literal value in Ruby and it'll tell us it's UTF-8:
"abc\xCF\x88\xCF\x88".encoding
=> #<Encoding:UTF-8>
When, in reality, it's the result of forcing the encoding from UTF-8 into ASCII in "abcψψ".
"abcψψ".force_encoding("US-ASCII")
=> "abc\xCF\x88\xCF\x88"
So, the real solution is to check whether the string has a valid encoding or, more simply, if it can be correctly represented with the encoding it has. We can use "valid_encoding?" for this purpose:
"abcψψ".force_encoding("US-ASCII").valid_encoding?
=> false
If we want to convert the string to the encoding we need, we can use the encode, we get the following exception:
"abcψψ".encode("US-ASCII", "UTF-8")
...
# Encoding::UndefinedConversionError (U+03C8 from UTF-8 to US-ASCII)
To deal with these characters, we can tell Ruby what to do with some extra parameters. Let's say we're happy with just removing them in the final string:
"abcψψ".encode("US-ASCII", "UTF-8", invalid: :replace, undef: :replace, replace: "")
=> "abc"
The order of the parameters is not intuitive, so we need to remember that, first, it comes with our target encoding and, second, the encoding we come from. If we can't get the source encoding, we can just pass the target one, and it'll assume it's in your default encoding, which, in Ruby 2.0, is UTF-8.
"abcψψ".encode("US-ASCII", invalid: :replace, undef: :replace, replace: "")
=> "abc"
So, the conclusion is that if you need to transform your strings into a different encoding, use the valid_encoding? method to check whether everything is okay and then the options parameter in the encode method to make sure you don't get an exception in your code.
Invalid byte sequence error
When does it happen?
Next, let's focus on another very common encoding error in Ruby, our last one. Here, we have an invalid set of bytes that don't represent anything in the encoding we're using.
"abc\xA1z".encode("US-ASCII")
...
# Encoding::InvalidByteSequenceError ("\xA1" on UTF-8)
From the official docs:
Raised by Encoding and String methods when the string being transcoded contains a byte that is invalid for either the source or target encoding.
Invalid bytes might be the result of network errors, corrupted files, or using the wrong encoding in our program.
How can we fix the InvalidByteSequence error?
In this case, we can use a similar approach to the UndefinedConversionError, which replaces what's invalid with what we want.
"abc\xA1z".force_encoding("US-ASCII").scrub("*")
=> "abc*z"
We had to use force_encoding, so we don't get an exception calling encode. We could also remove the character as follows:
"abc\xA1z".force_encoding("US-ASCII").scrub("")
=> "abcz"
We can also check whether converting it to our target encoding will result in an error with:
"abc\xA1z".force_encoding("US-ASCII").valid_encoding?
=> false
So, a working solution might be:
received_string = "abc\xA1z"
string_in_target_encoding = received_string.force_encoding("US-ASCII")
if !string_in_target_encoding.valid_encoding?
string_in_target_encoding.scrub!("_")
end
puts string_in_target_encoding
# abc_z
However, the problem with this method is that we don't know if we're fixing a InvalidByteSequenceError or an UndefinedConversionError. A better way would be to rescue from those exceptions and treat them separately; next, we'll see how it's done.
A more complete solution to handle encoding issues in Ruby
With the approach of using scrub for InvalidByteSequence and encode with replace for UndefinedConversion, we can get to the following solution:
def safe_encode(string, target_encoding)
begin
string.encode(target_encoding)
rescue Encoding::InvalidByteSequenceError
string.force_encoding(target_encoding).scrub!("")
rescue Encoding::UndefinedConversionError
string.encode(target_encoding, invalid: :replace, undef: :replace, replace: "")
end
end
puts safe_encode("abc\xA1z", "US-ASCII") # => abcz
puts safe_encode("abcψψ", "US-ASCII") # => abc
Let's see what it does line by line:
First, we try to encode the string to our target_encoding.
string.encode(target_encoding)
If this method doesn't raise an exception, this will be the last line executed in the method and, therefore, our returned value. If it raised an exception, we have two potential cases.
If we have an InvalidByteSequenceError, it means some bytes cannot be represented in our target encoding, so we just remove them.
rescue Encoding::InvalidByteSequenceError
string.force_encoding(target_encoding).scrub!("")
If we get an UndefinedConversionError, it means we can't convert our string to the target encoding, so we also remove these characters.
rescue Encoding::UndefinedConversionError
string.encode(target_encoding, invalid: :replace, undef: :replace, replace: "")
This solution catches the two kind of exceptions we've seen, but it's very simple since it just removes the characters that can't be encoded. Therefore, prior to using this method in production, I'd recommend trying to understand more deeply why the encoding doesn't work in the first place.
You can start by checking whether you can get the source encoding (i.e., a file, a form, a network message) and find out why it can't be converted to the target encoding. This is usually because it's a more limited encoding (i.e., it has less characters than the source encoding, such as encoding UTF-8 into ASCII).
It's also important to keep an eye on their compatibility by, first, understanding their context. In some cases, the same bytes represent different characters in each encoding.
Other encoding issues
Besides the issues we've seen, we might encounter others in the future. Encoding is a complex problem, and we haven't covered all the details in this article.
For example, when we have a character like Å, it can be represented as the code point "\u00c5" or as the composition of the letter "A" and an accent: "A\u030A". Depending on the situation, this might generate some issues, such as showing "A" and the accent characters as two separate elements. See our article on the topic for further details.
Another issue we haven't covered is what happens when you have two strings that look the same, but internally, they're represented by two different code points. For example, " " and " " —the regular blank space and "no-break space". This is called "visual spoofing" or "homograph attack," and we need to be aware of it to only show to the user the right data.
Conclusion
As you can see, there are many aspects to keep in mind when handling encoding errors. However, now we're more familiar with some of the issues in Ruby and how to handle them, so the next time we spot an encoding exception, we'll better equipped to fix it.
Note on the Ruby version used
I've used Ruby 2.6.5p114 for the examples, but any version from 2.4 should give us the same results.
Posted on July 17, 2020
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 30, 2024