Convert legacy Japanese encoding by Red
koba-yu
Posted on August 6, 2022
In this post, I talk about coding with Red programming language. The repo is here. Red has great feature to parse and convert string/binary data. I hope this post helps you understand Red's ability!
Encoding is a curse in Japan
This paragraph does not explain codings but the Japanese cultural context. Skip if you are not interested.
Japanese is probably one of the most suffering people from encoding trouble in the world. We have 3 traditional character types - hiragana, katakana, and kanji - and even unique symbols like β¨(hot spring) and emojis! Especially kanji has too many characters. For example, θΎΊ
, ι
, and ι
have the semantically same meaning but we need to select the character correctly when it is used as surname kanji, especially in a formal context like weddings. It depends on which character was chosen when his/her ancestor registered his/her surname to the government in the Meiji era(around 1900). Anyway, we needed custom Japanese encoding until Unicode was established. In those days, Japanese people mainly used encoding called "ShiftJIS" and many legacy systems still output their files by this encoding. To make matters worse, Microsoft Excel in Japan can open the file as UTF-8 only when it has BOM. Without BOM, its default choice is "ShiftJIS" (and there are fewer chances the file has BOM...). If you open a UTF-8 file without BOM by Japanese Excel, it cannot show characters correctly. This has been continuing to spread confusion for people unfamiliar with IT, and many people still think "System should be able to output a ShiftJIS file". These are why even nowadays Japanese programmers (like me) need to deal with "ShiftJIS" encoding files.
Let's convert ShiftJIS to UTF-8 by Red
Motivation
I often use Red programming language for my data processing. Though Red is a fantastic language, it is young and does not support easy encoding transformation at least now. Of course, most popular languages can read/convert a ShiftJIS file to UTF-8 so I can handle encoding by using those languages. However, it is more useful for me if I can convert only by Red, without other language. Therefore I tried to implement the code. This post explains how to code it.
About codes
The code I made is here. I explain what I did and some points of the code in this post.
1st Step, make a map for byte mapping (by C#)
Unfortunately, there is no logical codepoint conversion rule between ShiftJIS and UTF-8. Therefore I had to get a byte map, from 1 Shift-JIS character bytes to corresponding UTF-8 character bytes. However, I could not find any reliable comparison table on the internet. So I wrote C# code to make it. You can run this code by LINQPad app, a very famous C# scratch pad. It is free for basic features. This code loops through all of the possible int values from the beginning of ShiftJIS codepoint to the end, gets a char instance from the int value and then gets a byte array from the char. You think it is easy, don't you? But there are a lot of "pitfalls" - ShiftJIS has "undefined" codepoints everywhere! So I have to skip them but there is no logical way to judge whether a certain codepoint is "defined" or not. I have to hard-code specific codepoints to skip...here is a list of the value ranges where I had to skip. It was too many to typeπ. shift-jis-utf8-bytes.txt is the resulting output of the code. In this file, a pair of the byte expressions are in each line, separated by |
. The left part is a hexadecimal ShiftJIS byte expression and the right is the corresponding UTF-8 one. Some of the lines are listed below as an example.
80|C280
8150|EFBFA3
8151|EFBCBF
8152|E383BD
8153|E383BE
γ»
γ»
γ»
EAA0|E6A787
EAA1|E98199
EAA2|E791A4
EAA3|E5879C
EAA4|E78699
2nd Step, make Red map! value as a conversion table.
Now I have got the byte table so I can process the table to convert it to Redβs map!
value. This is very easy. The code is here. The step is just;
- Split each line by
|
- Convert hexadecimal string to binary! value by
dabase/base xxxx 16
The resulting file is bytemap.red.
Final Step, parse and replace binary value for converting.
Finally, I can write actual code to convert. shiftjis-to-utf8.red is the final file. The code is really short and neat thanks to Red's parse
feature.
In line 11, I include the bytemap.red
file that I created in the previous step. Then I make a rule block special-bytes
to judge whether the binaries are ShiftJIS-defined bytes or not. If not, the bytes must be inside the ASCII code range. ShiftJIS has the same codepoints up to ASCII's code end as UTF-8 does. Special-bytes
block looks like below;
[
#{80} |
#{8150} |
#{8151} |
#{8152} |
#{8153} |
#{8154} |
γγ»
γγ»
γγ»
#{EAA0} |
#{EAA1} |
#{EAA2} |
#{EAA3} |
#{EAA4}]
]
This line is the part processing ShiftJIS bytes. If current bytes match any of the special-bytes
, the code (select bytemap sb)
is executed and ShiftJIS bytes are replaced with UTF-8 ones as a result. If the bytes do not match as ShiftJIS bytes, this line is executed and the code (make binary! reduce [ascii])
just keep the byte without changing since they are ASCII bytes.
In the last line, I append binary!
value #{EFBBBF}
to the head of the converted binary. This is BOM. Because I am poor Japanese, I need to add this to let Excel open the file correctly...π
How is the code? It is very clear!
I love the code style of Red. Binary processing code tends to be complicated in many languages but I could write it as very declarative and readable code in Red. If you get interested in Red or have any questions about the code, feel free to ask!
Posted on August 6, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024