TAHRI Ahmed R.
Posted on September 3, 2019
There is a very old issue regarding "encoding detection" in a text file that has been partially resolved by a program like Chardet. I did not like the idea of single prober per encoding table that could lead to hard coding specifications.
I wanted to challenge the existing methods of discovering originating encoding.
You could consider this issue as obsolete because of current norms :
You should indicate used charset encoding as described in standards
But the reality is different, a huge part of the internet still have content with an unknown encoding. (One could point out subrip subtitle (SRT) for instance)
This is why a popular package like Requests embed Chardet to guess apparent encoding on remote resources.
You should know that :
- You should not care about the originating charset encoding, that because two different table can produce two identical files.
- BOM (byte order mark) is not universal and concern only a tiny number of encodings and not only Unicode !
I'm brute-forcing on three premises (in that order) :
- Binaries fit encoding table
- Chaos
- Coherence
Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute to improve or rewrite it.
Coherence : For each language there is on earth (the best we can), we have computed letter appearance occurrences ranked. So I thought that those intel are worth something here. So I use those records against the decoded text to check if I can detect intelligent design.
So I present to you Charset Normalizer. The Real First Universal Charset Detector.
Feel free to help though testing or contributing.
Posted on September 3, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.