Sebastian Rapetti
Posted on December 9, 2017
Before starting, the reader considers this my personal opinion, how I see the things and not the only way resolve the problem.
I am working on a validator library and one of the various implemented filters do HTML escaping (replacing special characters with the correct HTML entities). For escape special char I need to know name or number of the html entity. Not all entities have a name then use the number is oblugatory.
For get the number of char, PHP offer a function called ord(). Unfortunately ord() does not work with UTF-8 multibyte characters and if multi-byte functions are not availables need to find another solution.
To be honest I don't like use multi-byte function on a filter because I cannot presume that all strings or character passed to the filter are multi-byte. In this case, defensive programming help us and we consider that all passed can be dangerous.
In comments of ord() function documentations I found a little snippet of code that seem the solution for my problem. It return the char number for all multi-byte chars.
Original code
This is the original version of the function (it's five years old code), prensent in PHP docs comments:
<?php
function ordutf8($string, &$offset)
{
$code = ord(substr($string, $offset, 1));
if ($code >= 128) {//otherwise 0xxxxxxx
if ($code < 224)
$bytesnumber = 2;//110xxxxx
else if ($code < 240)
$bytesnumber = 3;//1110xxxx
else if ($code < 248)
$bytesnumber = 4;//11110xxx
$codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
for ($i = 2; $i <= $bytesnumber; $i++) {
$offset ++;
$code2 = ord(substr($string, $offset, 1)) - 128;//10xxxxxx
$codetemp = $codetemp * 64 + $code2;
}
$code = $codetemp;
}
$offset += 1;
if ($offset >= strlen($string))
$offset = -1;
return $code;
}
Code is thinked to manage a phrase, not only one character at time. Inside this function, there are two IF, five IF-ELSE and one FOR statements. Too complicated.
Step 1
We are in 2018 and PHP7 (7.2 current major version) provide a lot of new features, the one I love most is type checking!
Next, I need to manage one char at time, offset argument is no longer required.
<?php
function ordutf8(string $char) : int
{
$code = ord(substr($char, 0, 1));
if ($code >= 128) {
if ($code < 224)
$bytesnumber = 2;
else if ($code < 240)
$bytesnumber = 3;
else if ($code < 248)
$bytesnumber = 4;
$codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
$offset = 0;
for ($i = 2; $i <= $bytesnumber; $i++) {
$offset ++;
$code2 = ord(substr($char, $offset, 1)) - 128;
$codetemp = $codetemp * 64 + $code2;
}
$code = $codetemp;
}
return $code;
}
One of two IF statement was removed.
Step 2
This code piece contains two IF-ELSE written using ternary operators:
<?php
$codetemp = $code - 192 - ($bytesnumber > 2 ? 32 : 0) - ($bytesnumber > 3 ? 16 : 0);
We replace it with static values, provide directly the correct number without check it every time.
<?php
function ordutf8(string $char) : int
{
$code = ord(substr($char, 0, 1));
if ($code >= 128) {
$count = 0;
if ($code < 224) {
$bytesnumber = 2;
} else if ($code < 240) {
$bytesnumber = 3;
$count = 32;
} else if ($code < 248) {
$bytesnumber = 4;
$count = 48;
}
$codetemp = $code - 192 - $count;
$offset = 0;
for ($i = 2; $i <= $bytesnumber; $i++) {
$offset ++;
$code2 = ord(substr($char, $offset, 1)) - 128;
$codetemp = $codetemp * 64 + $code2;
}
$code = $codetemp;
}
return $code;
}
Two of five IF-ELSE statements was removed.
Step 3
We can refactor FOR statements for remove variables and reduce lines of code.
<?php
function ordutf8(string $char) : int
{
$code = ord(substr($char, 0, 1));
if ($code >= 128) {
$count = 0;
if ($code < 224) {
$bytes = 2;
} else if ($code < 240) {
$bytes = 3;
$count = 32;
} else if ($code < 248) {
$bytes = 4;
$count = 48;
}
$temp = $code - 192 - $count;
for ($i = 1; $i < $bytes; $i++) {
$code = $temp = $temp * 64 + ord(substr($char, $i, 1)) - 128;
}
}
return $code;
}
In details from this:
<?php
$codetemp = $code - 192 - $count;
$offset = 0;
for ($i = 2; $i <= $bytesnumber; $i++) {
$offset ++;
$code2 = ord(substr($char, $offset, 1)) - 128;
$codetemp = $codetemp * 64 + $code2;
}
$code = $codetemp;
To this:
<?php
$temp = $code - 192 - $count;
for ($i = 1; $i < $bytes; $i++) {
$code = $temp = $temp * 64 + ord(substr($char, $i, 1)) - 128;
}
Step 4
Now we need to try to remove a little bit of IF-ELSE statements. This step seem more complicated than previous, but let us reflect on the conditions used in the IF statements.
we now have four conditions:
- >=128 2 bytes
- <224 2 bytes
- <240 3 bytes
- <248 4 bytes
we can rewrite them using only three:
- >127 2 bytes
- >223 3 bytes
- >239 4 bytes
without using IF-ELSE statements and overwriting the variables when to the next condition is true.
<?php
function ordutf8(string $char) : int
{
$code = ord(substr($char, 0, 1));
if ($code > 127) {
$bytes = 2;
$count = 0;
if ($code > 223){
$bytes = 3;
$count = 32;
}
if ($code > 239){
$bytes = 4;
$count = 48;
}
$temp = $code - 192 - $count;
for ($i = 1; $i < $bytes; $i++) {
$code = $temp = $temp * 64 + ord(substr($char, $i, 1)) - 128;
}
}
return $code;
}
Five of five IF-ELSE statements was removed. Remain 3 IF and one FOR statements.
Step 5
We now try to remove FOR statement and rewrite all as mathematics formula.
<?php
function ordutf8(string $char) : int
{
$code = ord(substr($char, 0, 1));
if ($code > 239){
return ((/*($code - 240) * 64 + */ord(substr($char, 1, 1)) - 128) *
64 + ord(substr($char, 2, 1)) - 128) *
64 + ord(substr($char, 3, 1)) - 128;
}
if ($code > 223){
return (($code - 224) * 64 + ord(substr($char, 1, 1)) - 128)
* 64 + ord(substr($char, 2, 1)) - 128;
}
if ($code > 127) {
return ($code - 192) * 64 + ord(substr($char, 1, 1)) - 128;
}
return $code;
}
Now only three IF statement remain.
UPDATE: Commented code in first IF statement return always zero. Useless code.
Conclusion
I hope this article can help you understand that you do not have to stop at the moment the code works. The code can always be improved, simplified and made more efficient.
Code improvement can be appreciated running little test on https://3v4l.org/TWFLX.
Code is also available on my Gist
Posted on December 9, 2017
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.