Handling text in programming, where to start from as a student

Since the general agreement about printing “Hello, World!” as your first program in every programming language, until you become a regex enthusiast, you’ll always be handling text one way or another, and you’ll notice that this is a subject of its own once you start using a somewhat lower-level language like Java, as I did once I replaced JavaScript by it.

This article is intended to be read by beginners that want to know the basic subjects of text handling in programming to study.

Terminology

First of all, we need to know what each part of a text/character is called. In other words, we need to know some terms that surround this subject.

Diacritic marks: Marks placed above or below (or sometimes next to) a letter in a word to indicate a particular pronunciation — in regard to accent, tone, or stress — as well as meaning, especially when a homograph exists without the marked letter or letters;
Scripts: A writing system or an orthography, consisting of a set of visible marks, forms, or structures called characters or graphs that are related to some structure in the linguistic system;
Character: Single visual object used to represent text, numbers, or symbols;
String: Data values that are made up of ordered sequences of characters, such as "hello world". A string can contain any sequence of characters, visible or invisible, and characters may be repeated.
Character set: Collection of characters that might be used by multiple languages. Example: The Latin character set is used by English and most European languages, though the Greek character set is used only by the Greek language;
Coded character set: Character set in which each character corresponds to a unique number;
Code point: Any allowed value in the character set or code space;
Code space: Range of integers whose values are code points. Some sources say this range of integers is directly related to the number of possible characters of a given encoding system;
Code unit: “Word size” of the character encoding scheme, such as 7-bit, 8-bit, 16-bit. In some schemes, some characters are encoded using multiple code units, resulting in a variable-length encoding. A code unit is referred to as a code value in some documents;
Octet: Eight-bit byte (Bytes are not eight bits in all computer systems).

Standards and Encoding

In the mid-1960s, the US settled on the ASCII (American Standard Code for Information Interchange) standard to define the characters and their encoding for any teleprinter made to write in the English language. ASCII has a 7-bit code unit, in other words, it has a range of 127 possible characters going through letters, digits, punctuation, and some control characters like backspace or new line.

ASCII wasn’t the only standard for character encoding, though, some countries made their standards based on ASCII, and other countries whose alphabets had nothing to do with English’s made their standards from scratch. Then computers happen, and soon — though not common at all — people had the opportunity to send documents across different countries. This was so much of a mess that Japan, which had 4 different encoding systems completely incompatible with each other, even invented a word for when you try to read a document written in a different encoding system and the characters got messy: Mojibake (or 文字化け). For this purpose, it was just much better to send a fax across the world.

Then the Internet happen, and suddenly, it was just so easy to even have a live conversation with someone in another country, and for that reason, the encoding system that ran in your computer became an important subject of discussion. At this point, we needed a standard and encoding system that was fully compatible with any language, and that could also meet some computers’ specific criteria, as some of them at the time could interpret 8 zeroes in a row (bits) as the end of a string. For that, Unicode was created.

Unicode is a standard (and only a standard, as opposed to ASCII) that defines hundreds of thousands of characters (for now), covering 159 modern and historic scripts, as well as symbols, emojis, and non-visual control and formatting codes. To encode all these characters, the UTF-8 encoding system was created, which was not the first, but is the most popular to this day, accounting for 98% of all web pages, and up to 100.0% for some languages (as of 2022).

As opposed to ASCII, which simply assigned a character to any natural number that can be represented by 7 bits, UTF-8 made it differently and beautifully thought. To begin with, each character is assigned to a natural number, like upper case A to 65, and lower case A to 97 (just like ASCII for compatibility purposes). After that, we have each byte (up to 4 bytes) divided into sections. The first byte has its first digits defining how many bytes that character will take, so if the first byte starts with 110 (zero meaning “stop counting”) it means that the character takes 2 bytes. Every byte besides the first one must start with 10, meaning it’s a continuation of the previous one. Every bit other than the mentioned sum up to a sequence that will be traduced to a hexadecimal number that forms the code point, which alongside the prefix U+, represents the needed character, like so:

Character	Unicode code point	Binary UTF-8
:	U+58	00111010
狈	U+29384	11100111 10001011 10001000
Æ	U+198	11000011 10000110
§	U+167	11000010 10100111
∑	U+8721	11100010 10001000 10010001

There are multiple encoding systems, such as UTF-16, UTF-32, UCS-2, etc. Each one with its advantages and disadvantages, so it’s worth reading about them once you have to choose or work with one. I’m not covering all of them here because that's not my objective.

Recommended reading: Unicode at Wikipedia, UTF-8 at Wikipedia, UTF-16 at Wikipedia, UTF-32 at Wikipedia.

Unicode APIs

Character set

Most languages have APIs to handle code points, and that’s all we need. As you’ll probably be using UTF-8 and UTF-16 anyways, there are not many use cases to know the encoding system of a string (or character set, as some languages call it), but if you want to and you are using Java 11, you can use a library like Apache Tika to detect the encoding of a String:

import org.apache.tika.parser.txt.CharsetDetector;
import java.nio.charset.Charset;

public class CharsetHandler {
  // Get default text encoding for this JVM, if you wish
  public String defaultCharset = Charset.defaultCharset().name();

  public static void main(String[] args) {
     CharsetDetector detector = new CharsetDetector();

     String ASCII = "Test";
     String Unicode = "狈";

     detector.setText(ASCII.getBytes());
     System.out.println(detector.detect().getName()); // Output: "ISO-8859-2"

     detector.setText(Unicode.getBytes());
     System.out.println(detector.detect().getName()); // Output: "UTF-8"
  }
}

If you take a time to test it yourself you’ll notice that this is not consistent at all. The string “English” gives you the output “UTF-8”, the string “Test” gives you “ISO-8859-2”, and the string “Testenglish” gives you “ISO-8859-1”. This is an inherent issue of this kind of operation. TIka’s documentation explicitly says:

Character set detection is at best an imprecise operation. The detection process will attempt to identify the charset that best matches the characteristics of the byte data, but the process is partly statistical in nature, and the results can not be guaranteed to always be correct.

If your application depends on operating with a specific text encoding, you can have it set on your database, JVM (In case of using Java), Toolkit (like GTK, in case of GUI applications), Web Browser (by specifying it on your HTML file), etc. If your language does not support custom encoding at runtime, you can take your text in a foreign encoding, convert it to your language’s encoding, and export it to the original encoding if you ever need to. As your default language’s encoding will certainly be UTF-8 or UTF-16 (depending on your operating system), there’s no need to be afraid of incompatibility between encodings. As I’ve already mentioned, Unicode has more than a hundred thousand characters available, and this number is far away from UTF-8 and 16’s limit.

Code unit

Knowing that detecting a character set is at best an imprecise operation, and that you probably already know it if you are working with a database of a GUI framework (thus knowing its code unit), there are not many reasons to get this information through code, but if you want, you can use a method inside your language or check its documentation. There’s no reason to do something like this:

public static long[] minAndMaxCodeUnits(String input) {
  char lowerCodePointChar =
        Character.toChars(input.codePoints().min().getAsInt())[0];
  char higherCodePointChar =
        Character.toChars(input.codePoints().max().getAsInt())[0];

  long[] result = {
        InstrumentationAgent.getObjectSize(lowerCodePointChar),
        InstrumentationAgent.getObjectSize(higherCodePointChar)
  };

  return result;
}

First of all, because your language probably can’t handle a per-string character set, so every string will have the same one, thus having the same code unit; and also because some languages, like Java, treat some or everything as objects, so most values will have different sizes in memory.

Code point and Code space

Though a code point is a hexadecimal number, the limit of possible characters in UTF-8 and UTF-16 is enough to be represented by a 4-byte integer, which is the data type requested by code point operations. For example, we can make a method that apply the Caesar cipher to a string:

public static String caesarCipher(String input, int shift) {
  IntStream stream = input.codePoints().map((codepoint) -> {
     boolean isLetter = Character.isLetter(codepoint);
     int newcodepoint;

     if (!isLetter)
        return codepoint;
     else
        newcodepoint = codepoint + shift;

     if (!Character.isLetter(newcodepoint))
        newcodepoint -= 26;

     return newcodepoint;
  });

  return new String(stream.toArray(), 0, input.length());
}

Or a method that reverts the case of every letter:

public static String revertCase(String input) {
  IntStream stream = input.codePoints().map((codepoint) -> {
     if (!Character.isLetter(codepoint))
        return codepoint;
     else if (Character.isLetter(codepoint) && codepoint > 96)
        return codepoint - 32;
     else
        return codepoint + 32;
  });

  return new String(stream.toArray(), 0, input.length());
}

The latter example could be done in other ways, such as this one. Like every single mathematical operation in programming, the limit is your creativity (check out the Fast Inverse Squareroot from Quake III to know what I’m talking about).

Both examples are also good samples of controlling a string’s code space. On them what we needed to do was to maintain the character’s code point in the ranges where letters laid (65 <= x <= 90 for upper case, and 97 <= x <= 122 for lower case).

Regular Expressions

A regular expression is like a language built-in another language. It works by interpreting a string of specific characters in a specific order as a whole complex operation that returns matches for a given pattern in a given string. You then can choose what to do with these substrings.

Here’s an operation with regular expressions that removes all emojis from a string:

public String removeEmoji(String input) {
  String regex = "[^\\p{L}\\p{N}\\p{P}\\p{Z}]";
  return Pattern.compile(regex, Pattern.UNICODE_CHARACTER_CLASS)
        .matcher(input)
        .replaceAll("");
}

Then an operation that replaces every whitespace by an underline:

public String replaceWhitespaceByUnderline(String input) {
  String regex = "[\\s]";
  return input.replaceAll(regex, "_");
}

And a more complex operation than take any kebab-case words and turn them into camel-case words:

public String kebabCase2CamelCase(String input) {
  String regex = "(?:([\\p{IsAlphabetic}]*)(-[\\p{IsAlphabetic}]+))+";

  Matcher m = Pattern.compile(regex).matcher(input);
  boolean hasSubSequence = m.find();

  if (hasSubSequence) {
     Matcher kebabCaseMatches = Pattern.compile(regex).matcher(input);
     while (kebabCaseMatches.find()) {
        String currentOccurence = kebabCaseMatches.group();

        while (currentOccurence.contains("-")) {

           // Indented for better understanding
           currentOccurence = currentOccurence.replaceFirst("-[\\p{IsAlphabetic}]",
                 Character.toString(
                       Character.toUpperCase(
                             currentOccurence.charAt(currentOccurence.indexOf("-") + 1)
                       )
                 )
           );
        }

        input = input.replaceFirst(regex, currentOccurence);
     }
  }

  return input;
}

Recommended reading: Regular-Expressions.info, or regexr.com if you want to practice.

Conclusion

Handling text on programming is a complex task, as it is easy to make it overcomplicated at a design standpoint, and even the tools created to make it easy are not simple either.

Depending on what you want to do, you can get away with a very simple regular expression, and to some extend (really low one) it’s not so complicated. Code point operations, though, require some creativity to be done, and even more to be done efficiently.

I’d like to make a quote from u/blablahblah on Reddit:

Keep in mind this all becomes way more complicated if you deal with non-English text.

For example, the German letter ß was historically capitalized as "SS", not exactly the same thing as subtracting 32 from the code point. They have a capital version of that letter now, but it's not 32 before the lower case version.

For a Caesar cypher, what do you do if you get Spanish text with an á or ñ in it? Does is differ if the à is represented as one code point (U+00E1) or two (U+0301 U+0061)? Unicode normalization — identifying that those mean the same thing — is a huge and complicated thing.

Note that I’m not trying to make it look easy. If you ever need to choose or work directly with an encoding system or standard, or make complex operations with text, it is highly recommended that you study this subject, as it is for every subject in programming.

Blog