UTF-8 Glyphs and Graphemes

In the previous posts i was freely referring to →T← as a "character". To move forward we need more precise definition. What you see between these arrows is:

A character...
represented by a grapheme...
having specific UTF-8 code point...
rendered as glyph...
in given typeface/font.

Grapheme is smallest functional unit of writing system. It means that T cannot be split.

Code points were explained previously.

Glyph refers to a shape. It means that T is recognizable as one vertical bar with horizontal bar on top.

Typeface refers to font used to present glyph to reader. Those are different typefaces of the same glyph: T, 𝑇, 𝖳, 𝙏.

Is every code point a grapheme?

No.

Most people agree that white characters are graphemes. Youcannotdenythatspaceisfunctionalunitofawritingystem, can you?

But there are non-printable characters used to control text flow, like right-to-left or left-to-right directives. Or invisible zero width joiners to glue something together. Those can barely be considered functional units alone.

For sure ASCII control characters described previously have code points but are not functional units of writing system, therefore are not graphemes.

Things goes crazy when you start grapheme decomposition. For example ̨ in ę has its own U+328 code point:

$ raku -e 'uniparse( "COMBINING OGONEK" ).ord.base( 16 ).say'
328

Raku note: Some terminals do not allow to paste bare combining characters. So I forced creation of ̨ by parsing its Unicode name. Alternative method is to use string interpolation like "\c[COMBINING OGONEK]".

"Ogonek" means tiny tail in Polish :) But without being attached to another grapheme this tiny tail is not a functional unit from a linguistic point of view, so not a grapheme.

Grapheme cannot be split, but above question shows that it can?

Split does not mean decomposed. You cannot have meaningful "half of T". But some graphemes can be composed from other graphemes. There will be separate post about it in the future.

Can the same glyph represent two different graphemes?

Yes.

For example A and Α are not the same graphemes and not the same code points:

$ raku -e '"AΑ".uninames>>.say'
LATIN CAPITAL LETTER A
GREEK CAPITAL LETTER ALPHA

Those are called "homoglyphs" and will be described in separate post.

Is typeface/font defined in Unicode?

It is complicated :)

Unicode does not specify which font to use. You cannot force something to be rendered using Arial font, purely by providing given UTF-8 code point.

However Latin letters do have typefaces defined under separate code points.

Take for example P and 𝘗. First is U+50 Latin P letter that is wrapped in markdown directive causing it to be rendered as italic. Second one is U+1D617 Latin P letter presented in sansserif-italic typeface. Both of them produce similar glyphs to represent grapheme, but achieved in different way.

Those typefaces defined on Unicode level are almost exclusively used in math/physics formulas.

Tricky thing is - despite the fact that they are both Latin P letters, you cannot compare them directly:

$ raku -e 'say "𝘗" eq "P"'
False

Coming up next: Fun with browsing code point namespace (optional). Codepoint properties.

Blog

UTF-8 Glyphs and Graphemes

Paweł bbkr Pabian

Join Our Newsletter. No Spam, Only the good stuff.

Related