Back in 2014 I decided that a blog that almost no one reads wasn’t good enough, so I created a blog that no one reads, my computer programming blog, The Hard-Core Coder. (I was afraid the term “hard-core” would attract all sorts of the wrong attention, but apparently those fears were for naught. No one has ever even noticed, let alone commented. Yay?)
In the seven years since, I’ve only published 83 posts, so the lack of traffic or followers isn’t too surprising. (Lately I’ve been trying to devote more time to it.) There is also that the topic matter is usually fairly arcane.
But not always. For instance, today’s post about Unicode.
Although, to be honest, that post is written for computer users who’ve gotten their hands dirty dealing with code page or characters set issues, so it’s still a bit esoteric and technical. Here I thought I’d point to that post for those interested and provide a less technical overview of Unicode.
Because, whether you realize it or not, it’s not just something very important for the internet but something you probably use every day (assuming you use a computer, tablet, or phone, every day).
Unicode is the standard that computers worldwide use to represent text. The goal of Unicode is to include every alphabet in the world, including some historical ones. Unicode also includes lots of special characters (such as ℜ, ∞, ∅, ⇔, ⊗), and it includes the emoji set (😎🧛🏼♀️🎃🍕🚓💥).
So you can perhaps see its importance. Every text, every email, every webpage; they all use the Unicode standard. For most users Unicode is generally something they don’t ever have to think about, but there is one situation where it’s helpful to understand what’s going on.
If you’ve ever had your text change from this:
I didn’t imagine that!
To this (what happened to that nice apostrophe?):
I didnâ€™t imagine that!
Or rather than the apostrophe, perhaps it was the quotes:
He asked, ‟Did you find it?”
And they turned into something like:
He asked, â€źDid you find it?â€ť
In both cases, you’ve run afoul of a (very common) Unicode issue — specifically one involving UTF-8, the most common form of Unicode users encounter. It usually happens when pasting copied text into something that doesn’t recognize the text as UTF-8, but as some form of ASCII (a much older way of encoding text).
I’ll give you a couple of solutions after I’ve made you read the rest of the post.
So what, from a casual user’s point of view, is Unicode? I said above that it seeks to include every alphabet in the world (and a lot more). As succinctly as possible (in less than 25 words):
Unicode seeks to map all glyphs of all languages to unique integer code points along with various 32-bit, 16-bit and 8-bit physical encodings.
It actually does say it all, although it might take a bit of unpacking.
Firstly, a glyph is the visual representation of a letter (or, in some languages, part of one). There is, on the one hand, the letter A as an abstraction with no physical representation — it’s simply the first letter of the English alphabet.
On the other hand is the letter’s representation on the page or screen, which depends on its font (as in the examples of A shown here). Unicode does not concern itself with fonts, only with the abstraction. As far as Unicode is concerned, there is only one glyph for the letter A (and one for a — Unicode does recognize case, because that’s a formal part of language).
While we don’t see it in English, in other languages the letter A can also represent as À, Á, Â, Ã, Å, Ā, or others, and these are all distinct glyphs to Unicode (likewise their lowercase versions).
Secondly, a code point is just a unique number assigned to a glyph. Code points start at zero and count upwards. They are not necessarily contiguous — there are ranges of code points with no glyphs (yet) assigned. Unicode code points currently go up to just over two-million.
Finally, the physical encodings in package sizes used by nearly all modern computing devices. Unicode currently is a 21-bit standard (those two-million plus code points), so while it fits comfortably in 32 bits, it requires some gyrations to fit in 16 or 8. That’s what standards like UTF-8 are for.
Getting a bit more detailed, Unicode divides into four basic sections, two of them abstract, two of them concrete.
Firstly, there is the Abstract Character Repertoire (ACR), which is the collection of glyphs Unicode includes. There is no order other than within an alphabet or character set that has one. All the special characters and emojis are included in the ACR. Note that these entries are just text — for instance: “LATIN CAPITAL LETTER A” or “FACE WITH TEARS OF JOY”
Secondly, there is the Coded Character Set (CCS), which is also abstract, but bridges towards a concrete implementation. The CCS is the set of non-negative integers, the code points, assigned to each member of the ACR. Note that these entries make no reference to physical size — bits, let alone bytes.
[See the Unicode Code Charts for listings of the ACR-CCS “scripts” (language related groups of glyphs).]
Thirdly, there are the Character Encoding Forms (CEF), and now the rubber starts to meet the road. These describe how the CCS code points fit into physical machine widths of 32, 16, and 8 bits. A CEF defines how a machine or database handles Unicode internally.
Lastly, there are the Character Encoding Schemes (CES), which map the CCS code points to octets (bytes). These schemes are important for byte-based storage (e.g. disks) and especially for the internet, which is based on octets. UTF-8 is a CES — probably the most important and common one. Most webpages use UTF-8.
[See Unicode Technical Report #17 for a more detailed account of these four sections]
The USA never needed much more than ASCII (the American Standard Code for Information Interchange), but much of the rest of the world doesn’t have character sets that fit into 8-bit bytes (ASCII does with room to spare).
Asian languages, however, have large character sets — thousands, if not tens of thousands, of distinct glyphs. Historically, countries found their own ways of dealing with this (transmission and storage have long been byte-based), and there existed a metaphoric (but very real) Tower of Babel situation that made life hell for administrators and users.
Unicode embraces all the glyphs of the world. Problem at long last solved.
As for the UTF-8 issues above, there are two possible solutions, depending on where the problem is happening.
The fancy apostrophe and quote characters aren’t ASCII. Before Unicode they were defined in different versions of extended-ASCII (and not always with the same encoding). Pasting the UTF-8 sequences that encode those fancy characters is interpreted as extended ASCII (see my other post for details).
The solution is to either edit the pasted text to restore the fancy characters, or replace those fancy characters in the original with their plain brown ASCII versions.
In an HTML context, one can also use HTML codes such as “ (left double-quote) and ” (right double-quote). The fancy apostrophe is ’ (right single-quote). One can also use the &#….; form to insert any Unicode character given its code point.
Since the problem usually happens when pasting nicely formatted text (say from a Word document) to some web-based source (comment box or whatever), either solution may be helpful.
This is the month of giving thanks, and I’m hardly the first to give thanks for Unicode. It really has made my life as a computer programmer (and as a user) a lot better. Standards are sometimes more of a pain than a blessing, but I can’t say I’ve ever found any downside in Unicode.
Stay Unicoded, my friends! Go forth and spread beauty and light.