Why UTF-8 is a train wreck (or: UNIX Doesn’t Represent Everyone)

This post won’t go into the gory details of Unicode or the UTF-8 encoding. That ground has been covered better elsewhere than I could ever hope to here. What we’re looking at today is almost as much political as technical, although technical decisions play a huge part in the tragedy. What I am positing today is that UTF-8–for all its lofty compatibility goals–fails miserably in the realm of actual, meaningful compatibility.

The supposed brilliance of UTF-8 is that its code points numbered 0-127 are entirely compatible with 7-bit ASCII, so that a data stream containing purely ASCII data will never need more than one byte per encoded character. This is all well and good, but the problem is that aside from UNIX and its derivatives, the vast majority of ASCII-capable hardware and software made heavy use of the high-order bit, specifying characters for code points 128-255. However, the UTF-8 encoding either chokes on or specifies control characteristics using the high-order bit, to include encoding whether or not the character specified will require a second byte.ย ย This makes 7-bit ASCII (as well as encodings touting 7-bit ASCII compatibility) little more than a mental exercise for most systems: like it or not, the standard for end-user systems was set by x86 PCs and MS-DOS, not UNIX, and MS-DOS and its derivatives make heavy use of the high-order bit. UNIX maintained 7-bit purity in most implementations, as mandated by its own portability goals, and UTF-8’s ultimate specifications were coded up on a New Jersey diner placemat by Ken Thompson, the inventor of UNIX, and Rob Pike, one of its earliest and most prolific contributors. UTF-8 effectively solved the problem for most UNIX systems, which were pure 7-bit systems from the beginning. But why should UTF-8’s massive shortcomings have been foisted upon everyone else, as if UNIX–like many of its proponents–was some playground bully, shoving its supposed superiority down everyone else’s throats?

It should not. The UNIX philosophy, like functional programming, microkernels, role-based access control, and RISC, has its merits, but it is not the only kid on the block, and solutions like UTF-8 that just happen to work well in UNIX shouldn’t be forced upon environments where they only break things. Better to make a clean break to a sane, fixed-width encoding like UTF-32, perhaps providing runtimes for both ASCII (includingย its 8-bit extensions) and the new encoding to allow software to be ported to use it piecemeal. At least with something like UTF-32, data from other encodings can be programmatically converted to it, whereas with UTF-8 with its two-bit 8th-bit meddling, there’s no way of knowing whether you’re dealing with invalid code points, kludgey shift characters, or some ASCII extension that was used for a meaningful purpose.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s