Why UTF-8 is a train wreck (or: UNIX Doesn’t Represent Everyone)

This post won’t go into the gory details of Unicode or the UTF-8 encoding. That ground has been covered better elsewhere than I could ever hope to here. What we’re looking at today is almost as much political as technical, although technical decisions play a huge part in the tragedy. What I am positing today is that UTF-8–for all its lofty compatibility goals–fails miserably in the realm of actual, meaningful compatibility.

The supposed brilliance of UTF-8 is that its code points numbered 0-127 are entirely compatible with 7-bit ASCII, so that a data stream containing purely ASCII data will never need more than one byte per encoded character. This is all well and good, but the problem is that aside from UNIX and its derivatives, the vast majority of ASCII-capable hardware and software made heavy use of the high-order bit, specifying characters for code points 128-255. However, the UTF-8 encoding either chokes on or specifies control characteristics using the high-order bit, to include encoding whether or not the character specified will require a second byte.  This makes 7-bit ASCII (as well as encodings touting 7-bit ASCII compatibility) little more than a mental exercise for most systems: like it or not, the standard for end-user systems was set by x86 PCs and MS-DOS, not UNIX, and MS-DOS and its derivatives make heavy use of the high-order bit. UNIX maintained 7-bit purity in most implementations, as mandated by its own portability goals, and UTF-8’s ultimate specifications were coded up on a New Jersey diner placemat by Ken Thompson, the inventor of UNIX, and Rob Pike, one of its earliest and most prolific contributors. UTF-8 effectively solved the problem for most UNIX systems, which were pure 7-bit systems from the beginning. But why should UTF-8’s massive shortcomings have been foisted upon everyone else, as if UNIX–like many of its proponents–was some playground bully, shoving its supposed superiority down everyone else’s throats?

It should not. The UNIX philosophy, like functional programming, microkernels, role-based access control, and RISC, has its merits, but it is not the only kid on the block, and solutions like UTF-8 that just happen to work well in UNIX shouldn’t be forced upon environments where they only break things. Better to make a clean break to a sane, fixed-width encoding like UTF-32, perhaps providing runtimes for both ASCII (including its 8-bit extensions) and the new encoding to allow software to be ported to use it piecemeal. At least with something like UTF-32, data from other encodings can be programmatically converted to it, whereas with UTF-8 with its two-bit 8th-bit meddling, there’s no way of knowing whether you’re dealing with invalid code points, kludgey shift characters, or some ASCII extension that was used for a meaningful purpose.

Advertisements

Now I’ve seen everything…

The president and CEO of OSEHRA recently posted the following announcement:

The Department of Veterans Affairs yesterday announced a decision to select a new electronic health record system based on the same platform that DoD purchased a couple of years ago. The announcement recognizes many unique needs of VA that differ from the DoD. VA would thus not be implementing an identical EHR. VA is trying to create a future health IT ecosystem that takes advantage of previous investments with this new platform, as well connections with many other IT systems in the private sector. The industry trend toward open platforms, open APIs, and open source software is expected to remain integral to VA’s strategy to build a new and interoperable ecosystem. OSEHRA provides a valuable link joining VA to the broad health IT community. This activity will remain critical to the success of VA’s transition strategy by eliminating future gaps and conflicts in an ever more complex ecosystem. Transition to a new EHR system will require years of efforts and in-depth expertise in VistA that currently resides mostly in the OSEHRA community. Innovations in health IT such as cloud-based implementations, analytics, clinical decision support systems, community-based care, and connected health will come from domains external to traditional EHR systems. Recent VA investments in eHMP and DHP are examples of open source innovations external to traditional EHRs, and they are expected to evolve as new platforms within the VA’s emerging health IT ecosystem.

Seong K, Mun, PhD
President and CEO

I suppose if we have our heads in such a place where the sun doesn’t reach, we can pretend that the VA’s adoption of a proprietary EHR is somehow a victory for open source.

I suppose, however, that I shouldn’t be surprised, considering that OSEHRA is just a dog-and-pony show to allow the government to pretend that it supports open source while doing exactly the opposite.

It helps little that large and critical components of eHMP–which is admittedly an extremely impressive project–aren’t even published in OSEHRA’s code-in-flight releases.

In the sand hast thou buried thine own heads, OSEHRA. An ally you are not.

Hasta la VistA, Baby!


UPDATE

This article implies that VA dropping VistA would be good for VistA. This makes the assumption that the extra-governmental VistA community and private vendors (like MedSphere and DSS) would step in to fill the void left by VA’s departure from VistA development. If, instead, this community continues to expect salvation from within the VA bureaucracy, VistA will die.

Also, please remember that I do not in any way fault individual VA developers for the bumbling mismanagement of the product.

It brings me no joy to express the grim reality, but I believe that at least someone needs to speak the difficult truth: politicians have never been friendly to VistA, government cannot effectively manage software projects, and the only bright path forward for VistA is to get it out of the hands of corrupt government cronies like Shulkin.


I’m not going to wring my hands today.

Instead, I’d like to extend my sincerest good wishes to Secretary Shulkin and his team as they embark upon what is sure to be a long and difficult transition to the Cerner EHR. I really do hope it works out for them.

I’m also hardly able to contain my excitement for what this could mean for the future of VistA. Provided the VA stays the course with this plan, its future has never been brighter.

The VA has been trying to get out of software development for years, and has had VistA limping along on life support the whole time. Outside, private-sector vendors have been understandably hesitant to make major changes to the VistA codebase, because they haven’t wanted to break compatibility with the VA’s patch stream. But now, there’s a chance that the patch stream will dry up, along with the stream of bad code, infected with the virus of Cache ObjectScript, and the VA’s marked indifference towards fixing structural problems with core modules like Kernel and FileMan. The VA always hated VistA, and they were atrociously incompetent custodians of it, from the moment it emerged from the rather offensively-named “underground railroad”. They suck at software development, so they should get out of that business and let the open source community take the reins.

This is not to say that there weren’t or aren’t good programmers at the VA: far from it, but VA’s bumbling, incompetent, top-heavy management bureaucracy forever hobbled their best programmers’ best intentions. And let’s be real: had Secretary Shulkin announced that VA was keeping VistA, it would be status quo, business-as-usual. VistA would still be VA’s redheaded stepchild, and the bitrot already plaguing it would get even worse. There was never the tiniest chance that the VA would wake up and start managing VistA well, much less innovating with it. And even if this Cerner migration fails (which is not at all unlikely), there will never be such a chance. Its successes stem entirely from its origins as an unauthorized, underground skunkworks project by those great VistA pioneers who courageously thumbed their noses at bureaucratic stupidity. VistA only ever succeeded in spite of the VA; not because of it.

But, what about patient care? Won’t it get worse as a result of dropping such a highly-rated EHR?

Worse than what? VA sucks at that too, and always has. Long waiting lists, poor quality of care, bad outcomes, scheduling fraud, skyrocketing veteran suicides: none of this is related in any way to VAs technology, for better or worse. It’s just that pouring money into IT changes is a quick way for a bureaucrat with a maximal career span far too short to affect any real change to appear that they’re doing something. When IT projects fail, they can dump it in their successors’ laps, or blame the contractor, and go upon their merry way visiting fraud, waste, and abuse upon the taxpayer, while those who committed to making the ultimate sacrifice in service of king and country are left wondering why it still takes them months just to be seen.

So I sincerely do wish the VA the best of luck in its witless endeavor, and hope that they succeed, by whatever comical measure of success their bumbling allows. Hopefully, this will open the door for the open-source community to take the awesomeness that is VistA and bring it forward into a brighter and happier future.

Feel free to join me. Virtual popcorn and soda is free.

The Problem With Package Managers

As Linux moves farther away from its UNIX roots, and more towards being yet another appliance for the drooling masses (the same drooling masses who just five years ago couldn’t grok the difference between a CD-ROM tray and a cup holder), our once great proliferation of usable choices has dwindled due to a tendency on the part of developers to target only Debian- or Red Hat-based distributions, with a strong bias towards Ubuntu on the Debian side, while few of the more generous developers will also target SuSE, and even fewer will distribute software as a distribution-agnostic tarball. This situation leaves users of other distributions in a precarious position, especially in the case of those of us who–like the author of this article–believe that systemd is a baroque, labyrinthine monument to bogosity (how Lennart Poettering manages to get hired by any reputable software development firm is an atrocity that boggles the mind–his other big “hit” is a three-coil, peanut-laden steamer of a solution-looking-for-a-problem called PulseAudio), and would seek one of the increasingly rare sysvinit based distributions to get away from it.

This is a problem mostly due to package managers. If you’re on a Debian-based system, you get apt. Red Hat, yum. SuSE, zypper. These utilities should need no introduction, and are often praised by Linux users: a single command will install a package and all of its required shared libraries and dependencies, and another command will upgrade packages to the latest and greatest versions, all from a centralized, cloud-based repository or list of repositories. They do provide some convenience, but at a cost: the days of reliably being able to find a simple tarball that will work with the incantation of ./configure; make; make install seem to be numbered. This was a nice, cross-platform solution, and had the added benefit of producing binaries that were well-optimized for your particular machine.

One bright light in all this darkness is the pkgsrc tool in NetBSD: you check out a full source tree from a CVS repository, and this creates a directory structure of categories (editors, databases, utilities, etc.) into which are further subdirectories representing packages. All you need to do is descend into the desired subdirectory and type an appropriate make incantation to download the package and its dependencies, build them, and install them to your system. Updates are similar: fetch the latest updates from the CVS repo, and repeat the process.

However, not even pkgsrc has solved the other big problem with most package managers, and that is the politics of getting new packages into the repositories. The Node.js package manager, npm, is the only one that does this correctly (in the FOSS sense) in any way: you go to the npmjs.org website, create an account, choose a package name (and hope it hasn’t already been taken by another developer), and you are in charge of that little corner of the npm world. You manage your dependencies, your release schedule, your version scheme, the whole nine yards. With Linux distributions, it seems that only a blood sacrifice to the gatekeepers will allow you to contribute your own packages, and even when you get past their arcane requirements, it is still a mass of red tape just to publish patches and updated versions of your software. Node.js, for instance, has not been updated in the mainline distribution repositories since v0.10, which is by all measures an antique.

In order to meet my standards, there are three solutions, that should be employed together:

  • Publicly and brutally shame developers who release only deb and rpm packages but no ./configure; make; make install tarball until they are so insecure that they cry into their chocolate milk and do the right thing (or strengthen the developer gene pool by quitting altogether and opting for a job wiping viruses for drooling PC users with The Geek Squad)
  • Push the Linux distributions to abandon the brain-dead cathedral approach to repo management and opt for a more bazaar-like egalitarian approach like npm
  • Make countless, humiliating memes of Lennart Poettering in embarrassing and compromising contexts (this bit is more for the health of UNIX as a whole than for package managers, but it’s the duty of every good UNIX citizen)

 

ArcaOS 5.0: UNIAUD Update

I have discovered that part of my UNIAUD  audio driver problem can be solved. Using the command line UNIMIX.EXE, I can manually set the speaker output level. Turns out that there was actually sound being generated, but only to the headphone jack.

There’s still another problem, however: desktop sounds repeat and are very noisy and filled with static.

I will be publishing a few screenshots of ArcaOS in the coming days.

Why C is Almost Always the Wrong Choice

C has no true string data type.

The common arguments defending this as a feature rather than a shortcoming go something like this:

  • Performance. The argument here is that statically-allocated, null-terminated char arrays are faster than accessing the heap, and by forcing the programmer to manage his own memory, huge performance gains will result.
  • Portability. This one goes along the lines that introducing a string type could introduce portability problems, as the semantics of such a type could be wildly different from architecture to architecture.
  • Closeness to the machine. C is intended to be as “close to the machine” as possible, providing minimal abstraction: since the machine has no concept of a string, neither should C.

If these arguments are true, then we shouldn’t be using C for more than a tiny fraction of what it is being used for today. The reality of these arguments is more like this:

  • Performance: I’m a programmer of great hubris who actually believes that I can reinvent the manual memory management wheel better than the last million programmers before me (especially those snooty implementers of high-level languages), and I think that demonstrating my use of pointers, malloc(), gdb, and valgrind makes me look cooler than you.
  • Portability: I’m actually daft enough to think that the unintelligible spaghetti of preprocessor macros in this project constitutes some example of elegant, portable code, and that such things make me look cooler than you.
  • Closeness to the machine: I’ve never actually developed anything that runs in ring zero, but using the same language that Linus Torvalds does makes me look cooler than you.

The technical debt this attitude has incurred is tremendous: nearly every application that gets a steady stream of security vulnerability patches is written in C, and the vast majority of them are buffer overflow exploits made possible by bugs in manual memory management code. How many times has bind or sendmail been patched for these problems?

The truth is that most software will work better and run faster with the dynamic memory management provided by high-level language runtimes: the best algorithms for most common cases are well-known and have been implemented better than most programmers could ever do. For most corner cases, writing a shared library in C and linking it into your application (written in a high-level language) is a better choice than going all-in on C. This provides isolation of unsafe code, and results in the majority of your application being easier to read, and easier for open-source developers to contribute to. And most applications won’t even need any C code at all. Let’s face it: the majority of us are not writing kernels, database management systems, compilers, or graphics-intensive code (and in the latter case, C’s strengths are very much debatable).

The long and short of it is that most software today is I/O-bound and not CPU-bound: almost every single one of the old network services (DNS servers, mail servers, IRC servers, http servers, etc.) stand to gain absolutely nothing from being implemented in C, and should be written in high-level languages so that they can benefit from run-time bounds checking, type checking, and leak-free memory management.

Can I put out a CVE on this?

ArcaOS 5.0: Initial Impressions of the Latest OS/2 Distribution

I’ve been a fan of OS/2 since the 1995 release of OS/2 Warp 3.0. I stuck with it as my main operating system for around five years, by which time the difficulties inherent in making the sparsely-updated OS run on modern hardware forced me towards Linux, MacOS, and other operating systems for daily computing tasks. But, I’ve always tried to keep an OS/2 system around, even if not as my primary machine. An IBM Aptiva machine running OS/2 Warp Connect 3.0 was my OS/2 system from 2000-2004, and too many others to mention filled its shoes from then on.

I learned about eComStation a rather long time after its 2001 release, but was impressed with it right away. Especially the many improvements to its installer. I bought new licenses for eCS every time a new version was released, but always had to dig up relatively old hardware to support its limited selection of drivers. Laptops were typically a no-go.

When ArcaOS 5.0 was released, I picked up a license right away. Arca Noae’s online shop is quite straightforward: I simply added the personal license to my shopping cart, checked out, and waited to receive an e-mail saying that my personalized ISO image was ready for download. This was delivered in a compact 7-zip format, which expanded to approximately 1.1GiB. Too large for a CD-R, so I burned the image to DVD-R media and proceeded to begin installing it on my Lenovo ThinkPad T530i (2.4GHz dual-core i3, 16GiB RAM, 1TiB SSHD storage, Intel HD3000 graphics).

My first impression of the updated installer revealed that the ridiculously long product keys of the eComStation days have mercifully been removed in exchange for the personalized ISO approach, which embeds the license holder’s name in the ACPI driver (and possibly elsewhere, though I haven’t looked). I noticed that even in the installer, ArcaOS was running my laptop in full 1080p resolution. The installer detected the T530i’s on-board Ethernet NIC, and automatically set it up for DHCP configuration. Sadly, the on-board Wi-Fi was neither detected nor supported.

When I got to the part of the installer where you select a destination volume for ArcaOS to format and install itself onto, it became clear that something was wrong, as my 1TB hybrid hard drive was being seen as 511GiB and with a corrupt partition table. It took two more false starts to correct this, and the solution was in changing the SATA controller mode BIOS setting from AHCI to compatibility mode, and disabling UEFI entirely.

Once the system was installed–a process which took about half an hour–ArcaOS booted up to the familiar Workplace Shell UI in glorious 1080p. I immediately set about doing my usual customizations to make it feel more like my beloved OS/2 Warp Connect. However, not everything was perfect:

  • The MS-DOS and Win-OS/2 subsystems seem to be completely broken, both in full-screen and windowed mode. The command prompts for the former only produce a blinking cursor, and the latter locks up the system entirely. I’ve had problems before attempting to run DOS and Windows 3.x apps under OS/2 and eCS when they are installed on a volume larger than 2GiB, but have never seen them freeze entirely under any IBM release of OS/2 or in any release of eComStation.
  • While the Panorama VESA video driver supports my Intel HD 3000 GPU in 1080p resolution, attempting to increase color depth from 64K colors to 16M colors results in a complete system lock-up, requiring me to power cycle the machine in order to recover.
  • If the eCS-era tools for switching between Panorama, SNAP, and other video drivers are present in ArcaOS, I cannot find them. At one point, I tried to add OS components using Selective Install, which ended up resetting my display drivers to vanilla VGA: 640×480 and 16 colors. The only way I could find to fix this was to reinstall the OS from scratch.
  • The UniAud driver shows up in hardware manager, as does my on-board sound, but there is no sound from the speakers, save for the occasional high-pitched chirp.
  • MultiMac does not yet support my Intel Centrino 2200 Wi-Fi adapter. I expected this, as Wi-Fi chipsets are notoriously proprietary. I have high hopes for the forthcoming FreeBSD driver ports.

The improvements in video card support and Ethernet support, as well as in the installer, make ArcaOS a compelling update for any OS/2 user. However, Arca Noae has some non-trivial work to do in order to bring the product up to the same level of polish as eComStation 2.1.