Unicode in Haxe 4

We finally stopped doing pointless stuff and focused on emojis

Article by Simon Krajewski on 2018-09-07.

Comments

If you listened to any of Nicolas' keynotes during the last 5 or so years, you know that the next Haxe version at the time would finally have Unicode support. And yet, that never really worked out. Now, with Haxe 4 on the horizon, we can finally look forward to having this shortcoming addressed.

Why do we need Unicode?

There's a lot of variety in natural languages, even at the most basic level: their symbols. "Symbol" is a broad term and its meaning depends on the language. For western languages, we can equate "symbol" with "letter". What you're reading right now are words consisting of letters. These letters come from the Latin alphabet, and many European languages extend this alphabet. For instance, the German alphabet has the Umlaute Ä, Ö and Ü, while the French like to put accents on their e to turn it into é, é or ê.

There are, of course, other alphabets. The name alphabet itself comes from the Greek "alpha beta", i.e. the first two letters α and β of the Greek alphabet. Everyone is familiar with π, too. Then there's the Cyrillic script used in Eastern Europe. There's also the Arabic and Hebrew alphabets, and more.

If we enumerated all the letters in these alphabets, we would already get a lot of symbols that we have to deal with. And then we look at Asia. We can start with Japanese and two of its writing systems: hiragana and katakana. They are syllabaries, which means their symbols are not letters but more akin to syllables. Japanese also uses kanji, which are symbols adopted from Chinese. If you add Chinese, Japanese and Korean, you get CJK and according to the official Unicode FAQ, Unicode "supports over 80,000 CJK characters".

The takeaway here is that there are a lot of symbols we have to represent.

So, what is Unicode?

Let's ask Wikipedia:

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text

Historically, the development of computer technology was US-centric. It is thus not surprising that the "American Standard Code for Information Interchange", or ASCII, was developed in the 1960s and became dominant. ASCII covers exactly what was needed at the time: The Latin alphabet and common punctuation. The International Organization for Standardization (ISO) extended this in their ISO 8859 standards to include letters used in European languages. For instance, ISO 8859-1 covers German, French and others, while ISO 8859-3 supports Turkish.

It is important to understand that these are distinct standards; you had to pick one or the other. This means that a document could support either French or Turkish symbols, but not both at the same time. The reason for that is that memory is limited, and used to be limited much more. While we're used to gigabytes of RAM and terrabytes of hard drive storage, the situation was more dire in the 1990s. Hence, it was desirable to represent a character as exactly one byte. A byte, as we know, has 256 possible values - not nearly enough to represent every symbol we need even if we ignore restrictions like compliance to the ASCII standard.

Unicode acknowledges this problem and offers various solutions. We can categorize these solutions like so:

  • we use more bytes per character
  • we use a variable encoding length

UCS2 and UTF-8

Nowadays, UTF-8 is the most used encoding across the web. It is a variable width encoding, which means that the number of bytes per character may vary from character to character. What we do when dealing with a unicode string is the following:

Read a byte
Is it < 128? If yes, we have our character
Otherwise, read the next byte and somehow add it together with the previous one
Is it < ...? If yes, we have our character
Otherwise, read the next byte and ...

This goes on for up to 4 bytes. I omit the details of how exactly the total value is calculated, but the idea should be obvious regardless. Likewise, the pros and cons are easy to see:

  • good: We save a lot of space because more than one byte is only used if necessary.
  • good: We can represent any character.
  • bad: We can't tell at which position within a UTF-8 string character number X starts.

The latter means that an operations such as string.charAt(position) is not performed in constant (O(1)), but linear time (O(n) with n being the number of characters in the string). For large strings, this can easily lead to performance problems.

In contrast, UCS-2 is a fixed width encoding that uses 2 bytes per character. The pros and cons are just the opposite of UTF-8:

  • good: We can tell at which position within a UCS-2 string character number X starts.
  • bad We can only represent 65,536 characters (known as the Basic Multilingual Plane).
  • bad: We waste some space because we use 2 bytes even if we don't need them.

It's the classic tradeoff: speed vs. memory. UCS-2 strings may use twice as much memory as UTF-8 ones, but operations such as charAt are performed in constant time.

For the sake of completeness, UTF-32 has to be mentioned. It's a fixed width encoding that uses 4 bytes per character, which means that it uses a lot of space even by today's standards.

Unicode in Haxe

For targets where we handle the string implementation ourselves (hxcpp, HashLink, Eval), we decided to use UCS-2 encoding. Initial benchmarks showed that UTF-8 encoding was too slow in many cases. This is, in part, due to our String API. Operations such as charAt(offset) and charCodeAt(offset) are very common in Haxe code. In fact, the only way to iterate over a String in Haxe 3 is to loop over its length and get each individual character. We hope to address this in Haxe with support for key => value iteration. Of course, our String API originates from the Flash/JavaScript APIs, and it is probably not a coincidence that JavaScript engines also use UCS-2 encoding.

So far, the only API changes we have made is an additional encoding argument to the following functions:

  • haxe.io.Bytes.ofString
  • haxe.io.Bytes.getString
  • haxe.io.BytesBuffer.addString

There are currently two supported values:

  • UTF8: The String is encoded as UTF-8. This is the default.
  • RawNative: The String is encoded using the native representation, which might vary per target.

We can see this by compiling the following example to e.g. JavaScript:

import haxe.io.Bytes;

class Main {
    static public function main() {
        var s = "ä";
        var bUtf8 = Bytes.ofString(s, UTF8);
        var bNative = Bytes.ofString(s, RawNative);
        trace(bUtf8.toHex()); // c3a4
        trace(bNative.toHex()); // e400
    }
}

The UTF8 version is guaranteed to give the same result on all targets. However, the RawNative version might give different results on different targets. For instance, compiling this to Python yields c3a4 for both traces, which means Python uses UTF-8 as its native encoding. We do, however, guarantee that a ofString to getString roundtrip with RawNative yields a string that is equal to the original input.

Further unicode work

We're currently still ironing out some things, but there will be a solid first iteration of Unicode support in Haxe 4. Beyond that, we have plans to provide more API to convert between various formats. There is already an open pull request which we will look into for Haxe 4.1. As usual, we're always happy to see more issues being reported on our issue tracker because we clearly don't have enough problems as it is!

Overall, it's a lot of work to get everything right, but that is a small price to pay for the ability to have emojis in your string literals! 😒

Avatar for Simon Krajewski

By Simon Krajewski

Published 2018-09-07

 tech