Unicode, UTF-8, and 💖
I recently ran into a problem with one of my side projects and, in the process of fixing it, learned quite a bit about Unicode. Croniker (“cron” + “moniker”) is an app I wrote that lets you schedule changes to your Twitter name in advance. It’s like Buffer, but for your Twitter name.
So one day, I scheduled an all-emoji name (like “💖🎉”) in Croniker. The next day, I noticed that it hadn’t been sent to Twitter and checked the logs. There was a semi-helpful error message from Twitter’s API:
Account update failed: Name can't be blank.
When twitter dot com gave me the same error message, I knew the problem was on Twitter’s side. So: what the heck was happening? I was really confused for weeks until I learned more about how Unicode works and which Unicode characters Twitter allows. Let’s learn about Unicode!
What is Unicode even
Unicode is a set of numbers, called codepoints, and each codepoint maps to a
character. For example, codepoint
0x61, usually written as “U+0061”, maps to a
lowercase “a”. Each codepoint is a hexadecimal number, so U+0061 is actually 97
in base 10. The Unicode standard doesn’t say anything about how those codepoints
are encoded on disk. It just says: “Hey, U+0061 is LATIN SMALL LETTER A, so when
you see some bits and interpret them as U+0061, make sure they come out looking
like ‘a’, OK?”
Each codepoint is composed of one or more code units. In UTF-81, which is the most popular Unicode encoding, there are 8 bits per code unit. One codepoint will be encoded as more than one UTF-8 code unit if it’s represented by a number that requires more than 8 bits (like U+1F43C “PANDA FACE”, which requires two code units). You may have also heard of ASCII, which was the most widely used encoding scheme before UTF-8.
Unlike UTF-8, which can be infinitely extended by adding more bits, ASCII encodes all of its characters using only 8 bits. This means it can only represent 256 (2^8) characters, and since ASCII was originally designed by English-speaking people, most of those 256 characters were used for Western characters. This lack of support for non-Western languages is one of the reasons that UTF-8 was created. An interesting and intentional property of UTF-8 is that it’s backwards-compatible with ASCII. A string that’s encoded in ASCII can be read by a program that’s assuming UTF-8 and it will work just fine.
🛩 Planes 🛩
Now that we know how Unicode is (usually) encoded, let’s talk about how the standard is structured. For convenience, Unicode is divided into planes, which are continuous groups of 65,536 codepoints. The first plane, Plane 0, is called the Basic Multilingual Plane (or BMP) and runs from U+0000 to U+FFFF. Almost all modern languages are represented in the BMP, but there are 16 total planes. Unicode is big!
Now, some emoji are in the BMP, but when emoji got popular, a lot more emoji were added to Unicode. In fact, they added so many emoji that they ran out of space for them in the BMP. So newer emoji are in Plane 1, the Supplementary Multilingual Plane. It’d be nice if all of the emoji were in the same plane, but to be fair, the Unicode committee didn’t know that emoji would become a Thing. When they added a heart symbol in 1993, they had no idea that they also needed to save space for 📲 (“Mobile Phone With Rightwards Arrow at Left”).
Bringing it back 💫
So why couldn’t I set my username to “💖🎉”? It turns out that Twitter strips out all characters that aren’t in the Basic Multilingual Plane. That means ❤ (U+2764) is allowed in usernames, but new emoji like 💖 (U+1F496) aren’t, because they’re in the Supplementary Multilingual Plane.
Discovering exactly what was happening with my Twitter username took weeks, but it was so satisfying! I had to read a lot of blog posts from a lot of people to figure it out. I linked to some of those posts below, and I recommend reading them all to get even more detail on Unicode and emoji. Now I feel like I can talk a little bit more about How Modern Computers Work.
Here are some of the helpful resources I used to learn more about Unicode!
- Monica Dinculescu wrote a truly excellent post about how an emoji gets rendered. Thank you, Monica, for teaching me about codepoints and text shaping!
- Joel Spolsky’s article on Unicode is a deserved classic, and taught me about the history of character encodings.
- Eevee wrote about the technical aspects of Unicode.
- @FakeUnicode is an excellent Twitter account that taught me about the BMP.
- Listen to me talk about emoji and Unicode on the Bikeshed podcast.
1 UTF-16 is very similar UTF-8, but uses 16 bits to encode each code unit instead of 8 bits. This means that it takes up more space than UTF-8, so most people use UTF-8 instead. You can read more about UTF-8 vs UTF-16 vs UTF-32.