Unicode Transformation Format (UTF-8)

So you’ve heard of UTF-8, but what exactly is it?

UTF stands for Unicode Transformation Format. UTF-8 is one way to represent Unicode characters.

Setup 🔗︎

Ever wondered how emojis are represented in your computer?

If my computer is only 1’s and 0’s, how is 🤯 this represented?

This is the purpose of UTF.

Previously, computers only used ASCII characters. This was a limited set of characters, including some basic characters and the English alphabet.

If someone wanted to send you a message in Japanese or send emojis, this was simply not possible!

Enter Unicode .

Unicode gives us a unique code for each emoji, letter and anything else you need to write on your computer.

How it works? 🔗︎

UTF-8 encodes each Unicode character into between 1 and 4 bytes.

For example, the 🤯 from earlier

input_str = "🤯"
utf8_encoded = input_str.encode('utf-8')
utf8_bytes = [f"{byte:08b}" for byte in utf8_encoded]
print(utf8_bytes)
# ['11110000', '10011111', '10100100', '10101111']

The first part of the Unicode sequence indicates the size of the incoming list. The subsequent parts starting with 10 indicate that they are subsequent bytes.

11110 from the first byte indicates that the byte stream is of length 4. If the byte stream was length one, the first byte would start with 0.

To calculate what the actual Unicode value is, we strip these out, and also the 10 from the subsequent bytes. Append these together, and we will get the Unicode value.

So the value is:

000 + 011111 + 100100 + 101111
= 000011111100100101111
= 129327 (base 10)

This (U+1F92F) , corresponds to the exploding head emoji 🤯.

Other UTF-X 🔗︎

There are other ways to encode Unicode characters like UTF-16 and UTF-32.

Here is the same emoji as earlier, but in UTF-16.

input_str = "🤯"
utf16_encoded = input_str.encode('utf-16')
utf16_bytes = [f"{byte:08b}" for byte in utf16_encoded]
print(utf16_bytes)
# ['11111111', '11111110', '111110', '11011000', '101111', '11011101']

This one is a bit different.

While UTF-8 sends each byte separately, UTF-16 sends two bytes at a time. This explains UTF-8 and UTF-16 (8 bits and 16 bits).

The first two bytes of the utf-16 decoded string indicate the endian-ness of the subsequent utf-8 characters.

  • 11111111 11111110 (or 0xFFFE in hexadecimal) indicates little-endian
  • 11111110 11111111 (or 0xFEFF in hexadecimal) indicates big-endian

Converting UTF-16 is a bit harder, so we won’t go into the details here.

Interestingly, UTF-16 is the default for Javascript.

Conclusion 🔗︎

The Unicode and UTF pieces of code are very interesting. It’s cool to understand how my characters are actually being represented on the machine.