Unicode, bytes, and bytearray

Unicode, bytes, and bytearray | Core Python 3.8

Python strings represent text using a scheme known as Unicode. The way this works for most basic programs is pretty transparent, so if you’d like, you could skip this section for now and read up on the topic as needed. However, as string and text file handling is one of the main uses of Python code, it probably wouldn’t hurt to at least skim this section.
Abstractly, each Unicode character is represented by a so-called code point, which is simply its number in the Unicode standard. This allows you to refer to more than 120,000 characters in 129 writing systems in a way that should be recognizable by any modern software. Of course, your keyboard won’t have hundreds of thousands of keys, so there are general mechanisms for specifying Unicode characters, either by 16- or 32-bit hexadecimal literals (prefixing them with \u or \U, respectively) or by their Unicode name (using \N{name}).

>>> "\u00C6"

'Æ'

>>> "\U0001F60A"

''

>>> "This is a cat: \N{Cat}"

'This is a cat: {image of Cat}'

You can find the various code points and names by searching the Web, using a description of the character you need, or you can use a specific site such as http://unicode-table.com.

The idea of Unicode is quite simple, but it comes with some challenges, one of which is the issue of encoding. All objects are represented in memory or on disk as a series of binary digits—zeroes and ones—grouped in chunks of eight, or bytes, and strings are no exception. In programming languages such as C, these bytes are completely out in the open. Strings are simply sequences of bytes. To interoperate with C, for example, and to write text to files or send it through network sockets, Python has two similar types, the immutable bytes and the mutable bytearray. If you wanted, you could produce a bytes object directly, instead of a string, by using the prefix b:

>>> b'Hello, world!' b'Hello, world!'

However, a byte can hold only 256 values, quite a bit less than what the Unicode standard requires. Python bytes literals permit only the 128 characters of the ASCII standard, with the remaining 128 byte values requiring escape sequences like \xf0 for the hexadecimal value 0xf0 (that is, 240).

It might seem the only difference here is the size of the alphabet available to us. That’s not really accurate, however. At a glance, it might seem like both ASCII and Unicode refer to a mapping between nonnegative integers and characters, but there is a subtle difference: where Unicode code points are defined as integers, ASCII characters are defined both by their number and by their binary encoding. One reason this seems completely unremarkable is that the mapping between the integers 0–255 and an eight-digit binary numeral is completely standard, and there is little room to maneuver. The thing is, once we go beyond the single byte, things aren’t that simple. The direct generalization of simply representing each code point as the corresponding binary numeral may not be the way to go. Not only is there the issue of byte order, which one bumps up against even when encoding integer values, there is also the issue of wasted space: if we use the same number of bytes for encoding each code point, all text will have to accommodate the fact that you might want to include a few Anatolian hieroglyphs or a smattering of Imperial Aramaic. There is a standard for such an encoding of Unicode, which is called UTF-32 (for Unicode Transformation Format 32 bits), but if you’re mainly handling text in one of the more common languages of the Internet, for example, this is quite wasteful.

There is an absolutely brilliant alternative, however, devised in large part by computing pioneer Kenneth Thompson. Instead of using the full 32 bits, it uses a variable encoding, with fewer bytes for some scripts than others. Assuming that you’ll use these scripts more often, this will save you space overall, similar to how Morse code saves you effort by using fewer dots and dashes for the more common letters.11 In particular, the ASCII encoding is still used for single-byte encoding, retaining compatibility with older systems. However, characters outside this range use multiple bytes (up to six). Let’s try to encode a string into bytes, using the ASCII, UTF-8, and UTF-32 encodings.

>>> "Hello, world!".encode("ASCII")

b'Hello, world!'

>>> "Hello, world!".encode("UTF-8")

b'Hello, world!'

>>> "Hello, world!".encode("UTF-32")

b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\ x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\ x00\x00'

As you can see, the first two are equivalent, while the last one is quite a bit longer. Here’s another example:

>>> len("How long is this?".encode("UTF-8"))

17

This is an important method of compression in general, used for example in Huffman coding, a component of several modern compression tools.

>>> len("How long is this?".encode("UTF-32")) 72

The difference between ASCII and UTF-8 appears once we use some slightly more exotic characters:

>>> "Hællå, wørld!".encode("ASCII") Traceback (most recent call last): ... UnicodeEncodeError: 'ascii' codec can't encode character '\xe6' in position 1: ordinal not in range(128)

The Scandinavian letters here have no encoding in ASCII. If we really need ASCII encoding (which can certainly happen), we can supply another argument to encode, telling it what to do with errors. The normal mode here is 'strict', but there are others you can use to ignore or replace the offending characters.

>>> "Hællå, wørld!".encode("ASCII", "ignore")

b'Hll, wrld!'

>>> "Hællå, wørld!".encode("ASCII", "replace")

b'H?ll?, w?rld!'

>>> "Hællå, wørld!".encode("ASCII", "backslashreplace")

b'H\\xe6ll\\xe5, w\\xf8rld!'

>>> "Hællå, wørld!".encode("ASCII", "xmlcharrefreplace")

b'Hællå, wørld!'

In almost all cases, though, you’ll be better off using UTF-8, which is in fact even the default encoding.

>>> "Hællå, wørld!".encode()

b'H\xc3\xa6ll\xc3\xa5, w\xc3\xb8rld!'

This is slightly longer than for the "Hello, world!" example, whereas the UTF-32 encoding would be of exactly the same length in both cases.

Just like strings can be encoded into bytes, bytes can be decoded into strings.

>>> b'H\xc3\xa6ll\xc3\xa5, w\xc3\xb8rld!'.decode()

'Hællå, wørld!'

As before, the default encoding is UTF-8. We can specify a different encoding, but if we use the wrong one, we’ll either get an error message or end up with a garbled string. The bytes object itself doesn’t know about encoding, so it’s your responsibility to keep track of which one you’ve used.

Rather than using the encode and decode methods, you might want to simply construct the bytes and str (i.e., string) objects, as follows:

>>> bytes("Hællå, wørld!", encoding="utf-8")

b'H\xc3\xa6ll\xc3\xa5, w\xc3\xb8rld!'

>>> str(b'H\xc3\xa6ll\xc3\xa5, w\xc3\xb8rld!', encoding="utf-8")

'Hællå, wørld!'

Using this approach is a bit more general and works better if you don’t know exactly the class of the stringlike or bytes-like objects you’re working with—and as a general rule, you shouldn’t be too strict about that.
One of the most important uses for encoding and decoding is when storing text in files on disk. However, Python’s mechanisms for reading and writing files normally do the work for you! As long as you’re okay with having your files in UTF-8 encoding, you don’t really need to worry about it. But if you end up seeing gibberish where you expected text, perhaps the file was actually in some other encoding, and then it can be useful to know a bit about what’s going on. If you’d like to know more about Unicode in Python, check out the HOWTO on the subject.12

■ Note Your source code is also encoded, and the default there is UtF-8 as well. If you want to use some other encoding (for example, if your text editor insists on saving as something other than UtF-8), you can specify the encoding with a special comment.
# -*- coding: encoding name -*
Replace encoding name with whatever encoding you’re using (uppercase or lowercase), such as utf-8 or, perhaps more likely, latin-1, for example. Finally, we have bytearray, a mutable version of bytes. In a sense, it’s like a string where you can modify the characters—which you can’t do with a normal string. However, it’s really designed more to be used behind the scenes and isn’t exactly user-friendly if used as a string-alike. For example, to replace a character, you have to assign an int in the range 0…255 to it. So if you want to actually insert a character, you have to get its ordinal value, using ord.

Learn Python Programming