Skip to main content

Unicode, bytes, and bytearray | Core Python 3.8

Python strings represent text using a scheme known as Unicode. The way this works for most basic programs is pretty transparent, so if you’d like, you could skip this section for now and read up on the topic as needed. However, as string and text file handling is one of the main uses of Python code, it probably wouldn’t hurt to at least skim this section.
    Abstractly, each Unicode character is represented by a so-called code point, which is simply its number in the Unicode standard. This allows you to refer to more than 120,000 characters in 129 writing systems in a way that should be recognizable by any modern software. Of course, your keyboard won’t have hundreds of thousands of keys, so there are general mechanisms for specifying Unicode characters, either by 16- or 32-bit hexadecimal literals (prefixing them with \u or \U, respectively) or by their Unicode name (using \N{name}).

>>> "\u00C6" 

'Æ' 

>>> "\U0001F60A" 

'' 

>>> "This is a cat: \N{Cat}"

'This is a cat: {image of Cat}'

You can find the various code points and names by searching the Web, using a description of the character you need, or you can use a specific site such as http://unicode-table.com.

    The idea of Unicode is quite simple, but it comes with some challenges, one of which is the issue  of encoding. All objects are represented in memory or on disk as a series of binary digits—zeroes and  ones—grouped in chunks of eight, or bytes, and strings are no exception. In programming languages such as C, these bytes are completely out in the open. Strings are simply sequences of bytes. To interoperate with C, for example, and to write text to files or send it through network sockets, Python has two similar types, the immutable bytes and the mutable bytearray. If you wanted, you could produce a bytes object directly, instead of a string, by using the prefix b:

>>> b'Hello, world!' b'Hello, world!'

However, a byte can hold only 256 values, quite a bit less than what the Unicode standard requires. Python bytes literals permit only the 128 characters of the ASCII standard, with the remaining 128 byte values requiring escape sequences like \xf0 for the hexadecimal value 0xf0 (that is, 240). 
    It might seem the only difference here is the size of the alphabet available to us. That’s not really accurate, however. At a glance, it might seem like both ASCII and Unicode refer to a mapping between nonnegative integers and characters, but there is a subtle difference: where Unicode code points are defined as integers, ASCII characters are defined both by their number and by their binary encoding. One reason this seems completely unremarkable is that the mapping between the integers 0–255 and an eight-digit binary numeral is completely standard, and there is little room to maneuver. The thing is, once we go beyond the single byte, things aren’t that simple. The direct generalization of simply representing each code point as the corresponding binary numeral may not be the way to go. Not only is there the issue of byte order, which one bumps up against even when encoding integer values, there is also the issue of wasted space: if we use the same number of bytes for encoding each code point, all text will have to accommodate the fact that you might want to include a few Anatolian hieroglyphs or a smattering of Imperial Aramaic. There is a standard for such an encoding of Unicode, which is called UTF-32 (for Unicode Transformation Format 32 bits), but if you’re mainly handling text in one of the more common languages of the Internet, for example, this is quite wasteful. 
    There is an absolutely brilliant alternative, however, devised in large part by computing pioneer Kenneth Thompson. Instead of using the full 32 bits, it uses a variable encoding, with fewer bytes for some scripts than others. Assuming that you’ll use these scripts more often, this will save you space overall, similar to how Morse code saves you effort by using fewer dots and dashes for the more common letters.11 In particular, the ASCII encoding is still used for single-byte encoding, retaining compatibility with older systems. However, characters outside this range use multiple bytes (up to six). Let’s try to encode a string into bytes, using the ASCII, UTF-8, and UTF-32 encodings.

>>> "Hello, world!".encode("ASCII") 

b'Hello, world!' 

>>> "Hello, world!".encode("UTF-8") 

b'Hello, world!' 

>>> "Hello, world!".encode("UTF-32") 

b'\xff\xfe\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o\x00\x00\x00,\x00\ x00\x00 \x00\x00\x00w\x00\x00\x00o\x00\x00\x00r\x00\x00\x00l\x00\x00\x00d\x00\x00\x00!\x00\ x00\x00'

As you can see, the first two are equivalent, while the last one is quite a bit longer. Here’s another example:

>>> len("How long is this?".encode("UTF-8")) 

17

This is an important method of compression in general, used for example in Huffman coding, a component of several modern compression tools.

>>> len("How long is this?".encode("UTF-32")) 72

The difference between ASCII and UTF-8 appears once we use some slightly more exotic characters:

>>> "Hællå, wørld!".encode("ASCII") Traceback (most recent call last):  ... UnicodeEncodeError: 'ascii' codec can't encode character '\xe6' in position 1: ordinal not in range(128)

The Scandinavian letters here have no encoding in ASCII. If we really need ASCII encoding (which can certainly happen), we can supply another argument to encode, telling it what to do with errors. The normal mode here is 'strict', but there are others you can use to ignore or replace the offending characters.

>>> "Hællå, wørld!".encode("ASCII", "ignore") 

b'Hll, wrld!'

 >>> "Hællå, wørld!".encode("ASCII", "replace") 

b'H?ll?, w?rld!' 

>>> "Hællå, wørld!".encode("ASCII", "backslashreplace") 

b'H\\xe6ll\\xe5, w\\xf8rld!' 

>>> "Hællå, wørld!".encode("ASCII", "xmlcharrefreplace") 

b'Hællå, wørld!'

In almost all cases, though, you’ll be better off using UTF-8, which is in fact even the default encoding.

>>> "Hællå, wørld!".encode() 

b'H\xc3\xa6ll\xc3\xa5, w\xc3\xb8rld!'

This is slightly longer than for the "Hello, world!" example, whereas the UTF-32 encoding would be of exactly the same length in both cases. 
    Just like strings can be encoded into bytes, bytes can be decoded into strings.

>>> b'H\xc3\xa6ll\xc3\xa5, w\xc3\xb8rld!'.decode() 

'Hællå, wørld!'

As before, the default encoding is UTF-8. We can specify a different encoding, but if we use the wrong one, we’ll either get an error message or end up with a garbled string. The bytes object itself doesn’t know about encoding, so it’s your responsibility to keep track of which one you’ve used. 
    Rather than using the encode and decode methods, you might want to simply construct the bytes and str (i.e., string) objects, as follows:

>>> bytes("Hællå, wørld!", encoding="utf-8") 

b'H\xc3\xa6ll\xc3\xa5, w\xc3\xb8rld!' 

>>> str(b'H\xc3\xa6ll\xc3\xa5, w\xc3\xb8rld!', encoding="utf-8") 

'Hællå, wørld!'

    Using this approach is a bit more general and works better if you don’t know exactly the class of the stringlike or bytes-like objects you’re working with—and as a general rule, you shouldn’t be too strict about that.
One of the most important uses for encoding and decoding is when storing text in files on disk. However, Python’s mechanisms for reading and writing files normally do the work for you! As long as you’re okay with having your files in UTF-8 encoding, you don’t really need to worry about it. But if you end up seeing gibberish where you expected text, perhaps the file was actually in some other encoding, and then it can be useful to know a bit about what’s going on. If you’d like to know more about Unicode in Python, check out the HOWTO on the subject.12

 ■ Note  Your source code is also encoded, and the default there is UtF-8 as well. If you want to use some other encoding (for example, if your text editor insists on saving as something other than UtF-8), you can specify the encoding with a special comment.
# -*- coding: encoding name -*
    Replace encoding name with whatever encoding you’re using (uppercase or lowercase), such as utf-8 or, perhaps more likely, latin-1, for example. Finally, we have bytearray, a mutable version of bytes. In a sense, it’s like a string where you can modify the characters—which you can’t do with a normal string. However, it’s really designed more to be used behind the scenes and isn’t exactly user-friendly if used as a string-alike. For example, to replace a character, you have to assign an int in the range 0…255 to it. So if you want to actually insert a character, you have to get its ordinal value, using ord.

>>> x = bytearray(b"Hello!") 

>>> x[1] = ord(b"u") 

>>> x 

bytearray(b'Hullo!')


Comments

Popular posts from this blog

Strings | Core Python 3.8

Strings  Now what was all that "Hello, " + name + "!" stuff about? The first program in this chapter was simply print("Hello, world!") It is customary to begin with a program like this in programming tutorials. The problem is that I haven’t really explained how it works yet. You know the basics of the print statement (I’ll have more to say about that later), but what is "Hello, world!"? It’s called a string (as in “a string of characters”). Strings are found in almost every useful, real-world Python program and have many uses. Their main use is to represent bits of text, such as the exclamation “Hello, world!” Single-Quoted Strings and Escaping Quotes Strings are values, just as numbers are: >>> "Hello, world!"  'Hello, world!' There is one thing that may be a bit surprising about this example, though: when Python printed out our string, it used single quotes, whereas we used double quotes. What’s the differ...

Variables and Statements | Core Python 3.8

Variables  Another concept that might be familiar to you is variables. If algebra is but a distant memory, don’t worry: variables in Python are easy to understand. A variable is a name that represents (or refers to) some value. For example, you might want the name x to represent 3. To make it so, simply execute the following: >>> x = 3 This is called an assignment. We assign the value 3 to the variable x. Another way of putting this is to say that we bind the variable x to the value (or object) 3. After you’ve assigned a value to a variable, you can use the variable in expressions. >>> x * 2  6 Unlike some other languages, you can’t use a variable before you bind it to something. There is no “default value.” ■ Note  the simple story is that names, or identifiers, in python consist of letters, digits, and underscore characters (_). they can’t begin with a digit, so Plan9 is a valid variable name, whereas Plan is not. Statements  Until...

Execution the Program | Core Python 3.8

Saving and Executing Your Programs  The interactive interpreter is one of Python’s great strengths. It makes it possible to test solutions and to experiment with the language in real time. If you want to know how something works, just try it! However, everything you write in the interactive interpreter is lost when you quit. What you really want to do is write programs that both you and other people can run. In this section, you learn how to do just that.     First of all, you need a text editor, preferably one intended for programming. (If you use something like Microsoft Word, which I really don’t really recommend, be sure to save your code as plain text.) If you are already using IDLE, you’re in luck. With IDLE, you can simply create a new editor window with File › New File. Another window appears, without an interactive prompt. Whew! Start by entering the following: print("Hello, world!") Now select File › Save to save your program (which is, in fact, a pl...

Dictionaries: When Indices Won’t Do | Core Python 3.8

You’ve seen that lists are useful when you want to group values into a structure and refer to each value by number. In this chapter, you learn about a data structure in which you can refer to each value by name. This type of structure is called a mapping. The only built-in mapping type in Python is the dictionary. The values in a dictionary don’t have any particular order but are stored under a key, which may be a number, a string, or even a tuple. Dictionary Uses The name dictionary should give you a clue about the purpose of this structure. An ordinary book is made for reading from start to finish. If you like, you can quickly open it to any given page. This is a bit like a Python list. On the other hand, dictionaries—both real ones and their Python equivalent—are constructed so that you can look up a specific word (key) easily to find its definition (value).     A dictionary is more appropriate than a list in many situations. Here are some examples of uses of Pyth...

Lists and Tuples | Core Python 3.8

This chapter introduces a new concept: data structures. A data structure is a collection of data elements (such as numbers or characters, or even other data structures) that is structured in some way, such as by numbering the elements. The most basic data structure in Python is the sequence. Each element of a sequence is assigned a number—its position, or index. The first index is zero, the second index is one, and so forth. Some programming languages number their sequence elements starting with one, but the zeroindexing convention has a natural interpretation of an offset from the beginning of the sequence, with negative indexes wrapping around to the end. If you find the numbering a bit odd, I can assure you that you’ll most likely get used to it pretty fast.     This chapter begins with an overview of sequences and then covers some operations that are common to all sequences, including lists and tuples. These operations will also work with strings, which will be use...