Skip to main content

Files and Stuff | Core Python 3.8

So far, we’ve mainly been working with data structures that reside in the interpreter itself. What little interaction our programs have had with the outside world has been through input and print. In this chapter, we go one step further and let our programs catch a glimpse of a larger world: the world of files and streams. The functions and objects described in this chapter will enable you to store data between program invocations and to process data from other programs.

Opening Files 

You can open files with the open function, which lives in the io module but is automatically imported for you. It takes a file name as its only mandatory argument and returns a file object. Assuming that you have a text file (created with your text editor, perhaps) called somefile.txt stored in the current directory, you can open it like this:

>>> f = open('somefile.txt')

You can also specify the full path to the file, if it’s located somewhere else. If it doesn’t exist, however, you’ll see an exception traceback like this:

Traceback (most recent call last):

  File "<stdin>", line 1, in <module> 

FileNotFoundError: [Errno 2] No such file or directory: 'somefile.txt'

If you wanted to create the file by writing text to it, this isn’t entirely satisfactory. The solution is found in the second argument to open.

File Modes 

If you use open with only a file name as a parameter, you get a file object you can read from. If you want to write to the file, you must state that explicitly, supplying a mode. The mode argument to the open function can have several values.

Most Common Values for the Mode Argument of the open Function

'r' Read mode (default)
'w' Write mode
'x' Exclusive write mode
'a' Append mode
'b' Binary mode (added to other mode)
't' Text mode (default, added to other mode)
'+' Read/write mode (added to other mode)

Explicitly specifying read mode has the same effect as not supplying a mode string at all. The write mode enables you to write to the file and will create the file if it does not exist. The exclusive write mode goes further and raises a FileExistsError if the file already exists. If you open an existing file in write mode, the existing contents will be deleted, or truncated, and writing starts afresh from the beginning of the file; if you’d rather just keep writing at the end of the existing file, use append mode.
    The '+' can be added to any of the other modes to indicate that both reading and writing is allowed. So, for example, 'r+' can be used when opening a text file for reading and writing. (For this to be useful, you will probably want to use seek as well; see the sidebar “Random Access” later in this chapter.) Note that there is an important difference between 'r+' and 'w+': the latter will truncate the file, while the former will not.
    The default mode is 'rt', which means your file is treated as encoded Unicode text. Decoding and encoding are then performed automatically, with UTF-8 as the default encoding. Other encodings and Unicode error-handling strategies may be set using the encoding and errors keyword arguments. (See Chapter 1 for more on Unicode.) There is also some automatic translation of newline characters. By default, lines are ended by '\n'. Other line endings ('\r' or '\r\n') are automatically replaced on reading. On writing, '\n' is replaced by the system’s default line ending (os.linesep).
    Normally, Python uses what is called universal newline mode, where any valid newline ('\n', '\r', or '\r\n') is recognized, for example, by the readlines method, discussed later. If you wish to keep this mode but want to prevent automatic translation to and from '\n', you can supply an empty string to the newline keyword argument, as in open(name, newline=''). If you want to specify that only '\r' or '\r\n' is to be treated as a valid line ending, supply your preferred line ending instead. In this case, the line ending is not translated when reading, but '\n' will be replaced by the proper line ending when writing.
    If your file contains nontextual, binary data, such as a sound clip or image, you certainly wouldn’t want any of these automatic transformations to be performed. In that case, you simply use binary mode ('rb', for example) to turn off any text-specific functionality.
    There are a few other more slightly advanced optional arguments, as well, for controlling buffering and working more directly with file descriptors. See the Python documentation, or run help(open) in the interactive interpreter, to find out more.

The Basic File Methods 

Now you know how to open files. The next step is to do something useful with them. In this section, you learn about some basic methods of file objects and about some other file-like objects, sometimes called streams. A file-like object is simply one supporting a few of the same methods as a file, most notably either read or write or both. The objects returned by urlopen (see Chapter 14) are a good example of this. They support methods such as read and readline, but not methods such as write and isatty, for example.

THREE STANDARD STREAMS

in Chapter 10, in the section about the sys module, i mentioned three standard streams. these are filelike objects, and you can apply most of what you learn about files to them. a standard source of data input is sys.stdin. When a program reads from standard input, you can supply text by typing it, or you can link it with the standard output of another program, using a pipe, as demonstrated in the section “piping Output.” the text you give to print appears in sys.stdout. the prompts for input also go there. data written to sys.stdout typically appears on your screen but can be rerouted to the standard input of another program with a pipe, as mentioned. error messages (such as stack traces) are written to sys.stderr, which is similar to sys.stdout but can be rerouted separately.

Reading and Writing 

The most important capabilities of files are supplying and receiving data. If you have a file-like object named f, you can write data with f.write and read data with f.read. As with most Python functionality, there is some flexibility in what you use as data, but the basic classes used are str and bytes, for text and binary mode, respectively. Each time you call f.write(string), the string you supply is written to the file after those you have written previously.

>>> f = open('somefile.txt', 'w') 

>>> f.write('Hello, ') 

>>> f.write('World!') 

>>> f.close()

Notice that I call the close method when I’m finished with the file. You’ll learn more about it in the section “Closing Files” later in this chapter. Reading is just as simple. Just remember to tell the stream how many characters (or bytes, in binary mode) you want to read. Here’s an example (continuing where I left off):

>>> f = open('somefile.txt', 'r') 

>>> f.read(4) 'Hell' 

>>> f.read() 

'o, World!'

First I specify how many characters to read (4), and then I simply read the rest of the file (by not supplying a number). Note that I could have dropped the mode specification from the call to open because 'r' is the default.

Piping Output 

In a shell such as bash, you can write several commands after one another, linked together with pipes, as in this example:

$ cat somefile.txt | python somescript.py | sort

This pipeline consists of three commands.
 •    cat somefile.txt: This command simply writes the contents of the file somefile.txt to standard output (sys.stdout).
 •    python somescript.py: This command executes the Python script somescript. The script presumably reads from its standard input and writes the result to standard output.
 •    sort: This command reads all the text from standard input (sys.stdin), sorts the lines alphabetically, and writes the result to standard output.

But what is the point of these pipe characters (|), and what does somescript.py do? The pipes link up the standard output of one command with the standard input of the next. Clever, eh? So you can safely guess that somescript.py reads data from its sys.stdin (which is what cat somefile.txt writes) and writes some result to its sys.stdout (which is where sort gets its data).
    A simple script (somescript.py) that uses sys.stdin is shown in Listing 11-1. The contents of the file somefile.txt are shown in Listing 11-2.
Listing 11-1. Simple Script That Counts the Words in sys.stdin

# somescript.py 

import sys 

text = sys.stdin.read() 

words = text.split() 

wordcount = len(words) 

print('Wordcount:', wordcount)

Listing 11-2. A File Containing Some Nonsensical Text
Your mother was a hamster and your father smelled of elderberries.

Here are the results of cat somefile.txt | python somescript.py:

Wordcount: 11

RANDOM ACCESS

in this chapter, i treat files only as streams—you can read data only from start to finish, strictly in order. in fact, you can also move around a file, accessing only the parts you are interested in (called random access) by using the two file-object methods seek and tell. the method seek(offset[, whence]) moves the current position (where reading or writing is performed) to the position described by offset and whence. offset is a byte (character) count. whence defaults to io.SEEK_SET or 0, which means that the offset is from the beginning of the file  (the offset must be nonnegative). whence may also be set to io.SEEK_CUR or 1 (move relative to current position; the offset may be negative) or io.SEEK_END or 2 (move relative to the end of the file). Consider this example:

>>> f = open(r'C:\text\somefile.txt', 'w') 

>>> f.write('01234567890123456789') 

20 

>>> f.seek(5) 

>>> f.write('Hello, World!') 

13 

>>> f.close() 

>>> f = open(r'C:\text\somefile.txt') 

>>> f.read() 

'01234Hello, World!89'

the method tell() returns the current file position, as in the following example:

>>> f = open(r'C:\text\somefile.txt') 

>>> f.read(3) 

'012' 

>>> f.read(2) 

'34' 

>>> f.tell() 

5

Reading and Writing Lines 

Actually, what I’ve been doing until now is a bit impractical. I could just as well be reading in the lines of a stream as reading letter by letter. You can read a single line (text from where you have come so far, up to and including the first line separator you encounter) with the readline method. You can use this method either without any arguments (in which case a line is simply read and returned) or with a nonnegative integer, which is then the maximum number of characters that readline is allowed to read. So if some_file. readline() returns 'Hello, World!\n', then some_file.readline(5) returns 'Hello'. To read all the lines of a file and have them returned as a list, use the readlines method. 
    The method writelines is the opposite of readlines: give it a list (or, in fact, any sequence or iterable object) of strings, and it writes all the strings to the file (or stream). Note that newlines are not added; you need to add those yourself. Also, there is no writeline method because you can just use write. 

Closing Files 

You should remember to close your files by calling their close method. Usually, a file object is closed automatically when you quit your program (and possibly before that), and not closing files you have been reading from isn’t really that important. However, closing those files can’t hurt and might help to avoid keeping the file uselessly “locked” against modification in some operating systems and settings. It also avoids using up any quotas for open files your system might have. 
    You should always close a file you have written to because Python may buffer (keep stored temporarily somewhere, for efficiency reasons) the data you have written, and if your program crashes for some reason, the data might not be written to the file at all. The safe thing is to close your files after you’re finished with them. 
    If you want to reset the buffering and make your changes visible in the actual file on disk but you don’t yet want to close the file, you can use the flush method. Note, however, that flush might not allow other programs running at the same time to access the file because of locking considerations that depend on your operating system and settings. Whenever you can conveniently close the file, that is preferable. 
    If you want to be certain that your file is closed, you could use a try/finally statement with the call to close in the finally clause.

# Open your file here 

try:

    # Write data to your file 

finally:

    file.close()

There is, in fact, a statement designed specifically for this kind of situation—the with statement.

with open("somefile.txt") as somefile:

     do_something(somefile)

The with statement lets you open a file and assign it to a variable name (in this case, somefile). You then write data to your file (and, perhaps, do other things) in the body of the statement, and the file is automatically closed when the end of the statement is reached, even if that is caused by an exception.

CONTEXT MANAGERS

the with statement is actually a quite general construct, allowing you to use so-called context managers. a context manager is an object that supports two methods: __enter__ and __exit__. the __enter__ method takes no arguments. it is called when entering the with statement, and the return value is bound to the variable after the as keyword. the __exit__ method takes three arguments: an exception type, an exception object, and an exception traceback. it is called when leaving the method (with any exception raised supplied through the parameters). if __exit__ returns false, any exceptions are suppressed. Files may be used as context managers. their __enter__ methods return the file objects themselves, while their __exit__ methods close the files. For more information about this powerful, yet rather advanced, feature, check out the description of context managers in the python reference Manual. also see the sections on context manager types and on contextlib in the python library reference.

Using the Basic File Methods 

Assume that somefile.txt contains the text in Listing 11-3. What can you do with it?
Listing 11-3. A Simple Text File

Welcome to this file There is nothing here except This stupid haiku

Let’s try the methods you know, starting with read(n).

>>> f = open(r'C:\text\somefile.txt') 

>>> f.read(7) 

'Welcome' 

>>> f.read(4) 

' to ' 

>>> f.close()

Next up is read():

>>> f = open(r'C:\text\somefile.txt') 

>>> print(f.read()) 

Welcome to this file There is nothing here except This stupid haiku 

>>> f.close()

Here’s readline():

>>> f = open(r'C:\text\somefile.txt') 

>>> for i in range(3):

        print(str(i) + ': ' + f.readline(), end='') 

0:Welcome to this file 

1: There is nothing here except 

2: This stupid haiku 

>>> f.close()

And here’s readlines():

>>> import pprint 

>>> pprint.pprint(open(r'C:\text\somefile.txt').readlines()) 

['Welcome to this file\n', 'There is nothing here except\n', 'This stupid haiku']

Note that I relied on the file object being closed automatically in this example. Now let’s try writing, beginning with write(string).

>>> f = open(r'C:\text\somefile.txt', 'w') 

>>> f.write('this\nis no\nhaiku') 

13 

>>> f.close()

After running this, the file contains the text in Listing 11-4.
Listing 11-4. The Modified Text File
this is no haiku
Finally, here’s writelines(list):

>>> f = open(r'C:\text\somefile.txt') 

>>> lines = f.readlines() 

>>> f.close() 

>>> lines[1] = "isn't a\n" 

>>> f = open(r'C:\text\somefile.txt', 'w') 

>>> f.writelines(lines) 

>>> f.close()

After running this, the file contains the text in Listing 11-5.
Listing 11-5. The Text File, Modified Again
this isn't a haiku

Iterating over File Contents 

Now you’ve seen some of the methods file objects present to us, and you’ve learned how to acquire such file objects. One of the common operations on files is to iterate over their contents, repeatedly performing some action as you go. There are many ways of doing this, and you can certainly just find your favorite and stick to that. However, others may have done it differently, and to understand their programs, you should know all the basic techniques. 
    In all the examples in this section, I use a fictitious function called process to represent the processing of each character or line. Feel free to implement it in any way you like. Here’s one simple example:

def process(string):

    print('Processing:', string)

More useful implementations could do such things as storing data in a data structure, computing a sum, replacing patterns with the re module, or perhaps adding line numbers. Also, to try out the examples, you should set the variable filename to the name of some actual file.
One Character (or Byte) at a Time One of the most basic (but probably least common) ways of iterating over file contents is to use the read method in a while loop. For example, you might want to loop over every character (or, in binary mode, every byte) in the file. You could do that as shown in Listing 11-6. If you’d rather read chunks of several characters or bytes, supply the desired length to read.
Listing 11-6. Looping over Characters with read

with open(filename) as f:

    char = f.read(1)

    while char:

        process(char)

        char = f.read(1)

This program works because when you have reached the end of the file, the read method returns an empty string, but until then, the string always contains one character (and thus has the Boolean value true). As long as char is true, you know that you aren’t finished yet. As you can see, I have repeated the assignment char = f.read(1), and code repetition is generally considered a bad thing. (Laziness is a virtue, remember?) To avoid that, we can use the while True/break technique introduced in Chapter 5. The resulting code is shown in Listing 11-7.
Listing 11-7. Writing the Loop Differently

with open(filename) as f:

    while True:

        char = f.read(1)

        if not char: break

        process(char)

As mentioned in Chapter 5, you shouldn’t use the break statement too often (because it tends to make the code more difficult to follow). Even so, the approach shown in Listing 11-7 is usually preferred to that in Listing 11-6, precisely because you avoid duplicated code.

One Line at a Time 

When dealing with text files, you are often interested in iterating over the lines in the file, not each individual character. You can do this easily in the same way as we did with characters, using the readline method (described earlier, in the section “Reading and Writing Lines”), as shown in Listing 11-8.
Listing 11-8. Using readline in a while Loop

with open(filename) as f:

    while True:

        line = f.readline()

        if not line: break

        process(line)

Reading Everything 

If the file isn’t too large, you can just read the whole file in one go, using the read method with no parameters (to read the entire file as a string) or the readlines method (to read the file into a list of strings, in which each string is a line). Listings 11-9 and 11-10 show how easy it is to iterate over characters and lines when you read the file like this. Note that reading the contents of a file into a string or a list like this can be useful for other things besides iteration. For example, you might apply a regular expression to the string, or you might store the list of lines in some data structure for further use.
Listing 11-9. Iterating over Characters with read

with open(filename) as f:

    for char in f.read():

        process(char)

Listing 11-10. Iterating over Lines with readlines

with open(filename) as f:

    for line in f.readlines():

        process(line)

Lazy Line 

Iteration with fileinput Sometimes you need to iterate over the lines in a very large file, and readlines would use too much memory. You could use a while loop with readline, of course, but in Python, for loops are preferable when they are available. It just so happens that they are in this case. You can use a method called lazy line iteration—it’s lazy because it reads only the parts of the file actually needed (more or less). You have already encountered fileinput in Chapter 10. Listing 11-11 shows how you might use it. Note that the fileinput module takes care of opening the file. You just need to give it a file name.
Listing 11-11. Iterating over Lines with fileinput

import fileinput 

for line in fileinput.input(filename):

    process(line)

File Iterators 

It’s time for the coolest (and the most common) technique of all. Files are actually iterable, which means that you can use them directly in for loops to iterate over their lines. See Listing 11-12 for an example.
Listing 11-12. Iterating over a File

with open(filename) as f:

    for line in f:

        process(line)

In these iteration examples, I have used the files as context managers, to make sure my files are closed. Although this is generally a good idea, it’s not absolutely critical, as long as I don’t write to the file. If you are willing to let Python take care of the closing, you could simplify the example even further, as shown in  Listing 11-13. Here, I don’t assign the opened file to a variable (like the variable f I’ve used in the other examples), and therefore I have no way of explicitly closing it.
Listing 11-13. Iterating over a File Without Storing the File Object in a Variable

for line in open(filename):

    process(line)

Note that sys.stdin is iterable, just like other files, so if you want to iterate over all the lines in standard input, you can use this form:

import sys 

for line in sys.stdin:

    process(line)

Also, you can do all the things you can do with iterators in general, such as converting them into lists of strings (by using list(open(filename))), which would simply be equivalent to using readlines.

>>> f = open('somefile.txt', 'w') 

>>> print('First', 'line', file=f) 

>>> print('Second', 'line', file=f) 

>>> print('Third', 'and final', 'line', file=f) 

>>> f.close() 

>>> lines = list(open('somefile.txt')) 

>>> lines ['First line\n', 'Second line\n', 'Third and final line\n'] 

>>> first, second, third = open('somefile.txt') 

>>> first 

'First line\n' 

>>> second 

'Second line\n' 

>>> third 

'Third and final line\n'

In this example, it’s important to note the following:
 •    I’ve used print to write to the file. This automatically adds newlines after the strings I supply.
 •    I use sequence unpacking on the opened file, putting each line in a separate variable. (This isn’t exactly common practice because you usually won’t know the number of lines in your file, but it demonstrates the “iterability” of the file object.)
 •    I close the file after having written to it, to ensure that the data is flushed to disk.  (As you can see, I haven’t closed it after reading from it. Sloppy, perhaps, but  not critical.)

Comments

Popular posts from this blog

Strings | Core Python 3.8

Strings  Now what was all that "Hello, " + name + "!" stuff about? The first program in this chapter was simply print("Hello, world!") It is customary to begin with a program like this in programming tutorials. The problem is that I haven’t really explained how it works yet. You know the basics of the print statement (I’ll have more to say about that later), but what is "Hello, world!"? It’s called a string (as in “a string of characters”). Strings are found in almost every useful, real-world Python program and have many uses. Their main use is to represent bits of text, such as the exclamation “Hello, world!” Single-Quoted Strings and Escaping Quotes Strings are values, just as numbers are: >>> "Hello, world!"  'Hello, world!' There is one thing that may be a bit surprising about this example, though: when Python printed out our string, it used single quotes, whereas we used double quotes. What’s the differ...

Variables and Statements | Core Python 3.8

Variables  Another concept that might be familiar to you is variables. If algebra is but a distant memory, don’t worry: variables in Python are easy to understand. A variable is a name that represents (or refers to) some value. For example, you might want the name x to represent 3. To make it so, simply execute the following: >>> x = 3 This is called an assignment. We assign the value 3 to the variable x. Another way of putting this is to say that we bind the variable x to the value (or object) 3. After you’ve assigned a value to a variable, you can use the variable in expressions. >>> x * 2  6 Unlike some other languages, you can’t use a variable before you bind it to something. There is no “default value.” ■ Note  the simple story is that names, or identifiers, in python consist of letters, digits, and underscore characters (_). they can’t begin with a digit, so Plan9 is a valid variable name, whereas Plan is not. Statements  Until...

Unicode, bytes, and bytearray | Core Python 3.8

Python strings represent text using a scheme known as Unicode. The way this works for most basic programs is pretty transparent, so if you’d like, you could skip this section for now and read up on the topic as needed. However, as string and text file handling is one of the main uses of Python code, it probably wouldn’t hurt to at least skim this section.     Abstractly, each Unicode character is represented by a so-called code point, which is simply its number in the Unicode standard. This allows you to refer to more than 120,000 characters in 129 writing systems in a way that should be recognizable by any modern software. Of course, your keyboard won’t have hundreds of thousands of keys, so there are general mechanisms for specifying Unicode characters, either by 16- or 32-bit hexadecimal literals (prefixing them with \u or \U, respectively) or by their Unicode name (using \N{name}). >>> "\u00C6"  'Æ'  >>> "\U0001F60A"  '...

Lists and Tuples | Core Python 3.8

This chapter introduces a new concept: data structures. A data structure is a collection of data elements (such as numbers or characters, or even other data structures) that is structured in some way, such as by numbering the elements. The most basic data structure in Python is the sequence. Each element of a sequence is assigned a number—its position, or index. The first index is zero, the second index is one, and so forth. Some programming languages number their sequence elements starting with one, but the zeroindexing convention has a natural interpretation of an offset from the beginning of the sequence, with negative indexes wrapping around to the end. If you find the numbering a bit odd, I can assure you that you’ll most likely get used to it pretty fast.     This chapter begins with an overview of sequences and then covers some operations that are common to all sequences, including lists and tuples. These operations will also work with strings, which will be use...