Files

One source of input for computer programs is a file stored in the local file system. You could write a program that checked a text document for spelling errors, or like you did in project 2, a program to apply filters to an image file.

There are two types of files:

Text files contain one or more lines that contain text characters, encoded according to a character encoding (like UTF-8). Each line in a text file ends with a control character. Unfortunately, the character varies based on the operating system. When a text file is created on a Mac, lines end with "\n". For Unix-created files, lines end with "\r". And annoyingly, on Windows, lines end with both characters, "\r\n". We'll need to keep this in mind when we read a file, if we're trying to break it up into lines.

Anything that's not a text file is a binary file. The bytes in a binary file are intended to be interpreted in some other way, typically dictated by the file extension. For example, all of these are binary files:

Images (with extensions such as GIF/JPG/PNG)
Audio (e.g. WAV/MP3)
Video (e.g. MOV/MP4)
Compressed files (e.g. ZIP/RAR)

Handling files

This is the general 3-step process for handling files in Python:

Open the file
Either read data from file or write data to file
Close the file

It's necessary to explicitly close the file at the end, since if you don't, the operating system won't let any other processes mess with that file.

Let's see how to write the code to make those steps happen.

Opening files

The global built-in function open(file, mode) opens a file and returns a file object. The first argument is the path to the file, while the second argument specifies what we want to do with that file, like read or write it. The following code opens "myinfo.txt" in a read-only mode:

open("myinfo.txt", mode="r")

Here are the most common modes:

Mode Meaning Description r Read If the file does not exist, this raises an error. w Write If the file does not exist, this mode creates it. If file already exists, this mode deletes all data. a Append If the file does not exist, this mode creates it. If file already exists, this mode appends data to end of existing data. b Binary Use for binary files along with r or w more.

The function also takes additional optional arguments and modes, described in the docs.

Reading the file

Once we have a file object, there are several methods that we can use to read the contents of the file.

The read method reads the entire contents of the file into a string:

my_file = open("myinfo.txt", mode="r")
my_info = my_file.read()

The readlines method reads the contents into a list of strings, where each string is a line of the file. That's handy since it takes care of the cross-OS issues with different line end characters ("\r" vs. "\n").

my_file = open("myinfo.txt", mode="r")
file_lines = my_file.readlines()

for line in file_lines:
    print(line)

Python also provides an option for reading a file lazily - just one line at a time - by using a for loop.

rows = []
my_file =  open("longbook.txt", mode="r")
for line in my_file:
    rows.append(line)

If we allow that loop to iterate all the way to the end of the file, then there's no difference between that and readlines. However, we could break out of the loop once we've found something in the file, like so:

rows = []
my_file =  open("longbook.txt", mode="r")
for line in my_file:
    rows.append(line)
     if line.find('Chapter 2') > -1:
            break

This is a great approach for very long text files, since it means you don't have to read the whole darn file into memory.

Writing files

To write a file, we need to first open it in either the "w" mode, which will empty out the file upon opening it, or the "a" mode, which will keep the prior contents but append additional data to the end.

Overwriting the entire file:

my_file = open("myinfo.txt", mode="w")
my_file.write("Birth city: Pasadena, CA")

Appending to the existing file contents:

my_file = open("myinfo.txt", mode="a")
my_file.write("First pet: Rosarita (Rabbit)")

Closing files

Finally, once we're done reading or writing, we need to close the file. The close() method closes the file, ending all operations and freeing up resources.

my_file = open("myinfo.txt", mode="r")
my_file.close()

A fairly different approach is to use a with statement to open the file, and then put all the reading and writing calls inside the body of the statement.

with open("myinfo.txt", mode="r") as my_file:
    lines = my_file.readlines()
    my_file.close()

print(lines)

Once all the statements indented inside the with block are executed, Python takes care of closing the file for you. Any code that runs after the with block would not be able to read or write that file, since it's no longer open.

Some programmers prefer the second approach since you only need to remember to open the file. But either approach is fine, whatever floats your boat! 🛶

Online files

The open(path) function only works for opening local files in the file system. What if there's a file online that you want to work with? I actually work more with online files than local files, myself!

In the Python standard library, the urllib.request module includes a urlopen(url) function in the urllib module that can open a file at a URL and return a file-like object.

This code opens a text file that contains an entire book, The Count of Monte Cristo:

import urllib.request

text_file = urllib.request.urlopen('https://www.gutenberg.org/cache/epub/1184/pg1184.txt')

Once the file is opened, we can use similar methods as discussed above.

This line of code reads the whole book into a single string:

whole_book = text_file.read()

However, there's one significant difference between that string and the string returned when reading a local file. The variable above now stores a byte string, which Python displays with a lowercase b in front:

b'\xef\xbb\xbfThe Count of Monte Cristo, by Alexandre Dumas, p\xc3\xa8re.'

A byte string is a series of bytes (8-bit sequences), which is how computers actually store data behind the strings. Python byte strings do allow the first 128 bytes to be shown as letters, but there are thousands of characters beyond those. That's why character encodings exist, to specify how a sequence of bytes corresponds to a particular character. The most common encoding is UTF-8, especially for files in English or European languages.

In order to translate that byte string into a string of characters, we must know the encoding of the original data, and then call decode(encoding) on the byte string.

Since that book file was indeed encoded with UTF-8, we can decode it like this:

whole_book = whole_book.decode('utf-8')

Now, instead of seeing b'p\xc3\xa8re', we'll see 'père' in the string.

Once decoded, we can use string operations on it, such as using split to turn it into a list of lines.

➡️ Next up: Project 3: Text generator