Avoiding Windows backslash problems with Python’s raw strings

I’m a Unix guy, but the participants in my Python classes overwhelmingly use Windows. Inevitably, when we get to talking about working with files in Python, someone will want to open a file using the complete path to the file.  And they’ll end up writing something like this:

filename = 'c:\abc\def\ghi.txt'

But when my students try to open the file, they discover that Python gives them an error, indicating that the file doesn’t exist!  In other words, they write:

for one_line in open(filename):    print(one_line)

What’s the problem?  This seems like pretty standard Python, no?

Remember that strings in Python normally contain characters. Those characters are normally printable, but there are times when you want to include a character that isn’t really printable, such as a newline.  In those cases, Python (like many programming languages) includes special codes that will insert the special character.

The best-known example is newline, aka ‘\n’, or ASCII 10. If you want to insert a newline into your Python string, then you can do so with ‘\n’ in the middle.  For example:

s = 'abc\ndef\nghi'

When we print the string, we’ll see:

>>> print(s)

abc

def

ghi

What if you want to print a literal ‘\n’ in your code? That is, you want a backslash, followed by an “n”?  Then you’ll need to double the backslash:The “\\” in a string will result in a single backslash character. The following “n” will then be normal. For example:

s = 'abc\\ndef\\nghi'

When we say:

>>> print(s)

abc\ndef\nghi

It’s pretty well known that you have to guard against this translation when you’re working with \n. But what other characters require it? It turns out, more than many people might expect:

  • \a — alarm bell (ASCII 7)
  • \b — backspace (ASCII
  • \f — form feed
  • \n — newline
  • \r — carriage return
  • \t — tab
  • \v — vertical tab
  • \ooo —  character with octal value ooo
  • \xhh — character with hex value hh
  • \N{name} — Unicode character {name}
  • \uxxxx — Unicode character with 16-bit hex value xxxx
  • \Uxxxxxxxx — Unicode character with 32-bit hex value xxxxxxxx

In my experience, you’re extremely unlikely to use some of these on purpose. I mean, when was the last time you needed to use a form feed character? Or a vertical tab?  I know — it was roughly the same day that you drove your dinosaur to work, after digging a well in your backyard for drinking water.

But nearly every time I teach Python — which is, every day — someone in my class bumps up against one of these characters by mistake. That’s because the combination of the backslashes used by these characters and the backslashes used in Windows paths makes for inevitable, and frustrating, bugs.

Remember that path I mentioned at the top of the blog post, which seems so innocent?

filename = 'c:\abc\def\ghi.txt'

It contains a “\a” character. Which means that when we print it:

>>> print(filename)
c:bc\def\ghi.txt

See? The “\a” is gone, replaced by an alarm bell character. If you’re lucky.

So, what can we do about this? Double the backslashes, of course. You only need to double those that would be turned into special characters, from the table I’ve reproduced above: But come on, are you really likely to remember that “\f” is special, but “\g” is not?  Probably not.

So my general rule, and what I tell my students, is that they should always double the backslashes in their Windows paths. In other words:

>>> filename = 'c:\\abc\\def\\ghi.txt'

>>> print(filename)
c:\abc\def\ghi.txt

It works!

But wait: No one wants to really wade through their pathnames, doubling every backslash, do they?  Of course not.

That’s where Python’s raw strings can help. I think of raw strings in two different ways:

  • what-you-see-is-what-you-get strings
  • automatically doubled backslashes in strings

Either way, the effect is the same: All of the backslashes are doubled, so all of these pesky and weird special characters go away.  Which is great when you’re working with Windows paths.

All you need to do is put an “r” before the opening quotes (single or double):

>>> filename = r'c:\abc\def\ghi.txt'

>>> print(filename)
c:\abc\def\ghi.txt

Note that a “raw string” isn’t really a different type of string at all. It’s just another way of entering a string into Python.  If you check, type(filename) will still be “str”, but its backslashes will all be doubled.

Bottom line: If you’re using Windows, then you should just write all of your hard-coded pathname strings as raw strings.  Even if you’re a Python expert, I can tell you from experience that you’ll bump up against this problem sometimes. And even for the best of us, finding that stray “\f” in a string can be time consuming and frustrating.

PS: Yes, it’s true that Windows users can get around this by using forward slashes, like we Unix folks do. But my students find this to be particularly strange looking, and so I don’t see it as a general-purpose solution.

A quick intro to the Unix “find” utility

One of the most powerful Unix command-line utilities is “find” — but it also has a huge number of options, and most of the documentation I’ve read on “find” is hard to follow and understand.  That’s a shame, because once you understand what “find” does and how it works, you can accomplish quite a bit.  I hope that this post will show you some of the basics of “find”, so that you can take advantage of it in your day-to-day work.

The basic idea is that “find” looks through a directory (and all of its subdirectories), applying one or more filters when deciding which files are interesting, and executing one or more actions on matching files.

So, what can you do with “find”?

  • Move any backup log older than 30 days to /tmp/
  • Find all of the MP4 files larger than 100MB
  • Find all of the documents with either “doc” or “docx” extensions anywhere in your home directory
  • In a directory of text files, find those containing the phrase “budget” which have not been touched in the last 30 days

(In these examples, I’m going to use the GNU version of find, which is standard on Linux machines and available for the Mac via Homebrew.  Note that if you use Homebrew on the Mac, then GNU “find” will be installed as “gfind” by default.  Use the –with-default-names option to “brew install” if you want to avoid this prefix.)

Note: There is a big difference between “find” and “locate”, which are often confused for one another:

  • “find” looks for files according to a number of criteria, and performs an action on the files matching those criteria. The search takes place when you run the program.
  • “locate” uses a database (typically created with the “updatedb” command) for filenames matching a pattern, and returns those filenames.

So if you know that you have a file named “important.txt” somewhere on your system, then you probably want to use “locate” — assuming, of course, you have been updating your filename database on a regular basis, typically via “cron”.

If you don’t remember the name of the file, but do remember that you modified it in the last 14 days, and that it contains the phrase “very important”, then you can use “find”.

For example, let’s say that I just want to find all of the files in the current directory and all of its subdirectories.  I can say:

find . -print

This means: Look at all files and directories in the current directory (.) and contained within its subdirectories, and then print them.

Now, in GNU find, both of these arguments are optional; you can just say

find

but I don’t recommend doing so, if only because it’s a bit ambiguous.  Moreover, the longer version emphasizes that “find” looks through a directory, filters through the results (although we don’t have any filters here), and then executes something (in this case, “print”).  The filters and actions are specified using command-line arguments; thus, we say “-print” if we want to print the name of the file.  Note that it’s not “–print” (i.e., with two “-” characters before “print”), which we might expect.

Also notice that the result includes all files, including directories and special Unix files (e.g., device files).  If you want to only look at files, then you can specify the “-type” filter.  For example, the following command shows all files (i.e., not subdirectories, symbolic links, or the like) under the current directory:

find . -type f -print    # find regular files

What if you want to find directories?  Then instead of using “-type f”, specify “-type d”:

find . -type d -print    # find directories

What if I only want to find files that match a certain pattern?  Then I can filter using the “-name” test and the shell’s standard characters.  For example, let’s say I want to find all of the files that end with “.txt”.  I can then say:

find . -type f -name "*.txt" -print

The above applies two tests —only regular files (i.e., not directories or the like) that match the pattern “*.txt” will match and be printed.

What if I want to find files that end with “.txt” or “.text”?  In such cases, it might be easiest to use the “or” option, written as “-o”, that combines two tests.  For example:

find . -type f \( -name '*.txt' -o -name '*.text' \) -print

The “-o” option (for logical “or” — and yes, there is also a “-a” option that’s logical “and”) allows either of the tests to succeed in order for it to declare success. However, the items on either side of “-o” must be inside of parentheses.  Since parentheses in the Unix shell have their own uses, we need to preface them with backslashes, to avoid clashes between the levels of parsing.  But wait — if the “\(” and “\)” are touching the arguments, then you’ll get hard-to-understand errors.  So make sure that “\(” and “\)” are surrounded by whitespace, if you want to avoid trouble.

Let’s say that I want to find old files on my system. Unix filesystems keep track of file ages in three different ways:

  • ctime (creation time) — when was the file first created?
  • mtime (modification time) — when was the file last modified?
  • atime (access time) — when was the file last accessed/read?

Let’s say that I want to find files in the current directory (and below) that were last accessed 7 days ago.  I can say:

find . -atime 7 -print

The “atime” is measured in 24-hour increments, starting with midnight of the current day. So “-atime 7” means, “last accessed 7*24 hours before midnight today.”

But wait a second — when was the last time you wanted to find files that were accessed exactly 7 days ago?  It’s far more likely that you want to find files that were last accessed less than 7 days ago. In order to do that, you need to preface the number with a “-” sign:

find . -atime -7 -print

By contrast, if you want to find all of those files that were accessed more than 7 days ago, you’ll want to preface the number with a “+” sign:

find . -atime +7 -print

And of course, if you want to find files that were accessed more than 2 days ago, but less than 9 days ago, you can say:

find . -atime +2 -atime -9 -print

Depending on your needs, it might well be better to use “mtime” rather than “atime”. I’m often interested in finding files I changed recently, rather than those I read recently. The same rules apply; here’s how I would find all of those files that I last modified more than two days ago but less than 9 days ago:

find . -mtime +2 -mtime -9 -print

Notice that I’m able to combine two rules (i.e., two “atime” or “mtime” rules) without using “-a” to join them together with a logical “and”.

Another useful thing to look for is big files. What files, for example, are bigger than 2 GB? I can say the following:

$ find . -size +2G -print

(I believe that this “-size” option only works this way on GNU find. Other versions might well require that you specify the file size in blocks. It has been a while since I used non-GNU versions.)

Look familiar? That’s right; the “+2” means “greater than”, and the “G” suffix means “GB”.  You can use a bunch of suffixes to the number, to indicate just how big the file should be.  As you might have guessed, you can say “-2M” to mean “less than 2 MB”, which on a modern computer is just about everything, to be honest.

We can also combine these, just as we did with “atime” and “mtime”: What files are bigger than 500 MB and smaller than 5 GB?

find . -size +500M -size -5G -print

We can combine these filters with others. What files are bigger than 500 MB and smaller than 5 GB, and were last accessed no more than 30 days ago?

find . -size +500M -size -5G -atime -30 -print

You can imagine using this sort of command to find large, unused files, such as old videos that you had forgotten are on your filesystem. Indeed, what if I’m only interested in finding MP4 files that are larger than 500 MB, smaller than 5 GB, and accessed in the last 30 days? I  can add another condition:

find . -size +500M -size -5G -atime -30 -name "*.mp4" -print

There are lots of other filters you can apply, and GNU find is especially full of them. There are alternative ways to specify dates. You can search for particular types of special files.  You can search for certain permissions. And so forth.  But the ones I’ve shown you are the ones I’ve used most often.

But the tests are only the first part of using “find”: Once you’ve gotten a list of files, what can you do with them?

So far, we’ve seen a single action, namely “-print”.  There are a few others that you might find useful.

The first is “-ls”, which runs the Unix “ls” command (with a few options that’ll show size and permissions):

find . -size +500M -size -5G -atime -30 -name "*.mp4" -ls

The above will not only print the filename (like “-print”), but will also show lots of other information about the files we’ve found. What if you want to write this list to a file? Then just use the “-fls” option, and give it a filename:

find . -size +500M -size -5G -atime -30 -name "*.mp4" -fls big-movies.txt

It’s pretty common to want to delete files. So you can use the “-delete” option to do so.  Warning: Running a program that automatically deletes files can be very dangerous. I almost never do this, because I’m always so worried that something will go wrong.  Here’s how I can remove all of the backup files in my Linux /var/log directory that are more than 21 days old:

find . -name '*.gz' -mtime +21 -delete -print

Note that you can have more than one action; in this case, my first action was “-delete”, and my second was “-print”.

It’s pretty common for me to want to search through an entire directory for a file that contains particular text. In other words, I want to run the “grep” utility on each file. I can do that by using the all-purpose “-exec” action.  The basic idea is as follows: You hand “-exec” a command, and the command is then ended with \; (yes, backlash + semicolon). In between, you can write whatever Unix command you want, including options. The current filename can be put into the command with the special formula {} (i.e., empty curly braces).  For example, I can say:

find . -name "*.txt" -exec grep Reuven {} \;

The above will show all lines from all files containing my name. (Of course, a regular expression can be far more complex than this; if you aren’t familiar with grep or regexps, you can take my free “regular expressions crash course.”)    But the output only shows the lines we would get from “grep”, which (by default) doens’t show the name of the current file if you’re running it one file at a time. For this reason, we would be wise to include the “-H” option:

$ find . -name "*.txt" -exec grep -H Reuven {} \;

While “grep” is the most common command that I run via “-exec”, you can use any program you want, including programs that you’ve written.  In this way, you can really make “find” work for you, and execute custom code for each file that fits a criteria. Combine “find” with “cron”, and you have an easy way to identify files that need your attention, or that should be removed, or that you’ve been looking for and otherwise cannot find.

If there’s one drawback to “find”, it’s that the search happens in real time. There is no database through which it runs. Which means that if you’re going through a very large directory structure, you might discover that “find” takes quite a while.

And that’s about it! If you’re like me, then you’ll find (no pun intended) that these use cases cover most of what you need with the “find” utility. The documentation is extremely long, but only because “find” has many other tests and actions that you can mix and match in a variety of ways.