Want to understand Python’s comprehensions? Think in Excel or SQL.

Comprehensions are among the most useful constructs in Python. They merge the old, trusty “map” and “filter” functions into a single piece of compact, elegant syntax, allowing us to expression complex ideas in a minimum of code. Comprehensions are one of the most important tools in a Pythonista’s toolbox.

And yet, I have found that a very large number of Python programmers, including some experienced developers, are not completely comfortable with comprehensions. There are two reasons for this: First, it’s not obvious when to use them, and what sorts of problems they solve. The second problem, which is at least as important, is that the syntax is hard for people to remember and understand.

I’ve started to use a new explanation and introduction to comprehensions in my Python classes, and have found that it helps to lower the learning curve to some degree. In this post, I’m publicizing this explanation, in the hopes that it’ll help Python developers to understand when, where, and how to use comprehensions.

Let’s take a simple problem: I want to take a list of five integers, and get a list of their squares. If you give this problem to a new (or even intermediate) Python programmer, the answer would look something like this:

numbers = range(5)
output = [ ]
for number in numbers:
    output.append(number * number)
print(output)

Now, the thing is that this does work. (In my courses, I often use the phrase, “Unfortunately, this works.”) Often, when I talk about comprehensions, I talk about functional programming, the idea of immutable data structures, the idea that we don’t want to change things, and the benefits of thinking in terms of mapreduce.

But let’s ignore all of that, and ask a simpler question: If you were to give this problem to your accountant, how would they solve the problem?

Almost certainly, an accountant would fire up Excel, and put the numbers in a column:

A
-
0
1
2
3
4

Let’s assume that the above numbers are in the spreadsheet’s column A. The Excel user would, given this task, then tell Excel that column B should be calculated as A*A. And it would be done:

A  B
-  -
0  0
1  1
2  4
3  9
4  16

You could argue that the difference here is that Excel has a GUI, and Python doesn’t. But that’s missing the point. The real difference is that our accountant told Excel how to transform the first column into the second column, whereas our Python developer wrote a program that describe how to carry out that transformation.

We can think about this in a different way, too: Rather than solving the problem serially, as in the above for loop, the accountant is thinking in a parallel manner, applying a single expression to a large data set. The Excel user doesn’t care, or even know, the order in which the numbers are handed to the expression. The important thing is that the expression is applied once to each of the numbers, and that the final result appears in the correct order.

We might laugh at Excel, and dismiss its users as technical neophytes. And certainly, many users of Excel would deny that they possess serious programming chops. But this sort of thinking, which is so fundamental and natural to Excel users, is alien to many programmers. Which is a shame, because it allows us to express a very large number of ideas in a simple way.

To summarize this approach:

  • Think of your input as an iterable source of data
  • Think of what operation you want to apply to each element of that source
  • Get a new sequence out

That’s what the traditional “map” function does. Python does have a “map” function, but today, we typically use list comprehensions instead.

Let’s try to make this a bit more concrete, using the example that I used above: Let’s say that we have a list of five numbers, and we want to turn that list into a list of its squares. The list-comprehension syntax looks as follows:

[number * number for number in range(5) ]

Yikes. No wonder people are scared off by this syntax.  Let’s take the above syntax apart:

  • First of all, we’re going to get a list back. (It’s called a “list comprehension” for a reason.) That’s because of the square brackets, which are mandatory, and which tell Python what sort of object to create.
  • The data source will be “range(5),” which returns a list.
  • Each element in the data source will be assigned, in turn, to the iteration variable “number.”
  • We’ll invoke the operation “number * number” on each element of the data source.

In other words, we’re creating a new list, the elements of which are the result of applying our expression to each element of the source. This sounds suspiciously like what our accountant did above, using Excel: We’re telling Python what we want, and how to transform our source to that result. But how are things done internally? How is the list created? We neither know nor care.

List-comprehension syntax can be daunting for people to understand, in part because the order of the operations seems unusual. I’ve found that it can help to rewrite list comprehensions in the following way:

[number * number
 for number in range(5) ]

Yes, that’s right — I now spread list comprehensions across two lines; the first describes the operation I want to invoke, and the second line describes the data source. If this still seems unfamiliar, let’s try to bring it into a context with which you might have some experience:

[number * number           # SELECT
 for number in range(5) ]  # FROM

While they’re not directly equivalent, there are a fair number of similarities between a SELECT query in SQL, the placement of its SELECT expression and FROM clause, and our list comprehension.  The FROM clause in an SQL query describes our data source, which is typically going to be a table, but can also be a view or even the result of a function call. And the initial part of the SELECT is often the name of a column, but  can include function calls and operators.

On the one hand, the SELECT-FROM combination seems almost too simple to mention, in that you’re just retrieving a selected set of values from a data source.  On the other hand, such queries form the backbone of the database industry. In the same way, such functionality forms the backbone of many Python programs, iterating over a data structure, and plucking out part of it, transforming that part, and then returning a new list.

One of my favorite examples (and an exercise in my ebook, “Practice Makes Python“) is to take the /etc/passwd file used in Unix, and get the usernames contained within that file. /etc/passwd consists of one record per line, and the fields are separated by colons. Here are several lines from the /etc/passwd on my computer:

nobody:*:-2:-2::0:0:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0::0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1::0:0:System Services:/var/root:/usr/bin/false
_uucp:*:4:4::0:0:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico

We might normally think of a file as a collection of bytes, to which we give semantic meaning when we read it. But in Python, we’re encouraged to see a file as an ordered, iterable collection of lines of text. True, I can read from a file based on bytes, but it’s so common to want to read files by line that the language provides several constructs to do so.

We know that we can iterate over the lines of a file:

for line in open('/etc/passwd'):
    print(line)

This demonstrates that a file is iterable, which means that it can serve as a data source for a list comprehension. This means that the above code can be rewritten as:

[line
 for line in open('/etc/passwd')]

Again, the first line in our list comprehension represents the expression we want to apply to every element of our data source. In this case, the expression is just the line.  If we want to get the username from each of  these lines, we just need to apply the “split” method on the string, returning a list — and then retrieve index 0 from the resulting list.  For example:

[line.split(":")[0]
 for line in open('/etc/passwd')]

Again, we can think of it in terms of an SQL query:

SELECT username
FROM users

But of course, “username” in the above is a column name.  A more equivalent query to my list comprehension would be a “Users” table with an “info” column, queried as follows:

SELECT split_part(info, ':', 1)
FROM users;

Note that in this case, I’m using the built-in PostgreSQLsplit_part” operator to perform the equivalent operation to the str.split method in Python.

Remember that in the case of my SQL query, the result of a query always looks and acts like a table. The number and types of columns returned will depend on the number and types of expressions that I have in the SELECT  statement.  But the result set will have one or more columns, and zero or more rows.

In the same way, the result of a list comprehension is always going to be a list.  You can have whatever expression you want inside of the list comprehension; the expression represents one item in a list, not the list itself.

For example, let’s assume that I want to turn the usernames in /etc/passwd into a list of dictionaries. This doesn’t require a dictionary comprehension, which creates a single dictionary.  Rather, it requires a list  comprehension, in which the expression creates a dictionary.  Here’s a simple-minded such list comprehension:

[ {'name':line.split(":")[0]}
   for line in open('/etc/passwd')]

The above will work, in that it creates a list of dictionaries. And each dictionary has a single key-value pair.  But it seems a bit silly to do the above.  Rather, I’d probably want to have a dictionary containing the username and the numeric user ID, which is at index 2. I can then write:

[ {'name':line.split(":")[0], 'id':line.split(":")[2]}
for line in open('/etc/passwd')]

Again, we can think about this in terms of Excel, or even in terms of SQL: My query now produces a single column of results, but each column contains a text string. Or we can even say that the query produces two columns of results, which is not at all unusual in the world of SQL.

Let’s ignore the efficiency (or lack thereof) of invoking str.split twice in one comprehension: When I run this code on my Mac, it results in an exception, claiming that an index is out of range.

The reason is simple: I split each line into a list. But if there’s a line that doesn’t contain any : characters, it’ll be turned into a single-element list. I thus need to weed out any lines that won’t conform. Specifically, on my Mac at least, I need to remove any lines in /etc/passwd that are comments, meaning that they start with the ‘#’ character.

In the world of list comprehensions, I say the following:

[ {'name':line.split(":")[0], 'id':line.split(":")[2]}
for line in open('/etc/passwd')
if not line.startswith("#")]

Let’s extend our earlier SQL analogy further, adding the equivalent SQL syntax in comments after our Python code:

[ {'name':line.split(":")[0], 'id':line.split(":")[2]}    # SELECT
for line in open('/etc/passwd')                           # FROM
if not line.startswith("#")]                              # WHERE

Of course, when the first line of our comprehension becomes this long, it’s often a good idea to use a function, instead. And since the first line can be any legitimate Python expression, a function is often a good idea:

def get_user_info(line):
    name, passwd, id, rest = line.split(":", 3)   # max 4 fields
    return {'name':name, 'id':id}

[ get_user_info(line)             # SELECT
for line in open('/etc/passwd')   # FROM
if not line.startswith("#")]      # WHERE

A list comprehension thus gives you power similar to an SQL SELECT query — except that you’re not querying data in a table, but rather any object that conforms to Python’s iteration protocol, which includes a very  large number of built-in and custom-made objects.

Now, when would you want to use a list comprehension? And how does it differ from a for loop?

Using a list comprehension is appropriate whenever you want to transform data. That is, you have an iterable data source, and you want to create a new list whose elements are based on those of the data source. For  example, let’s assume that (for some reason) I want to find out how many times each character is used in /etc/passwd.  I can thus do the following, using collections.Counter:

from collections import Counter
counts = [Counter(line)
          for line in open('/etc/passwd')
          if not line.startswith("#")]

We know that “counts” is a list, because I used a list comprehension to create it. It is a list containing many Counter objects, one for each non-comment line in /etc/passwd. What if I want to find out what the most  popular character is in each line? I can modify my expression, asking the Counter object for the most common character:

counts = [Counter(line).most_common(1)
          for line in open('/etc/passwd')
          if not line.startswith("#")]

I can extend my expression even more, to get the most popular character from each line (inside of a two-element tuple in a one-element list):

counts = [Counter(line).most_common(1)[0][0]
          for line in open('/etc/passwd')
          if not line.startswith("#")]

And now I can find out how many times each most-popular character appears:

Counter([Counter(line).most_common(1)[0][0]
          for line in open('/etc/passwd')
          if not line.startswith("#")])

On my computer, the answer is:

Counter({':': 71, 'e': 4, 's': 1})

Meaning that in 71 non-comment lines, “:” is the most common, but in 4 lines it’s “e”, and in one line it’s “s”.  Now, could I have done this with a for loop?  Yes, of course — but because I’m dealing with iterables, and  because I’m using objects that work with such iterables, I can chain them together to get an answer in a way that doesn’t require me to tell Python how to do its job. I’m doing things like our accountant did, back at the  start of this article — I’m saying what I want, and letting Python do the hard work of dealing with this for me.

When would I use a for loop, then? The distinction is between whether you want to get a list back, and whether you want to execute a command a number of times.  If you want to build a list, and if it’s built on an iterable that already exists, then I’d say a list comprehension is almost certainly going the be the best bet.  But if you want to execute something a number of times without creating a list, then a comprehension is the a bad way to do it; you should use a “for” loop, instead.

It’s true that list comprehensions are faster than for loops. But most of the time, for loops are used for different things than list comprehensions. “for” loops shouldn’t be used when you want to turn one iterable structure into another; that’s for comprehensions. And you shouldn’t execute something (e.g., print) many times via a list comprehension, even if you can do so via a called function.  I’ve found that the dividing line between when to use a “for” loop, and when to use a comprehension, is clearly delineated in the minds of experienced Python developers, but very hazy among newcomers to the language, and to these ideas.

So, to summarize:

  • If you want to execute a command numerous times, use a “for” loop.
  • If you have an iterable, and want to create a new iterable, then a list comprehension is probably your best bet.
  • Building a list comprehension is sort of like working in Excel: You start with a set of data, and you create a new set of data. Any expression can be used to map from one to the other.  You don’t care about how Python does things behind the scenes; you just want to get your new data back.
  • A list comprehension consists of either two or three parts, which are often easier to understand if you put them on separate lines: (1) the expression, (2) the data source, and (3) an optional “if” statement.
  • These three lines are analogous to SQL’s SELECT, FROM, and WHERE clauses in a query.  And just as each of those (SELECT, FROM, and WHERE) can use arbitrary expressions, so too can Python’s list comprehensions use arbitrary expressions. A list comprehension will always return a list, though — just as a SELECT will always return a table-like result set.
  • Do you want to create a set, or perhaps a dictionary, rather than a list?  Then you can use a set comprehension or a dict comprehension. The idea is the same as everything I’ve said about list comprehensions, except that your result will be a single set or a single dictionary.

Do you find it difficult to work with list comprehensions?  If so, what’s hard for you about them?  And does the above help to make their use, and their syntax easier to remember?  I’m eager to hear your reactions, so that I can improve these explanations even further.

Why you should almost never use “is” in Python

It’s really tempting, when you first start to use Python, to use “is” rather than “==”.  It’s a bit more readable, and it feels like it should just work, especially when you’re dealing with integers. In a language that uses “or” and “and” instead of “||” and “&&”, it seems logical to use “is” instead of “==”. And if you try “is” with small integers, or even with short strings, you might be lulled into thinking that you should use “is” in lots of places.

But you shouldn’t.  Really, in almost no case, should you use “is”; rather, you should almost certainly use “==”.  In fact, there’s only one case in which most Python programmers should be using “is”, and that’s to check to see if something is None.

In this blog post, which is the result of many questions and discussions I’ve had with students in my Python classes, I’m going to try to describe the reasons for this — and along the way, describe some parts of how Python’s objects are allocated, and what we mean when we say that two objects are “the same.”

Let’s start with the basics: Everything in Python is an object. Every object in Python has a unique ID number, which we can retrieve from an object by using the built-in “id” function:

>>> id(5)
140236457829784

>>> id('abc')
4503718088

>>> id([1,2,3])
4504494160

Now, if two variables are pointing to the same object, they will (not surprisingly) return the same ID:

>>> x = [1,2,3]
>>> y = x
>>> id(x)
4504494160
>>> id(y)
4504494160

Given that x and y point to the same list, changes to the list will be reflected in both variables:

>>> x[0] = '!'
>>> y[1] = '?'
>>> x
['!', '?', 3]
>>> y
['!', '?', 3]

In such a case, it’s pretty clear that x and y are both pointing to precisely the same object. They aren’t just equal in value; they are one and the same — aliases for one another.

We can ask Python if this is true by using the “is” operator, also known as the “identity operator.” “is” doesn’t compare the values of x and y. Rather, it checks to see if x and y have the same ID. If so, then they are the same object. If not, then they aren’t. It’s as simple as that. Perhaps it goes without saying, but two objects that “is” each other are also “==” to each other, since an object’s value should be equal to itself:

>>> x == y
True

>>> x is y
True

>>> id(x) == id(y)
True

The above code shows that x and y have the same ID. This means that they “is” each other; we’re dealing with two names for the same object. Their values are thus equal, which is what “==” checks.

Again: The “is” operator returns “True” if two names are referring to the same object. And the “==” operator returns “True” if two names point to objects that contain the same value.

The most common usage, by far, is when we want to know if something is None. True, we would use “==”. But in both readability and speed, “is None” trumps “== None”. So your code should generally say:

if x is None:
    print("x is None!")

It shouldn’t surprise us to find out that “is” is faster than “==”. After all, “is” is implemented in C, and is a simple comparison of the IDs of the two objects. No function call is needed, and we certainly don’t need to compare the values of the two objects, which can also take some time.

The use of “is None” works because the None object is a singleton in Python. No matter what you do, id(None) will always return the same value. (Note that this value won’t stay constant across different invocations of Python.)  In other words:

>>> id(None)
4315260920

>>> id(None)
4315260920

>>> x = None
>>> id(x)
4315260920

What happens if you try to create a new instance of None? Well, we would first have to find out None’s type:

>>> type(None)
<type 'NoneType'>

Unfortunately, NoneType isn’t a defined identifier in Python:

>>> NoneType
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'NoneType' is not defined

So if we want to create a new instance of None, we’ll need to do it ourselves:

>>> type(None)()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: cannot create 'NoneType' instances

Aha.  Well, that’s a shame. But I was using Python 2.7 in the above example. What if I try Python 3?

>>> type(None)
<class 'NoneType'>

>>> NoneType()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'NoneType' is not defined

>>> x = type(None)()
>>> type(x)
<class 'NoneType'>
>>> x is None
True

So no matter how you slice it, None is a singleton. Which is why you can (and should) use “is None”, rather than “== None”, in your code.

But what happens if you decide that you want to use “is” in other places? The problem is that it will sometimes work. That “sometimes” is because “is” exposes some of Python’s internal optimizations in ways that can be a bit surprising.

Strings are how I was initially introduced to the difference between “==” and “is”, and the danger of using “is” over-zealously. Two equal strings should be “==”, but are they “is”?

>>> x = 'a' * 5
>>> y = 'a' * 5
>>> x == y
True
>>> x is y
True

Well, that’s interesting — and I got the same result in Python 2.7, 3.4, and also in PyPy. But why should this be the case? One possibility is that strings are immutable, and that having Python use a single object for each string that we create, would be efficient. And indeed, this is true — so long as the string is short:

>>>> x = 'a' * 5000
>>>> y = 'a' * 5000
>>>> x == y
True
>>>> x is y
False

The above, which works the same in Python 2.7, 3.4, and in PyPy, demonstrates that Python won’t reuse just any string that we have created. There is a limit.  I experimented with things a bit, and I found that 21 is the magic length at which strings are no longer “is” to one another. That is:

>>> x = 'a' * 20
>>> y = 'a' * 20
>>> x is y
True

>>> x = 'a' * 21
>>> y = 'a' * 21
>>> x is y
False

The above was true in Python 2.7 and 3.4, and also in PyPy. However, I also found some seemingly weird behavior, which is undoubtedly because of the way in which Python byte-compiles and then executes for loops:

>>> for i in range(15,25):
        x = 'a' * i
        y = 'a' * i
        print("[{0}] x is y: {1}".format(i, x is y))

[15] x is y: False
[16] x is y: False
[17] x is y: False
[18] x is y: False
[19] x is y: False
[20] x is y: False
[21] x is y: False
[22] x is y: False
[23] x is y: False
[24] x is y: False

Wow, that’s kind of strange, no? Indeed, in a for loop, I found that the only number for which the two strings were “is” to one another was 1:

>>> for i in range(0,10):
...     x = 'a' * i
...     y = 'a' * i
...     print("[{0}] x is y: {1}".format(i, x is y))
...
[0] x is y: False
[1] x is y: True
[2] x is y: False
[3] x is y: False
[4] x is y: False
[5] x is y: False
[6] x is y: False
[7] x is y: False
[8] x is y: False
[9] x is y: False

At the same time, if you create a long literal string and assign it to a variable, you’ll likely find that the strings are “is” to one another:

>>> x = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

>>> y = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'

>>> x is y
True

(Forgive the re-formatting that WordPress did to the above assignments; in Python, they were both on one line.)

I’m not sure what is going on here, but it just goes to show that you really shouldn’t use “is” unless you know what you’re doing.  And even if you think that you know what you’re doing, you might still be surprised!  Bottom line: Using “is” on strings is almost always a bad idea.

Now, this is generally something that we don’t need to think or care about very much. But let’s say that you’re working with large strings, and that these strings might repeat themselves on occasion. In such a case, you will end up with many copies of the same string. Python helps us to solve this problem by “interning” strings. Interning is a technique that has been around for many years in the programming world, which allows us to store only one copy of any given string. In Python 2, we use the built-in “intern” function. In Python 3, we must use sys.intern; intern is no longer a builtin.

“intern” takes a string (and only a string) as a parameter. It returns a reference — either to a new string that was created, or to a string that was already allocated. Thus, the length of the string doesn’t matter; even in the case of a long string, it will only be allocated a single time:

>>> from sys import intern     # Python 3 only
>>> x = intern('a' * 5000)
>>> y = intern('a' * 5000)
>>> x is y
True

As you can see, using “intern” guarantees that every unique string is allocated only once. If you use “intern” on the same string a second time, Python returns a reference to the first string.

Python uses “intern” internally for a variety of purposes.  If you’re working with long strings that repeat themselves, then it might be worth using intern. But for the most part, Python creates and allocates so many objects that a few strings here and there are probably not going to make a difference.    Certainly, you should only use “intern” once you have identified bottlenecks.

You might think that even if strings are allocated multiple times, and are thus not “is” to one another, at least integers are going to be identical. After all, Python wouldn’t allocate new objects for numbers, would it?

We can test this pretty easily, of course:

>>> x = 200
>>> y = 200
>>> x is y
True

Well, that’s encouraging, right?  Let’s try something bigger:

>>> x = 2000
>>> y = 2000
>>> x is y
False

So yes, it turns out that even integers that are equal aren’t necessarily pointing to the same object.   As Amy Hanlon pointed out in her fantastic talk about Python “wats”, this is because Python pre-allocates a number of integers. If your integer is within that range, then they will use the same object, and be “is” to one another. But if you’re outside of that range, then you’ll have two separate objects. Unless, of course, you allocate them in the same line of code:

>>> x = 2000; y = 2000
>>> x is y
True

Have I mentioned that you really shouldn’t use “is” to compare objects except for None? I hope that you’re increasingly convinced.

I’ll close this post with a bit of mischief: In theory, if two objects are “is”, then they’re pointing to the same object — which means that they should be identical to one another, and thus also give us a True response to “==”.  While Python doesn’t allow us to redefine “is”, we can redefine what an object says when we try to compare with using “==”:

>>> class Foo(object):
...     def __eq__(self, other):
...         return False
...
>>> f1 = Foo()
>>> f2 = f1
>>> f1 is f2
True
>>> f1 == f2
False

I cannot think of a situation in which this would be a desirable thing to do. But it’s fun, and allows us to sharpen our understanding of the difference between “==” and “is”.

If you liked this explanation, then you’ll likely also enjoy my ebook, “Practice Makes Python,” with 50 exercises meant to improve your Python fluency.

Free Webinar on June 23rd: Introduction to Regular Expressions

If you’re a programmer, then you have likely heard about regular expressions (“regexps”) before. However, it’s also likely that you have tried to learn them, and have found them to be completely confusing. That’s not unusual; while regular expressions provide us with a powerful tool for analyzing text, their terse, dense, and cryptic syntax can make the effort not seem worthwhile.

On June 23rd, I’m going to be offering a one-hour free Webinar introducing regular expressions, showing how they can make your code more powerful and expressive.

While I’ll mostly be using Python, I’ll also show some other languages and platforms (e.g., Ruby, JavaScript, and the Unix “grep” command).

My demo and discussion will be about an hour long, and will be followed by ample time for Q&A.  My previous Webinars have been lots of fun; I hope that you’ll join in!  You can get (free) tickets at EventBrite.

And hey, if you’re an independent consultant, you can get a double dose of me on that same day; we Freelancers Show panelists will be doing our monthly Q&A just beforehand.  Come and get your questions about consulting answered by our panel of experts!

I look forward to seeing you at one or both of these events!  If you have any questions, you can e-mail me or contact me on Twitter as @reuvenmlerner.

Free one-hour Webinar about Python’s magic methods on May 6th

I’ll be giving another free one-hour Webinar about Python — this time, about the magic methods that Python offers developers who want to change the ways in which their objects work.

We’ll start off with some of the simplest magic methods, but will quickly move onto some of the more interesting and advanced ones — affecting the way that our objects are compared, formatted, hashed, pickled, and sauteed.  (OK, maybe not sauteed.)  Some familiarity with Python objects is expected, but not too much advanced knowledge is necessary.

Register now at EventBrite; if you have any questions, please contact me at reuven@lerner.co.il, or as @reuvenmlerner on Twitter.  It should be a lot of fun; I hope to see you there!

Is it hashable? Fun and games with hashing in Python

One of the basic data types that Python developers learn to use, and to appreciate, is the dictionary, or “dict.” This is the Python term for what other languages call hashes, associative arrays, hashmaps, or hash tables. Dictionaries are pervasive in Python, both in the programs that we write, and in the implementation of the language; behind every namespace or object, at least one dictionary is behind the scenes.

Dictionaries are fairly easy to use, once you get used to the rules of the road:

  1. A dictionary contains pairs, not individual elements. Each pair has two elements, a “key” and a “value.” So given a dictionary d, len(d) will return the number of pairs, not the number of individual elements.
  2. You can think of the key as a sort of index. Just as we use numeric indexes to retrieve elements of a string, list, or tuple, we use a dict’s keys to retrieve its values.
  3. The retrieval is one-way. You can get a value via its key, but you cannot get a key via its value.
  4. The retrieval takes constant time, aka O(1). You can use the “in” operator to find out if a key exists in a dictionary. If you retrieve a key that doesn’t exist, you’ll get a KeyError exception.
  5. The key must be hashable, and (if a container, such as a tuple) may only contain other hashable objects.
  6. The values may be any Python types or sizes. You can have a dict of strings, but also a dict of lists, tuples, dicts, modules, or any other objects.
  7. The keys of a dictionary are unique. If you assign d[‘a’]=1 to the dict “d”, the key “a” now exists, with a value of 1. If the key “a” already existed, then its previous value is lost.
  8. The key-value pairs in a dictionary are not ordered in any meaningful way. Do not depend on the order of the pairs in a dictionary.

To anyone familiar with dicts, or with hash tables in other languages, most of the above rules make a great deal of sense. Indeed, most of them follow naturally from the implementation of dicts: When you store d[‘a’] = 1, the dict “d” takes the key “a” and invokes the hash function on it. The result of the hash function is a number, which indicates where in the hash table the key-value pair should be stored. This is the key (no pun intended) advantage of a dictionary, and the secret of its lookup speed: The result of applying the “hash” function on our key determines where the key-value pair will be stored. Python can then jump to that location in memory, and retrieve the value associated with the key.

This also explains why you can use keys to retrieve values, but not the reverse: The location of a value in memory depends completely on its key. Moreover, while keys must be unique, values don’t have to be.

Furthermore, this explains why pairs in a dict don’t seem to be ordered in any predictable way; their order is determined by the hash function, which is deliberately designed to provide hard-to-predict results.

For example, I can create a simple dictionary:

>>> d = {'a':1, 'b':2, 'c':3}
 >>> d
 {'a': 1, 'c': 3, 'b': 2}

As you can see, the printed representation of our dictionary shows the keys in the order ‘a’, ‘c’, and ‘b’, rather than the order in which they were inserted. Assigning to the dictionary either replaces an existing pair (if I reuse a key) or adds a new pair:

>>> d['a'] = 100
 >>> d
 {'a': 100, 'c': 3, 'b': 2}
 >>> d['z'] = [1,2,3]
 >>> d
 {'a': 100, 'c': 3, 'b': 2, 'z': [1, 2, 3]}

Almost all of this matches the rules for dict-like structures in other languages — except for rule #5, the requirement that the keys be hashable. (Or if we’re dealing with container objects, that the contained elements be hashable.) It’s reasonable to ask why this is forbidden. There aren’t a lot of times when I would like to use a list, set, or dict as a dictionary key, but it does happen. Why does Python prevent me from doing so?

The answer has to do with predictability: If I could use a list as my dictionary key, then there would be the chance of the list changing after storing it. In such a case, the list’s current hash value will be different than its previous hash value — meaning that the list will be located somewhere other than where it should be. In such a case, the key-value pair would be “lost” inside of the dict.

This can actually happen in Ruby, which doesn’t restrict the data types which can be used as keys. For example:

myarray = [1,2,3]    # create a Ruby Array 
h = {myarray => 1}   # use the array as a key

h[myarray]           # What value is associated with myarray?
   => 1              # We get 1 back, as expected! Yay!

Ruby stores our name-value pair inside of the hash, its equivalent of a dict. However, I can modify the array that is being used as a hash key:

myarray << 4        # append 4 to myarray

    => [ 1, 2, 3, 4 ]

When we stored the name-value pair in myarray, the Array had three elements. Now it has four, thus giving it a new hash value. After modifying myarray, we can ask Ruby to retrieve the value associated with it in h:

h[myarray]         # Get the value for key "myarray"
     => nil        # nil means non-existent key

In other words, the hash still has the key “myarray”, but the key-value pair is stored in the location determined by myarray.hash when we first stored it, not in its current incarnation.

Ruby’s solution to this problem is to provide a “rehash” method, which tells a hash to go through its contents, and recalculate the keys’ hash values and locations. Once you do this, the data returns:

h.rehash  # recalculate positions
    =>  { [ 1, 2, 3, 4 ] => 1 }

h[myarray]
    => 1

We thus see that in Ruby, we’re allowed to use mutable data structures as keys. The advantage is that we’re not limited, but the disadvantage is that keys might get changed, and thus provide incorrect search results.

Python, many years ago, solved this problem a different way: Instead of allowing us complete flexibility in our hash keys, Python restricted us, to (largely) immutable ones. Thus, we can use None, True, False, integers, floats, strings, and tuples — although ints and strings are the most common, in my experience. If we try to store a key-value pair in a dictionary, Python checks to make sure that the key is a hashable type:

>>> mylist = [1,2,3]

>>> d = {mylist:1}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

The “unhashable” error message that we get isn’t from the assignment to d, but rather from the call to hash() that Python makes on “mylist”. We can see this if we try to invoke the hash function directly:

>>> hash(mylist)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

It’s not really true, as we’ve seen, that lists are inherently unhashable. Rather, Python decided long ago that it would refuse to hash anything whose value might be subject to change, to avoid elements getting lost.

The hash function in Python has been described before, in more detail (and with greater knowledge) than I could provide. There are two aspects to Python’s hash function, which I’m not in a position to criticize, but which do seem strange to me:

  1. Hash functions, by their very nature, are supposed to be one-way, deterministic, but fairly unpredictable. That is, if I know the output of hash(‘a’), I shouldn’t be able to easily know what hash(‘b’) or hash(‘c’) is. But hash(‘a’) should always return the same value. In Python, this is the case for strings and tuples. But hash(), when handed an int, returns that int. Thus, hash(1) is 1, hash(100) is 100, and hash(255) is 255. This strikes me as a bit strange, and seems to violate one of the basic rules of hashing. I can only conclude that either I don’t know much about hash functions (which is quite possible), that Python doesn’t expect us to use many integers as dictionary keys, or that it just doesn’t matter that much.
  2. The hash function apparently returns -1 when it encounters an error. Thus, the hash values of both -1 and -2 are -2.

The result of a hash function doesn’t need to be unique, but it does need to evenly distribute the results, such that we’ll minimize collisions. That is, it’s possible that hash(‘a’) and hash(‘b’) will return the same value — but it should be hard to figure out which values will give us the same results. If, by some chance, all of your keys have the same hash value, then you end up with a “collision.” This is invisible to the user of the dict, except that the lookups suddenly become much slower. Imagine a dict in which 100 keys all have the same hash value; our lookup speed suddenly becomes O(n), like a list, rather than O(1), which is theoretically possible in a dict.

This apparently became an issue several years ago, when there were some attacks against Web sites running Python. Web applications often use dicts to pass incoming parameters, which means that if you choose your keys cleverly enough, you can cause a massive slowdown on a site, in a denial-of-service attack.

The solution is to add a random seed to the hashing algorithm. This isn’t implemented in Python by default, but can easily be added by invoking Python 2.7 with the -R command-line parameter, or Python 3.x with the PYTHONHASHSEED environment variable set.

Thus, in Python 2.7:

$ python -R
>>> hash('a')
-5027793331667802690
>>> hash('b')
-5027793332354350531
>>>

$ python -R
>>> hash('a')
-4154372447873558006
>>> hash('b')
-4154372448337085303
>>>

Notice how, thanks to the -R parameter, we force Python to re-seed its hash function, thus reducing the chance of a successful attack.

Python 3 took this a step further, by using an environment variable. If you set PYTHONHASHSEED  to “random”, then it behaves like Python 2, above. But if you set PYTHONHASHSEED to a numeric value, then the hash function is seeded with the number you provide. This makes it easier to test your code, but also to enjoy the extra security that the randomized hash keys provide.

Now, you would think that from everything I wrote above, that if I write my own class, it won’t be hashable. But it turns out that this is not the case:

>>> class Foo(object):
        pass

>>> f = Foo()
>>> hash(f)
273483861

According to the Python documentation, user-defined classes are hashable by default; the hash value of such an object depends on the object’s unique ID number, which we can get via the built-in “id” function. This means that the hash value of a user-defined object won’t change, regardless of any changes you might make to its attributes.

But let’s say that I want to have the hash reflect the attributes. According to the Python documentation, this means that I should define both the __hash__ method (which the built-in “hash” function will call on our object, and which must return an integer) and the __eq__ method, to check if two things are equal (since two equal objects should have equal hashes, too). I’m going to define a simple class, along with a __hash__ method:

>>> class Foo(object):
        def __init__(self, x):
            self.x = x
        def __hash__(self):
            return hash(self.x)

>>> f = Foo('a')
>>> hash(f)
12416037344
>>> hash('a')
12416037344

As the above code demonstrates, our object now returns the hash value of whatever is set on its “x” attribute. There is no difference between invoking hash(‘a’) and hash(f), assuming that f.x is ‘a’.

So now let’s put our object in a dictionary:

>>> d = {f:1}
>>> d[f]
1

So far so good, right? But now let’s be a bit evil, and change the value of f.x:

>>> f.x = 'abc'
>>> d[f]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: <__main__.Foo object at 0x109987590>

What happened? Well, we forced Python to be like the Ruby example that I provided earlier: When we created d, and used f as a key, Python stored the pair {f:1} based on the value of hash(f). But then we changed the value of f.x, which means that the value of hash(f) has changed, Which means that now, when we invoke d[f], Python will complain that there is no such key. The only way to get this key back is to use d.keys() or d.items(), which will return the key. But our ability to retrieve our value via the key, or even to check if our key exists, is now gone:

>>> f in d
False

We can have even more fun, by ensuring that the value of __hash__ changes every time we invoke it:

>>> import random
>>> class LoseMe(object):
        def __hash__(self):
            return random.randint(1,1000)

>>> x = LoseMe()
>>> hash(x)
374
>>> hash(x)
50

Now, the odds are pretty good that when I stick this object into a dictionary, I won’t be able to get it back:

>>> x = LoseMe()
>>> d = {x: 1}
>>> x in d
False
>>> x in d
False

But of course:

>>> list(d.keys())
[<__main__.LoseMe object at 0x109997a10>]

Should you ever define __hash__ to return a random value? Almost certainly not. And yet, I’d like to think that knowing how to do such a thing is both interesting and provides insights into how Python implements one of its core features.

 

A quick introduction to implementing Python iterators

When you put a piece of Python data into a “for” loop, the loop doesn’t execute on the data itself.  Rather, it executes on the data’s “iterator.”  An iterator is an object that knows how to behave inside a loop.

Let’s take that apart.  First, let’s assume that I say:

for letter in 'abc':
    print(letter)

I’m not really iterating over ‘abc’.  Rather, I’m iterating over the iterator object that I got from ‘abc’.  That is invisible and behind the scenes, but it happens all the same.  We can get the iterator of any object with the iter() function:

>>> s = 'abc'

>>> iter(s)
<iterator at 0x10a47f150>

>>> iter(s)
<iterator at 0x10a47f190>

>>> iter(s)
<iterator at 0x10a47f050>

Notice that each time we invoke iter(s), we get back a new and different object.  (We can tell, because there is a different address in memory for each one.)  That’s because each iterator is used only once.  Once you get to the end of an iterator object, the object is thrown out, and you need to get a new one.

OK, so what can we do with this iterator object?  Why do we care about it so much?  Because we can invoke the next() function on it.  Each time we do so, we’re basically telling the object that we want to get the next piece of data that it’s providing:

>>> i = iter(s)

>>> next(i)
'a'

>>> next(i)
'b'

>>> next(i)
'c'

So far, so good: Each time we invoke next(i), we ask our iterator object (i) to give us the next element.  But there are only three elements in s, which raises the question of what we’ll get when we invoke next() another time:

>>> next(i)
StopIteration

In other words, Python raises an exception (StopIteration) when we get to the end.  We can now invoke next(i) as many times as we want; we’ll always get StopIteration, which indicates that there is nothing more to get.

You can thus think of a “for” loop as a “while” loop that catches the StopIteration exception, and then leaves the loop when it happens. Consider this function:

def myfor(data):
    i = iter(data)
    while True:
        try:
            print next(i)
        except StopIteration:
            break

Now, this “myfor” function only prints the elements of the sequence it was given, so it’s not really a replacement for loop.  But it’s not a bad way to begin to understand how these things work. Our function starts off by getting an iterator for our data.  It then assumes that we are going to iterate forever on the object, using the “while True” infinite loop. However, we know that when next(i) is done providing elements of data, it will raise StopIteration.  At that point, we’ll catch the exception and return from the function.

Let’s assume that you want to make instances of your class iterable. This means that when we invoke iter() on an instance of your class, we’ll want to get back an iterator.  Which means that we’ll want to get back an object on which we can invoke next(), and either get the next object or the StopIteration exception.

The easiest way to do this is to define both __iter__ (which is invoked when you run iter() on an object) and __next__ (which is invoked when you run next() on an iterator) within your class object. That is, you’ll define __iter__ to return self, because the object is its own iterator.  And you’ll define __next__ to return the next piece of data in turn, or to raise StopIteration if there is no more data.

Remember that in an iterator, there is no “previous” or “reset” or anything of the sort.  All you can do is move forward, one item at a time, until you get to the end.

So let’s say that I want to define a simple iterator, one that returns the elements of a piece of data.  (Yes, basically what you already get built in by Python.)  We can say:

class MyIter(object):
    def __init__(self, data):
        self.data = data
        self.index = 0
    def __iter__(self):
        return self
    def __next__(self):   # In Python 2, this is just "next")
        if self.index >= len(self.data):
            raise StopIteration
        value = self.data[self.index]
        self.index += 1
        return value

Now I can say

>>> m = MyIter('abc')
>>> for letter in m:
        print(letter)

and it will work!

You can take any class you want, and make it into an iterator by adding the  __iter__ method (which returns self) and the __next__ (or in Python 2, “next”)  method.  Once you have done that, instances of MyIter can now be put inside of “for” loops, list comprehensions, or anything else that expects an “iterable” type of data.

If you don’t use “with”, when does Python close files? The answer is: It depends.

One of the first things that Python programmers learn is that you can easily read through the contents of an open file by iterating over it:

f = open('/etc/passwd')
for line in f:
    print(line)

Note that the above code is possible because our file object “f” is an iterator. In other words, f knows how to behave inside of a loop — or any other iteration context, such as a list comprehension.

Most of the students in my Python courses come from other programming languages, in which they are expected to close a file when they’re done using it. It thus doesn’t surprise me when, soon after I introduce them to files in Python, they ask how we’re expected to close them.

The simplest answer is that we can explicitly close our file by invoking f.close(). Once we have done that, the object continues to exist — but we can no longer read from it, and the object’s printed representation will also indicate that the file has been closed:

>>> f = open('/etc/passwd')
>>> f
<open file '/etc/passwd', mode 'r' at 0x10f023270>
>>> f.read(5)
'##\n# '

f.close()
>>> f
<closed file '/etc/passwd', mode 'r' at 0x10f023270>

f.read(5)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-ef8add6ff846> in <module>()
----> 1 f.read(5)
ValueError: I/O operation on closed file

But here’s the thing: When I’m programming in Python, it’s pretty rare for me to explicitly invoke the “close” method on a file. Moreover, the odds are good that you probably don’t want or need to do so, either.

The preferred, best-practice way of opening files is with the “with” statement, as in the following:

with open('/etc/passwd') as f:
    for line in f:
        print(line)

The “with” statement invokes what Python calls a “context manager” on f. That is, it assigns f to be the new file instance, pointing to the contents of /etc/passwd. Within the block of code opened by “with”, our file is open, and can be read from freely.

However, once Python exits from the “with” block, the file is automatically closed. Trying to read from f after we have exited from the “with” block will result in the same ValueError exception that we saw above. Thus, by using “with”, you avoid the need to explicitly close files. Python does it for you, in a somewhat un-Pythonic way, magically, silently, and behind the scenes.

But what if you don’t explicitly close the file? What if you’re a bit lazy, and neither use a “with” block nor invoke f.close()?  When is the file closed?  When should the file be closed?

I ask this, because I have taught Python to many people over the years, and am convinced that trying to teach “with” and/or context managers, while also trying to teach many other topics, is more than students can absorb. While I touch on “with” in my introductory classes, I normally tell them that at this point in their careers, it’s fine to let Python close files, either when the reference count to the file object drops to zero, or when Python exits.

In my free e-mail course about working with Python files, I took a similarly with-less view of things, and didn’t use it in all of my proposed solutions. Several people challenged me, saying that not using “with” is showing people a bad practice, and runs the risk of having data not saved to disk.

I got enough e-mail on the subject to ask myself: When does Python close files, if we don’t explicitly do so ourselves or use a “with” block? That is, if I let the file close automatically, then what can I expect?

My assumption was always that Python closes files when the object’s reference count drops to zero, and thus is garbage collected. This is hard to prove or check when we have opened a file for reading, but it’s trivially easy to check when we open a file for writing. That’s because when you write to a file, the contents aren’t immediately flushed to disk (unless you pass “False” as the third, optional argument to “open”), but are only flushed when the file is closed.

I thus decided to conduct some experiments, to better understand what I can (and cannot) expect Python to do for me automatically. My experiment consisted of opening a file, writing some data to it, deleting the reference, and then exiting from Python. I was curious to know when the data would be written, if ever.

My experiment looked like this:

f = open('/tmp/output', 'w')
f.write('abc\n')
f.write('def\n')
# check contents of /tmp/output (1)
del(f)
# check contents of /tmp/output (2)
# exit from Python
# check contents of /tmp/output (3)

In my first experiment, conducted with Python 2.7.9 on my Mac, I can report that at stage (1) the file existed but was empty, and at stages (2) and (3), the file contained all of its contents. Thus, it would seem that in CPython 2.7, my original intuition was correct: When a file object is garbage collected, its __del__ (or the equivalent thereof) flushes and closes the file. And indeed, invoking “lsof” on my IPython process showed that the file was closed after the reference was removed.

What about Python 3?  I ran the above experiment under Python 3.4.2 on my Mac, and got identical results. Removing the final (well, only) reference to the file object resulted in the file being flushed and closed.

This is good for 2.7 and 3.4.  But what about alternative implementations, such as PyPy and Jython?  Perhaps they do things differently.

I thus tried the same experiment under PyPy 2.7.8. And this time, I got different results!  Deleting the reference to our file object — that is, stage (2), did not result in the file’s contents being flushed to disk. I have to assume that this has to do with differences in the garbage collector, or something else that works differently in PyPy than in CPython. But if you’re running programs in PyPy, then you should definitely not expect files to be flushed and closed, just because the final reference pointing to them has gone out of scope. lsof showed that the file stuck around until the Python process exited.

For fun, I decided to try Jython 2.7b3. And Jython exhibited the same behavior as PyPy.  That is, exiting from Python did always ensure that the data was flushed from the buffers, and stored to disk.

I repeated these experiments, but instead of writing “abc\n” and “def\n”, I wrote “abc\n” * 1000 and “def\n” * 1000.

In the case of Python 2.7, nothing was written after the “abc\n” * 1000. But when I wrote “def\n” * 1000, the file contained 4096 bytes — which probably indicates the buffer size. Invoking del(f) to remove the reference to the file object resulted in its being flushed and closed, with a total of 8,000 bytes. So in the case of Python 2.7, the behavior is basically the same regardless of string size; the only difference is that if you exceed the size of the buffer, then some data will be written to disk before the final flush + close.

In the case of Python 3, the behavior was different: No data was written after either of the 4,000 byte outputs written with f.write. But as soon as the reference was removed, the file was flushed and closed. This might point to a larger buffer size. But still, it means that removing the final reference to a file causes the file to be flushed and closed.

In the case of PyPy and Jython, the behavior with a large file was the same as with a small one: The file was flushed and closed when the PyPy or Jython process exited, not when the last reference to the file object was removed.

Just to double check, I also tried these using “with”. In all of these cases, it was easy to predict when the file would be flushed and closed: When the block exited, and the context manager fired the appropriate method behind the scenes.

In other words: If you don’t use “with”, then your data isn’t necessarily in danger of disappearing — at least, not in simple simple situations. However, you cannot know for sure when the data will be saved — whether it’s when the final reference is removed, or when the program exits. If you’re assuming that files will be closed when functions return, because the only reference to the file is in a local variable, then you might be in for a surprise. And if you have multiple processes or threads writing to the same file, then you’re really going to want to be careful here.

Perhaps this behavior could be specified better, and thus work similarly or identically on different platforms? Perhaps we could even see the start of a Python specification, rather than pointing to CPython and saying, “Yeah, whatever that version does is the right thing.”

I still think that “with” and context managers are great. And I still think that it’s hard for newcomers to Python to understand what “with” does. But I also think that I’ll have to start warning new developers that if they decide to use alternative versions of Python, there are all sorts of weird edge cases that might not work identically to CPython, and that might bite them hard if they’re not careful.

If you enjoyed this explanation, check out my free e-mail course on working with files in Python, or my e-book, “Practice Makes Python,” with 50 battle-tested exercises in Python programming!

All 50 exercises in “Practice Makes Python” are complete! (Plus, a sample exercise about taxes in Freedonia)

The latest draft of “Practice Makes Python,” my ebook intended to sharpen your Python programming skills, is now out. This draft includes all 50 exercises, solutions, and explanations that I had hoped to include in the book.

I’m very excited to have reached this milestone, and appreciate the input from my many students and colleagues who have provided feedback.

The next steps in my release of the book are: Use a different toolchain that will allow for internal hyperlinks in the PDF, generate epub and mobi formats, and then start on the video explanations that will be included in a higher-tier version of the book package. Even without these steps, the content of the book is ready, and is a great way for you to improve your Python skills. The book is not meant to teach you Python, and assumes that you are familiar with the basics.

Please check out the latest version of Practice Makes Python. In case you’re not sure whether the book is for you, I am enclosing another sample exercise, this time from the chapter on modules and packages. As always, comments and suggestions are welcome.

Sales tax

The Republic of Freedonia has a strange tax system. To help businesses calculate their sales taxes, the government has decided to provide a Python software library.

Sales tax on a purchase depends on where the purchase was made, as well as the time of the purchase. Freedonia has four provinces, each of which charges a different percentage of tax:

  • Chico: 50%
  • Groucho: 70%
  • Harpo: 50%
  • Zeppo: 40%

Yes, the taxes are quite high in Freedonia. (So high, in fact, that they are said to have a Marxist government.) However, these taxes rarely apply in full. That’s because the amount of tax applied depends on the hour at which the purchase makes place. The tax percentage is always multiplied by the hour at which the purchase was made. At midnight, there is no sales tax. From 12 noon until 1 p.m., only 50% (12/24) of the tax applies. And from 11 p.m. until midnight, 95% (i.e., 23/24) of the tax applies.

Your job is to implement that Python module, “freedonia.py”. It should provide a function, “calculate_tax”, which takes three arguments: The amount of the purchase, the province in which the purchase took place, and the hour (using 24-hour notation) at which it happened. The “calculate_tax” function should return the final price.

Thus, if I were to invoke

calculate_tax(500, 'Harpo', 12)

A $500 purchase in Harpo province (with 50% tax) would normally be $750. However, because the purchase was done at 12 noon, the tax is only half of its usual amount, or $125, for a total of $625. If the purchase were made at 9 p.m. (i.e, 21:00 on a 24-hour clock), then the tax would be 87.5% of its full rate, or 43.75%, for a total price of $718.75.

Note that while you can still use a single file, exercises such as this one lend themselves to having two files, one of which (“use_freedonia.py”) imports and then uses “freedonia.py”.

Solution

# freedonia.py
rates = {
 'Chico': 0.5,
 'Groucho': 0.7,
 'Harpo': 0.5,
 'Zeppo': 0.4
}

def time_percentage(hour):
    return hour / 24.0

def calculate_tax(amount, state, hour):
    return amount + (amount * rates[state] * time_percentage(hour))

And now, the program that uses it:

from freedonia import calculate_tax

print "You owe a total of: {}".format(calculate_tax(100, 'Harpo', 12))

print "You owe a total of: {}".format(calculate_tax(100, 'Harpo', 21))

Discussion

The “freedonia” module does precisely what a Python module should do: Namely, it defines data structures and functions that provide functionality to one or more other programs. By providing this layer of abstraction, it allows a programmer to focus on what is important to him or her, such as the implementation of an online store, without having to worry about the nitty-gritty of particular details.

While some countries have extremely simple systems for calculating sales tax, others — such as the United States — have many overlapping jurisdictions, each of which applies its own sales tax, often at different rates and on different types of goods. Thus, while the Freedonia example is somewhat contrived, it is not unusual to purchase or use libraries of this sort of sales taxes.

Our module defines a dictionary (“rates”), in which the keys are the provinces of Freedonia, and the values are the taxation rates that should be applied there. Thus, we can find out the rate of taxation in Groucho province with “rates[‘Groucho’]”. Or we can ask the user to enter a province name in the “province” variable, and then get “rates[province]”. Either way, that will give us a floating-point number which we can use to calculate the tax.

A wrinkle in the calculation of Freedonian taxation is the fact that taxes get progressively higher as the day goes on. In order to make this calculation easier, I wrote a “time_percentage” function, which simply takes the hour and returns it as a percentage of 24 hours. In Python 2, integer division always returns an integer, even when that means throwing away the remainder. Thus, we divide the current hour not by “24” (an int) but by “24.0” (a float), which ensures that the result will be a floating-point number.

Finally, the “calculate_tax” function takes three parameters — the amount of the sale, the name of the province in which the sale is taking place, and the hour at which the sale happened — and returns a floating-point number indicating the actual, current tax rate.

It should be noted that if you’re actually doing calculations involving serious money, you should almost certainly *not* be using floats. Rather, you should use integers, and then calculate everything in terms of cents, rather than dollars. This avoids the fact that floating-point numbers are not completely accurate on computers. (Try to add “0.1” and “0.7” in Python, and see what the result is.) However, for the purposes of this example, and given the current state of the Freedonian economy in any event, this is an acceptable risk for us to take.

“Practice Makes Python” is now available for early-bird purchase

My first ebook, “Practice Makes Python” — containing 50 exercises that will help to sharpen your Python skills — is now available for early-bird purchase!3D_book

The book is already about 130 pages (and 26,000 words) long, containing about 40 exercises on such subjects as basic data structures, working with files, functional programming, and object-oriented development. But it’s not quite done, and thus I’m calling this an “early-bird” purchase of the book: Not all of the exercises are ready, the formatting isn’t quite there yet, and PDF is the only format available for now. That said, even in this draft version, there is more than enough here to help many Python developers to gain fluency and improve their skills with the language.

Anyone who purchases the book now can use the coupon code EARLY to get a 10% discount. Perhaps it goes without saying, but anyone buying the book now will also get all updates and improvements, free of charge, as they occur over the coming weeks. And anyone who finds that they didn’t get value from the book is welcome to e-mail me and say so — and I’ll refund 100 percent of your purchase price.

The basic idea behind “Practice Makes Python” is that learning Python — or any language — is a long, slow process. Even the best courses cannot possibly give you enough practice with the language for it to feel natural. That only comes with practice. Most people end up practicing, as it were, on projects at work. My goal with this book is to give people who have taken Python courses a chance to become more familiar with the language.

My PhD studies in Learning Sciences taught me a great deal about how people learn, and one of the most important lessons was that of “constructionism” — that one of the best ways to learn is through the creation of things that are important to the individual. I have tried to make the exercises in “Practice Makes Python” interesting and fun, as well as relevant to what people do with Python on a day-to-day basis. Perhaps you won’t be creating Pig Latin translation programs in your day job, but the techniques that you learn from writing such programs in the book will undoubtedly help you out. Certainly, by working through the exercises — not by reading the answers and discussions! — you will learn a great deal about Python programming.

If you recently took a course in Python, or even if you have been working with it for up to a year, I believe that “Practice Makes Python” will give you the knowledge and confidence you need to master this fun and interesting language. These exercises are based on the many Python courses I have taught in the United States, Europe, Israel, and China over the years, and have proven themselves to help programmers start to really “get” Python.

I’d be delighted to hear what you think about “Practice Makes Python,” and how it can help to improve people’s Python programming skills even more. Contact me at reuven@lerner.co.il if you have thoughts or ideas.

In Python, it’s all about the attributes

Newcomers to Python are often amazed to discover how easily we can create new classes. For example:

class Foo(object): 
    pass

is a perfectly valid (if boring) class. We can even create instances of this class:

f = Foo()

This is a perfectly valid instance of Foo. Indeed, if we ask it to identify itself, we’ll find that it’s an instance of Foo:

>>> type(f)
<class '__main__.Foo'>

Now, while “f” might be a perfectly valid and reasonable instance of Foo, it’s not very useful. It’s at this point that many people who have come to Python from another language expect to learn where they can define instance variables. They’re relieved to know that they can write an  __init__ method, which is invoked on a new object immediately after its creation. For example:

class Foo(object):
    def __init__(self, x, y):
      self.x = x
      self.y = y

>> f = Foo(100, 'abc')

>>> f.x
100
>>> f.y
'abc'

On the surface, it might seem like we’re setting two instance variables, x and y, on f, our new instance of Foo. And indeed, the behavior is something like that, and many Python programmers think in these terms. But that’s not really the case, and the sooner that Python programmers stop thinking in terms of “instance variables” and “class variables,” the sooner they’ll understand how much of Python works, why objects work in the ways that they do, and how “instance variables” and “class variables” are specific cases of a more generalized system that exists throughout Python.

The bottom line is inside of __init__, we’re adding new attributes to self, the local reference to our newly created object.  Attributes are a fundamental part of objects in Python. Heck, attributes are fundamental to everything in Python. The sooner you understand what attributes are, and how they work, the sooner you’ll have a deeper understanding of Python.

Every object in Python has attributes. You can get a list of those attributes using the built-in “dir” function. For example:

>>> s = 'abc'
>>> len(dir(s))
71
>>> dir(s)[:5]
['__add__', '__class__', '__contains__', '__delattr__', '__doc__']

>>> i = 123
>>> len(dir(i))
64
>>> dir(i)[:5]
['__abs__', '__add__', '__and__', '__class__', '__cmp__']

>>> t = (1,2,3)
>>> len(dir(t))
32
>>> dir(t)[:5]
['__add__', '__class__', '__contains__', '__delattr__', '__doc__']

As you can see, even the basic data types in Python have a large number of attributes. We can see the first five attributes by limiting the output from “dir”; you can look at them yourself inside of your Python environment.

The thing is, these attribute names returned by “dir” are strings. How can I use this string to get or set the value of an attribute? We somehow need a way to translate between the world of strings and the world of attribute names.

Fortunately, Python provides us with several built-in functions that do just that. The “getattr” function lets us get the value of an attribute. We pass “getattr” two arguments: The object whose attribute we wish to read, and the name of the attribute, as a string:

>>> getattr(t, '__class__')
tuple

This is equivalent to:

>>> t.__class__
tuple

In other words, the dot notation that we use in Python all of the time is nothing more than syntactic sugar for “getattr”. Each has its uses; dot notation is far easier to read, and “getattr” gives us the flexibility to retrieve an attribute value with a dynamically built string.

Python also provides us with “setattr”, a function that takes three arguments: An object, a string indicating the name of the attribute, and the new value of the attribute. There is no difference between “setattr” and using the dot-notation on the left side of the = assignment operator:

>>> f = Foo()
>>> setattr(f, 'x', 5)
>>> getattr(f, 'x')
5
>>> f.x
5
>>> f.x = 100
>>> f.x
100

As with all assignments in Python, the new value can be any legitimate Python object. In this case, we’ve assigned f.x to be 5 and 100, both integers, but there’s no reason why we couldn’t assign a tuple, dictionary, file, or even a more complex object. From Python’s perspective, it really doesn’t matter.

In the above case, I used “setattr” and the dot notation (f.x) to assign a new value to the “x” attribute. f.x already existed, because it was set in __init__. But what if I were to assign an attribute that didn’t already exist?

The answer: It would work just fine:

>>> f.new_attrib = 'hello'
>>> f.new_attrib
'hello' 

>>> f.favorite_number = 72
>>> f.favorite_number
72

In other words, we can create and assign a new attribute value by … well, by assigning to it, just as we can create a new variable by assigning to it. (There are some exceptions to this rule, mainly in that you cannot add new attributes to many built-in classes.) Python is much less forgiving if we try to retrieve an attribute that doesn’t exist:

>>> f.no_such_attribute
AttributeError: 'Foo' object has no attribute 'no_such_attribute'

So, we’ve now seen that every Python object has attributes, that we can retrieve existing attributes using dot notation or “getattr”, and that we can always set attribute values. If the attribute didn’t exist before our assignment, then it certainly exists afterwards.

We can assign new attributes to nearly any object in Python. For example:

def hello():
    return "Hello"

>>> hello.abc_def = 'hi there!'

>>> hello.abc_def
'hi there!'

Yes, Python functions are objects. And because they’re objects, they have attributes. And because they’re objects, we can assign new attributes to them, as well as retrieve the values of those attributes.

So the first thing to understand about these “instance variables” that we oh-so-casually create in our __init__ methods is that we’re not creating variables at all. Rather, we’re adding one or more additional attributes to the particular object (i.e., instance) that has been passed to __init__. From Python’s perspective, there is no difference between saying “self.x = 5” inside of __init__, or “f.x = 5” outside of __init__. We can add new attributes whenever we want, and the fact that we do so inside of __init__ is convenient, and makes our code easier to read.

This is one of those conventions that is really useful to follow: Yes, you can create and assign object attributes wherever you want. But it makes life so much easier for everyone if you assign all of an object’s attributes in __init__, even if it’s just to give it a default value, or even None. Just because you can create an attribute whenever you want doesn’t mean that you should do such a thing.

Now that you know every object has attributes, it’s time to consider the fact that classes (i.e., user-defined types) also have attributes. Indeed, we can see this:

>>> class Foo(object):
        pass

Can we assign an attribute to a class?  Sure we can:

>>> Foo.bar = 100
>>> Foo.bar
100

Classes are objects, and thus classes have attributes. But it seems a bit annoying and roundabout for us to define attributes on our class in this way. We can define attributes on each individual instance inside of __init__. When is our class defined, and how can we stick attribute assignments in there?

The answer is easier than you might imagine. That’s because there is a fundamental difference between the body of a function definition (i.e., the block under a “def” statement) and the body of a class definition (i.e., the block under a “class” statement). A function’s body is only executed when we invoke the function. However, a the body of the class definition is executed immediately, and only once — when we define the function. We can execute code in our class definitions:

class Foo(object):
    print("Hello from inside of the class!")

Of course, you should never do this, but this is a byproduct of the fact that class definitions execute immediately. What if we put a variable assignment in the class definition?

class Foo(object):
    x = 100

If we assign a variable inside of the class definition, it turns out that we’re not assigning a variable at all. Rather, we’re creating (and then assigning to) an attribute. The attribute is on the class object. So immediately after executing the above, I can say:

Foo.x

and I’ll get the integer 100 returned back to me.

Are you a little surprised to discover that variable assignments inside of the class definition turn into attribute assignments on the class object? Many people are. They’re even more surprised, however, when they think a bit more deeply about what it must mean to have a function (or “method”) definition inside of the class:

>>> class Foo(object):
        def blah(self):
            return "blah"

>>> Foo.blah
<unbound method Foo.blah>

Think about it this way: If I define a new function with “def”, I’m defining a new variable in the current scope (usually the global scope). But if I define a new function with “def” inside of a class definition, then I’m really defining a new attribute with that name on the class.

In other words: Instance methods sit on a class in Python, not on an instance. When you invoke “f.blah()” on an instance of Foo, Python is actually invoking the “blah” method on Foo, and passing f as the first argument. Which is why it’s important that Python programmers understand that there is no difference between “f.blah()” and “Foo.blah(f)”, and that this is why we need to catch the object with “self”.

But wait a second: If I invoke “f.blah()”, then how does Python know to invoke “Foo.blah”?  f and Foo are two completely different objects; f is an instance of Foo, whereas Foo is an instance of type. Why is Python even looking for the “blah” attribute on Foo?

The answer is that Python has different rules for variable and attribute scoping. With variables, Python follows the LEGB rule: Local, Enclosing, Global, and Builtin. (See my free, five-part e-mail course on Python scopes, if you aren’t familiar with them.)  But with attributes, Python follows a different set of rules: First, it looks on the object in question. Then, it looks on the object’s class. Then it follows the inheritance chain up from the object’s class, until it hits “object” at the top.

Thus, in our case, we invoke “f.blah()”. Python looks on the instance f, and doesn’t find an attribute named “blah”. Thus, it looks on f’s class, Foo. It finds the attribute there, and performs some Python method rewriting magic, thus invoking “Foo.blah(f)”.

So Python doesn’t really have “instance variables” or “class variables.”  Rather, it has objects with attributes. Some of those attributes are defined on class objects, and others are defined on instance objects. (Of course, class objects are just instances of “type”, but let’s ignore that for now.)  This also explains why people sometimes think that they can or should define attributes on a class (“class variables”), because they’re visible to the instances. Yes, that is true, but it sometimes makes more sense than others to do so.

What you really want to avoid is creating an attribute on the instance that has the same name as an attribute on the class. For example, imagine this:

class Person(object):
    population = 0
    def __init__(self, first, last):
        self.first = first        
        self.last = last
        self.population += 1

p1 = Person('Reuven', 'Lerner')
p2 = Person('foo', 'bar')

This looks all nice, until you actually try to run it. You’ll quickly discover that Person.population remains stuck at 0, but p1.population and p2.population are both set to 1. What’s going on here?

The answer is that the line

self.population += 1

can be turned into

self.population = self.population + 1

As always, the right side of an assignment is evaluated before the left side. Thus, on the right side, we say “self.population”. Python looks at the instance, self, and looks for an attribute named “population”. No such attribute exists. It thus goes to Person, self’s class, and does find an attribute by that name, with a value of 0. It thus returns 0, and executes 0 + 1. That gives us the answer 1, which is then passed to the left side of the assignment. The left side says that we should store this result in self.population — in other words, an attribute on the instance! This works, because we can always assign any attribute. But in this case, we will now get different results for Person.population (which will remain at 0) and the individual instance values of population, such as p1 and p2.

We can actually see what attributes were actually set on the instance and on the class, using a list comprehension:

class Foo(object):
    def blah(self):
        return "blah"

>>> [attr_name for attr_name in dir(f) if attr_name not in dir(Foo)]
[]

>>> [attr_name for attr_name in dir(Foo) if attr_name not in dir(object)]
['__dict__', '__module__', '__weakref__', 'blah']

In the above, we first define “Foo”, with a method “blah”. That method definition, as we know, is stored in the “blah” attribute on Foo. We haven’t assigned any attributes to f, which means that the only attributes available to f are those in its class.

If you liked this explanation, then you’ll likely also enjoy my ebook, “Practice Makes Python,” with 50 exercises meant to improve your Python fluency.