2+2 might be 4, but what is True+True?

In Python, we know that we can add two integers together:

>> 2 + 2
4

And of course, we can add two strings together:

>> 'a' + 'b'
'ab'

We can even add two lists together:

>> [1,2,3] + [4,5,6]
[1, 2, 3, 4, 5, 6]

But what happens when you add two booleans together?

>> True + True
2

Huh?  What’s going on here?

This would seem to be the result of several different things going on.

First of all, booleans in Python are actually equal to 1 and 0. We know that in a boolean context (i.e., in an “if” statement or “while” loop), 1 is considered to be True, and 0 is considered to be False. But then again, 10, 100, and ‘abc’ are also considered to be True, while the empty string is considered to be False. All values are turned into booleans (using __nonzero__ in Python 2, and __bool__ in Python 3) when they’re in a place that requires a boolean value.

But whereas 10 and the empty string are turned into True and False (respectively) in a boolean context, 1 and 0 are special. True and False are really equal to 1 and 0, as we can see here:

>> True == 1
True
>>> True == 10
False
>>> False == 0
True
>>> False == ''
False

So, it would seem that while it’s preferable to use True and False, rather than 1 and 0, the difference is a semantic one, for programmers.  Behind the scenes, they’re really numbers.

But how can that be?  Why in the world would Python consider two objects of completely different types to have equal values?

The answer is that in Python, the “bool” type is actually a subclass of “int”, as we can see here:

>> bool.__bases__
(<class 'int'>,)

So while it’s still a bit weird for two values with different types to be considered equal, it’s not totally weird, given that booleans are basically a specialized form of integers.   And indeed, if we look at the documentation for booleans, we find that it implements very few methods of its own; most are inherited from int.  One of the many inherited operators is __eq__, the method that determines whether two objects are equal. Which means that when we’re comparing 1 and True, we’re really just comparing 1 (of type int) and 1 (of type bool, a subclass of int, which means that it behaves like an int).

This, then, explains what’s going on: When we say “True + True”, the int.__add__ operator is called. It gets two instances of bool, each with a value of 1, and which know how to behave as if they were regular ints. So we get 2 back.

Just for kicks, what happens if we say

True += 1

in Python 3, we get an error, saying that we can’t assign to a keyword. That’s because in Python 3, True and False are true keywords, rather than names defined in the “builtins” namespace. But in Python 2, they’re just built-in names. Which means that this is what happens in Python 2:

>>> True += 1
>>> True
2
>>> type(True)
<type 'int'>

Look at what we did: Not only did we add True + 1, but we then assigned this integer back to True! Which means that True is no longer an instance of “bool” in the “builtins” namespace, but rather an integer.

How did this happen? Because of Python’s scoping rules, known as “LEGB” for local-enclosing-globals-builtins.  When we assigned to True with +=, we didn’t change the value of True in builtins. Rather, we created a new global variable. The builtin value is still there, but because the global namespace gets priority, its value (2) masks the actual value (True, or 1) in the builtins namespace.

Indeed, look at this:

>>> True += 1
>>> True == __builtins__.True
False

How can we get out of this situation? By embracing nihilism, deleting truth (i.e., the True name we’ve defined in the global namespace).  Once we’ve done that, True in the builtins namespace will still exist:

>>> del(True)
>>> True
True

I’d like to think that none of us will ever be doing mathematical operations with booleans, or trying to assign to them. But understanding how these things work does provide a clearer picture of Python scoping rules, which govern how everything else in the language functions.

What other weird boolean/integer behavior have you noticed? I’m always curious to find, and then understand, these somewhat dark (and useless) aspects of Python!

Unfamiliar with Python scoping rules, or just want to brush up on them? Take my free, five-part “Python variable scoping” e-mail course: http://lerner.co.il/e-mail-courses/variable-scoping-in-python/

Python function brain transplants

What happens when we define a function in Python?

The “def” keyword does two things: It creates a function object, and then assigns a variable (our function name) to that function object.  So when I say:

def foo():
      return "I'm foo!"

Python creates a new function object.  Inside of that object, we can see the bytecodes, the arity (i.e., number of parameters), and a bunch of other stuff having to do with our function.

Most of those things are located inside of the function object’s __code__ attribute.  Indeed, looking through __code__ is a great way to learn how Python functions work.  The arity of our function “foo”, for example, is available via foo.__code__.co_argcount.  And the byte codes are in foo.__code__.co_code.

The individual attributes of the __code__ object are read-only.  But the __code__ attribute itself isn’t!  We can thus be a bit mischievous, and perform a brain transplant on our function:

def foo():
      return "I'm foo!"

def bar():
      return "I'm bar!"

foo.__code__ = bar.__code__

Now, when we run foo(), we’re actually running the code that was defined for “bar”, and we get:

"I'm in bar!"

This is not likely to be something you want to put in your actual programs, but it does demonstrate some of the power of the __code__ object, and how much it can tell us about our functions.  Indeed, I’ve found over the last few years that exploring the __code__ object can be quite interesting!

The five-minute guide to setting up a Jupyter notebook server

JupyterNearly every day, I teach a course in Python.  And nearly every day, I thus use the Jupyter notebook: I do my live-coding demos in it, answer students’ questions using it, and also send it to my students at the end of the day, so that they can review my code without having to type furiously or take pictures of my screen.

I also encourage my students to install the notebook — not just so that they can open the file I send them at the end of each day, but also so that they can learn to experiment, iterate, and play with their code using a modern interactive environment.

My two-day class in regular expressions uses Python, but as a means to an end: I teach the barest essentials of Python in an hour or so, just so that we can read from files and search inside of them using regular expressions.  For this class, it’s overkill to ask the participants to install Python, especially if they plan to use other languages in their work.

When I teach this class, then, I set up a server that is only used for my course. Students can all log into the Jupyter notebook, create and use their own notebook files, and avoid installing anything new on their own computers.

Moreover, this whole setup now takes me just a few minutes. And since the class lasts only two days, I’m paying a few dollars for the use of a server that doesn’t contain any valuable data on it. So if someone happens to break into the server or ruin it, or if one of my students happens to mess around with things, nothing from my own day-to-day work will be ruined or at risk.

How do I do this? It’s pretty simple, actually:

  • Create a new server.  I use Digital Ocean, because they’re cheap and easy to set up. When I teach a class in Israel or Europe, I use one of their European data centers (or should I say “centres”), because it’ll be a bit faster.  If a large number of students will participate, then I might get a server with a lot of RAM, but I normally get one of the most basic DO plans.  I normally set up an Ubuntu server, but you’re welcome ot use any version of Linux you want.
  • I log in as root, and add a bunch of packages, starting with “python-pip”.  (Note that this assumes you’re willing to use Python 2.  I don’t really care which version I use for this class; when Ubuntu switches to Python 3, I’ll gladly use that instead.) This ensures that the Python “pip” program is installed, allowing me to install Python packages:
apt-get install unzip emacs wget curl python-pip
  • I then use “pip” to install the “ipython[notebook]” package, which in turn downloads and installs everything else I’ll need:
pip install -U 'ipython[notebook]'   # yes, you need the quotes
  • Next, I create a “student” user.  It is under this user that I’ll run the notebook.
  • As the “student” user, I start the notebook with the “–generate-config” option. This creates a configuration directory (~/student/.jupyter) with a configuration file (jupyter_notebook_config.py) inside of it:
jupyter notebook --generate-config
  • Open the jupyter_notebook_config.py file, which is (as the suffix indicates) a Python file, defining a number of variables.  You’ll need to set a few of these in order for things to work; from my experience, changing four lines is sufficient:
    c.NotebookApp.open_browser = False    # no browser needed on a server
    c.NotebookApp.ip = '0.0.0.0'          # listen on the network
    NotebookApp.password = u''            # don't require a password
    c.NotebookApp.token = ''              # don't require a security token

 

  • These final two instructions, to remove password protection and the security token, are debatable.  On the one hand, my server will exist for two days at most, and won’t have any important data on it.  On the other hand, you could argue that I’ve provided a new entry point for bad people in the world to attack the rest of the Internet.
  • Once that’s done, go back to the Unix shell, and launch the notebook using “nohup”, so that even if/when you log out, the server will still keep running:
    nohup jupyter notebook

    Once this is done, your server should be ready!  If you’re at the IP address 192.168.1.1, then you should at this point be able to point your browser to http://192.168.1.1:8888, and you’ll see the Jupyter notebook.  You can, of course, change the port that’s being used; I’ve found that when working from a cafe, non-standard ports can sometimes be blocked. Do remember that low-numbered ports, such as 80, can only be used by root, so you’ll need something higher, such as 8000 or 8080.

Also note that working this way means that all users have identical permissions. This means that in theory, one student can delete, change, or otherwise cause problems in another student’s notebook.  In practice, I’ve never found this to be a problem.

A bigger problem is that the notebook’s back end can sometimes fail. In such cases, you’ll need to ssh back into the server and restart the Jupyter notebook process. It’s frustrating, but relatively rare.  By using “nohup”, I’ve been able to avoid the server going down whenever my ssh connection died and/or I logged out.

I’ve definitely found that using the Jupyter notebook has improved my teaching.  Having my students use it in my courses has also improved their learning!   And as you can see, setting it up is a quick operation, requiring no more than a few minutes just before class starts.  Just don’t forget to shut down the server when you’re done, or you’ll end up paying for a server you don’t need.

Comprehensive documentation for setting up a Jupyter server, including security considerations, are at the (excellent) Jupyter documentation site.

Suggestions for how to improve these instructions are quite welcome, of course!

A short Python class puzzle

PuzzleHere’s a short Python puzzle that I use in many of my on-site courses, which I have found to be useful and instructive: Given the following short snippet of Python, which letters will be printed, and in which order?

print("A")
class Person(object):
    print("B")
    def __init__(self, name):
        print("C")
        self.name = name
    print("D")
print("E")

p1 = Person('name1')
p2 = Person('name2')

Think about this for a little bit before looking at the answer, or executing it on your computer.

Let’s start with the answer, then:

A
B
D
E
C
C

Let’s go through these one by one, to understand what’s happening.

The initial “A” is printed because… well, because it’s the first line of the program. And in a Python program, the lines are executed in order — starting with the first, then the second, and so forth, until we reach the end.  So it shouldn’t come as a surprise that the “A” line is printed first.

But what is surprising to many people — indeed, the majority of people who take my courses — is that we next see “B” printed out.  They are almost always convinced that the “B” won’t be printed when the class is defined, but rather when… well, they’re really sure when “B” will be printed.

Part of the problem is that Python’s keywords “class” and “def” look very similar. The former defines a new class, and the latter defines a new function. And both start with a keyword, and take a block of code.  They should work the same, right?

Except that “class” and “def” actually work quite differently: The “def” keyword creates a new function, that’s true.  However, the body of the function doesn’t get executed right away. Instead, the function body is byte compiled, and we end up with a function object.

This means that so long as the function body doesn’t contain any syntax errors (including indentation errors), the function object will be created. Indeed, the following function will produce an error when run, but won’t cause a problem when being defined:

def foo():    asdfafsafasfsafavasdvadsvas

This is not at all the case for the “class” keyword, which creates a new class. And if we think about it a bit, it stands to reason that “class” cannot work this way. Consider this: Immediately after I’ve defined a class, the methods that I’ve defined are already available. This means that the “def” keyword inside of the class must have executed. And if “def” executed inside of a class, then everything else executes inside of a class, also.

Now, under most normal circumstances, you’re not going to be invoking “print” from within your class definition. Rather, you’ll be using “def” to define methods, and assignment to create class attributes. But both “def” and assignment need to execute if they are to create those methods and attributes! Their execution cannot wait until after the class is already defined.

I should also add that in Python, a class definition operates like a module: What would looks like an assignment to a global variable inside of the class definition is actually the creation of an attribute on the class (module) itself. And of course, the fact that the body of a class definition executes line by line makes it possible to use decorators such as @staticmethod and @property.

In short, “def” inside of a class definition creates a new function object, and then creates a class-level attribute with the same name as your function.

So, when you define a class, Python executes each line, in sequence, at the time of definition.  That’s why “B” is printed next.

Why is “C” not printed next? Because the body of a function isn’t executed when the function is defined. So when we define our __init__  method, “print” doesn’t run right away.  It does, however, run one time for each of the two instances of “Person” we create at the end of the program, which is why “C” is printed twice.

However, before “C”  is printed, the class definition still needs to finish running.  So we first see “D” (inside of the class definition), and then “E” (just after the class is defined).

Over the years, I’ve found that understanding what goes on inside of a class definition helps to understand many of the fundamental principles of Python, and how the language is implemented. At the same time, I’ve found that even in my advanced Python classes, a large number of developers don’t answer this puzzle correctly, which means that even if you work with Python for a long time, you might not have grasped the difference between “class” and “def”.

Data science + machine learning + Python course in Shanghai

Reuven teaching in ChinaData science is changing our lives in dramatic ways, and just about every company out there wants to take advantage of the insights that we can gain from analyzing our data — and then making predictions based on that analysis.

Python is a leading language (and some would say the leading language) for data scientists. So it shouldn’t come as a surprise that in addition to teaching intro and advanced Python courses around the world, I’m increasingly also teaching courses in how to use Python for data science and machine learning.  (Within the next few weeks, I expect to release a free e-mail course on how to use Pandas, an amazing Python library for manipulating data.   I’ve already discussed it on my mailing list, with more discussion of the subject to come in the future.)

Next month (i.e., February 2017), I’ll be teaching a three-day course in the subject in Shanghai, China.  The course will be in English (since my Chinese is not nearly good enough for giving lectures at this point), and will involve a lot of hands-on exercises, as well as lecture and discussion.  And lots of bad jokes, of course.

Here’s the advertisement (in Chinese); if you’re interested in attending, contact me or the marketing folks at Trig, the company bringing me to China:

http://mp.weixin.qq.com/s/kNwRpwEdhwqjL22e4TdgLA

Can’t make it to Shanghai? That’s OK; maybe I can come to your city/country to teach!  Just contact me at reuven@lerner.co.il, and we can chat about it in greater depth.

Free download: Cheat sheet for Python data structures

Get the Python data manipulations cheat sheet

Just about every day of every week, I teach Python. I teach not only in Israel, but also in Europe, China, and the US.  While I do teach newcomers to programming, the overwhelming majority of my students are experienced developers in another language — most often C, C++, Java, or C#.

For many of these people, I find that it’s hard to keep track of which data structure does what — when do you use lists vs. dicts vs. tuples vs. sets.  Moreover, it’s hard for them to remember the most common methods and operators we use on these data structures.

Perhaps the most common question I get is, “How do I add a new element to a dictionary?”  They’re often looking for an “append” method, and are surprised to find that one doesn’t exist.

That, and other questions, led me to create this  “cheat sheet for Python data structures.”  It’s not meant to be all-encompassing, but rather to provide some insights and reminders into the most common tasks you’ll want to do with lists, tuples, dicts, and sets.

My students have found this to be helpful — and I hope that it’ll be useful to other Python developers out there, as well!  Feedback is, of course, warmly welcome.

Get the Python data manipulations cheat sheet

The (lack of a) case against Python 3

A few days ago, well-known author and developer Zed Shaw wrote a blog post, “The Case Against Python 3.”   I have a huge amount of respect for Zed’s work, and his book (Learn Python the Hard Way) is one whose approach is similar to mine — so much so, that I often tell people who either are about to take my course to read it in preparation — and that people who want to practice more after finishing my course, should read it afterwards.

It was thus disappointing for me to see Zed’s post about Python 3, with which I disagree.

Let’s make it clear: About 90% of my work is as a Python trainer at various large companies; my classes range from “Python for non-programmers” and “Intro Python” to “Data science and machine learning in Python,” with a correspondingly wide range of backgrounds. I would estimate that at least 95% of the people I teach are using Python 2 in their work.

In my own development work, I switch back and forth between Python 2 and 3, depending on whether it’s for a client, for myself, and what I plan to do with it.

So I’m far from a die-hard “Python 3 or bust” person. I recognize that there are reasons to use either 2 or 3.  And I do think that if there’s a major issue in the Python world today, it’s in the world of 2 vs. 3.

But there’s a difference between recognizing a problem, and saying that Python 3 is a waste of time — or, as Zed is saying, that it’s a mistake to teach Python 3 to new developers today.  Moreover, I think that the reasons he gives aren’t very compelling, either for newcomers to programming in general, or to experienced programmers moving to Python.

Zed’s argument seems to boil down to:

  • Implementing Unicode in Python 3 has made things harder, and
  • The fact that you cannot run Python 2 programs in the Python 3 environment, but need to translate them semi-automatically with a combination of 2to3 and manual intervention is crazy and broken.

I think that the first is a bogus argument, and the second is overstating the issues by a lot.

As for Unicode: This was painful. It was going to be painful no matter what.  Maybe the designers got some things wrong, but on the whole, Unicode works well (I think) in Python 3.

In my experience, 90% of programmers don’t need to think about Unicode, because so many programmers use ASCII in their work.  For them, Python 3 works just fine, no better (and no worse) than Python 2 on this front.

For people who do need Unicode, Python 3 isn’t perfect, but it’s far, far better than Python 2. And given that some huge proportion of the world doesn’t speak English, the notion that a modern language won’t natively support Unicode strings is just nonsense.

This does mean that code needs to be rewritten, and that people need to think more before using strings that contain Unicode.  Yes, those are problems.  And Zed points out some issues with the implementation that can be painful for people.

But again, the population that will be affected is the 10% who deal with Unicode.  That generally doesn’t include new developers — and if it does, everything is hard for them.  So the notion that Unicode problems making Python 3 impossible to use is just silly.  And the notion that Python can simply ignore Unicode needs, or treat non-English characters are a second thought, is laughable in the modern world.

The fact that you cannot run Python 2 programs in the Python 3 VM might have been foolish in hindsight.  But if the migration from Python 2 to 3 is slow now, imagine what would have happened if companies never needed to migrate?  Heck, that might still happen come 2020, when large companies don’t migrate.  I actually believe that large companies won’t ever translate their Python 2 code into Python 3.  It’s cheaper and easier for them to pay people to keep maintaining Python 2 code than to move mission-critical code to a new platform.  So new stuff will be in Python 3, and old stuff will be in Python 2.

I’m not a language designer, and I’m not sure how hard it would have been to allow both 2 and 3 to run on the same system. I’m guessing that it would have been hard, though, if only because it would have saved a great deal of pain and angst among Python developers — and I do think that the Python developers have gone out of their way to make the transition easier.

Let’s consider who this lack of v2 backward compatibility affects, and what a compatible VM might have meant to them:

  • For new developers using Python 3, it doesn’t matter.
  • For small (and individual) shops that have some software in Python 2 and want to move to 3, this is frustrating, but it’s doable to switch, albeit incrementally.  This switch wouldn’t have been necessary if the VM were multi-version capable.
  • For big shops, they won’t switch no matter what. They are fully invested in Python 2, and it’s going to be very hard to convince them to migrate their code — in 2016, in 2020, and in 2030.

(PS: I sense a business opportunity for consultants who will offer Python 2 maintenance support contracts starting in 2020.)

So the only losers here are legacy developers, who will need to switch in the coming three years.  That doesn’t sound so catastrophic to me, especially given how many new developers are learning Python 3, the growing library compatibility with 3, and the fact that 3 increasingly has features that people want. With libraries such as six, making your code run in both 2 and 3 isn’t so terrible; it’s not ideal, but it’s certainly possible.

One of Zed’s points strikes me as particularly silly: The lack of Python 3 adoption doesn’t mean that Python 3 is a failure.  It means that Python users have entrenched business interests, and would rather stick with something they know than upgrade to something they don’t.  This is a natural way to do things, and you see it all the time in the computer industry.  (Case in point: Airlines and banks, which run on mainframes with software from the 1970s and 1980s.)

Zed does have some fair points: Strings are more muddled than I’d like (with too many options for formatting, especially in the next release), and some of the core libraries do need to be updated and/or documented better. And maybe some of those error messages you get when mixing Unicode and bytestrings could be improved.

But to say that the entire language is a failure because you get weird results when combining a (Unicode) string and a bytestring using str.format… in my experience, if someone is doing such things, then they’re no longer a newcomer, and know how to deal with some of these issues.

Python 3 isn’t a failure, but it’s not a massive success, either.  I believe that the reasons for that are (1) the Python community is too nice, and has allowed people to delay upgrading, and (2) no one ever updates anything unless they have a super-compelling reason to do so and they can’t afford not to.  There is a growing number of super-compelling reasons, but many companies are still skeptical of the advantages of upgrading. I know of people who have upgraded to Python 3 for its async capabilities.

Could the Python community have handled the migration better? Undoubtedly. Would it be nice to have more, and better, translation tools?  Yes.  Is Unicode a bottomless pit of pain, no matter how you slice it, with Python 3’s implementation being a pretty good one, given the necessary trade-offs? Yes.

At the same time, Python 3 is growing in acceptance and usage. Oodles of universities now teach Python 3 as an introductory language, which means that in the coming years, a new generation of developers will graduate and expect/want to use Python 3. People in all sorts of fields are using Python, and many of them are switching to Python 3.

The changes are happening: Slowly, perhaps, but they are happening. And it turns out that Python 3 is just as friendly to newbies as Python 2 was. Which doesn’t mean that it’s wart-free, of course — but as time goes on, the intertia keeping people from upgrading will wane.

I doubt that we’ll ever see everyone in the Python world using Python 3. But to dismiss Python 3 as a grave error, and to say that it’ll never catch on, is far too sweeping, and ignores trends on the ground.

Enjoyed this article? Subscribe to my free weekly newsletter; every Monday, I’ll send you new ideas and insights into programming — typically in Python, but with some other technologies thrown in, as well!  Subscribe at http://lerner.co.il/newsletter.

Implementing “zip” with list comprehensions

zipperI love Python’s “zip” function. I’m not sure just what it is about zip that I enjoy, but I have often found it to be quite useful. Before I describe what “zip” does, let me first show you an example:

>>> s = 'abc'
>>> t = (10, 20, 30)

>>> zip(s,t)
[('a', 10), ('b', 20), ('c', 30)]

As you can see, the result of “zip” is a sequence of tuples. (In Python 2, you get a list back.  In Python 3, you get a “zip object” back.)  The tuple at index 0 contains s[0] and t[0]. The tuple at index 1 contains s[1] and t[1].  And so forth.  You can use zip with more than one iterable, as well:

>>> s = 'abc'
>>> t = (10, 20, 30)
>>> u = (-5, -10, -15)

>>> list(zip(s,t,u))
[('a', 10, -5), ('b', 20, -10), ('c', 30, -15)]

(You can also invoke zip with a single iterable, thus ending up with a bunch of one-element tuples, but that seems a bit weird to me.)

I often use “zip” to turn parallel sequences into dictionaries. For example:

>>> names = ['Tom', 'Dick', 'Harry']
>>> ages = [50, 35, 60]

>>> dict(zip(names, ages))
{'Harry': 60, 'Dick': 35, 'Tom': 50}

In this way, we’re able to quickly and easily product a dict from two parallel sequences.

Whenever I mention “zip” in my programming classes, someone inevitably asks what happens if one argument is shorter than the other. Simply put, the shortest one wins:

>>> s = 'abc'
>>> t = (10, 20, 30, 40)
>>> list(zip(s,t))
[('a', 10), ('b', 20), ('c', 30)]

(If you want zip to return one tuple for every element of the longer iterable, then use “izip_longest” from the “itertools” package.)

Now, if there’s something I like even more than “zip”, it’s list comprehensions. So last week, when a student of mine asked if we could implement “zip” using list comprehensions, I couldn’t resist.

So, how can we do this?

First, let’s assume that we have our two equal-length sequences from above, s (a string) and t (a tuple). We want to get a list of three tuples. One way to do this is to say:

[(s[i], t[i])              # produce a two-element tuple
 for i in range(len(s))]   # from index 0 to len(s) - 1

To be honest, this works pretty well! But there are a few ways in which we could improve it.

First of all, it would be nice to make our comprehension-based “zip” alternative handle inputs of different sizes.  What that means is not just running range(len(s)), but running range(len(x)), where x is the shorter sequence. We can do this via the “sorted” builtin function, telling it to sort the sequences by length, from shortest to longest. For example:

>>> s = 'abcd'
>>> t = (10, 20, 30)

>>> sorted((s,t), key=len)
[(10, 20, 30), 'abcd']

In the above code, I create a new tuple, (s,t), and pass that as the first parameter to “sorted”. Given these inputs, we will get a list back from “sorted”. Because we pass the builtin “len” function to the “key” parameter, “sorted” will return [s,t] if s is shorter, and [t,s] if t is shorter.  This means that the element at index 0 is guaranteed not to be longer than any other sequence. (If all sequences are the same size, then we don’t care which one we get back.)

Putting this all together in our comprehension, we get:

>>> [(s[i], t[i])    
    for i in range(len(sorted((s,t), key=len)[0]))]

This is getting a wee bit complex for a single list comprehension, so I’m going to break off part of the second line into a function, just to clean things up a tiny bit:

>>> def shortest_sequence_range(*args):
        return range(len(sorted(args, key=len)[0]))

>>> [(s[i], t[i])     
    for i in shortest_sequence_range(s,t) ]

Now, our function takes *args, meaning that it can take any number of sequences. The sequences are sorted by length, and then the first (shortest) sequence is passed to “len”, which calculates the length and then returns the result of running “range”.

So if the shortest sequence is ‘abc’, we’ll end up returning range(3), giving us indexes 0, 1, and 2 — perfect for our needs.

Now, there’s one thing left to do here to make it a bit closer to the real “zip”: As I mentioned above, Python 2’s “zip” returns a list, but Python 3’s “zip” returns an iterator object. This means that even if the resulting list would be extremely long, we won’t use up tons of memory by returning it all at once. Can we do that with our comprehension?

Yes, but not if we use a list comprehension, which always returns a list. If we use a generator expression, by contrast, we’ll get an iterator back, rather than the entire list. Fortunately, creating such a generator expression is a matter of just replacing the [ ] of our list comprehension with the ( ) of a generator expression:

>>> def shortest_sequence_range(*args):
      return range(len(sorted(args, key=len)[0]))

>>> g = ((s[i], t[i])
         for i in shortest_sequence_range(s,t) )

>>> for item in g:
        print(item)
('a', 10)
('b', 20)
('c', 30)

And there you have it!  Further improvements on these ideas are welcome — but as someone who loves both “zip” and comprehensions, it was fun to link these two ideas together.

Fun with floats

I’m in Shanghai, and before I left to teach this morning, I decided to check the weather.  I knew that it would be hot, but I wanted to double-check that it wasn’t going to rain — a rarity during Israeli summers, but not too unusual in Shanghai.

I entered “shanghai weather” into DuckDuckGo, and got the following:

Never mind that it gave me a weather report for the wrong Chinese city. Take a look at the humidity reading!  What’s going on there?  Am I supposed to worry that it’s ever-so-slightly more humid than 55%?

The answer, of course, is that many programming languages have problems with floating-point numbers.  Just as there’s no terminating decimal number to represent 1/3, lots of numbers are non-terminating when you use binary, which computers do.

As a result floats are inaccurate.  Just add 0.1 + 0.2 in many programming languages, and prepare to be astonished.  Wait, you don’t want to fire up a lot of languages? Here, someone has done it for you: http://0.30000000000000004.com/ (I really love this site.)

If you’re working with numbers that are particularly sensitive, then you shouldn’t be using floats. Rather, you should use integers, or use something like Python’s decimal.Decimal, which guarantees accuracy at the expense of time and space. For example:

>> from decimal import Decimal
>>> x = Decimal('0.1')
>>> y = Decimal('0.2')
>>> x + y
Decimal('0.3')
>>> float(x+y)
0.3

Of course, you should be careful not to create your decimals with floats:

>> x = Decimal(0.1)
>>> y = Decimal(0.2)
>>> x + y
Decimal('0.3000000000000000166533453694')

Why is this the case? Let’s take a look:

>> x
Decimal('0.1000000000000000055511151231257827021181583404541015625')

>>> y
Decimal('0.200000000000000011102230246251565404236316680908203125')

So, if you’re dealing with sensitive numbers, be sure not to use floats! And if you’re going outside in Shanghai today, it might be ever-so-slightly less humid than your weather forecast reports.

Speedy string concatenation in Python

As many people know, one of the mantras of the Python programming language is, “There should be one, and only one, way to do it.”  (Use “import this” in your Python interactive shell to see the full list.)  However, there are often times when you could accomplish something in any of several ways. In such cases, it’s not always obvious which is the best one.

A student of mine recently e-mailed me, asking which is the most efficient way to concatenate strings in Python.

The results surprised me a bit — and gave me an opportunity to show her (and others) how to test such things.  I’m far from a benchmarking expert, but I do think that what I found gives some insights into concatenation.

First of all, let’s remember that Python provides us with several ways to concatenate strings.  We can use the + operator, for example:

>> 'abc' + 'def'
'abcdef'

We can also use the % operator, which can do much more than just concatenation, but which is a legitimate option:

>>> "%s%s" % ('abc', 'def')
'abcdef'

And as I’ve mentioned in previous blog posts, we also have a more modern way to do this, with the str.format method:

>>> '{0}{1}'.format('abc', 'def')
'abcdef'

As with the % operator, str.format is far more powerful than simple concatenation requires. But I figured that this would give me some insights into the relative speeds.

Now, how do we time things? In Jupyter (aka IPython), we can use the magic “timeit” command to run code.  I thus wrote four functions, each of which concatenates in a different way. I purposely used global variables (named “x” and “y”) to contain the original strings, and a local variable “z” in which to put the result.  The result was then returned from the function.  (We’ll play a bit with the values and definitions of “x” and “y” in a little bit.)

def concat1(): 
    z = x + y 
    return z 

 def concat2(): 
    z = "%s%s" % (x, y) 
    return z 

def concat3(): 
    z = "{}{}".format(x, y) 
    return z 

def concat4(): 
    z = "{0}{1}".format(x, y) 
    return z

I should note that concat3 and concat4 are almost identical, in that they both use str.format. The first uses the implicit locations of the parameters, and the second uses the explicit locations.  I decided that if I’m already benchmarking string concatenation, I might as well also find out if there’s any difference in speed when I give the parameters’ iindexes.

I then defined the two global variables:

x = 'abc' 
y = 'def'

Finally, I timed running each of these functions:

%timeit concat1()
%timeit concat2()
%timeit concat3()
%timeit concat4()

The results were as follows:

  • concat1: 153ns/loop
  • concat2: 275ns/loop
  • concat3: 398ns/loop
  • concat4: 393ns/loop

From this benchmark, we can see that concat1, which uses +, is significantly faster than any of the others.  Which is a bit sad, given how much I love to use str.format — but it also means that if I’m doing tons of string processing, I should stick to +, which might have less power, but is far faster.

The thing is, the above benchmark might be a bit problematic, because we’re using short strings.  Very short strings in Python are “interned,” meaning that they are defined once and then kept in a table so that they need not be allocated and created again.  After all, since strings are immutable, why would we create “abc” more than once?  We can just reference the first “abc” that we created.

This might mess up our benchmark a bit.  And besides, it’s good to check with something larger. Fortunately, we used global variables — so by changing those global variables’ definitions, we can run our benchmark and be sure that no interning is taking place:

x = 'abc' * 10000 
y = 'def' * 10000

Now, when we benchmark our functions again, here’s what we get:

  • concat1: 2.64µs/loop
  • concat2: 3.09µs/loop
  • concat3: 3.33µs/loop
  • concat4: 3.48µs/loop

Each loop took a lot longer — but we see that our + operator is still the fastest.  The difference isn’t as great, but it’s still pretty obvious and significant.

What about if we no longer use global variables, and if we allocate the strings within our function?  Will that make a difference?  Almost certainly not, but it’s worth a quick investigation:

def concat1(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = x + y 
     return z 

def concat2(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "%s%s" % (x, y) 
     return z 

def concat3(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "{}{}".format(x, y) 
     return z 

def concat4(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "{0}{1}".format(x, y) 
     return z 

And our final results are:

  • concat1: 4.89µs/loop
  • concat2: 5.78µs/loop
  • concat3: 6.22µs/loop
  • concat4: 6.19µs/loop

Once again, we see that + is the big winner here, but (again) but less of a margin than was the case with the short strings.  str.format is clearly shorter.  And we can see that in all of these tests, the difference between “{0}{1}” and “{}{}” in str.format is basically zero.

Upon reflection, this shouldn’t be a surprise. After all, + is a pretty simple operator, whereas % and str.format do much more.  Moreover, str.format is a method, which means that it’ll have greater overhead.

Now, there are a few more tests that I could have run — for example, with more than two strings.  But I do think that this demonstrates to at least some degree that + is the fastest way to achieve concatenation in Python.  Moreover, it shows that we can do simple benchmarking quickly and easily, conducting experiments that help us to understand which is the best way to do something in Python.