The relative speeds of str.format and %

My most recent blog post talked about the use of str.format instead of the % operator for interpolating values into strings. Some people who read the post wondered about their relative speeds.

I should first note that my first response to this is: I don’t really care that much. I’m not saying that speed isn’t important, or that optimization should never be done. Rather, my philosophy is that people are expensive and computers are cheap — and thus, anything we do to make people more productive, even if that comes at the expense of program speed, is probably fine.

Of course, that’s not always going to be true. Sometimes, you need (or just want) to squeeze more out of your computer. And to be a good programmer, you also need to know the relative advantages and disadvantages of the techniques you’re using.

So I decided to run a few, quick benchmarks on the relative speeds of str.format and %.  Sure enough, the % operator was a lot faster.  I ran my benchmarks the magic %timeit command that is built into the IPython interactive shell.  (If you’re using Python and aren’t using IPython, you should really switch ASAP.)  Note that in order to make things easier to read, I’m removing the IPython input and output prompts, and using >>> to denote where I entered text.

>>> name = 'Reuven'
>>> %timeit 'Hello there, {}'.format(name)
1000000 loops, best of 3: 243 ns per loop

>>> %timeit 'Hello there, %s' % name
10000000 loops, best of 3: 147 ns per loop

Wow.  As you can see, %timeit executed each of these lines of code 1,000,000 times. It then gave the average speed per loop. The % operator was, on average, about 100 ns faster than str.format. That shouldn’t come as a huge surprise, given that % is an operator (and thus doesn’t require a method call), doesn’t handle indexes and attributes, and can (I’m guessing) pass a great deal of its work off to C’s printf function.

Then again, is 100 ns really that long to wait for a formatted string?  I’m not so sure, to be honest.

What happens if we perform more than one interpolation?

>>> first = 'Reuven'
>>> last = 'Lerner'
>>> %timeit 'Hello there, {} {}'.format(first, last)
1000000 loops, best of 3: 371 ns per loop

>>> %timeit 'Hello there, %s %s' % (first, last)
1000000 loops, best of 3: 243 ns per loop

Each of these takes significantly longer to run than was the case with a single replacement. The difference between them continues to be about 120 ns per assignment — still not something to worry about too much, but the difference does exist.

What if I make the strings space-padded?

>>> %timeit 'Hello there, {:10} {:15}' % (first, last)
1000000 loops, best of 3: 459 ns per loop

>>> %timeit 'Hello there, %10s %15s' % (first, last)
1000000 loops, best of 3: 254 ns per loop

Now we see an even starker difference between the two ways of handling things. What about something like floating-point math, which takes longer?

>>> import math
>>> %timeit 'I love to eat {}'.format(math.pi)
1000000 loops, best of 3: 587 ns per loop

>>> %timeit 'I love to eat %f' % math.pi
1000000 loops, best of 3: 354 ns per loop

Limiting the number of decimals shown doesn’t seem to change the outcome very much:

>>> %timeit 'I love to eat {:.3}'.format(math.pi)
1000000 loops, best of 3: 582 ns per loop

>>>%timeit 'I love to eat %.3f' % math.pi
1000000 loops, best of 3: 329 ns per loop

UPDATE: Several people on Reddit took me to task for failing to consider the overhead of the str.format method call.  I mentioned this briefly above, but should have realized that there was an easy to to avoid this, namely aliasing the attributes (the method str.format and the float math.pi) to local variables:

>>> f = 'I love to eat {:.3}'.format
>>> p = math.pi
>>> %timeit f(p)
1000000 loops, best of 3: 489 ns per loop

>>> %timeit 'I love to eat %f' % p
1000000 loops, best of 3: 370 ns per loop

We still see significant overhead. Again, I’m guessing that a lot of this has to do with the overhead of a method vs. an operator. I’m not about to start looking at the bytecodes; this wasn’t meant to be a super-deep investigation or benchmark, but rather a quick check and comparison, and I think that on that front, it did the trick.

So, what have we learned?

  • Yes, str.format is slower than %.
  • The number of parameters you pass to str.format, and whether you then adjust the output with padding or a specified number of decimals, significantly influences the output speed.
  • That said, in many programs, the difference in execution speed is often 100 ns, which is not enough to cause trouble in many systems.

If speed is really important to you, then you should probably use %, and not str.format. However, if speed is more important than the maintainability or readability of your code, then I’d argue that Python is probably a poor choice of programming language.

Teaching an old dog new tricks — or, how I learned to love Python’s str.format, and gave up on %

I have been programming in Python for many years. One of the things that I wondered, soon after starting to work in Python, was how you can get Perl-style variable interpolation. After all, Perl (like the Unix shell) supports two types of quotes — single quotes (in which everything is taken literally) and double quotes (in which variables’ values are inserted). Thus, in Perl, you can do something like:

$name = 'Reuven';
print "Hello, $name\n";

And sure enough, it’ll print “Hello, Reuven”.

Because single and double quotes are equivalent in Python (so long as they come as a matched set), there is no variable interpolation. The technique that I learned years ago, when I started with Python, was that you could use the % operator on a string. In this context, % looks to the string on its left, determines how many values within the string need to be replaced, and then looks right to find those values. It then returns a new string, effectively interpolating the values. For example:

>>> name = 'Reuven'
>>> "Hello, %s" % name

'Hello, Reuven'

The above Python code works just fine, returning a string with a nice, personalized greeting. And indeed, for the length of my time working with Python, I have enjoyed using this % syntax. Yes, it’s a bit weird. And no, I cannot ever remember more than the absolute basics of printf’s various % codes, meaning that I either make everything a string (with %s), or I guess, or I look up the printf codes and what they do. But to be honest, I normally just use %s, and thus benefit additionally from the fact that Python will silently invoke “str” on the parameter.

The thing is, % is supposedly going away, or is at least deprecated. (A note on the python-dev list indicates that % will go away no sooner than 2022, which is a heckuva long time from now.) As of Python 2.6, not to mention Python 3.x, we have been told that it will eventually disappear, and that we shouldn’t use % any more. Instead, we should use the str.format method.

I have always mentioned str.format to my Python classes, but to be honest, I’ve usually relied upon % when giving live demonstrations and answering questions. And I would even encourage my students to use the % syntax, in part because I found it to be so much easier.

And yet.  I knew that I was doing something wrong, and I knew that I was probably misleading my students to some degree. Thus, in the last three classes I taught, I started to push harder in the direction of str.format. And that’s when I realized two things: (1) It’s just as easy as %, and actually easier in some ways, and (2) I hadn’t learned enough about str.format to use it, beyond the simplest ways. I thus spent a great deal of time researching it — and found out that str.format, while it takes some getting used to, is more than worth the effort.

Let’s start with the simplest case. I’d like to be able to say “Good morning” to someone, using both their first and last names. Assuming that I have variables named “first” and “last”, I can do this with the old syntax as follows:

>>> first = 'Reuven'
>>> last = 'Lerner'
>>> "Good morning, %s %s" % (first, last)

'Good morning, Reuven Lerner'

In this example, we already see one of the problems with the % syntax — namely, that if we have more than one value, we need to put it into a tuple. Perhaps this is logical and reasonable from Python’s perspective, but I can assure you that it surprises many of my students.

So, how would we do it using str.format? Pretty similarly, in many ways:

>>> "Good morning, {} {}".format(first, last)

'Good morning, Reuven Lerner'

Notice that we’ve changed things a bit here. No longer are we invoking a binary operator (%) on the string. Rather, we’re invoking a string method that takes a set of parameters. This is more logical and consistent. I can’t tell you how many of my students think that % is somehow connected to “print”, when in fact it’s connected (in the case of string formatting) to strings. Having to use put the “.format” at the end of the string makes the method call more obvious.

As you might already know, the “{} {}” in the string tells str.format to take its two parameters, and to insert them, in order, into the string. Because there are two arguments, we can only have two {} inside of the string. This is a bit harder to understand, both because having {} in Python reminds many people of a dictionary, and because the empty curly braces look a bit weird. But fine, I can live with that, and got used to it very quickly.

Where str.format quickly shows its advantages over %, however, is if I want to display the input parameters in reverse order. When I use %, there is no real way to do that. Plus, if I want to reuse a value passed to %, I cannot do so. With str.format, I can swap the order in which the inputs are displayed:

>>> "Good morning, {1} {0}".format(first, last)

'Good morning, Lerner Reuven'

Notice what happened in the above: If I just use “{} {}”, then str.format uses the two parameters in order. However, I’m also able to treat the parameters as a sequence, with indexes starting at 0. I can then insert them in reverse order, as I did above, or in the regular order:

>>> "Good morning, {0} {1}".format(first, last)

'Good morning, Reuven Lerner'

Note that if you explicitly state the field numbers, then you cannot rely on the automatic numbering.

Of course, this lets me also pass a sequence of values to be inserted, so long as we then use the splat (*) operator on it, to turn it into a parameter list:

>>> names = ('Reuven', 'Lerner')
>>> "Good morning, {} {}".format(*names)

'Good morning, Reuven Lerner'

You can also call str.format with keyword arguments. When you do this, you can then put a keyword name within the {}:

>>> "Good morning, {first} {last}".format(first='Reuven', last='Lerner')

'Good morning, Reuven Lerner'

The above really appeals to me. The named parameters are explicit (if a bit long), and the use of {first} and {last} is quite readable — certainly more so than %(first)s ever was with the % operator!

I can, of course, also pass a dictionary of names, using the ** operator to turn it into a set of keyword arguments:

>>> person = {'first':'Reuven', 'last':'Lerner'}
>>> "Good morning, {first} {last}".format(**person)

'Good morning, Reuven Lerner'

I described all of these to my students in the last month, and I was pleasantly surprised to see how comfortable they were with the syntax. I’m sure that this reflects, to some degree, my comfort with the syntax, as well.

I should note that you can combine numeric and keyword arguments when working with str.format. I really suggest that you not do so. The results would look like this:

>>> person = {'first':'Reuven', 'last':'Lerner'}
>>> "Good {0}, {first} {last}".format('morning', **person)

'Good morning, Reuven Lerner'

Yukko.

Now, the one thing that would appear to be missing from str.format is… well, formatting! The bad news is that str.format has a completely and different way of indicating how you want to format output. The good news is that it’s not too hard to learn and understand.

Let’s start with the easiest part: If you want to display a string within a fixed-width field, then you can do so by adding a colon (:) and then a number.  So to put my name in a fixed-width field of 10 spaces, we would say:

>>> "Your name is {name:10}".format(name="Reuven")

'Your name is Reuven    '

(Notice the trailing spaces after my name.)

In the above example, my name is left-justified. If I want it to be right-justified, I could use a > sign between the : and the number:

>>> "Your name is {name:>10}".format(name="Reuven")

'Your name is     Reuven'

And yes, I could have used an optional < symbol to say that my name should be left-justified within the field of 10 spaces in the first example.  Or I could center the text in a field of 10 spaces with the ^ specifier instead of < or >.

To pad the string with something other than a space, we specify it before the <, >, or ^ character. For example, if I’m moving to Hollywood, then perhaps I should do something like this:

>>> "Your name is {name:*^10}".format(name="Reuven")

'Your name is **Reuven**'

If I want to put the string in the (default) left-most position of the string, filling with characters on the right, then I must use the < specifier, so that the text will be on the left, and the stars on the right.

So it’s pretty clear that str.format is pretty snazzy when it comes to text. How about numbers? I wasn’t really sure how things would work here, but it turns out that they’re also quite straightforward. If you’re displaying integers, then you can go ahead and say:

>>> "The price is ${number}.".format(number=123)

'The price is $123.'

So far, we don’t see any difference between passing an integer and a string. And indeed, they share many characteristics. However, we might want to display an integer in a different way. We can do that using one of the (many) modifiers that str.format provides — letters placed just before the end of the closing } character. For example, we can get the price in binary (with a trailing “b”), or in hexadecimal (with a trailing “x”), as in the following example:

>>> "The price is ${number:x}.".format(number=123)

'The price is $7b.'

Of course, we can also zero-pad the number, such that it will always take up a fixed width. Just place a 0 between the colon and the width:

>>> "Your call is important to us. You are call #{number:05}.".format(number=123)

'Your call is important to us. You are call #00123.'

Notice that inside of the {}, we cannot put executable Python code. Instead, there is a mini-language that is separate and different from Python. However, there are two small exceptions to this rule: (1) We can retrieve any attribute with the standard . notation, and (2) we can retrieve a single item with the [] notation.

For example:

>>> class Foo(object):
        def __init__(self):
        self.x = 100
>>> f = Foo()
>>> 'Your number is {o.x}'.format(o=f)

'Your number is 100'n

Notice how we were able to retrieve the “x” attribute from the “f” object, which we mapped to “o” within the string. However, while you can retrieve an attribute, you cannot execute it. Thus, the following will not work:

>>> "Your name is {name.upper()}".format(name="Reuven")

AttributeError: 'str' object has no attribute 'upper()'

See what happened? I said “name.upper()”, in order to execute the method “str.upper” on “name”.  However, Python doesn’t want me to execute code there. So it takes the name of the attribute literally — and thus complained that there is no attribute “upper()”, with the parentheses. Of course, if you try it without the parentheses, it’ll work, for some value of “work”:

>>> "Your name is {name.upper}".format(name="Reuven")

'Your name is <built-in method upper of str object at 0x1028bf2a0>'

Similarly, we can retrieve an individual element of a sequence or mapping with []. However, we cannot use the slice notation for more than one element. For example:

>>> "Your favorite number is {n[3]}.".format(n=numbers)

'Your favorite number is 3.'

However:

>>> "Your favorite numbers are {n[2:4]}.".format(n=numbers)

ValueError: Missing ']' in format string

The “:” character, which we use for slices, isn’t available in format strings, because it’s used to control the formatting of the output.

You can, of course, use [] on a dictionary, as well. However — and this is a bit weird for Python — we omit the quote marks, even when our key is a string. For example:

>>> person = {'first':'Reuven', 'last':'Lerner'}
>>> "Your name is {p[first]}.".format(p=person)

'Your name is Reuven.'

If we were to include the quotes…

>>> "Your name is {p['first']}.".format(p=person)

KeyError: "'first'"

There is actually a lot more to str.format than what I have shared here. In particular,  each type has its own format specifications, which means that you can do certain things with floats (e.g., setting the precision) that you cannot do with strings.

You can even add formatting functionality to your own Python classes, such that they’ll be displayed in the way that you want, along with format specifiers that you define.

If you want to learn more about this, I’d definitely suggest reading PEP 3101, which describes str.format. I’d also suggest a slide show by Eric Smith, which summarizes things nicely. Finally, the Python documentation has some excellent examples, including a guide for moving from % to str.format.

I hope that this was helpful and useful! If you enjoyed this blog post, check out my many other resources, including my free e-mail course on Python scoping, and my free Webinar on functional programming in Python.

Three Pythonic products: A (free) Webinar, a course, and an ebook

I love developing software.  I also love helping people to learn how to develop better. That’s why I have been teaching programming classes for more than a decade, and why I write about programming. There is so much to learn; it’s a rare day on which I don’t learn something new, and it’s a rare week in which I don’t apply some new understanding to a problem that I’m solving for a client.

Now that I have finished my PhD, I have some more time to focus on creating products that I believe will help people to program better. I’m pleased to announce three initial such products, all aimed at Python developers who want to improve their skills:

  1. A free Webinar (“How functional programming will make you a better Python programmer“), scheduled for Monday, September 15th, at 2 p.m. Eastern Time. The Webinar will consist of a 45-minute presentation, followed by a Q&A period. My goal is to describe functional programming, and then demonstrate some of the functional techniques that are available in Python. If you are a Python programmer who has heard about functional programming but doesn’t really know what it is, then I think you’ll get a lot out of this free seminar. Please register and attend! Even if you’re not interested in my other products, I’d love to see you at this Webinar — if only to learn what topics people do want to learn, and what I should address in future Webinars, books, and online classes.
  2. For those who want to dig deeper into functional programming in Python, I’ll be giving a full-day, live, online course on the subject. This course, which will include lectures, demos, and many exercises, begins with an introduction to functions in Python, then describes how to pass functions as parameters, list/set/dict comprehensions, customizing list.sort and sorted, lambda, map/filter/zip, reduce, and the “functools” module. This class, like all of my classes, will be intense and highly interactive — but will give you a new perspective on programming in general, and particularly in Python. You can register for this one-day course at EventBrite; it’ll be given on Sunday, September 21st and again on Tuesday, September 23rd.
  3. If you have taken a Python course, but feel that you need to practice Python more thoroughly before you feel like you’re truly fluent in the language, then you might be interested in my ebook, “Practice Makes Python.” The book will contain about 50 exercises in various aspects of Python, with the aim of helping you to learn through doing. I hope to release the ebook about a month from now, and am basing many of the exercises on my courses, as well as comments I’ve received from many of my students. If you’re interested in the book, you should sign up for the announcement list, where I’ll be providing sample content and behind-the-scenes information.

These products (well, the paid ones, anyway) come with a full, 100% money-back guarantee. And of course, if you have questions, you can always contact me via e-mail.

The easiest way to return to the last Git branch

I don’t know about you, but it’s common for me to switch between branches in Git.  After all, that’s one of the main advantages of using Git — the incredible ease with which you can create and merge branches.

Just a few minutes ago, I was in the “adwords” branch of an application I’m working on.  I wanted to go back to “master”, make sure that I hadn’t missed any commits from the central repository, and then go back to “adwords”.  If there were any commits in “master” that I was missing in “adwords”, I figured that I would just rebase from “master” to “adwords”.

So I did a “git checkout master”, and found that I was up to date with the central Git server. For reasons that I can’t explain, I then decided to try something out: Instead of returning to the previous branch with “git checkout adwords”, I instead typed

git checkout -

(That’s a minus sign after the word “checkout.”)

Sure enough, I returned to the “adwords” branch!

Now, there is a fair amount of logic to this: In the Unix shell, you can typically return to the previous directory with “cd -“, which has proven to be quite useful over the years.  In Git, of course, branches are just aliases to commits.  So “git checkout -” is returning you to the previous branch, but it’s really just taking you back to whatever the last commit was that you worked on.

I just checked the Git documentation (“git checkout –help”), and it seems that this is a special case of a more generalizable command structure:

As a special case, the "@{-N}" syntax for the N-th last branch/commit checks out 
branches (instead of detaching). You may also specify - which is synonymous
with "@{-1}".

I can’t imagine wanting to tell Git to return to the 5th-most-recent branch that I worked on, so this generalized formula seems a bit much to me.

I predict that this trick will save me precious seconds every day, all of which I squandered in writing this blog post.  But I do think that this is a super-cool trick and feature, and demonstrates once again how clever and useful Git is.

Remember: Don’t mix = and NULL

I was recently performing an analysis on the database for my dissertation software, and wanted to look at one particular set of rows.  The table I was looking at is called Nodes, and because each node (represented by a row in the table) can optionally belong to a group, the group_id column allows for NULL values.

So, if I want to find all of the nodes in group 2, I can say

SELECT * FROM Nodes WHERE group_id = 2;

My data consists (of course) of several tables, including Versions, which has a many-to-one relationship with Nodes.  If I want all of the versions for nodes in group 2, I can say:

SELECT COUNT(*) FROM Versions WHERE node_id in (SELECT id FROM Nodes WHERE group_id = 2);

Now, let’s say that I want to get all of the versions that are not from group 2.  How would I do that?  Well, one way would be:

SELECT COUNT(*) FROM Versions WHERE node_id 
   NOT IN (SELECT id FROM Nodes WHERE group_id = 2);

But at some point, I flipped the logic, turning the NOT IN into IN, and = into <>:

SELECT COUNT(*) FROM Versions WHERE node_id 
  IN (SELECT id FROM Nodes WHERE group_id <> 2);

That, of course, was a bug.  That’s because SQL uses trinary (or ternary) logic, not the binary true-false logic to which we’re accustomed.  The result of comparing anything with NULL results in a NULL.  For example, consider the following table:

select * from truth_test ;
+--------+
| val    |
+--------+
| t      |
| f      |
| [null] |
+--------+

I can create a truth table pretty easily:

SELECT val, val = true as is_true, val=false as is_false, 
       val=null as equal_null, val is null as is_null 
FROM truth_test;
+--------+---------+----------+------------+---------+
| val    | is_true | is_false | equal_null | is_null |
+--------+---------+----------+------------+---------+
| t      | t       | f        | [null]     | f       |
| f      | f       | t        | [null]     | f       |
| [null] | [null]  | [null]   | [null]     | t       |
+--------+---------+----------+------------+---------+

You can see, then, that TRUE is equal to TRUE, and FALSE is equal to FALSE, which we would expect.  But using the = operator with NULL is always going to result in NULL — even with itself!

If you want to compare with NULLs, then you need to use IS NULL.  (Or you can use a function such as COALESCE, which returns the first non-NULL value passed to it.)  And that’s the mistake that I made in my queries yesterday: I was trying to get all of the nodes for which group_id was not 2.  But by using =, I inadvertently excluded all of the rows for which group_id was NULL — in other words, all of the nodes that weren’t assigned to a group.  Which, it turns out, was the majority of them.  Whoops.

My solution was to create a view containing all of the IDs that I want to exclude:

CREATE OR REPLACE VIEW group2_model_ids
AS
SELECT nodes.id
FROM nodes
WHERE nodes.group_id = 2;

Now, if I want the number of versions that aren’t in group 2, I can use a subselect, as follows:

SELECT COUNT(*) 
FROM Versions 
WHERE node_id NOT IN (SELECT id FROM group2_model_ids);

Of course, I could have simply rewritten the query, either to use COALESCE or to check for both NULL and not-NULL values.  But I prefer this way, especially since I’m only dealing with a few tens of thousands of rows of data in this table, in part because it gives me a name and abstraction that I can use in future queries.

Book Review: The Undercover Economist Strikes Back

In order to get an undergraduate degree from MIT, at least when I was there, you needed to take a certain number of humanities and social-science courses.  This was to stop you from coming out a complete one-dimensional student; the idea was that rounding out your education with knowledge from other fields was good for you as a person, and also good for you as an engineer or scientist.  (And yes, I realize that not everyone at MIT studied science or engineering, but those were the overwhelming favorites.)  One of the most popular social sciences that people took was economics — which is a social science, although MIT’s version actually included a great deal of math.

At the time, I tended to be quite dismissive of economics.  I didn’t think that it could possibly be interesting, or why so many of my friends were taking so many courses in that field.  What insights could they gain?

And then, just before I graduated, MIT Press had one of its amazing sales on overstock books.  I bought a book by a then-professor at MIT, named Paul Krugman, called “The Age of Diminished Expectations.”  Reading this book was quite a revelation for me; I suddenly realized that economics was complex, fascinating, and described the world in all sorts of interesting ways.  For years, I read and followed Krugman’s writing, in Slate and then (of course) the New York Times, gleaning what I could about economics.  (I also happen to subscribe to many of his political views, but that’s secondary.)  Whenever I could find an interesting and well-written book about economics, I would get it, because I found it to be so interesting and compelling.

Several years ago, a friend asked if I had read “The Undercover Economist,” by Tim Harford.  I hadn’t, but decided that perhaps it was worth a read, and found it to be delightful, but in a different way from Krugman. Harford isn’t an economics researcher, but he knows just how to put economics research into words and a perspective that everyone can understand.  His examples are often drawn from pop culture, and he’s able to distill the academic debates and intrigue to their essence.  The fact that he’s very funny only adds to his appeal.  I’ve since become quite a fan of Harford’s, listening (among other things) to the “More or Less” podcast from the BBC that he hosts, a sort of Mythbusters of statistics (Mathbusters?).

So it should come as no surprise that I ordered his latest book, “The Undercover Economist Strikes Back,” almost as soon as it came out earlier this year.  I just read it cover to cover over the weekend, and came away delighted.  As someone who has been reading Krugman’s work for years, and who also listens to NPR’s excellent Planet Money podcast, I can’t say that there was a huge amount of new information in this book.  But it was written so well, and put things into such nice context, that this doesn’t matter.

Harford has a gift for making economics not only understandable, but also interesting and relevant to our own lives.  In The Undercover Economist, he describes microeconomics, which describes how businesses and individuals respond to incentives.  In this new book, he describes macroeconomics, which is a different kettle of fish altogether — it’s about how governments and economies work.  If you think of macroeconomics as a complex system, then it’s no surprise that the aggregate behaves differently from its individual, constituent agents. (This, it took me many years to learn, is a much better explanation than what economics instructors tell their students, which is simply that “macro is different from micro.”)

The book talks about all sorts of great stuff, starting with recessions, moving onto unemployment, and covering a large number of topics that are in the newspaper each day, that affect each and every one of us, and which probably seem strange or detached from our reality, but which are actually quite important — particularly if you’re in a democracy, and need to separate real economics from crazy talk.

Harford includes a great definition and discussion of what money is, and brings up the famous story of the island of Yap, which used huge, largely immovable stones as money.  He also introduces the different schools of thought on the subject, and where (and how) they differ — and how much of what politicians in the US and Europe have been saying and doing over the last five years has been foolish or misplaced.

The question-and-answer format in which he wrote the book is a little tedious, but much less than I expected it to be.  Really?  Yes, really.

In my mind, perhaps the topic that was most obviously missing from the book was a discussion of currency, and how that can affect an economy.  If you live in the US, or even in England or Europe, you can largely ignore currency issues.  Sure, there are exchange rates, and yes, they affect you to some degree, but it’s not a huge deal.

In Israel, by contrast, the exchange rate is  a huge deal, because Israel imports and exports so much.  The dollar’s rise and fall affects everyone, from high-tech software companies to people shopping at the supermarket. The ways in which the Bank of Israel played with exchange rates and buying dollars, just to keep things relatively stable (while claiming that they were doing no such thing) are impressive, and point to the sorts of challenges that small, trade-oriented economies have but that large ones don’t.  I’m not sure if this was an omission due to time or space constraints, or if as someone living in England, Harford hasn’t had to think or worry much about currency issues.

I’ve changed my tune 100 percent since I was an undergrad; i now find economics to be totally fascinating, and very much enjoy reading the sorts of book that Harford has put out.  If you’ve always wondered what macroeconomics is, or what the newspapers is talking about when they mentioned recessions, or whether the politicians suggesting budget cuts during the latest recession were saying the most obviously brilliant thing or the most foolish thing imaginable, Harford’s book is a fun, interesting read, and is highly recommended.

 

Turning a PostgreSQL array to rows

I’ve spent the last few blog posts in this series talking about PostgreSQL’s arrays — how to create them, query them, and even create them with the ARRAY function.  But there are cases in which I would like to do the opposite — turn a PostgreSQL array to rows.

There are a number of reasons why I might want to do this. For example, perhaps I want to find the intersection between two arrays.  The INTERSECT operator in PostgreSQL can easily do that with two sets of rows — but there isn’t any equivalent for arrays.  Similarly, the UNION operator lets me join two sets of rows, but nothing like that exists for arrays.  And of course, there are times when I might want to take the elements of an array and run a JOIN between them and the rows of an existing table.

The key to all of these is the UNNEST function.  You need to be a bit careful when you’re using UNNEST, because (as always with computers) PostgreSQL will do what you tell it to do, not necessarily what you want it to do.  For example, if I have the following table:

select * from foo;
+----+---------------+
| id | stuff         |
+----+---------------+
| 1  | {abc,def,abc} |
| 2  | {abc,xyz}     |
+----+---------------+

Let’s say that I use UNNEST on our first row:

SELECT UNNEST(stuff) FROM Foo where id = 1;[local]/reuven=# SELECT UNNEST(stuff) FROM Foo where id = 1;
+--------+
| unnest |
+--------+
| abc    |
| def    |
| abc    |
+--------+

Sure enough, the three elements of “stuff” where id is 1 have now become three rows. We get the results as a set of rows, which we can then capture in a table, query in a subquery, or turn back into an array, if we want:

SELECT ARRAY(SELECT UNNEST(stuff) FROM Foo where id = 1);
+---------------+
| array         |
+---------------+
| {abc,def,abc} |
+---------------+

This example might seem pointless, but it raises some additional possibilities.  For example, let’s say that I want to ensure that the elements of my array are unique.  I can do that by applying the DISTINCT operation on the result of UNNEST, and then packing those distinct rows back into an array:

SELECT ARRAY(SELECT DISTINCT UNNEST(stuff) FROM Foo where id = 1);
+-----------+
| array     |
+-----------+
| {abc,def} |
+-----------+

It’s this sort of operation that has led me to really love PostgreSQL arrays.  If I use them as actual arrays, then I feel bad, as though I’ve betrayed the entire world of normalized, relational databases. But if I think of arrays as frozen sets of rows, which I can then unfreeze (with UNNEST), manipulate, join, or otherwise modify, and then re-freeze (with ARRAY) as necessary, then things suddenly get much clearer and easier.

I’m in the last weeks of my PhD research and writing, and I’ve been using no small number of arrays in my research and analysis.  I wouldn’t dream of using arrays to store actual combinations of data that could otherwise be normalized and stored in a reasonable way.  But I am constantly creating views that turn complex queries into arrays, which I can then further query, nest, and unnest as possible.

I’m just about done with this series; next time, I’ll show how I can combine arrays and CTEs to perform the sorts of operations that I described above, without having to pollute my database namespace with lots of unnecessary views.

Turning PostgreSQL rows into arrays

So far in this series, I have described how you can create arrays and retrieve information from them — both the actual data stored in the array, and information about the array, such as its length. But the coolest trick, or set of tricks, that I use in PostgreSQL is the ability to turn arrays into rows, and vice versa.  This is where, in my mind, arrays start to get truly powerful, and allow us to do all sorts of things that would otherwise be quite difficult, or require another language. For example, let’s say that I want to keep track of my appointments in a database.  I create an Appointments table:

CREATE TABLE Appointments (id SERIAL,
 meeting_at TIMESTAMP,
 description TEXT
 );

And then I enter information into this table, such that I have the following appointments:

[local]/reuven=# select * from appointments order by meeting_at;
 +----+---------------------+---------------+
 | id | meeting_at          | description   |
 +----+---------------------+---------------+
 | 5 | 2014-05-22 12:00:00  | Dissertation  |
 | 4 | 2014-05-22 14:00:00  | Devel meeting |
 | 1 | 2014-05-23 15:00:00  | Dissertation  |
 | 2 | 2014-05-23 17:00:00  | Dinner        |
 | 3 | 2014-05-23 23:00:00  | Sleep         |
 +----+---------------------+---------------+
 (5 rows)

If I want to find out what I’m doing on May 23rd, I can simply say:

[local]/reuven=# select * from appointments where meeting_at::date = '2014-may-23';
 +----+---------------------+--------------+
 | id | meeting_at          | description  |
 +----+---------------------+--------------+
 | 1 | 2014-05-23 15:00:00  | Dissertation |
 | 2 | 2014-05-23 17:00:00  | Dinner       |
 | 3 | 2014-05-23 23:00:00  | Sleep        |
 +----+---------------------+--------------+

Notice that in the above example, I’m using PostgreSQL’s casting operator (::) to turn the meeting_at column, which contains a TIMESTAMP value, into a DATE value.  Without that,  the comparison would have been at midnight. So far, this isn’t very exciting or surprising.  But let’s say that I want to get all of the descriptions of today’s appointments in a comma-separated list.

Normally, I would need to retrieve all of the rows, and then use another language to piece them together.  With PostgreSQL’s arrays, though, we can do this all in the database — which executes much more quickly, among other things. The way to accomplish this is with the ARRAY function.  Now, don’t confuse the ARRAY[] constructor with the ARRAY function, although it’s hard not to.  (To be honest, I’m not sure how PostgreSQL can tell the difference; it might be the square brackets vs. parentheses that tell it to do different things.)

The ARRAY function is built into PostgreSQL, and turns a set of rows into an array.  That is, ARRAY expects to get rows as its input, and returns a single row, containing a single array, as its output. Let’s start with a set of rows, containing the descriptions of today’s appointments.  I’m going to use PostgreSQL’s concatenation operator (||) to combine the time (using the ::time cast) of the appointment with the appointment itself.  My query looks like:

SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'
ORDER BY meeting_at;

The result I get is:

 +-----------------------+
 | ?column?              |
 +-----------------------+
 | 15:00:00 Dissertation |
 | 17:00:00 Dinner       |
 | 23:00:00 Sleep        |
 +-----------------------+

Now that I have my rows, I can turn them into an array.  Now, the ARRAY function needs to be invoked via a SELECT.  Thus, using ARRAY means that we’re using a subselect.  The inner SELECT is what we did above.  The outer one is just our call to ARRAY:

 

SELECT ARRAY(SELECT meeting_at::time || ' ' || description 
FROM Appointments 
WHERE meeting_at::date = '2014-may-23'
ORDER BY meeting_at);

And sure enough, we get a one-row, one-column result:

 +--------------------------------------------------------------+
 | array                                                        |
 +--------------------------------------------------------------+
 | {"15:00:00 Dissertation","17:00:00 Dinner","23:00:00 Sleep"} |
 +--------------------------------------------------------------+

You can see that our three rows have been turned into three elements of the array.  I do this all of the time when creating reports or summaries of other data: I write a view or CTE that grabs the data from the rows that are of interest to me, and stuffs that data into an array.  I can then check the array’s contents, length, or anything else that might be useful in my analysis. But earlier, I said that I want to have a comma-separated list of the appointment.  I currently have them in an array.  Fortunately, PostgreSQL’s array_to_string can save the day, acting similarly to the “join” method that I know and love from such languages as Ruby and Python.  This function, wrapped around a call to ARRAY, wrapped around our subselect, will then look like:

SELECT array_to_string( 
  ARRAY(SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'ORDER BY meeting_at), ', ');

And sure enough, the result is as follows:

 +--------------------------------------------------------+
 | array_to_string                                        |
 +--------------------------------------------------------+
 | 15:00:00 Dissertation, 17:00:00 Dinner, 23:00:00 Sleep |
 +--------------------------------------------------------+

Not bad, for one query; we’ve taken our three rows, and turned them into a string, without having to use anything other than SQL!  There’s just one final problem here that I’d like to fix, namely that when you invoke a function as we did here, the function’s name becomes the name of the column.  I’d like to change that, so the column has a more reasonable name.  This is particularly important if you’re planning to read the column using a high-level language; your code will look awfully funny if it includes statements like:

puts row['array_to_string']

when it could instead read

puts row['appointments_for_today']

Here’s how I can use “AS” to change the name of the column, making it easier to read and work with:

SELECT array_to_string( 
  ARRAY(SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'ORDER BY meeting_at), ', ') 
AS appointments_for_today;

It is in these sorts of uses, to aggregate data and create reports, that I’ve found PostgreSQL’s arrays to be particularly useful and powerful. Next time, we’ll see how we can do the opposite, turning an array back into rows — and then, we’ll see how the combination of these conversions lets us perform all sorts of tricks.

Looking in PostgreSQL arrays with ANY

So far, this series has looked at how to create PostgreSQL arrays, how to retrieve data from them, and how to get the length of an array’s outer and inner dimensions. But one of the most common actions that we’ll want to do with an array is look inside to see if a particular value is in there.  Today, I’ll show you how to do that with the PostgreSQL ANY operator and its cousins. For example, given my table “foo”:

[local]/reuven=# select * from foo;
+----+---------------------+
| id | stuff               |
+----+---------------------+
| 8  | {abc}               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
+----+---------------------+
(3 rows)

We’ve already seen (in my last post on this topic) how we can find all of the rows in which the “stuff” array is of a certain length.  But sometimes — more often, perhaps — I want all of the rows in which a particular value is somewhere in the array.  (You can think of this as the SQL equivalent of Ruby’s “select” on an enumerable, or Python’s “filter” on a sequence.) For this, we need to use PostgreSQL’s ANY operator.  I have always found the syntax of ANY (and its synonym, SOME) to be a bit hard to understand.  The IN operator works by itself, but ANY only works with an = sign.  For example, if I want all of the rows from the above “foo” table in which the “stuff” array contains an element ‘abc’, I can say:

[local]/reuven=# select * from foo where 'abc' = ANY(stuff);
+----+---------------------+
| id | stuff               |
+----+---------------------+
| 8  | {abc}               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
+----+---------------------+
(3 rows)

Sure enough, because each of these three arrays contains an element ‘abc’, we get all three rows back.  What if I ask for ‘def’?

[local]/reuven=# select * from foo where 'def' = ANY(stuff);
+----+---------------------+
| id | stuff               |
+----+---------------------+
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
+----+---------------------+
(2 rows)

Again, not a surprise.  What if I search for ‘ghi’?

[local]/reuven=# select * from foo where 'ghi' = ANY(stuff);
+----+-------+
| id | stuff |
+----+-------+
+----+-------+
(0 rows)

This is important to remember, actually: ANY looks at the elements of the array, and no further.  So the string ‘ghi’ was compared with the elements of “stuff”, which in the case of row ID 10, was the three strings ‘abc’, ‘def’, and ‘ghi jkl’.  PostgreSQL compared ‘ghi’ with each of these, found them to be unequal, and thus return 0 rows.  Remember this when you’re trying to search through your text[] array column; you can look for specific elements of the array, but no further. What I personally find confusing, and easy to forget, is the fact that using ANY requires that I use an = sign before the call to ANY.  Moreover, it’s important to remember that ANY and the array go on the right-hand side of the equal sign, for reasons that I don’t quite understand.  Thus, the following will not work:

[local]/reuven=# select * from foo where ANY(stuff) = 'abc';
ERROR: 42601: syntax error at or near "ANY"
LINE 1: select * from foo where ANY(stuff) = 'abc';
                                ^

I’ve never had any occasion to use it, but you should know that PostgreSQL also offers an ALL operator.  It works the same way as ANY, except that it only returns a TRUE value if all of the elements in the array are equal to the value on the left-hand side.  For example:

[local]/reuven=# select * from foo where 'abc' = ALL(stuff);
+----+-------+
| id | stuff |
+----+-------+
| 8  | {abc} |
+----+-------+
(1 row)

In this example, we only get one row back, because the stipulation is that all of the array elements need to be equal to ‘abc’. I’ve got a few more posts planned in this series about PostgreSQL arrays.  If there is something about this subject that you have always wanted to know, please contact me, and I’ll do what I can to answer.

The real questions to ask offshore developers

Friends of mine, who are not software developers, have a small, retail Internet business.  The original developers created the application in Python, and my friends are looking for a full-stack Web/Python developer to help them.  Frustrated with their inability to find someone who can commit to their project, my friends have decided to hire offshore developers, which is another way of saying, “cheap programmers in Eastern Europe or India.”

Earlier this week, these friends e-mailed me the resumes of three Ukrainian programmers, asking me which seemed most appropriate, and what questions they should be asking.

The resumes were, from my perspective, largely identical.  All the programmers declared themselves to be “experts,” or “very experienced,” or just “experienced,” at Python, JavaScript, Web development, SQL, and many other technologies.  And the fact is, they probably are quite skilled at all of these technologies; the Ukrainians with whom I have worked — as well as the Indians, Chinese, and Romanians — have all been quite skilled, technically.

But here’s the thing: Technical skill isn’t the primary consideration when hiring a developer. This is doubly true when hiring an offshore developer.  That’s because the problems that I’ve seen with offshore programmers aren’t technical, but managerial.  As I told my friends, you would much rather have a so-so Python programmer who is reliable, and communicative than a genius programmer who is unreliable, or uncommunicative.  The sad fact is that many of the offshore outsourcing companies have talented programmers, but poor management and leadership, leading to breakdowns in communication, transparency, and scheduling, rather than technology.

Sure, a developer might know the latest object-oriented techniques, or know how to create a RESTful JSON API in his or her sleep.  But the programmer’s job isn’t to do those things. Rather, the programmer’s job is to do whatever the business needs to grow and improve.  If that requires fancy-shmancy programming techniques and algorithms, then great.  But most of the time, it just requires someone willing to pay attention to the project’s needs and schedule, writing simple and reliable code that’s necessary for the business to succeed.

The questions that you should be asking an offshore developer aren’t that different from the ones that you should be asking a developer in your own country, who speaks your language, and lives in your time zone.  Specifically, you should be asking about their communication patterns and processes.  Of course, you don’t want a dunce working on your programming project — but good communication and processes will smoke out such a person very quickly.

If there are no plans or expectations for communication, then you’re basically hoping that the developer knows what you want, that he or she will do it immediately, and that things won’t change — a situation that is pretty much impossible.

Good processes and a good developer will lead to a successful project.  Good processes and a bad developer will make it clear that the developer needs to go, and soon.  Bad processes and a developer of any sort will make it hard to measure performance, leading to frustrating on everyone’s part — and probably missed deadlines, overspent budgets, and more.

So I told my friends that they should get back to these Ukrainian programmers, and ask them the following questions:

  • What task tracking system do you prefer to use, in order to know what needs to be done, what has been done, and who has taken responsibility for each task?
  • How often do you want to meet to review progress?
  • Do you use automated testing to ensure that when we make progress, we can be sure that it works, and that we haven’t introduced regressions?
  • How easily will a third party be able to download the repository from Git (or whatever version-control system you’re using), and run those tests to verify that everything is working?

The answers to these questions are far, far more important than the technical skills of the person you’re hiring. Moreover, these are things that we can test empirically: If the developer doesn’t do one or more of them, we’ll know right away, and can find out what is going wrong.

If the developer is good, then he or she will encourage you to set up a task tracker, meet every day (or at least, every other day) to review where things are.  You’ll hear that automated testing is part of the development progress, and that of course it’s possible to download, install, and run the application on any compatible computer.

If the developer hedges on these things, or if he or she asks you to trust him, then that’s a bad sign.  Truth be told, the developer might be fantastic, brilliant, and do everything you want.  But do you want to take that risk?

If the developer has regular communication with you, tests their code, and allows you to download and run the application on your own, then you’re in a position to either praise them and keep the relationship going — or discover that things aren’t good, and shut it down right away.

Which brings me to my final point: With these sorts of communication practices in place, you’ll very quickly discover if the developers are doing what they promised.  If so, then that’s great for everything.  But if not, then you’ll know this within a week or less — and then you can get rid of them.

There are plenty of talented software developers in the world, but there are many fewer who both understand your business and make its success a priority.  A developer who values your business will want to demonstrate value and progress on a very regular basis.  Someone who cannot demonstrate value and progress probably isn’t deserving of your attention or money, regardless of where they live or what language they speak.  But if you can find someone excellent, who values you and your business, and who wants to help you succeed?  Then by all means, hire them — and it doesn’t matter whether they’re in Ukraine, or anywhere else.

What questions do you ask offshore developers before hiring them?