Free Webinar: Data science with Python on December 8th

It’s time for me to do another free one-hour Webinar, this time about data science with Python. It’ll be on December 8th, at 9 p.m. GMT.

Data science is all the rage, and rightly so — and Python is one of the best-known and best-equipped languages in which to do it.  In this Webinar, I’ll review some of the most popular packages used for analysis, including NumPy, SciPy, Pandas, and matplotlib, and will show how they can be used to answer questions that we have about our data.

As always, I hope that there will be lots of questions — and if we’re lucky, I’ll be able to provide answers, too!  Please come prepared for a highly interactive and fun event.  I’ll e-mail all registered participants about an hour before the Webinar with links for participating.

You can register at Eventbrite:

I look forward to seeing you there!  If you have any questions, please let me know via e-mail at or on Twitter as @reuvenmlerner.

Shark Tank increased our traffic by 1000x. Here’s how we handled it.

More than five years ago, I started work to with a small company with a simple idea. Rent Like a Champion was founded by alumni of the University of Notre Dame, which has a very strong culture of college football games. (Note to my non-American readers: We’re talking here about American football, rather than “real” football.)  However, South Bend, Indiana doesn’t have hotels close to the stadium, meaning that people don’t have a nearby, convenient place to stay when they come into town. RLAC thus offers homeowners the chance to rent out their homes on football weekends to visitors. Homeowners make some extra money, and football fans get to see the games they wanted while staying nearby.

I took over for a previous programmer, improving the site’s management and e-commerce features. RLAC has since grown massively, and my role has grown along with them. (And, I hasten to add, I’m no longer doing day-to-day coding on the site; my employee Genadi is doing a fantastic job of that, leaving me to handle more architectural, strategic, and managerial issues on the technology side.) There are still vestiges of the code that I inherited in 2010, but a great deal has changed and improved since then. We are using the latest version of both Ruby and Ruby on Rails, the latest version of MySQL, have both staging and production servers, and have a large number of automated tests. The site rarely goes down or is overwhelmed by traffic.

On the business side, we’re now on several dozen US college campuses, handle incoming and outgoing payments automatically, and even allow people to negotiate on the dates and prices of where they’re staying. Really, I couldn’t be prouder to be associated with this company.

A few months ago, we got some big news: Rent Like a Champion was going to be on Shark Tank, a reality TV show that allows you to pitch your company to investors. We didn’t know what this would mean for the company’s future, but we did know that it would mean lots of additional traffic. The numbers we heard were in the area of 10-20 thousand requests per second. This, from a site that had only 10-20 requests per second on our busiest days. Meaning, we had to figure out a reasonable way to scale our site up so that it could handle about 1,000 times the traffic we were used to seeing.

Now, Rails has a reputation for not being scalable. But it’s clear that Rails can scale, given enough computers. When someone says that it’s “not scalable,” what they probably mean is that given a certain amount of traffic, Rails requires more servers than other languages and frameworks. And yes, that’s probably true. But we clearly weren’t going to change our entire technology stack just for a few hours of television time.

We thus took a multi-pronged approach, working on every part of the site — from the infrastructure, to the application, to the servers we were using.

The bottom line, by the way, was that we more than survived the ordeal: Not only did we handle about 8,000 simultaneous requests/second for an extended period of time, but our servers were barely breaking a sweat. And as if that weren’t enough, we got investments from both Mark Cuban and Chris Sacca, two well-known billionaire investors.

So, what did we do to scale up? And what does this mean for projects you’re doing?

No state on the server

First and foremost, modern Web applications can be inherently scalable, if you design them correctly. Rails, like many other modern frameworks, assumes a “zero state” situation on the Web application server. This means that no user-related state is stored on the server itself. Instead, we store all such state in the database. This means that we can add as many servers as we want, because the actual data won’t be stored on the Web server.

One of the potentially tricky parts has to do with user sessions. Every modern Web framework offers developers the chance to use sessions; the user’s cookie contains an ID (typically encrypted) that allows us to look on disk, in memory, or in a database for the user’s session information.

If the session information is stored in a disk file, then you can only use a single Web server. That’s because if the same user visits several servers in the same day (which happens all of the time, if you have several servers behind a load balancer), and the session info is spread across multiple servers, they’ll effectively be logged out when they reach the second server. You can avoid this by storing session information in the database, but then that’ll slow down the application.

We decided to use cookie-based session information, in which the cookie itself contains the user’s session info. That means we don’t have to worry about how many servers we’re using, because the session info will be available to all of them. If you only store a user ID in the session, then the cookie will be particularly tiny (albeit encrypted), and thus you don’t have to worry about the size.

One part of our scaling was thus to ensure that users could move freely from one server to another. We then put all of the Web servers behind a single load balancer, giving the illusion that was a single machine, but actually distributing the load among many. The user’s cookie would go from their browser, to the load balancer, to the ultimate machine. From there, an ActiveRecord lookup via the user ID gave the user’s information.

Bottom line: Your app might already be more scalable than you think, if session information is stored in cookies. If you’re using files to store session information, though, your application is inherently non-scalable; once you use more than one Web server, the user might end up being logged out.

Use VMs, but not necessarily AWS

Several of the biggest trends we’ve seen in the last few years involve the combination of cloud computing and virtual machines. Thanks to VMs, we can spin up as many servers as we need, only when we need them.

I’ve long been of the opinion that putting a server “in the cloud,” as everyone likes to say nowadays, is often unnecessary. There’s nothing wrong with a plain ol’ server, after all, especially since such servers are often going to be cheaper and easier to maintain. (I’m getting to the point where I might be changing my tune, as I see the advantages of deploying to a new VM or container with each release, rather than upgrading each existing release.)

However, in the case of RLAC, moving to the cloud was a no-brainer — not because we need it for our permanent infrastructure, but because the key worry we had was being able to scale up quickly. We didn’t know how many requests we would get, and putting in place a set of “real” servers that could handle that capacity would cost a fortune. Besides, we knew that we needed to scale up quickly for (and during) the Shark Tank airing, but then scale down just as quickly following the show’s broadcast.

Amazon Web Services is the first name that people think of in this space. We decided to go a different route, in no small part thanks to the suggestions and connections of RLAC’s CIO, Mike Hostetler: We went with Server Central, a Chicago-based company that has massive bandwidth, and lots of experience configuring virtual machines for this kind of work. Sure, AWS might be better known. But one of the advantages we got from Server Central was actual, in-person help — something that Amazon wouldn’t provide for a player as small as ourselves.

I have to say that Server Central’s staff amazed and impressed me at every turn: They were available, helpful, and polite, and knew a ton about how to configure and tune our VMs for maximum benefit. They set up our MariaDB cluster (more on that below), as well as a load balancer. They helped us to clone VMs ahead of schedule, and to connect them (virtually, of course) with our network.

In the end, we didn’t go with a Chef or Puppet configuration of our VM, but rather used a combination of Capistrano to deploy our software our main VMs, which we then cloned for some backup VMs. Not super elegant, I’ll admit, but it worked just fine, and meant that we could eliminate another learning curve. Mike H. worked extensively with Server Central, and really got the hang of configuring and deploying those VMs, such that we had more than 20 available for the night of broadcast.

Bottom line: If you want to scale up and/or down quickly, then VMs are almost certainly the way to go. And if you’re looking to get actual service, rather than a faceless SaaS company, I’d definitely suggest speaking with Server Central.

Switching from Apache to nginx

I’ve said it on many occasions: I have a warm spot in my heart for htApache httpd. I’ve been doing Web development since before the first version of Apache was released, much less the Apache Foundation was founded, and I have always found it to be an easy to use, flexible HTTP server.

However, there’s no doubt that another server, nginx, offers greater performance and scalability than Apache. In the case of RLAC and Shark Tank, we were far more interested in scalability than ease of use. However, we didn’t want to have a super-hard learning curve and transition, which we feared would be the case if we moved to a combination of nginx and Unicorn, a popular duo in the Ruby on Rails world. We thus settled on using nginx and Phusion Passenger, a plugin that provides Ruby on Rails applications which we had previously used with Apache.

I have to admit that I was pleasantly surprised on all fronts by nginx. It installed without a hitch, thanks in no small part to Passenger’s super-easy installation process on Linux boxes. The configuration is quite different from Apache, but not as difficult as I would have expected. It allowed us to use our existing SSL certificates quite easily. And the performance, without a doubt, was quite impressive. Indeed, any performance issues we saw were the result of Passenger, rather than nginx; Ruby and Rails both use lots of memory, and thus there is a limit to the number of simultaneous requests you can handle per machine.

Bottom line: I hate to say it, but I don’t see a big advantage to Apache any more. nginx documentation and tutorials are quite good, support for Rails applications is excellent with Passenger, and the performance we saw quite very good. Configuration of Apache is still easier for me, but that’s going to be true for anything I’ve been doing for 20 years, I expect.

Database cluster

With the Web servers scaling up nicely, the biggest potential bottleneck suddenly became our database. The database on RLAC, as with most Web applications, is involved in every single page displayed, from showing the user’s name to listing homes for a particular football game, to letting homeowners set the prices for individual games and events.

The problem is that the database is a finite resource; if too many people come and visit the site at the same time, the database will cease responding to some of them, causing a domino effect that will cause many people to get errors or timeouts.

I’m a big fan of PostgreSQL, which I have been using for about 20 years. But when I inherited the RLAC code, we were already using MySQL on the site — and switching technologies is almost never a good idea if things are already working, so I didn’t move things around.

While PostgreSQL’s master-master replication is still being discussed and designed, MySQL has master-master, high-availability clustering working already. Even better, following the purchase of MySQL by Oracle (via their purchase by Sun), Monty Widenius, the original author of MySQL, has been working on MariaDB, a MySQL-compatible plugin that offers superior performance.

Server Central suggested that we use a HA master-master cluster of VMs, all running MariaDB. From our perspective as developers, it looked and felt like a normal MySQL database. But the performance was rock solid and super fast, and it meant that we didn’t need to worry about the database being a bottleneck, unless we were crushed by people actually trying to use the application.

Importing our old database (from the MySQL server on Rackspace) to our new MariaDB cluster was fast and easy. The two database servers feel almost identical, and the dump-restore cycle took a matter of minutes.

Bottom line: I feel somewhat chastened; after years of telling people how much better it is to use PostgreSQL, and how master-slave fallover is probably fine for most purposes, I was pleasantly surprised to use a master-master cluster that more than suited our needs. PostgreSQL could probably have handled the load just as well, of course, but switching to a different database just before a major PR event would have been a very bad idea.


The above was a great way to get our system to work under high loads. But it’s always possible to optimize things further, and one of the best-known ways to make Web applications increase their speed is to add caching. If you can cache a page, then everyone benefits: The user gets a must faster response, and the server doesn’t have to spend its time running Ruby and SQL, because the request has been served without even touching the server.

On RLAC, we used three different types of caches. These worked together spectacularly well, improving our performance massively. Our working assumption was that we would have a number of VMs on standby in case we would need them, and that most users would look at a few pages before bouncing off of our site. But we still knew that we would need to server thousands (and many tens of thousands) of requests per second, many of which just wanted to see our home page while watching Shark Tank on their TVs.

Our caches were as follows:
  • In our application, we cached many of our SQL queries. (Ruby on Rails lets you do this inside of your application with very little fuss.) Thus, every time we asked the database for the list of homes at event #12345, we stored that information in a cache, in case that same event would be requested again in the near future.
  • On each of our servers, we (mostly Mike Hostetler) installed and configured Varnish, which provided us with caching of static assets, such as images, CSS, and JavaScript. Moreover, the configuration meant that if someone came to the RLAC home page without being logged in, they would get a cached version of the page. If, however, they came after being logged in, they would hit our Rails app. This was necessary to ensure that people who had logged. We had a separate copy of Varnish running on each of our VMs, which was admittedly not the best solution, but it worked more than well enough for our purposes. It meant that every HTTP request coming to a VM from the load balancer would first go through Varnish, and only if necessary go to the Rails server.  I should add that Varnish is one of the most impressive pieces of software I’ve seen in a long time; it’s very good at what it does, and has an incredibly powerful configuration language that lets you examine and rewrite HTTP queries in amazing ways.  It’s definitely worth looking into, if you haven’t already.
  • Finally, we used CloudFlare, a content distribution network (CDN) that did initial caching of static assets, including all of our images, JavaScript, and CSS. One of the nicest parts of CloudFlare is that they have an “emergency page” feature, so that if your site isn’t available, you can at least show a decent static page, and ask people to return later.

Bottom line: We could have optimized our caching even more than this. But after a lot of configuration of Varnish, CloudFlare, nginx, and the Rails app’s HTTP response headers, we got to the point where each of our VMs could handle about 2,000 simultaneous requests that hit the database, and far more than that if they were static and only required the home page. But the caching was incredibly helpful.

Images on S3

This isn’t directly related to the move to Server Central and getting ready for Shark Tank, but it certainly helped: Many of our images, and especially photos of homes available for rent, are now stored on Amazon’s S3, rather than our own server. This reduces the load on our VM and nginx, and provides users with much faster retrieval speeds than we can provide. Moving things to S3 turns out to be quite easy to do; most of the image-uploading gems now available for Rails have an option to store things in S3; once we set up the appropriate buckets, this was a piece of cake.

Bottom line: The cost of S3 storage is laughably low, because you’re just paying for the bandwidth, and the per-byte cost is basically a joke. It’s totally reasonable to do this, and you’ll end up spending far less than you ever thought.

Our biggest mistake: delayed jobs

Remember how I said every VM was stateless, and that it didn’t matter how many VMs we would run? Yeah, that’s what we thought — and then we started to migrate our systems, and remembered that our Web server was configured to run Delayed Jobs. DJ is a Ruby gem that works with Rails, which does just as the name describes — it allows you to offload things to the background, so that your users (and server) don’t sit around. For example, RLAC processes many photos from homeowners. That processing, and subsequent copying of images to S3, can take some time. So we have offloaded it to DJ, which runs in the background all of the time, processing images.

About three days before we were supposed to switch over to Server Central, with our many VMs, we realized that Delayed Jobs expected files to be uploaded to a specific server. That is, we had state — files on the system that DJ was going to look for, scoop up, and process. DJ uses the database, which meant that it seemed we had to choose between rewriting our app such that all image uploads would go to a single, designated server, or that we would upload to many servers, with DJ (on all but the server that received the file) and the database going crazy because it cannot find the uploaded file on the local filesystem.

After some discussion (and a bit of panic), my developer Genadi found the answer: DJ lets you designate a queue name when the system starts up. We thus set up our DJ startup script, such that it grabbed the hostname on which it was running, and used that hostname as the name of its DJ queue. Thus, with 20 servers, we had 20 separate DJ queues, each of which was from a particular machine. This wasn’t the most efficient solution; if we had thought about it earlier, we probably would have centralized things, with a single DJ server. But it worked beautifully, required very little configuration (except for figuring out how to start up DJ with this option via our startup scripts), and pulled us through.

Bottom line: Think very hard about whether you have any extra services running that might involve state.

Cron jobs

Finally, the RLAC system has a large number of processes that are run via cron jobs. These are crucial for the business, in that they send out reminder e-mail messages to homeowners and renters, handle credit cards, and make homeowner payments. You don’t want these to go down, but you also don’t want these to run on every machine. Clearly, in contrast with DJ, we didn’t want these cron jobs to run on each of our Web servers.

The solution was to designate one of our servers, and to set it up to use cron. Thus, while our servers are all on identical VMs, you could say that one is more equal than others, in that it is running a variety of rake tasks via cron to deal with system maintenance. Perhaps we’ll pull these tasks onto a separate VM one of these days, but for now, things are working quite well.

The only (minor) hitch was that RLAC’s e-mail servers are run by Google, and Gmail started to tag all of the e-mail produced by our cron jobs as spam, and rejecting it. The solution was to configure Postfix (our SMTP server) to use SendGrid, our outgoing e-mail provider. The moment we did that (which was shockingly easy to do, I must admit), I started to get reports from our cron jobs once again.

Bottom line: If you have regularly scheduled cron jobs, either for system features or for maintenance (e.g., checking disk space or making database backups), make sure that you know in advance where you’ll be putting them. And test things, to make sure that you’re able to receive e-mail from your cron jobs, in case of success or failure.

Team effort

Perhaps the most important aspect of what we did was that it was a team effort.  None of us could have done this by ourselves; even on a relatively small application, there were too many types of responsibility and expertise for anyone to have done it on their own. I really enjoyed getting to work with not only Genadi Reznichenko (the amazing developer who has worked for me for more than three years), but also Mike Hostetler (who took on the role of CIO over the last year), and Trevan Hetzel (our front-end developer), as well as the RLAC team led by CEO Mike Doyle.  Communication and trust are the most important aspects of any software project, and we did well on both fronts, thanks to GitHub, Slack, weekly group phone calls, and lots of pair-programming sessions.

We worked hard to get things up and running; it was a team effort, but one which paid off in spades for the company. I hope that this description helps some of you to think about scaling up your Web applications in various ways; if you have stories to share, or questions (or comments) about what I’ve written here, please let me know. I’d love to swap stories and/or hear what you went through!

Updated ebook: Jewish Guide to Visiting China

cover-loresI’m happy to announce that my ebook, “The Jewish Guide to Visiting China,” has been updated quite a bit — and now has its own Web site,

If you’re Jewish and planning to visit China in the near future, I’m sure that my book will help to remove some of your concerns and fears, while giving you lots of practical advice that I’ve gleaned over my 10 trips to Beijing, Shanghai, and Nanjing.

In addition to updating the book itself a great deal, I have also added a new deluxe package, which includes not only the book, but a 30-minute Skype/phone call with me, in which I’ll address the questions and concerns you have about keeping kosher and Shabbat while in China.

In addition to this ebook, I continue to publish Mandarin Weekly, a free weekly newsletter containing the best links I can find for people (like me!) studying Mandarin Chinese.  You can get Mandarin Weekly sent to you via e-mail at the Web site, or follow it on Twitter, at @MandarinWeekly.

Python’s objects and classes — a visual guide

Python developers love to say that “everything is an object.” And indeed, when I teach Python classes, I say this several times, and many people nod in agreement, assuming that I’m merely repeating something they’ve heard before.  After all, people often say that everything in Java is an object (except for the things that aren’t), and that everything in .NET is an object.

But when we say that everything in Python is an object, we really mean everything, including — much to the surprise of my students — classes. This makes enormous sense, and it makes the entire object system easier to understand. And yet, it is still hard to put things in perspective.

In this blog post, I want to walk through some of the connections that we have among objects in Python, in the hopes that it’ll help to cement some of the ideas that stem from this “everything is an object” idea. It’ll also demonstrate some of the fun that happens when you’re creating an object hierarchy, and how things can get a bit weird.

Let’s start with a simple class (MyClass), and a simple instance of that class (m). In Python, we would write:

class MyClass(object):

m = MyClass()

In Python 3, we don’t need to explicitly say that MyClass inherits from object, since that’s true for all classes. But in Python 2, we have to inherit from object; if we don’t, then we get old-style classes, which we really don’t want.

Let’s see how this looks visually, with the arrow indicating that m is an instance of MyClass:

Python objectsSo far, that’s not very exciting. But let’s remember that everything in Python is an object. Thus, it’s true that m is an instance of MyClass; we can learn this by using the type function:

>>> type(m)

What happens if we ask MyClass about its type?

>>> type(MyClass)

Yes, MyClass is an instance of type — just as str, int, bool, and other Python classes are instances of type. Our diagram has just gotten a bit more complex:

Python objects 2

In the above diagram, we see that m is an instance of MyClass, and MyClass is an instance of type.

One main difference between regular objects and classes is that classes have a __bases__ attribute, a tuple, which indicates from which other class(es) this class inherits. MyClass, like all classes, should really have two pointers in our diagram — one representing its type, and another representing from which class (object) it inherits:

Python objects 3Many of the people to whom I teach Python are confused by the distinction between type and object, and what roles they play in the object’s life. Consider this:

  • Because MyClass is an instance of type, type.__init__ determines what happens to our class when it is created.
  • Because MyClass inherits from object, invoking a method on m will result in first looking for that method on MyClass. If the method doesn’t exist on MyClass, then Python will look on object.

All of this is well and good, but let’s take it a bit further: We know that MyClass is an instance of type. But this means that type itself is a class, right? What is the type of this type class?

>>> type(type)

Yes, in one of my favorite parts of Python, the type of type is type. In other words, type is an instance of itself. Pretty cool, eh? Let’s see how that fits into our diagram:

Python objects 4

If type is a class, then we know it must have two pointers in our diagram — one pointing to its class (type, aka itself), but the one to the class from which it inherits. What does type inherit from?

>>> type.__bases__

Let’s thus update our diagram, to show that type inherits from object. This makes sense, since if we invoke str(MyClass), we can rely on the inherited  implementation of object.__str__, without having to create a separate type.__str__.  And indeed, it would seem that this is what happens:

>>> type.__str__ is object.__str__

Let’s now update our diagram to indicate that type inherits from object:

Python objects 5

Finally, let’s not neglect our object class. As an object, it too must have a type. And as a class, we know that its type is type. Let’s add that to our diagram:

Python objects 6Remember that object is at the top of our inheritance hierarchy. This is represented in Python by an empty tuple:

>>> object.__bases__

We can represent this in our diagram in the following way:

Python objects 7Finally, let’s see what happens when we add a new class to this hierarchy, subclassing from MyClass. MySubClass inherits from MyClass, but is still an instance of type:

Python objects 8

If you’re an experienced Python developer, then the above may well be second nature for you. But if you’re new to the language, and particularly to the ways in which the various objects and classes interact, then I hope this has provided you with some additional clarity. Please let me know if there are additional aspects that you find confusing, and I’ll try to clarify them in future blog posts.

If you liked this explanation, then you’ll likely also enjoy my ebook, “Practice Makes Python,” with 50 exercises meant to improve your Python fluency.

Registration is open for my October Webinars (about regexps and technical training)

September has been busy with work and holidays, but I’m gearing up for an exciting and busy October. Among other things, I’m giving two (free) Webinars in that month, and you can already register for them:

  • Intermediate Regular expressions: In my previous Webinar about regular expressions, I covered the basics.  In this one, we’ll go further, spending a great deal of time talking about groups, backreferences, and some other topics that tend to confuse people.  I’ll be using Python to demonstrate regexps, but this Webinar isn’t only aimed at Python developers. Registration is free; sign up here.
  • Technical training: I’m starting to spend more and more time helping people to become technical trainers, teaching programming in high-tech companies. (Indeed, I’m in the process of starting my coaching program for people interested in improving their training skills.) In this Webinar, which I expect will be the first of many, I’ll give an overview of the technical-training landscape, how it works, and why you should seriously consider providing training services. I’ll briefly review pedagogical, logistical, and business considerations for the aspiring technical trainer. Registration for this training is free; you can sign up here.

Now available: My interview on “Developer on Fire” about training programmers

I was recently interviewed by Dave Rael for the “Developer on Fire” podcast. That episode has now been released, at Enjoy!

In PostgreSQL, as in life, don’t wait too long to commit

I recently helped a client with serious PostgreSQL problems. Their database is part of an application that configures and monitors large networks; each individual network node reports its status to the database every five minutes. The client was having problems with one table in particular, containing about 25,000 rows — quite small, by modern database standards, and thus unlikely to cause problems.

However, things weren’t so simple: This table was growing to more than 20 GB every week, even though autovacuum was running. In response, the company established a weekly ritual: Shut down the application, run a VACUUM FULL, and then restart things. This solved the problem in the short term — but by the end of the week, the database had returned to be about 20 GB in size.  Clearly, something needed to be done about it.

I’m happy to say that I was able to fix this problem, and that the fix wasn’t so difficult. Indeed, the source of the problem might well be obvious to someone with a great deal of PostgreSQL experience. That’s because the problem mostly stemmed from a failure to understand how PostgreSQL’s transactions work, and how they can influence the functioning of VACUUM, the size of your database, and the reliability of your system. I’ve found that this topic is confusing to many people who are new to PostgreSQL, and I thus thought that it would be a useful to walk others through the problem, and then describe how we solved it.

PostgreSQL, like many other modern relational databases, uses a technology known as multiversion concurrency control (“MVCC”).  MVCC reduces the number of exclusive locks placed on a database’s rows, by keeping around multiple versions of each row. If that sounds weird, then I’ll try to explain. Let’s start with a very simple table:


Now let’s add 1 million rows to that table:

INSERT INTO Foo (x) VALUES (generate_series(1,1000000));

Now, how much space does this table take up on disk? We can use the built-in pg_relation_size function, but that’ll return a large integer representing a number of bytes. Fortunately, PostgreSQL also include pg_size_pretty, which turns a number into something understandable by computer-literate humans:

SELECT pg_size_pretty(pg_relation_size('foo'));

35 MB

What happens to the size of our table if we UPDATE each row, incrementing x by 1?

UPDATE Foo SET x = x + 1;
SELECT pg_size_pretty(pg_relation_size('foo'));

69 MB

If you’re new to MVCC, then the above is probably quite surprising. After all, you might expect an UPDATE query to change each of the existing 1 million rows, keeping the database size roughly identical.

However, this is not at all the case: Rather than replace or update existing rows, UPDATE in an MVCC system adds new rows. Indeed, in an MVCC-based system, our queries never directly change or remove any rows. INSERT adds new rows, DELETE marks rows as no longer needed, and UPDATE performs both an INSERT and a DELETE.

That’s right: In an MVCC system, UPDATE doesn’t change existing rows, although we might have the illusion of this occurring.  Every time you UPDATE a table, you’re doubling its size.

At this point, many newcomers to PostgreSQL are even more confused: How can it be that deleted rows don’t go away? And why would UPDATE work in such a crazy way? And how can all of this possibly help me?

The key to understanding what’s happening is to realize that once a row has been added to the system, it cannot be changed. However, its space can be reused by new rows, if the old row is no longer accessible by any transaction.

This starts to make a bit more sense if you realize that every transaction in PostgreSQL has a unique ID number, known as the “transaction ID.” This number constantly increases, providing us with a running count of transactions in the system. Every time we INSERT a row, PostgreSQL records the first transaction ID from which the row should now be visible, in a column known as “xmin.” You an think of xmin as the row’s birthday. You cannot normally see xmin, but PostgreSQL will show it to you, if you ask for it explicitly:

SELECT  xmin, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ id │ x │

│ 187665 │  1 │ 2 │
│ 187665 │  2 │ 3 │
│ 187665 │  3 │ 4 │
│ 187665 │  4 │ 5 │

In an MVCC-based system, DELETE doesn’t really delete a row.  Rather, it indicates that a row is no longer available to future transactions. We can see an initial hint of this if, when we SELECT the contents of a table, we ask for not only xmin, but also xmax — the final transaction (or “expiration date,” if you like) for which a row should be visible:

SELECT  xmin, xmax, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ xmax │ id │ x │

│ 187656 │    0 │  1 │ 2 │
│ 187656 │    0 │  2 │ 3 │
│ 187656 │    0 │  3 │ 4 │
│ 187656 │    0 │  4 │ 5 │

We can see that each of these rows was added in the same transaction ID (187656). Moreover, all of these rows are visible, since their xmax value is 0. (Don’t worry; we’ll see other xmax values in a little bit.)  If I update some  of these rows, we’ll see that their xmin value will be different:

UPDATE Foo SET x = x + 1 WHERE id IN (1,3);
SELECT  xmin, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ id │ x │

│ 187668 │  1 │ 3 │
│ 187665 │  2 │ 3 │
│ 187668 │  3 │ 5 │
│ 187665 │  4 │ 5 │

The rows with ID 1 and 3 were added in transaction 187668, whereas the rows with ID 2 and 4 were added in an earlier transaction, number 187665.

Since I’ve already told you that rows cannot be changed by a query, this raises an interesting question: What happened to the rows with IDs 1 and 3 that were added in the first transaction, number 187665?

The answer is: Those rows still exist in the system! They have not been modified or removed. But the system did mark them as having an xmax of 187668, the transaction ID in which the new, replacement rows are now available.

Transaction IDs only go up. But you can imagine a hypothetical time-traveler’s database, in which you could move the transaction ID down, and see rows from an earlier era. (Somehow, I doubt that this would be a very popular television series.)

The thing is, we don’t need a hypothetical, time-traveler’s database in order to see such behavior. All we need is to open two separate sessions to our PostgreSQL database at the same time, and have each session open a transaction.

You see, when you BEGIN a new transaction, your actions are hidden from the rest of the world — and similarly, modifications made elsewhere in the database are hidden from you.  You can thus have a situation in two different sessions see completely different versions of the same row, thanks to their different transaction IDs.  This might sound crazy, but it actually helps the database to run smoothly, with a minimum of conflicts and locks.

So, let’s open two different PostgreSQL sessions, and take a look at what happens.  In order to distinguish between them, I used the \set command in psql to make PROMPT1 either “Session A>” or “Session B>”. Here’s session A:

Session A> BEGIN;
Session A> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187668 │    0 │  1 │ 3 │
│ 187665 │    0 │  2 │ 3 │
│ 187668 │    0 │  3 │ 5 │
│ 187665 │    0 │  4 │ 5 │

And here is session B:

Session B> BEGIN;
Session B> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187668 │    0 │  1 │ 3 │
│ 187665 │    0 │  2 │ 3 │
│ 187668 │    0 │  3 │ 5 │
│ 187665 │    0 │  4 │ 5 │

At this point, they are identical; both sessions see precisely the same rows, with the same xmin and xmax. Now let’s modify the values of two rows in session A:

Session A> update foo set x = x + 1 where id in (1,3);

What do we see now in session A?

Session A> select xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187670 │    0 │  1 │ 4 │
│ 187665 │    0 │  2 │ 3 │
│ 187670 │    0 │  3 │ 6 │
│ 187665 │    0 │  4 │ 5 │

Just as we would expect, we have two new rows (IDs 1 and 3), and two old rows (IDs 2 and 4), which we can distinguish via the xmin value.  In all cases, the rows have an xmax value of 0.

Let’s return to session B, and see what things look like there.  (Note that I haven’t committed any transactions here!)

Session B> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │  xmax  │ id │ x │

│ 187668 │ 187670 │  1 │ 3 │
│ 187665 │      0 │  2 │ 3 │
│ 187668 │ 187670 │  3 │ 5 │
│ 187665 │      0 │  4 │ 5 │

That’s right — the rows that session B sees are no longer the rows that session A sees.  Session B continues to see the same rows as were visible when it started the transaction. We can see that the xmax of session B’s old rows is the same as the xmin of session A’s new rows. That’s because any transaction after 187670 will see the new row (i.e., with an xmin of 187670), but any transaction started before 187670 will see the old row (i.e., with an xmax of 187670).

I hope that the picture is now coming together: When you INSERT a new row, its xmin is the current transaction ID, and its xmax is 0.  When you DELETE a row, its xmax is changed to be the current transaction ID.  And when you UPDATE a row, PostgreSQL does both of these actions.  That’s why the UPDATE I did earlier, on all of the rows in the table, doubled its size.  (More on that in a moment.)

This allows different transactions to work independently, without having to wait for locks. Imagine if session B wants to look at row with ID 3; because of MVCC, it doesn’t need to wait to read it.  Heck, session B can even change the row with ID 3, although it’ll hang until session A commits or rolls back.

And indeed, another benefit of MVCC is that rollbacks are trivially easy to implement: If we COMMIT session A and roll back session B, then session A’s view of the world will be the current one, for all following transactions.  But if we ROLLBACK session A, then everything is fine; PostgreSQL keeps track of the fact that session A was rolled back, and that we should actually be using (until further notice) the rows with an xmax of 187670.

As you can imagine, this can lead to some trouble. If every UPDATE increases the number of rows, and every DELETE fails to free up any space, every PostgreSQL database will eventually run out of room, no? Well, yes — except for VACUUM, which is the garbage-collecting mechanism for PostgreSQL’s rows. VACUUM can be run manually, but it’s normally run via “autovacuum,” a program that runs in the background, and invokes VACUUM on a regular basis.

VACUUM doesn’t free space from your PostgreSQL database. Rather, it identifies rows whose xmin is greater than the current transaction ID, and thus marks those rows as available for re-use. PostgreSQL thus keeps track of not only the current transaction ID, but also the ID of the earliest still-open transaction. So long as that transaction is open, it needs to be able to see the rows from that point in time.

Don’t confuse VACUUM with VACUUM FULL; the later does indeed clear out space from a PostgreSQL database, but locks the database while it’s doing so. If you have a serious 24/7 application, then VACUUM FULL is a very bad idea. The regular autovacuum daemon, albeit perhaps with a bit of configuration, should take care of identifying which rows can and should be marked for deletion.

In the case of my client’s problem, my initial thought was that autovacuum wasn’t running. However, a quick look at the pg_stat_user_tables view showed that autovacuum was running on a regular basis.

However, VACUUM will mark a row as eligible for reuse if no other transactions can see it. (After all, if a session remains open, then it still needs to see those old row versions.)  Meaning that if a transaction remains open from the time the system starts up, VACUUM will never identify any rows as eligible for reuse, because there is still an open transaction whose ID is lower than the xmax of the deleted rows.

And this is precisely what happened to my client: It turns out that their application, at startup, was opening several transactions with BEGIN statements… and then never committing those transactions or rolling them back.  None of those transactions led to a lock in PostgreSQL, because they weren’t executing any query that would lead to a conflict. And autovacuum was both running and reporting that it ran on each table — because it had indeed done so.

But when it came time to mark the old rows as eligible for reuse, VACUUM ignored all of the rows in the entire database, including our 23,000-row table — because of the transaction that the application had opened at startup. It didn’t matter how many times we can VACUUM, whether it was done manually or automatically, or what else was running at the time: All of the old versions of every row in our table were kept around by PostgreSQL, leading to a massive disk usage and database slowness. Stopping the application had the effect of closing those transactions, and thus allowing VACUUM FULL to do its magic, albeit at the cost of system downtime. And because the application opened the transactions at startup, any fix was bound to be a temporary one.

My client was a bit surprised to discover that the entire database could encounter such problems from an open transaction that was neither committed nor rolled back. However, I hope that you now see why the space wasn’t reclaimed, and why the database acted as it did.

The solution, of course, was to get rid of those three open, empty transactions at startup; once we did that, everything ran smoothly.

Transactions are a brilliant and useful invention, but in PostgreSQL (and other MVCC systems), you shouldn’t be holding onto open transactions for too long. Doing so might have surprisingly unpleasant consequences. Commit or rollback as soon as you can — or you might find yourself with a database accumulating gigabytes full of old, unnecessary rows.

Announcing: Coaching for technical trainers

What, exactly, do I do for a living?

Yes, I’m a programmer — or as I’m supposed to say nowadays, a “full-stack Web developer.” And yes, I’m a lead developer/CTO.  And yes, I’m a writer, with two ebooks written (about Python and visiting China), two more on the way, and my monthly Linux Journal column now in its 20th (!) year.

But over the last few years, another role has increasingly dominated the others: Much of my time is now spent as a technical trainer, teaching a variety of open-source technologies — Python, Ruby, Git, and PostgreSQL — to companies around the world.

Some software developers believe that training is less demanding, less fulfilling, or even less lucrative than creating software. And for some of them, that might well be true.

I have personally found training to be no less demanding than developing software — but I have also found that it is extremely fulfilling, and that it does a more-than-adequate job of paying the bills. We all know that high-tech companies are desperately looking for high-quality developers; when I do my job right, I provide them with such developers, ready to use the latest technologies to solve problems more efficiently and reliably than would otherwise have been possible.

The good news is that my training business is going quite well; I’m now booked solid for several months in advance, and I get to work with some great companies and very bright engineers. Part of my motivation for writing ebooks is now to reach the people whom I cannot possibly teach in person.

Being a trainer requires a number of skills beyond knowing how to program: You also need to know how to organize a syllabus, prepare exercises, prepare slides and printed materials, speak in front of a group, and answer questions. Beyond that, you also need to understand the business side of training — what are companies looking for, how do you approach them, how much do you charge them, and how can you grow your training business when companies are satisfied with your work. These skills, like many others, take time develop, and everyone has their own strengths and weaknesses. But I believe that if you’re willing to put in the effort, then you can learn how to become a technical trainer, and have the same sense of job satisfaction that I do.

If you are such a person, interested in offering technical training — on any subject, not just the technologies in which I specialize — then I invite you to consider joining my coaching program for technical trainers. This isn’t a course; it’s a personalized program that will help you to improve, month by month, in all of the ways that you need to become successful. You’ll have access not just to me, but to other trainers with varying levels of experience, who will provide you with feedback — just as I hope you’ll help them. The program is still in its infancy, but I believe that I’ve put together a combination of resources that can help everyone to become a trainer.

I’m launching the coaching program in two weeks, on October 1st, 2015. There aren’t any formal start of finish times; if you want to start later, then that’s fine, as well. My hope is that you’ll stay in the program for as long as you need to improve, getting feedback from me and others.  I also hope and expect that the program will more than pay for itself.

I’ll be holding a free Webinar on the subject of technical training on October 14th, at which I’ll also be taking questions from anyone who might be new to this field, or be curious about what it involves. I invite you to read more about my coaching program, to contact me if you have any questions about it, and to register for the Webinar that I’ll be holding next month!

Understanding nested list comprehensions in Python

In my last blog post, I discussed list comprehensions, and how to think about them. Several people suggested (via e-mail, and in comments on the blog) that I should write a follow-up posting about nested list comprehensions.

I must admit that nested list comprehensions are something that I’ve shied away from for years. Every time I’ve tried to understand them, let alone teach them, I’ve found myself stumbling for words, without being clear about what was happening, what the syntax is, or where I would want to use them. I managed to use them on a few occasions, but only after a great deal of trial and error, and without really understanding what I was doing.

Fortunately, the requests that I received, asking how to work with such nested list comprehensions, forced me to get over my worries. I’ve figured out what’s going on, and even think that I understand what my problem was with understanding them before.

The key thing to remember is that in a list comprehension, we’re dealing with an iterable. So when I say:

[ len(line) 
for line in open('/etc/passwd') ]

I’m saying that I want to iterate over the file object we got from opening /etc/passwd. There will be one element in the output list for each element in the input iterable — aka, every line in the file.

That’s great if I want my list comprehension to return something based on each line of /etc/passwd. But each line of /etc/passwd is a string, and thus also iterable. Maybe I want to return something not based on the lines of the file, but on the characters of each line.

Were I to use a “for” loop to process the file, I would use a nested loop — i.e., one loop inside of the other, with the outer loop iterating over lines and the inner loop iterating over consonants. It turns out that we can use a nested list comprehension, too. Here’s a simple example of a nested list comprehension:

[(x,y) for x in range(5) for y in range(5)]

If your reaction to this is, “What in the blazes does that mean?!?” then you’re not alone. Until just recently, that’s what I thought, too.

However: If we rewrite the above nested list comprehension using my preferred (i.e., multi-line) list-comprehension style, I think that things become a bit clearer:

 for x in range(5)  
 for y in range(5)]

Let’s take this apart:

  • Our output expression is the tuple (x,y). That is, this list comprehension will produce a list of two-element tuples.
  • We first run over the source range(5), giving x the values 0 through 4.
  • For each value in x, we run through the source range(5), giving y the values 0 through 4.
  • The number of values in the output depends on the number of runs of  the final (second) “for” line.
  • The output, not surprisingly, will be all of the two-element tuples from (0,0) to (4,4).

Now, let’s mix things up by changing them a bit:

  for x in range(5)  
  for y in range(x+1)]

Notice that now, the maximum value of y will vary according to the value of x. So we’ll get from (0,0) to (4,4), but we won’t see such things as (2,4) because y will never be larger than x.

Again, it’s important to understand several things here:

  • Our “for y” loop will execute once for each iteration over x.
  • In our “for y” loop, we have access to the variable x.
  • In our “for x” loop, we don’t have access to y (unless you consider the last value of y to be useful, but you really shouldn’t).
  • Our (x,y) tuple is output once for each iteration of the *final* loop, at the bottom.

Here’s another example: Assume that we have a few friends over, and that we have decided to play several games of Scrabble. Being Python programmers, we have stored our scores in a dictionary:

{'Reuven':[300, 250, 350, 400], 
 'Atara':[200, 300, 450, 150], 
 'Shikma':[250, 380, 420, 120], 
 'Amotz':[100, 120, 150, 180] }

I want to know each player’s average score, so I write a little function:

def average(scores):  
    return sum(scores) / len(scores)

If we want to find out each individual’s average score, we can use our function and a standard comprehension — in this case, a dict comprehension, to preserve the names:

 >>> { name : average(score)  
       for name, score in scores.items() }

{'Amotz': 137, 'Atara': 275, 'Reuven': 325, 'Shikma': 292}

But what if I want to get the average score, across all of the players? In such a case, I will need to grab each of the scores from inside of the inner lists. To do that, I can use a nested list comprehension:

>>> average([ one_score  
              for one_player_scores in scores.values()  
              for one_score in one_player_scores ])


What if I’m only interested (for whatever reason) in including scores that were above 200? As with all list comprehensions, I can use the “if” clause to weed out values that I don’t want. That condition can use any and all of the values that I have picked out of the various “for” lines:

>>> [ one_score      
      for one_player_scores in scores.values()     
      for one_score in one_player_scores
      if one_score > 200]

[300, 250, 350, 400, 300, 450, 250, 380, 420]

If I want to put these above-200 scores into a CSV file of some sort, I could do the following:

>>> ','.join([ str(one_score)  
               for one_player_scores in scores.values() 
               for one_score in one_player_scores  
               if one_score > 200])


Here’s one final example that I hope will drive these points home: Let’s assume that I have information about a hotel. The hotel has stored its information in a Python list. The list contains lists (representing rooms), and each sublist contains one or more dictionaries (representing people). Here’s our data structure:

rooms = [[{'age': 14, 'hobby': 'horses', 'name': 'A'},  
          {'age': 12, 'hobby': 'piano', 'name': 'B'},  
          {'age': 9, 'hobby': 'chess', 'name': 'C'}],  
         [{'age': 15, 'hobby': 'programming', 'name': 'D'}, 
          {'age': 17, 'hobby': 'driving', 'name': 'E'}],  
         [{'age': 45, 'hobby': 'writing', 'name': 'F'},  
          {'age': 43, 'hobby': 'chess', 'name': 'G'}]]

What are the names of the people staying at our hotel?

 >>> [ person['name']      
       for room in rooms
       for person in room ]

['A', 'B', 'C', 'D', 'E', 'F', 'G']

How about the names of people staying in our hotel who enjoy chess?

>>> [ person['name']  
      for room in rooms  
      for person in room  
      if person['hobby'] == 'chess' ]

['C', 'G']

Basically, every “for” line flattens the items over which you’re iterating by one more level, gives you access to that level in both the output expression (i.e., first line) and in the condition (i.e., optional final line).

I hope that this helps you to understand nested list comprehensions. If it did, please let me know! (And if it didn’t, please let me know that, as well!)

Want to understand Python’s comprehensions? Think in Excel or SQL.

Comprehensions are among the most useful constructs in Python. They merge the old, trusty “map” and “filter” functions into a single piece of compact, elegant syntax, allowing us to expression complex ideas in a minimum of code. Comprehensions are one of the most important tools in a Pythonista’s toolbox.

And yet, I have found that a very large number of Python programmers, including some experienced developers, are not completely comfortable with comprehensions. There are two reasons for this: First, it’s not obvious when to use them, and what sorts of problems they solve. The second problem, which is at least as important, is that the syntax is hard for people to remember and understand.

I’ve started to use a new explanation and introduction to comprehensions in my Python classes, and have found that it helps to lower the learning curve to some degree. In this post, I’m publicizing this explanation, in the hopes that it’ll help Python developers to understand when, where, and how to use comprehensions.

Let’s take a simple problem: I want to take a list of five integers, and get a list of their squares. If you give this problem to a new (or even intermediate) Python programmer, the answer would look something like this:

numbers = range(5)
output = [ ]
for number in numbers:
    output.append(number * number)

Now, the thing is that this does work. (In my courses, I often use the phrase, “Unfortunately, this works.”) Often, when I talk about comprehensions, I talk about functional programming, the idea of immutable data structures, the idea that we don’t want to change things, and the benefits of thinking in terms of mapreduce.

But let’s ignore all of that, and ask a simpler question: If you were to give this problem to your accountant, how would they solve the problem?

Almost certainly, an accountant would fire up Excel, and put the numbers in a column:


Let’s assume that the above numbers are in the spreadsheet’s column A. The Excel user would, given this task, then tell Excel that column B should be calculated as A*A. And it would be done:

A  B
-  -
0  0
1  1
2  4
3  9
4  16

You could argue that the difference here is that Excel has a GUI, and Python doesn’t. But that’s missing the point. The real difference is that our accountant told Excel how to transform the first column into the second column, whereas our Python developer wrote a program that describe how to carry out that transformation.

We can think about this in a different way, too: Rather than solving the problem serially, as in the above for loop, the accountant is thinking in a parallel manner, applying a single expression to a large data set. The Excel user doesn’t care, or even know, the order in which the numbers are handed to the expression. The important thing is that the expression is applied once to each of the numbers, and that the final result appears in the correct order.

We might laugh at Excel, and dismiss its users as technical neophytes. And certainly, many users of Excel would deny that they possess serious programming chops. But this sort of thinking, which is so fundamental and natural to Excel users, is alien to many programmers. Which is a shame, because it allows us to express a very large number of ideas in a simple way.

To summarize this approach:

  • Think of your input as an iterable source of data
  • Think of what operation you want to apply to each element of that source
  • Get a new sequence out

That’s what the traditional “map” function does. Python does have a “map” function, but today, we typically use list comprehensions instead.

Let’s try to make this a bit more concrete, using the example that I used above: Let’s say that we have a list of five numbers, and we want to turn that list into a list of its squares. The list-comprehension syntax looks as follows:

[number * number for number in range(5) ]

Yikes. No wonder people are scared off by this syntax.  Let’s take the above syntax apart:

  • First of all, we’re going to get a list back. (It’s called a “list comprehension” for a reason.) That’s because of the square brackets, which are mandatory, and which tell Python what sort of object to create.
  • The data source will be “range(5),” which returns a list.
  • Each element in the data source will be assigned, in turn, to the iteration variable “number.”
  • We’ll invoke the operation “number * number” on each element of the data source.

In other words, we’re creating a new list, the elements of which are the result of applying our expression to each element of the source. This sounds suspiciously like what our accountant did above, using Excel: We’re telling Python what we want, and how to transform our source to that result. But how are things done internally? How is the list created? We neither know nor care.

List-comprehension syntax can be daunting for people to understand, in part because the order of the operations seems unusual. I’ve found that it can help to rewrite list comprehensions in the following way:

[number * number
 for number in range(5) ]

Yes, that’s right — I now spread list comprehensions across two lines; the first describes the operation I want to invoke, and the second line describes the data source. If this still seems unfamiliar, let’s try to bring it into a context with which you might have some experience:

[number * number           # SELECT
 for number in range(5) ]  # FROM

While they’re not directly equivalent, there are a fair number of similarities between a SELECT query in SQL, the placement of its SELECT expression and FROM clause, and our list comprehension.  The FROM clause in an SQL query describes our data source, which is typically going to be a table, but can also be a view or even the result of a function call. And the initial part of the SELECT is often the name of a column, but  can include function calls and operators.

On the one hand, the SELECT-FROM combination seems almost too simple to mention, in that you’re just retrieving a selected set of values from a data source.  On the other hand, such queries form the backbone of the database industry. In the same way, such functionality forms the backbone of many Python programs, iterating over a data structure, and plucking out part of it, transforming that part, and then returning a new list.

One of my favorite examples (and an exercise in my ebook, “Practice Makes Python“) is to take the /etc/passwd file used in Unix, and get the usernames contained within that file. /etc/passwd consists of one record per line, and the fields are separated by colons. Here are several lines from the /etc/passwd on my computer:

nobody:*:-2:-2::0:0:Unprivileged User:/var/empty:/usr/bin/false
root:*:0:0::0:0:System Administrator:/var/root:/bin/sh
daemon:*:1:1::0:0:System Services:/var/root:/usr/bin/false
_uucp:*:4:4::0:0:Unix to Unix Copy Protocol:/var/spool/uucp:/usr/sbin/uucico

We might normally think of a file as a collection of bytes, to which we give semantic meaning when we read it. But in Python, we’re encouraged to see a file as an ordered, iterable collection of lines of text. True, I can read from a file based on bytes, but it’s so common to want to read files by line that the language provides several constructs to do so.

We know that we can iterate over the lines of a file:

for line in open('/etc/passwd'):

This demonstrates that a file is iterable, which means that it can serve as a data source for a list comprehension. This means that the above code can be rewritten as:

 for line in open('/etc/passwd')]

Again, the first line in our list comprehension represents the expression we want to apply to every element of our data source. In this case, the expression is just the line.  If we want to get the username from each of  these lines, we just need to apply the “split” method on the string, returning a list — and then retrieve index 0 from the resulting list.  For example:

 for line in open('/etc/passwd')]

Again, we can think of it in terms of an SQL query:

SELECT username
FROM users

But of course, “username” in the above is a column name.  A more equivalent query to my list comprehension would be a “Users” table with an “info” column, queried as follows:

SELECT split_part(info, ':', 1)
FROM users;

Note that in this case, I’m using the built-in PostgreSQLsplit_part” operator to perform the equivalent operation to the str.split method in Python.

Remember that in the case of my SQL query, the result of a query always looks and acts like a table. The number and types of columns returned will depend on the number and types of expressions that I have in the SELECT  statement.  But the result set will have one or more columns, and zero or more rows.

In the same way, the result of a list comprehension is always going to be a list.  You can have whatever expression you want inside of the list comprehension; the expression represents one item in a list, not the list itself.

For example, let’s assume that I want to turn the usernames in /etc/passwd into a list of dictionaries. This doesn’t require a dictionary comprehension, which creates a single dictionary.  Rather, it requires a list  comprehension, in which the expression creates a dictionary.  Here’s a simple-minded such list comprehension:

[ {'name':line.split(":")[0]}
   for line in open('/etc/passwd')]

The above will work, in that it creates a list of dictionaries. And each dictionary has a single key-value pair.  But it seems a bit silly to do the above.  Rather, I’d probably want to have a dictionary containing the username and the numeric user ID, which is at index 2. I can then write:

[ {'name':line.split(":")[0], 'id':line.split(":")[2]}
for line in open('/etc/passwd')]

Again, we can think about this in terms of Excel, or even in terms of SQL: My query now produces a single column of results, but each column contains a text string. Or we can even say that the query produces two columns of results, which is not at all unusual in the world of SQL.

Let’s ignore the efficiency (or lack thereof) of invoking str.split twice in one comprehension: When I run this code on my Mac, it results in an exception, claiming that an index is out of range.

The reason is simple: I split each line into a list. But if there’s a line that doesn’t contain any : characters, it’ll be turned into a single-element list. I thus need to weed out any lines that won’t conform. Specifically, on my Mac at least, I need to remove any lines in /etc/passwd that are comments, meaning that they start with the ‘#’ character.

In the world of list comprehensions, I say the following:

[ {'name':line.split(":")[0], 'id':line.split(":")[2]}
for line in open('/etc/passwd')
if not line.startswith("#")]

Let’s extend our earlier SQL analogy further, adding the equivalent SQL syntax in comments after our Python code:

[ {'name':line.split(":")[0], 'id':line.split(":")[2]}    # SELECT
for line in open('/etc/passwd')                           # FROM
if not line.startswith("#")]                              # WHERE

Of course, when the first line of our comprehension becomes this long, it’s often a good idea to use a function, instead. And since the first line can be any legitimate Python expression, a function is often a good idea:

def get_user_info(line):
    name, passwd, id, rest = line.split(":", 3)   # max 4 fields
    return {'name':name, 'id':id}

[ get_user_info(line)             # SELECT
for line in open('/etc/passwd')   # FROM
if not line.startswith("#")]      # WHERE

A list comprehension thus gives you power similar to an SQL SELECT query — except that you’re not querying data in a table, but rather any object that conforms to Python’s iteration protocol, which includes a very  large number of built-in and custom-made objects.

Now, when would you want to use a list comprehension? And how does it differ from a for loop?

Using a list comprehension is appropriate whenever you want to transform data. That is, you have an iterable data source, and you want to create a new list whose elements are based on those of the data source. For  example, let’s assume that (for some reason) I want to find out how many times each character is used in /etc/passwd.  I can thus do the following, using collections.Counter:

from collections import Counter
counts = [Counter(line)
          for line in open('/etc/passwd')
          if not line.startswith("#")]

We know that “counts” is a list, because I used a list comprehension to create it. It is a list containing many Counter objects, one for each non-comment line in /etc/passwd. What if I want to find out what the most  popular character is in each line? I can modify my expression, asking the Counter object for the most common character:

counts = [Counter(line).most_common(1)
          for line in open('/etc/passwd')
          if not line.startswith("#")]

I can extend my expression even more, to get the most popular character from each line (inside of a two-element tuple in a one-element list):

counts = [Counter(line).most_common(1)[0][0]
          for line in open('/etc/passwd')
          if not line.startswith("#")]

And now I can find out how many times each most-popular character appears:

          for line in open('/etc/passwd')
          if not line.startswith("#")])

On my computer, the answer is:

Counter({':': 71, 'e': 4, 's': 1})

Meaning that in 71 non-comment lines, “:” is the most common, but in 4 lines it’s “e”, and in one line it’s “s”.  Now, could I have done this with a for loop?  Yes, of course — but because I’m dealing with iterables, and  because I’m using objects that work with such iterables, I can chain them together to get an answer in a way that doesn’t require me to tell Python how to do its job. I’m doing things like our accountant did, back at the  start of this article — I’m saying what I want, and letting Python do the hard work of dealing with this for me.

When would I use a for loop, then? The distinction is between whether you want to get a list back, and whether you want to execute a command a number of times.  If you want to build a list, and if it’s built on an iterable that already exists, then I’d say a list comprehension is almost certainly going the be the best bet.  But if you want to execute something a number of times without creating a list, then a comprehension is the a bad way to do it; you should use a “for” loop, instead.

It’s true that list comprehensions are faster than for loops. But most of the time, for loops are used for different things than list comprehensions. “for” loops shouldn’t be used when you want to turn one iterable structure into another; that’s for comprehensions. And you shouldn’t execute something (e.g., print) many times via a list comprehension, even if you can do so via a called function.  I’ve found that the dividing line between when to use a “for” loop, and when to use a comprehension, is clearly delineated in the minds of experienced Python developers, but very hazy among newcomers to the language, and to these ideas.

So, to summarize:

  • If you want to execute a command numerous times, use a “for” loop.
  • If you have an iterable, and want to create a new iterable, then a list comprehension is probably your best bet.
  • Building a list comprehension is sort of like working in Excel: You start with a set of data, and you create a new set of data. Any expression can be used to map from one to the other.  You don’t care about how Python does things behind the scenes; you just want to get your new data back.
  • A list comprehension consists of either two or three parts, which are often easier to understand if you put them on separate lines: (1) the expression, (2) the data source, and (3) an optional “if” statement.
  • These three lines are analogous to SQL’s SELECT, FROM, and WHERE clauses in a query.  And just as each of those (SELECT, FROM, and WHERE) can use arbitrary expressions, so too can Python’s list comprehensions use arbitrary expressions. A list comprehension will always return a list, though — just as a SELECT will always return a table-like result set.
  • Do you want to create a set, or perhaps a dictionary, rather than a list?  Then you can use a set comprehension or a dict comprehension. The idea is the same as everything I’ve said about list comprehensions, except that your result will be a single set or a single dictionary.

Do you find it difficult to work with list comprehensions?  If so, what’s hard for you about them?  And does the above help to make their use, and their syntax easier to remember?  I’m eager to hear your reactions, so that I can improve these explanations even further.