Implementing “zip” with list comprehensions

zipperI love Python’s “zip” function. I’m not sure just what it is about zip that I enjoy, but I have often found it to be quite useful. Before I describe what “zip” does, let me first show you an example:

>>> s = 'abc'
>>> t = (10, 20, 30)

>>> zip(s,t)
[('a', 10), ('b', 20), ('c', 30)]

As you can see, the result of “zip” is a sequence of tuples. (In Python 2, you get a list back.  In Python 3, you get a “zip object” back.)  The tuple at index 0 contains s[0] and t[0]. The tuple at index 1 contains s[1] and t[1].  And so forth.  You can use zip with more than one iterable, as well:

>>> s = 'abc'
>>> t = (10, 20, 30)
>>> u = (-5, -10, -15)

>>> list(zip(s,t,u))
[('a', 10, -5), ('b', 20, -10), ('c', 30, -15)]

(You can also invoke zip with a single iterable, thus ending up with a bunch of one-element tuples, but that seems a bit weird to me.)

I often use “zip” to turn parallel sequences into dictionaries. For example:

>>> names = ['Tom', 'Dick', 'Harry']
>>> ages = [50, 35, 60]

>>> dict(zip(names, ages))
{'Harry': 60, 'Dick': 35, 'Tom': 50}

In this way, we’re able to quickly and easily product a dict from two parallel sequences.

Whenever I mention “zip” in my programming classes, someone inevitably asks what happens if one argument is shorter than the other. Simply put, the shortest one wins:

>>> s = 'abc'
>>> t = (10, 20, 30, 40)
>>> list(zip(s,t))
[('a', 10), ('b', 20), ('c', 30)]

(If you want zip to return one tuple for every element of the longer iterable, then use “izip_longest” from the “itertools” package.)

Now, if there’s something I like even more than “zip”, it’s list comprehensions. So last week, when a student of mine asked if we could implement “zip” using list comprehensions, I couldn’t resist.

So, how can we do this?

First, let’s assume that we have our two equal-length sequences from above, s (a string) and t (a tuple). We want to get a list of three tuples. One way to do this is to say:

[(s[i], t[i])              # produce a two-element tuple
 for i in range(len(s))]   # from index 0 to len(s) - 1

To be honest, this works pretty well! But there are a few ways in which we could improve it.

First of all, it would be nice to make our comprehension-based “zip” alternative handle inputs of different sizes.  What that means is not just running range(len(s)), but running range(len(x)), where x is the shorter sequence. We can do this via the “sorted” builtin function, telling it to sort the sequences by length, from shortest to longest. For example:

>>> s = 'abcd'
>>> t = (10, 20, 30)

>>> sorted((s,t), key=len)
[(10, 20, 30), 'abcd']

In the above code, I create a new tuple, (s,t), and pass that as the first parameter to “sorted”. Given these inputs, we will get a list back from “sorted”. Because we pass the builtin “len” function to the “key” parameter, “sorted” will return [s,t] if s is shorter, and [t,s] if t is shorter.  This means that the element at index 0 is guaranteed not to be longer than any other sequence. (If all sequences are the same size, then we don’t care which one we get back.)

Putting this all together in our comprehension, we get:

>>> [(s[i], t[i])    
    for i in range(len(sorted((s,t), key=len)[0]))]

This is getting a wee bit complex for a single list comprehension, so I’m going to break off part of the second line into a function, just to clean things up a tiny bit:

>>> def shortest_sequence_range(*args):
        return range(len(sorted(args, key=len)[0]))

>>> [(s[i], t[i])     
    for i in shortest_sequence_range(s,t) ]

Now, our function takes *args, meaning that it can take any number of sequences. The sequences are sorted by length, and then the first (shortest) sequence is passed to “len”, which calculates the length and then returns the result of running “range”.

So if the shortest sequence is ‘abc’, we’ll end up returning range(3), giving us indexes 0, 1, and 2 — perfect for our needs.

Now, there’s one thing left to do here to make it a bit closer to the real “zip”: As I mentioned above, Python 2’s “zip” returns a list, but Python 3’s “zip” returns an iterator object. This means that even if the resulting list would be extremely long, we won’t use up tons of memory by returning it all at once. Can we do that with our comprehension?

Yes, but not if we use a list comprehension, which always returns a list. If we use a generator expression, by contrast, we’ll get an iterator back, rather than the entire list. Fortunately, creating such a generator expression is a matter of just replacing the [ ] of our list comprehension with the ( ) of a generator expression:

>>> def shortest_sequence_range(*args):
      return range(len(sorted(args, key=len)[0]))

>>> g = ((s[i], t[i])
         for i in shortest_sequence_range(s,t) )

>>> for item in g:
        print(item)
('a', 10)
('b', 20)
('c', 30)

And there you have it!  Further improvements on these ideas are welcome — but as someone who loves both “zip” and comprehensions, it was fun to link these two ideas together.

Fun with floats

I’m in Shanghai, and before I left to teach this morning, I decided to check the weather.  I knew that it would be hot, but I wanted to double-check that it wasn’t going to rain — a rarity during Israeli summers, but not too unusual in Shanghai.

I entered “shanghai weather” into DuckDuckGo, and got the following:

Never mind that it gave me a weather report for the wrong Chinese city. Take a look at the humidity reading!  What’s going on there?  Am I supposed to worry that it’s ever-so-slightly more humid than 55%?

The answer, of course, is that many programming languages have problems with floating-point numbers.  Just as there’s no terminating decimal number to represent 1/3, lots of numbers are non-terminating when you use binary, which computers do.

As a result floats are inaccurate.  Just add 0.1 + 0.2 in many programming languages, and prepare to be astonished.  Wait, you don’t want to fire up a lot of languages? Here, someone has done it for you: http://0.30000000000000004.com/ (I really love this site.)

If you’re working with numbers that are particularly sensitive, then you shouldn’t be using floats. Rather, you should use integers, or use something like Python’s decimal.Decimal, which guarantees accuracy at the expense of time and space. For example:

>> from decimal import Decimal
>>> x = Decimal('0.1')
>>> y = Decimal('0.2')
>>> x + y
Decimal('0.3')
>>> float(x+y)
0.3

Of course, you should be careful not to create your decimals with floats:

>> x = Decimal(0.1)
>>> y = Decimal(0.2)
>>> x + y
Decimal('0.3000000000000000166533453694')

Why is this the case? Let’s take a look:

>> x
Decimal('0.1000000000000000055511151231257827021181583404541015625')

>>> y
Decimal('0.200000000000000011102230246251565404236316680908203125')

So, if you’re dealing with sensitive numbers, be sure not to use floats! And if you’re going outside in Shanghai today, it might be ever-so-slightly less humid than your weather forecast reports.

Speedy string concatenation in Python

As many people know, one of the mantras of the Python programming language is, “There should be one, and only one, way to do it.”  (Use “import this” in your Python interactive shell to see the full list.)  However, there are often times when you could accomplish something in any of several ways. In such cases, it’s not always obvious which is the best one.

A student of mine recently e-mailed me, asking which is the most efficient way to concatenate strings in Python.

The results surprised me a bit — and gave me an opportunity to show her (and others) how to test such things.  I’m far from a benchmarking expert, but I do think that what I found gives some insights into concatenation.

First of all, let’s remember that Python provides us with several ways to concatenate strings.  We can use the + operator, for example:

>> 'abc' + 'def'
'abcdef'

We can also use the % operator, which can do much more than just concatenation, but which is a legitimate option:

>>> "%s%s" % ('abc', 'def')
'abcdef'

And as I’ve mentioned in previous blog posts, we also have a more modern way to do this, with the str.format method:

>>> '{0}{1}'.format('abc', 'def')
'abcdef'

As with the % operator, str.format is far more powerful than simple concatenation requires. But I figured that this would give me some insights into the relative speeds.

Now, how do we time things? In Jupyter (aka IPython), we can use the magic “timeit” command to run code.  I thus wrote four functions, each of which concatenates in a different way. I purposely used global variables (named “x” and “y”) to contain the original strings, and a local variable “z” in which to put the result.  The result was then returned from the function.  (We’ll play a bit with the values and definitions of “x” and “y” in a little bit.)

def concat1(): 
    z = x + y 
    return z 

 def concat2(): 
    z = "%s%s" % (x, y) 
    return z 

def concat3(): 
    z = "{}{}".format(x, y) 
    return z 

def concat4(): 
    z = "{0}{1}".format(x, y) 
    return z

I should note that concat3 and concat4 are almost identical, in that they both use str.format. The first uses the implicit locations of the parameters, and the second uses the explicit locations.  I decided that if I’m already benchmarking string concatenation, I might as well also find out if there’s any difference in speed when I give the parameters’ iindexes.

I then defined the two global variables:

x = 'abc' 
y = 'def'

Finally, I timed running each of these functions:

%timeit concat1()
%timeit concat2()
%timeit concat3()
%timeit concat4()

The results were as follows:

  • concat1: 153ns/loop
  • concat2: 275ns/loop
  • concat3: 398ns/loop
  • concat4: 393ns/loop

From this benchmark, we can see that concat1, which uses +, is significantly faster than any of the others.  Which is a bit sad, given how much I love to use str.format — but it also means that if I’m doing tons of string processing, I should stick to +, which might have less power, but is far faster.

The thing is, the above benchmark might be a bit problematic, because we’re using short strings.  Very short strings in Python are “interned,” meaning that they are defined once and then kept in a table so that they need not be allocated and created again.  After all, since strings are immutable, why would we create “abc” more than once?  We can just reference the first “abc” that we created.

This might mess up our benchmark a bit.  And besides, it’s good to check with something larger. Fortunately, we used global variables — so by changing those global variables’ definitions, we can run our benchmark and be sure that no interning is taking place:

x = 'abc' * 10000 
y = 'def' * 10000

Now, when we benchmark our functions again, here’s what we get:

  • concat1: 2.64µs/loop
  • concat2: 3.09µs/loop
  • concat3: 3.33µs/loop
  • concat4: 3.48µs/loop

Each loop took a lot longer — but we see that our + operator is still the fastest.  The difference isn’t as great, but it’s still pretty obvious and significant.

What about if we no longer use global variables, and if we allocate the strings within our function?  Will that make a difference?  Almost certainly not, but it’s worth a quick investigation:

def concat1(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = x + y 
     return z 

def concat2(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "%s%s" % (x, y) 
     return z 

def concat3(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "{}{}".format(x, y) 
     return z 

def concat4(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "{0}{1}".format(x, y) 
     return z 

And our final results are:

  • concat1: 4.89µs/loop
  • concat2: 5.78µs/loop
  • concat3: 6.22µs/loop
  • concat4: 6.19µs/loop

Once again, we see that + is the big winner here, but (again) but less of a margin than was the case with the short strings.  str.format is clearly shorter.  And we can see that in all of these tests, the difference between “{0}{1}” and “{}{}” in str.format is basically zero.

Upon reflection, this shouldn’t be a surprise. After all, + is a pretty simple operator, whereas % and str.format do much more.  Moreover, str.format is a method, which means that it’ll have greater overhead.

Now, there are a few more tests that I could have run — for example, with more than two strings.  But I do think that this demonstrates to at least some degree that + is the fastest way to achieve concatenation in Python.  Moreover, it shows that we can do simple benchmarking quickly and easily, conducting experiments that help us to understand which is the best way to do something in Python.

Another free regexp Q&A webinar!

The last Webinar I did, with Q&A about regular expressions, was great fun — so much, that I’ve decided to do another one.

So, if you have questions (big or little) about regular expressions in Python, Ruby, JavaScript, and/or PostgreSQL, sign up for this free Webinar on Monday, April 11th: https://www.crowdcast.io/e/regexpqa2

If you already have questions, you can leave them in advance using the Crowdcast Q&A system.  (Or just surprise me during the Webinar itself.)

I look forward to seeing you there!

Free Webinar: Regexp Q&A

practice-makes-regexp-coverTo celebrate the publication of my new ebook, Practice Makes Regexp, my upcoming Webinar (on March 22nd) is all about regular expressions (“regexps”) in Python, Ruby, JavaScript, and PostgreSQL, as well as the Unix “grep” command.

Unlike previous Webinars, in which I gave a presentation and then took Q&A, this time will be all about Q&A: I want you to come with your questions about regular expressions, or even projects that you’re wondering how to attack using them.

I’ll do my best to answer your questions, whether they be about regexp syntax, differences between implementations and languages, how to debug hairy regexps, and even when they might not be the most appropriate tool for the job.

Please join me on March 22nd by signing up here:

http://ccst.io/e/regexpqa

And when you sign up, please don’t forget to ask a question or two!  (You can do that it advance — and doing so will really help me to prepare detailed answers.)

I look forward to your questions on the 22nd!

Reuven

Yes, you can master regular expressions!

Announcing: My new book, “Practice Makes Regexp,” with 50 exercises meant to help you learn and master regular expressions. With explanations and code in Python, Ruby, JavaScript, and PostgreSQL.

I spend most of my time nowadays going to high-tech companies and training programmers in new languages and techniques. Actually, many of the things I teach them aren’t really new; rather, they’re new to the participants in my training. Python has been around for 25 years, but for my students, it’s new, and even a bit exciting.

I tell participants that my job is to add tools to their programming toolbox, so that if they encounter a new problem, they’ll have new and more appropriate or elegant ways to attack and solve it. Moreover, I tell them, once you are intimately familiar with a tool or technique, you’ll suddenly discover opportunities to use it.
Earlier this week, I was speaking with one of my consulting clients, who was worried that some potentially sensitive information had been stored in their Web application’s logfiles — and they weren’t sure if they had a good way to search through the logs.

 

I suggested the first solution that came to mind: Regular expressions.

Regular expressions are a lifesaver for anyone who works with text.  We can use them to search for patterns in files, in network data, and in databases. We can use them to search and replace.  To handle protocols that have changed ever so slightly from version to version. To handle human input, which is always messier than what we get from other computers.

Regular expressions are one of the most critical tools I have in my programming toolbox.  I use them at least a few times each day, and sometimes even dozens of times in a given day.

So, why don’t all developers know and use regular expressions? Quite simply, because the learning curve is so steep. Regexps, as they’re also known, are terse and cryptic. Changing one character can have a profound impact on what text a regexp matches, as well as its performance. Knowing which character to insert where, and how to build up your regexps, is a skill that takes time to learn and hone.

Many developers say, “If I have a problem that involves regular expressions, I’ll just go to Stack Overflow, where my problem has likely been addressed already.” And in many cases, they’re right.

But by that logic, I shouldn’t learn any French before I go to France, because I can always use a phrasebook.  Sure, I could work that way — but it’s far less efficient, and I’ll miss many opportunities that would come my way if I knew French.

Moreover, relying on Stack Overflow means that you never get a full picture of what you can really do with regular expressions. You get specific answers, but you don’t have a fully formed mental model of what they are and how they work.

But wait, it gets worse: If you’re under the gun, trying to get something done for your manager or a big client, you can’t spend time searching through Stack Overflow. You need to bring your best game to the table, demonstrating fluency in regular expressions.  Without that fluency, you’ll take longer to solve the problem — and possibly, not manage to solve it at all.

Believe me, I understand — my first attempt at learning regular expressions was a complete failure. I read about them in the Emacs manual, and thought to myself, “What could this seemingly random collection of characters really do for me?”  I ignored them for a few more years, until I started to program in Perl — a language that more or less expected you to use regexps.

So I spent some time learning regexp syntax.  The more I used them,  the more opportunities I found to use them.  And the more I found that they made my life easier, better, and more convenient.  I was able to solve problems that others couldn’t — or even if they could, they took much longer than I did.  Suddenly, processing text was a breeze.

I was so excited by what I had learned that when I started to teach advanced programming courses, I added regexps to the syllabus.  I figured that I could figure out a way to make regexps understandable in an hour or two.

But boy, was I wrong: If there’s something that’s hard for programmers to learn, it’s regular expressions.  I’ve thus created a two-day course for people who want to learn regular expressions.  I not only introduce the syntax, but I have them practice, practice, and practice some more.  I give them situations and tasks, and their job is to come up with a regexp that will solve the problem I’ve given them.  We discuss different solutions, and the way that different languages might go about solving the problem.

After lots of practice, my students not only know regexp syntax — they know when to use it, and how to use it.  They’re more efficient and valuable employees. They become the person to whom people can turn with tricky text-processing problems.  And when the boss is pressuring them for a

ImageAnd so, I’m delighted to announce the launch of my second ebook, “Practice Makes Regexp.”  This book contains 50 tasks for you to accomplish using regular expressions.  Once you have solved the problem, I present the solution, walking you through the general approach that we would use in regexps, and then with greater depth (and code) to solve the problem in Python, Ruby, JavaScript, and PostgreSQL.  My assumption in the book is that you have already learned regexps elsewhere, but that you’re not quite sure when to use them, how to apply them, and when each metacharacter is most appropriate.

After you go through all 50 exercises, I’m sure that you’ll be a master of regular expressions.  It’ll be tough going, but the point is to sweat a bit working on the exercises, so that you can worry a lot less when you’re at work. I call this “controlled frustration” — better to get frustrated working on exercises, than when the boss is demanding that you get something done right away.

Right now, the book is more than 150 pages long, with four complete chapters (including 17 exercises).  Within two weeks, the remaining 33 exercises will be done.  And then I’ll start work on 50 screencasts, one for each of the exercises, in which I walk you through solutions in each of Python, Ruby, JavaScript, and PostgreSQL.  If my previous ebook is any guide, there will be about 5 hours (!) of screencasts when I’m all done.

If you have always shied away from learning regular expressions, or want to harness their power, Practice Makes Regexp is what you have been looking for.  It’s not a tutorial, but it will help you to understand and internalize regexps, helping you to master a technology that frustrates many people.

To celebrate this launch, I’m offering a discount of 10%.  Just use the “regexplaunch” offer code, and take 10% off of any of the packages — the book, the developer package (which includes the solutions in separate program files, as well as the 300+ slides from the two-day regexp course I give at Fortune 100 companies), or the consultant package (which includes the screencasts, as well as what’s in the developer package).

I’m very excited by this book.  I think that it’ll really help a lot of people to understand and use regular expressions.  And I hope that you’ll find it makes you a more valuable programmer, with an especially useful tool in your toolbox.

All 50 “Practice Makes Python” screencasts are complete!

I’m delighted to announce that I’ve completed a screencast for every single one of the 50 exercises in my ebook, “Practice Makes Python.”  This is more than 300 minutes (5 hours!) of Python instruction, helping you to become a more expert Python programmer.

Each screencast consists of me solving one of the exercises in real-time, describing what I’m doing and what I’m doing it.   They range in length from 4 to 10 minutes.  The idea is that you’ll do the exercise, and then watch my video to compare your answer (and approach) with mine.

If you enjoy my Webinars or in-person courses, then I think you’ll also enjoy these videos.

The screencasts, available with the two higher-tier “Practice Makes Python” packages,  can be streamed in HD video quality, or can be downloaded (DRM-free) to your computer for more convenient viewing.

To celebrate finally finishing these videos, I’m offering the two higher-end packages at 20% off for the coming week, until February 18th. Just use the offer code “videodone” with either the “consultant” or “developer” package, and enjoy a huge amount of Python video.

You can explore these packages at the “Practice Makes Python” Web site.

Not interested in my book, but still want to improve your Python skills?  You can always take one of my two free e-mail courses, on Python variable scoping and working with files. Those are and will remain free forever. And of course, there’s my free Webinar on Python and data science next week.

Free Webinar: Pandas and Matplotlib

It’s time for another free hour-long Webinar! This time, I’ll be talking about the increasingly popular tools for data science in Python, namely Pandas and Matplotlib. How can you read data into Pandas, manipulate it, and then plot it? I’ll show you a large number of examples and use cases, and we’ll also have lots of time for Q&A. Previous Webinars have been lots of fun, and I expect that this one will be, too!

Register (for free) to participate here:

https://www.eventbrite.com/e/analzying-and-viewing-data-with-pandas-and-matplotlib-tickets-21198157259

If you aren’t sure whether you’ll be able to make it, you can still sign up; I’ll be sending information, and a URL with the recording afterwards, soon after the Webinar concludes.

I look forward to seeing you there; if you have any questions, please feel free to contact me at reuven@lerner.co.il or on Twitter as @reuvenmlerner.

Reminder: Free Webinar on data science in Python

There’s still time to register for my free, one-hour Webinar on data science in Python, which will be tomorrow (Tuesday).  There’s clearly too much material for me to give just one Webinar, so this will be the first in a series that I’ll be offering over the coming months.  But if you’re interested in hearing how Python fits into the world of data science, or how you can use free, open-source tools to do lots of great analysis work, then I invite you to join me for what should be a fun time:

https://www.eventbrite.com/e/data-analysis-with-python-tickets-19543502141

There will be plenty of live-coding demos, bad jokes, and chances for you to ask questions. And it should be lots of fun, besides!

Free Webinar: Data science with Python on December 8th

It’s time for me to do another free one-hour Webinar, this time about data science with Python. It’ll be on December 8th, at 9 p.m. GMT.

Data science is all the rage, and rightly so — and Python is one of the best-known and best-equipped languages in which to do it.  In this Webinar, I’ll review some of the most popular packages used for analysis, including NumPy, SciPy, Pandas, and matplotlib, and will show how they can be used to answer questions that we have about our data.

As always, I hope that there will be lots of questions — and if we’re lucky, I’ll be able to provide answers, too!  Please come prepared for a highly interactive and fun event.  I’ll e-mail all registered participants about an hour before the Webinar with links for participating.

You can register at Eventbrite: https://www.eventbrite.com/e/data-analysis-with-python-tickets-19543502141

I look forward to seeing you there!  If you have any questions, please let me know via e-mail at reuven@lerner.co.il or on Twitter as @reuvenmlerner.