The (lack of a) case against Python 3

A few days ago, well-known author and developer Zed Shaw wrote a blog post, “The Case Against Python 3.”   I have a huge amount of respect for Zed’s work, and his book (Learn Python the Hard Way) is one whose approach is similar to mine — so much so, that I often tell people who either are about to take my course to read it in preparation — and that people who want to practice more after finishing my course, should read it afterwards.

It was thus disappointing for me to see Zed’s post about Python 3, with which I disagree.

Let’s make it clear: About 90% of my work is as a Python trainer at various large companies; my classes range from “Python for non-programmers” and “Intro Python” to “Data science and machine learning in Python,” with a correspondingly wide range of backgrounds. I would estimate that at least 95% of the people I teach are using Python 2 in their work.

In my own development work, I switch back and forth between Python 2 and 3, depending on whether it’s for a client, for myself, and what I plan to do with it.

So I’m far from a die-hard “Python 3 or bust” person. I recognize that there are reasons to use either 2 or 3.  And I do think that if there’s a major issue in the Python world today, it’s in the world of 2 vs. 3.

But there’s a difference between recognizing a problem, and saying that Python 3 is a waste of time — or, as Zed is saying, that it’s a mistake to teach Python 3 to new developers today.  Moreover, I think that the reasons he gives aren’t very compelling, either for newcomers to programming in general, or to experienced programmers moving to Python.

Zed’s argument seems to boil down to:

  • Implementing Unicode in Python 3 has made things harder, and
  • The fact that you cannot run Python 2 programs in the Python 3 environment, but need to translate them semi-automatically with a combination of 2to3 and manual intervention is crazy and broken.

I think that the first is a bogus argument, and the second is overstating the issues by a lot.

As for Unicode: This was painful. It was going to be painful no matter what.  Maybe the designers got some things wrong, but on the whole, Unicode works well (I think) in Python 3.

In my experience, 90% of programmers don’t need to think about Unicode, because so many programmers use ASCII in their work.  For them, Python 3 works just fine, no better (and no worse) than Python 2 on this front.

For people who do need Unicode, Python 3 isn’t perfect, but it’s far, far better than Python 2. And given that some huge proportion of the world doesn’t speak English, the notion that a modern language won’t natively support Unicode strings is just nonsense.

This does mean that code needs to be rewritten, and that people need to think more before using strings that contain Unicode.  Yes, those are problems.  And Zed points out some issues with the implementation that can be painful for people.

But again, the population that will be affected is the 10% who deal with Unicode.  That generally doesn’t include new developers — and if it does, everything is hard for them.  So the notion that Unicode problems making Python 3 impossible to use is just silly.  And the notion that Python can simply ignore Unicode needs, or treat non-English characters are a second thought, is laughable in the modern world.

The fact that you cannot run Python 2 programs in the Python 3 VM might have been foolish in hindsight.  But if the migration from Python 2 to 3 is slow now, imagine what would have happened if companies never needed to migrate?  Heck, that might still happen come 2020, when large companies don’t migrate.  I actually believe that large companies won’t ever translate their Python 2 code into Python 3.  It’s cheaper and easier for them to pay people to keep maintaining Python 2 code than to move mission-critical code to a new platform.  So new stuff will be in Python 3, and old stuff will be in Python 2.

I’m not a language designer, and I’m not sure how hard it would have been to allow both 2 and 3 to run on the same system. I’m guessing that it would have been hard, though, if only because it would have saved a great deal of pain and angst among Python developers — and I do think that the Python developers have gone out of their way to make the transition easier.

Let’s consider who this lack of v2 backward compatibility affects, and what a compatible VM might have meant to them:

  • For new developers using Python 3, it doesn’t matter.
  • For small (and individual) shops that have some software in Python 2 and want to move to 3, this is frustrating, but it’s doable to switch, albeit incrementally.  This switch wouldn’t have been necessary if the VM were multi-version capable.
  • For big shops, they won’t switch no matter what. They are fully invested in Python 2, and it’s going to be very hard to convince them to migrate their code — in 2016, in 2020, and in 2030.

(PS: I sense a business opportunity for consultants who will offer Python 2 maintenance support contracts starting in 2020.)

So the only losers here are legacy developers, who will need to switch in the coming three years.  That doesn’t sound so catastrophic to me, especially given how many new developers are learning Python 3, the growing library compatibility with 3, and the fact that 3 increasingly has features that people want. With libraries such as six, making your code run in both 2 and 3 isn’t so terrible; it’s not ideal, but it’s certainly possible.

One of Zed’s points strikes me as particularly silly: The lack of Python 3 adoption doesn’t mean that Python 3 is a failure.  It means that Python users have entrenched business interests, and would rather stick with something they know than upgrade to something they don’t.  This is a natural way to do things, and you see it all the time in the computer industry.  (Case in point: Airlines and banks, which run on mainframes with software from the 1970s and 1980s.)

Zed does have some fair points: Strings are more muddled than I’d like (with too many options for formatting, especially in the next release), and some of the core libraries do need to be updated and/or documented better. And maybe some of those error messages you get when mixing Unicode and bytestrings could be improved.

But to say that the entire language is a failure because you get weird results when combining a (Unicode) string and a bytestring using str.format… in my experience, if someone is doing such things, then they’re no longer a newcomer, and know how to deal with some of these issues.

Python 3 isn’t a failure, but it’s not a massive success, either.  I believe that the reasons for that are (1) the Python community is too nice, and has allowed people to delay upgrading, and (2) no one ever updates anything unless they have a super-compelling reason to do so and they can’t afford not to.  There is a growing number of super-compelling reasons, but many companies are still skeptical of the advantages of upgrading. I know of people who have upgraded to Python 3 for its async capabilities.

Could the Python community have handled the migration better? Undoubtedly. Would it be nice to have more, and better, translation tools?  Yes.  Is Unicode a bottomless pit of pain, no matter how you slice it, with Python 3’s implementation being a pretty good one, given the necessary trade-offs? Yes.

At the same time, Python 3 is growing in acceptance and usage. Oodles of universities now teach Python 3 as an introductory language, which means that in the coming years, a new generation of developers will graduate and expect/want to use Python 3. People in all sorts of fields are using Python, and many of them are switching to Python 3.

The changes are happening: Slowly, perhaps, but they are happening. And it turns out that Python 3 is just as friendly to newbies as Python 2 was. Which doesn’t mean that it’s wart-free, of course — but as time goes on, the intertia keeping people from upgrading will wane.

I doubt that we’ll ever see everyone in the Python world using Python 3. But to dismiss Python 3 as a grave error, and to say that it’ll never catch on, is far too sweeping, and ignores trends on the ground.

Enjoyed this article? Subscribe to my free weekly newsletter; every Monday, I’ll send you new ideas and insights into programming — typically in Python, but with some other technologies thrown in, as well!  Subscribe at http://lerner.co.il/newsletter.

Why you should read “Weapons of Math Destruction”

Review of “Weapons of Math Destruction: How big data increases inequality and threatens democracy,” by Cathy O’Neil

Cover of Weapons of Math DestructionOver the last few years, the study of statistics has taken on new meaning and importance with the emergence of “data science,” an imprecisely defined discipline that merges aspects of statistics with computer science. Google, Amazon, and Facebook are among the most famous companies putting data science to use, looking for trends among their users.

How does Facebook know who your friends might be, or which advertisements you want to watch? Data science. How does Amazon know which books you’re likely to buy, or how much to charge you for various products? Data science. How does Google know which search results to show you? Data science. How does Uber know when to implement surge pricing, and how much to charge? Data science.

A key part of data science is machine learning, in which the computer is trained to identify the factors that might lead to problems. If you have ever tried to make a legitimate credit-card payment, but your card has been denied because it looked suspicious, you can be sure that it didn’t “look” bad to a human. Rather, a machine-learning system, having been trained on millions of previous transactions, did its best to put you into the “good” or “bad” category.

Today, machine learning affects everything from what advertisements we see to translation algorithms to automatic driving systems to the ways in which politicans contact voters. Indeed, the secret weapon of Barack Obama’s two presidential campaigns was apparently his finely tuned data science system, which provided a shockingly accurate picture of which voters might change their minds, and which would be the best way to do so. (A great book on the subject is The Victory Lab, by Sasha Issenberg.)

I’ve been getting both excited and optimistic about the ability of data science to improve our lives. Every week, I hear (often on the Partially Derivative podcast) and read amazing, new stories about how data science helped to solve problems that would otherwise be difficult, time-consuming, or impossible to deal with.

Cathy O’Neil’s new book, “Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy,” has dampened my optimism and enthusiasm — and for that, I thank her.  The book came out at just the right time, pointing out where data science is causing more harm than good, and warning us to think, and even regulate it, before we let it take over our lives in too many more ways.

Even before reading this book, O’Neil was someone I admired: Her blog, mathbabe.org, has useful insights about the use of math in everyday life, her book “Doing Data Science” is a great introduction to the subject, and she’s a panelist on the “Slate Money” podcast, which I thoroughly enjoy each week.

While O’Neil is easygoing and funny in her writing and speaking, her book is deadly serious. In it, she says that the widespread use of data science is leading to serious problems in society.  She blames a number of different things for these failures. In particular, the opacity of many of the algorithms used makes them impossible to understand or evaluate. Their widespread use across huge populations for important decisions, and the frequent inability to find and appeal to a human, to insert some (pardon the term) common sense into the equation, means that mistakes can have far-reaching effects. Even if the results are good for most of the people most of the time, they can be bad for some of the people (and sometimes even most of the people) quite a bit of the time.

In statistics, you’re always dealing with averages, generalities, and degrees of confidence. When you’re letting a computer make decisions about people’s jobs, health, education, and court cases, you need to err on the safe side. Otherwise, many people could end up having their lives destroyed because they were statistical outliers, or didn’t quite match the profile you intended.

O’Neil points out, early on in the book, that data science involves creating statistical models. Models represent a form of reality, and help us to understand reality, but they aren’t themselves reality. The designer of a model needs to decide which factors to include and exclude. This decision-making process is, as O’Neil points out, riddled with the potential for error. This is particularly true if the thing you’re trying to measure isn’t easily quantified; in such cases, it’s common to use a proxy value.

For example, let’s say that you want to know how happy people are. You can’t directly measure that, so you use a proxy value for it — say, how much money people spend on luxury items. Not only is this a lousy proxy because there are lots of other reasons to buy luxury goods, but it’s likely to show that poor people are never happy. By choosing a bad proxy, you have made the model worthless. Combine a few bad proxy values, and unleash it on a large population, and you’re likely to do real harm.

Even if you choose your inputs (direct and proxies) correctly, your model will still likely have mistakes. That’s why it’s crucial to refine and improve the model over time, checking it against real-world data. As O’Neil points out in the book, this is why it makes sense for sports teams to model their players’ techniques; over time, they will analyze lots of players and games, and find out which factors are correlated with winning and losing. But in the case of a classroom teacher’s performance, how many inputs do you have? And how often does a fired teachers’s performance at other schools get factored into the model? Moreover, what if the inputs aren’t reliable? Put all three of these factors together, and you end up with a model that’s effectively random — but that still ends up having good teachers fired, and bad teachers remain.

(I should point out that the software I developed for my PhD dissertation, the Modeling Commons, is a collaborative, Web-based system for modeling with NetLogo. I developed it with the hope and expectation that by sharing models and discussing them, quality and understanding will both improve over time.)

As O’Neil points out, updates to models based on empirical data are rare, often because it is hard or impossible to collect such information. But as she points out, that’s no excuse; if you don’t update a model, it’s basically useless. If you give it a tiny number of inputs, its training is useless. And if your input data has the potential of being fudged, then you’re truly in terrible trouble. Given the choice between no model and a bad model, you’re probably better off with no model.

The thing is, these sorts of poorly designed, never-updated algorithms are playing a larger and larger part of our lives.  They’re being used to determine whether people are hired and fired, whether insurance companies accept or reject applications, and how people’s work schedules are determined.

Some of O’Neil’s most damning statements have to do with race, poverty, and discrimination in the United States. By using inappropriate proxies, police departments might reduce crime, but they do so by disproportionately arresting blacks.   And indeed, O’Neil isn’t saying that these data science algorithms aren’t efficient. But their efficiency is leading to behavior and outcomes that are bad for many individuals, and also for the long-term effects on society.

Sure, the “broken windows” form of policing might bring police to a neighborhood where they’re needed — but it will also result in more arrests in that neighborhood, leading to more residents being in trouble with the law simply because there are police officers in view of the perpretrators. Add to that the fact that many courts give longer sentences to those who are likely to return to a life of crime, and that they measure this likelihood based on the neighborhood in which you were raised — and you can easily see how good intentions lead to a disturbing outcome.

Moreover, we’ve gotten to the point in which no one knows or understands how many of these models work. This leads to the absurd situation in which everyone assumes the computer is doing a good job because it’s neutral. But it’s not neutral; it reflects the programmers’ understanding of its various inputs. The fact that no one knows what the model does, and that the public isn’t allowed to try to look at them, means that we’re being evaluated in ways we don’t even know. And these evaluations are affecting millions of people’s lives.

O’Neil suggests some ways of fixing this problem; conservatives will dislike her suggestions, which include government monitoring of data usage, and stopping organizations from sharing their demographic data. In Europe, for example, she points out that companies not only have to tell you what information they have about you, but are also prohibited from sharing such information with other companies. She also says that data scientists have the potential to do great harm, and even kill people — and it’s thus high time for data scientists have a “Hippocratic oath” for data, mirroring the famous oath that doctors take. And the idea that many more of these algorithms should be open to public scrutiny and criticism is a very wise one, even if I believe that it’s unrealistic.

Now, I don’t think that some of O’Neil’s targets are deserving of her scorn. For example, I continue to think that it’s fascinating and impressive that modern political party can model a country’s citizens in such detail, and then use that data to decide whom to target, and how. But her point about how US elections now effectively include a handful of areas in a handful of states, because only those are likely to decide the election, did give me pause.

I read a lot, and I try to read things that will impress and inform me. But “Weapons of Math Destruction” is the first book in a while to really shake me up, forcing me to reassess my enthusiasm for the increasingly widespread use of data science. She convinced me that I fell into the same trap that has lured so many technologists before me — namely, that a technology that makes us more efficient, and that can do new things that help so many, doesn’t have a dark side.  I’m not a luddite, and neither is O’Neil, but it is crucial that we consider the positive and negative influences of data science, and work to decrease the negative influences as much as possible.

The main takeaway from the book is that we shouldn’t get rid of data science or machine learning. Rather, we should think more seriously about where it can help, what sorts of models we’re building, what inputs and outcomes we’re measuring, whether those measures accurately reflect our goals, and whether we can easily check and improve our models. These are tools, and like all tools, they can be used for good and evil. Moreover, because of the mystique and opacity associated with computers and math, it’s easy for people to be lured into thinking that these models are doing things that they aren’t.

If you’re a programmer or data scientist, then you need to read this book, if only to think more deeply about what you’re doing. If you’re a manager planning to incorporate data science into your organization’s work, then you should read this book, to increase the chances that you’ll end up having a net positive effect. And if you’re a policymaker, then you should read this book, to consider ways in which data science is changing our society, and how you can (and should) ensure that it is a net positive.

In short, you should read this book. Even if you don’t agree with all of it, you’ll undoubtedly find it thought provoking, and a welcome counterbalance to our all-too-frequent unchecked cheerleading of technological change.

Implementing “zip” with list comprehensions

zipperI love Python’s “zip” function. I’m not sure just what it is about zip that I enjoy, but I have often found it to be quite useful. Before I describe what “zip” does, let me first show you an example:

>>> s = 'abc'
>>> t = (10, 20, 30)

>>> zip(s,t)
[('a', 10), ('b', 20), ('c', 30)]

As you can see, the result of “zip” is a sequence of tuples. (In Python 2, you get a list back.  In Python 3, you get a “zip object” back.)  The tuple at index 0 contains s[0] and t[0]. The tuple at index 1 contains s[1] and t[1].  And so forth.  You can use zip with more than one iterable, as well:

>>> s = 'abc'
>>> t = (10, 20, 30)
>>> u = (-5, -10, -15)

>>> list(zip(s,t,u))
[('a', 10, -5), ('b', 20, -10), ('c', 30, -15)]

(You can also invoke zip with a single iterable, thus ending up with a bunch of one-element tuples, but that seems a bit weird to me.)

I often use “zip” to turn parallel sequences into dictionaries. For example:

>>> names = ['Tom', 'Dick', 'Harry']
>>> ages = [50, 35, 60]

>>> dict(zip(names, ages))
{'Harry': 60, 'Dick': 35, 'Tom': 50}

In this way, we’re able to quickly and easily product a dict from two parallel sequences.

Whenever I mention “zip” in my programming classes, someone inevitably asks what happens if one argument is shorter than the other. Simply put, the shortest one wins:

>>> s = 'abc'
>>> t = (10, 20, 30, 40)
>>> list(zip(s,t))
[('a', 10), ('b', 20), ('c', 30)]

(If you want zip to return one tuple for every element of the longer iterable, then use “izip_longest” from the “itertools” package.)

Now, if there’s something I like even more than “zip”, it’s list comprehensions. So last week, when a student of mine asked if we could implement “zip” using list comprehensions, I couldn’t resist.

So, how can we do this?

First, let’s assume that we have our two equal-length sequences from above, s (a string) and t (a tuple). We want to get a list of three tuples. One way to do this is to say:

[(s[i], t[i])              # produce a two-element tuple
 for i in range(len(s))]   # from index 0 to len(s) - 1

To be honest, this works pretty well! But there are a few ways in which we could improve it.

First of all, it would be nice to make our comprehension-based “zip” alternative handle inputs of different sizes.  What that means is not just running range(len(s)), but running range(len(x)), where x is the shorter sequence. We can do this via the “sorted” builtin function, telling it to sort the sequences by length, from shortest to longest. For example:

>>> s = 'abcd'
>>> t = (10, 20, 30)

>>> sorted((s,t), key=len)
[(10, 20, 30), 'abcd']

In the above code, I create a new tuple, (s,t), and pass that as the first parameter to “sorted”. Given these inputs, we will get a list back from “sorted”. Because we pass the builtin “len” function to the “key” parameter, “sorted” will return [s,t] if s is shorter, and [t,s] if t is shorter.  This means that the element at index 0 is guaranteed not to be longer than any other sequence. (If all sequences are the same size, then we don’t care which one we get back.)

Putting this all together in our comprehension, we get:

>>> [(s[i], t[i])    
    for i in range(len(sorted((s,t), key=len)[0]))]

This is getting a wee bit complex for a single list comprehension, so I’m going to break off part of the second line into a function, just to clean things up a tiny bit:

>>> def shortest_sequence_range(*args):
        return range(len(sorted(args, key=len)[0]))

>>> [(s[i], t[i])     
    for i in shortest_sequence_range(s,t) ]

Now, our function takes *args, meaning that it can take any number of sequences. The sequences are sorted by length, and then the first (shortest) sequence is passed to “len”, which calculates the length and then returns the result of running “range”.

So if the shortest sequence is ‘abc’, we’ll end up returning range(3), giving us indexes 0, 1, and 2 — perfect for our needs.

Now, there’s one thing left to do here to make it a bit closer to the real “zip”: As I mentioned above, Python 2’s “zip” returns a list, but Python 3’s “zip” returns an iterator object. This means that even if the resulting list would be extremely long, we won’t use up tons of memory by returning it all at once. Can we do that with our comprehension?

Yes, but not if we use a list comprehension, which always returns a list. If we use a generator expression, by contrast, we’ll get an iterator back, rather than the entire list. Fortunately, creating such a generator expression is a matter of just replacing the [ ] of our list comprehension with the ( ) of a generator expression:

>>> def shortest_sequence_range(*args):
      return range(len(sorted(args, key=len)[0]))

>>> g = ((s[i], t[i])
         for i in shortest_sequence_range(s,t) )

>>> for item in g:
        print(item)
('a', 10)
('b', 20)
('c', 30)

And there you have it!  Further improvements on these ideas are welcome — but as someone who loves both “zip” and comprehensions, it was fun to link these two ideas together.

Fun with floats

I’m in Shanghai, and before I left to teach this morning, I decided to check the weather.  I knew that it would be hot, but I wanted to double-check that it wasn’t going to rain — a rarity during Israeli summers, but not too unusual in Shanghai.

I entered “shanghai weather” into DuckDuckGo, and got the following:

Never mind that it gave me a weather report for the wrong Chinese city. Take a look at the humidity reading!  What’s going on there?  Am I supposed to worry that it’s ever-so-slightly more humid than 55%?

The answer, of course, is that many programming languages have problems with floating-point numbers.  Just as there’s no terminating decimal number to represent 1/3, lots of numbers are non-terminating when you use binary, which computers do.

As a result floats are inaccurate.  Just add 0.1 + 0.2 in many programming languages, and prepare to be astonished.  Wait, you don’t want to fire up a lot of languages? Here, someone has done it for you: http://0.30000000000000004.com/ (I really love this site.)

If you’re working with numbers that are particularly sensitive, then you shouldn’t be using floats. Rather, you should use integers, or use something like Python’s decimal.Decimal, which guarantees accuracy at the expense of time and space. For example:

>> from decimal import Decimal
>>> x = Decimal('0.1')
>>> y = Decimal('0.2')
>>> x + y
Decimal('0.3')
>>> float(x+y)
0.3

Of course, you should be careful not to create your decimals with floats:

>> x = Decimal(0.1)
>>> y = Decimal(0.2)
>>> x + y
Decimal('0.3000000000000000166533453694')

Why is this the case? Let’s take a look:

>> x
Decimal('0.1000000000000000055511151231257827021181583404541015625')

>>> y
Decimal('0.200000000000000011102230246251565404236316680908203125')

So, if you’re dealing with sensitive numbers, be sure not to use floats! And if you’re going outside in Shanghai today, it might be ever-so-slightly less humid than your weather forecast reports.

Announcing: An online community for technical trainers

Over the last few years, my work has moved away from day-to-day software development, and more in the direction of technical training: Helping companies (and individuals) by teaching people how to solve problems in new ways.  Nowadays, I spend most of my time teaching courses in Python (at a variety of levels), regular expressions, data science, Git, and PostgreSQL.

And I have to say: I love it. I love helping people to do things they couldn’t do before.  I love meeting smart and interesting people who want to do their jobs better.  I love helping companies to become more efficient, and to solve problems they couldn’t solve before.  And I love the travel; next week, I leave for my 16th trip to China, and I’ll likely teach 5-6 classes in Europe before the year is over.

The thing is, I’m not alone: There are other people out there who do training, and who have the same feeling of excitement and satisfaction.

At the same time, trainers are somewhat lonely: To whom do we turn to improve our skills? Not our technical skills, but our skills as trainers? And our business skills as consultants who are looking to improve our knowledge of the training market?

Over the last year, I’ve started to help more and more people who are interested in becoming trainers. I’ve started a coaching practice. I’ve given Webinars and talks at conferences. I’ve started to work on a book on the subject.

But as of last week, I’ve also started a new, free community for technical trainers on Facebook. If you engage in training, or have always wanted to do so, then I invite you to join our new, free community on Facebook, at http://facebook.com/groups/techtraining .

I should note that this group is not for people running training businesses. Rather, it’s for the trainers themselves — the people who spend several days each month in a classroom, trying to get their ideas across in the best possible ways.

In this group, we’ll share ideas about (among other things):

  • How to find clients
  • How to prepare courses
  • What a good syllabus and/or proposals look like
  • How to decide whether a course is worth doing
  • How to price courses
  • Working on your own vs. via training companies
  • How to upsell new courses to your clients
  • How can education research help us to teach better

If you are a trainer, or want to be one, then I urge you to join our new community, at at http://facebook.com/groups/techtraining .  We’ve already had some great exchanges of ideas that will help us all to learn, grow, and improve. Join us, and contribute your voice to our discussion!

Speedy string concatenation in Python

As many people know, one of the mantras of the Python programming language is, “There should be one, and only one, way to do it.”  (Use “import this” in your Python interactive shell to see the full list.)  However, there are often times when you could accomplish something in any of several ways. In such cases, it’s not always obvious which is the best one.

A student of mine recently e-mailed me, asking which is the most efficient way to concatenate strings in Python.

The results surprised me a bit — and gave me an opportunity to show her (and others) how to test such things.  I’m far from a benchmarking expert, but I do think that what I found gives some insights into concatenation.

First of all, let’s remember that Python provides us with several ways to concatenate strings.  We can use the + operator, for example:

>> 'abc' + 'def'
'abcdef'

We can also use the % operator, which can do much more than just concatenation, but which is a legitimate option:

>>> "%s%s" % ('abc', 'def')
'abcdef'

And as I’ve mentioned in previous blog posts, we also have a more modern way to do this, with the str.format method:

>>> '{0}{1}'.format('abc', 'def')
'abcdef'

As with the % operator, str.format is far more powerful than simple concatenation requires. But I figured that this would give me some insights into the relative speeds.

Now, how do we time things? In Jupyter (aka IPython), we can use the magic “timeit” command to run code.  I thus wrote four functions, each of which concatenates in a different way. I purposely used global variables (named “x” and “y”) to contain the original strings, and a local variable “z” in which to put the result.  The result was then returned from the function.  (We’ll play a bit with the values and definitions of “x” and “y” in a little bit.)

def concat1(): 
    z = x + y 
    return z 

 def concat2(): 
    z = "%s%s" % (x, y) 
    return z 

def concat3(): 
    z = "{}{}".format(x, y) 
    return z 

def concat4(): 
    z = "{0}{1}".format(x, y) 
    return z

I should note that concat3 and concat4 are almost identical, in that they both use str.format. The first uses the implicit locations of the parameters, and the second uses the explicit locations.  I decided that if I’m already benchmarking string concatenation, I might as well also find out if there’s any difference in speed when I give the parameters’ iindexes.

I then defined the two global variables:

x = 'abc' 
y = 'def'

Finally, I timed running each of these functions:

%timeit concat1()
%timeit concat2()
%timeit concat3()
%timeit concat4()

The results were as follows:

  • concat1: 153ns/loop
  • concat2: 275ns/loop
  • concat3: 398ns/loop
  • concat4: 393ns/loop

From this benchmark, we can see that concat1, which uses +, is significantly faster than any of the others.  Which is a bit sad, given how much I love to use str.format — but it also means that if I’m doing tons of string processing, I should stick to +, which might have less power, but is far faster.

The thing is, the above benchmark might be a bit problematic, because we’re using short strings.  Very short strings in Python are “interned,” meaning that they are defined once and then kept in a table so that they need not be allocated and created again.  After all, since strings are immutable, why would we create “abc” more than once?  We can just reference the first “abc” that we created.

This might mess up our benchmark a bit.  And besides, it’s good to check with something larger. Fortunately, we used global variables — so by changing those global variables’ definitions, we can run our benchmark and be sure that no interning is taking place:

x = 'abc' * 10000 
y = 'def' * 10000

Now, when we benchmark our functions again, here’s what we get:

  • concat1: 2.64µs/loop
  • concat2: 3.09µs/loop
  • concat3: 3.33µs/loop
  • concat4: 3.48µs/loop

Each loop took a lot longer — but we see that our + operator is still the fastest.  The difference isn’t as great, but it’s still pretty obvious and significant.

What about if we no longer use global variables, and if we allocate the strings within our function?  Will that make a difference?  Almost certainly not, but it’s worth a quick investigation:

def concat1(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = x + y 
     return z 

def concat2(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "%s%s" % (x, y) 
     return z 

def concat3(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "{}{}".format(x, y) 
     return z 

def concat4(): 
     x = 'abc' * 10000 
     y = 'def' * 10000 
     z = "{0}{1}".format(x, y) 
     return z 

And our final results are:

  • concat1: 4.89µs/loop
  • concat2: 5.78µs/loop
  • concat3: 6.22µs/loop
  • concat4: 6.19µs/loop

Once again, we see that + is the big winner here, but (again) but less of a margin than was the case with the short strings.  str.format is clearly shorter.  And we can see that in all of these tests, the difference between “{0}{1}” and “{}{}” in str.format is basically zero.

Upon reflection, this shouldn’t be a surprise. After all, + is a pretty simple operator, whereas % and str.format do much more.  Moreover, str.format is a method, which means that it’ll have greater overhead.

Now, there are a few more tests that I could have run — for example, with more than two strings.  But I do think that this demonstrates to at least some degree that + is the fastest way to achieve concatenation in Python.  Moreover, it shows that we can do simple benchmarking quickly and easily, conducting experiments that help us to understand which is the best way to do something in Python.

Want to learn Chinese?

I run a side project that has nothing to do with computers or programming: Every Monday, I publish “Mandarin Weekly,” a curated collection of links and resources for people learning Chinese.  I’ve been studying Chinese for nearly two years now, and it is one of the most interesting and fun (and challenging!) things I’ve ever done.

Mandarin Weekly is running a giveaway for six months of free Yoyo Chinese, a great online school that teaches Chinese vocabulary, grammar, pronunciation, listening comprehension, and even reading characters.  On Sunday, we’ll be giving away two premium six-month memberships to Yoyo Chinese, each worth $100.

If you’ve always wanted to learn Chinese, then this is a great way to do it.  And if you are studying Chinese, then Yoyo is a great way to supplement or improve on your formal classroom studies.  Indeed, I think that anyone learning Chinese can benefit from this course.  To enter the giveaway, just sign up here:

http://mandarinweekly.com/giveaways/win-a-6-month-premium-membership-to-yoyo-chinese-a-99-value/

But wait, it gets better: If you enter the giveaway, you get one chance to win.  For every friend you refer to the giveaway, you get an additional three chances.  So if you enter, and then get five friends to enter, you will have 16 chances to win!

And now, back to your regularly scheduled technical blog…

Another free regexp Q&A webinar!

The last Webinar I did, with Q&A about regular expressions, was great fun — so much, that I’ve decided to do another one.

So, if you have questions (big or little) about regular expressions in Python, Ruby, JavaScript, and/or PostgreSQL, sign up for this free Webinar on Monday, April 11th: https://www.crowdcast.io/e/regexpqa2

If you already have questions, you can leave them in advance using the Crowdcast Q&A system.  (Or just surprise me during the Webinar itself.)

I look forward to seeing you there!

Free Webinar: Regexp Q&A

practice-makes-regexp-coverTo celebrate the publication of my new ebook, Practice Makes Regexp, my upcoming Webinar (on March 22nd) is all about regular expressions (“regexps”) in Python, Ruby, JavaScript, and PostgreSQL, as well as the Unix “grep” command.

Unlike previous Webinars, in which I gave a presentation and then took Q&A, this time will be all about Q&A: I want you to come with your questions about regular expressions, or even projects that you’re wondering how to attack using them.

I’ll do my best to answer your questions, whether they be about regexp syntax, differences between implementations and languages, how to debug hairy regexps, and even when they might not be the most appropriate tool for the job.

Please join me on March 22nd by signing up here:

http://ccst.io/e/regexpqa

And when you sign up, please don’t forget to ask a question or two!  (You can do that it advance — and doing so will really help me to prepare detailed answers.)

I look forward to your questions on the 22nd!

Reuven

Yes, you can master regular expressions!

Announcing: My new book, “Practice Makes Regexp,” with 50 exercises meant to help you learn and master regular expressions. With explanations and code in Python, Ruby, JavaScript, and PostgreSQL.

I spend most of my time nowadays going to high-tech companies and training programmers in new languages and techniques. Actually, many of the things I teach them aren’t really new; rather, they’re new to the participants in my training. Python has been around for 25 years, but for my students, it’s new, and even a bit exciting.

I tell participants that my job is to add tools to their programming toolbox, so that if they encounter a new problem, they’ll have new and more appropriate or elegant ways to attack and solve it. Moreover, I tell them, once you are intimately familiar with a tool or technique, you’ll suddenly discover opportunities to use it.
Earlier this week, I was speaking with one of my consulting clients, who was worried that some potentially sensitive information had been stored in their Web application’s logfiles — and they weren’t sure if they had a good way to search through the logs.

 

I suggested the first solution that came to mind: Regular expressions.

Regular expressions are a lifesaver for anyone who works with text.  We can use them to search for patterns in files, in network data, and in databases. We can use them to search and replace.  To handle protocols that have changed ever so slightly from version to version. To handle human input, which is always messier than what we get from other computers.

Regular expressions are one of the most critical tools I have in my programming toolbox.  I use them at least a few times each day, and sometimes even dozens of times in a given day.

So, why don’t all developers know and use regular expressions? Quite simply, because the learning curve is so steep. Regexps, as they’re also known, are terse and cryptic. Changing one character can have a profound impact on what text a regexp matches, as well as its performance. Knowing which character to insert where, and how to build up your regexps, is a skill that takes time to learn and hone.

Many developers say, “If I have a problem that involves regular expressions, I’ll just go to Stack Overflow, where my problem has likely been addressed already.” And in many cases, they’re right.

But by that logic, I shouldn’t learn any French before I go to France, because I can always use a phrasebook.  Sure, I could work that way — but it’s far less efficient, and I’ll miss many opportunities that would come my way if I knew French.

Moreover, relying on Stack Overflow means that you never get a full picture of what you can really do with regular expressions. You get specific answers, but you don’t have a fully formed mental model of what they are and how they work.

But wait, it gets worse: If you’re under the gun, trying to get something done for your manager or a big client, you can’t spend time searching through Stack Overflow. You need to bring your best game to the table, demonstrating fluency in regular expressions.  Without that fluency, you’ll take longer to solve the problem — and possibly, not manage to solve it at all.

Believe me, I understand — my first attempt at learning regular expressions was a complete failure. I read about them in the Emacs manual, and thought to myself, “What could this seemingly random collection of characters really do for me?”  I ignored them for a few more years, until I started to program in Perl — a language that more or less expected you to use regexps.

So I spent some time learning regexp syntax.  The more I used them,  the more opportunities I found to use them.  And the more I found that they made my life easier, better, and more convenient.  I was able to solve problems that others couldn’t — or even if they could, they took much longer than I did.  Suddenly, processing text was a breeze.

I was so excited by what I had learned that when I started to teach advanced programming courses, I added regexps to the syllabus.  I figured that I could figure out a way to make regexps understandable in an hour or two.

But boy, was I wrong: If there’s something that’s hard for programmers to learn, it’s regular expressions.  I’ve thus created a two-day course for people who want to learn regular expressions.  I not only introduce the syntax, but I have them practice, practice, and practice some more.  I give them situations and tasks, and their job is to come up with a regexp that will solve the problem I’ve given them.  We discuss different solutions, and the way that different languages might go about solving the problem.

After lots of practice, my students not only know regexp syntax — they know when to use it, and how to use it.  They’re more efficient and valuable employees. They become the person to whom people can turn with tricky text-processing problems.  And when the boss is pressuring them for a

ImageAnd so, I’m delighted to announce the launch of my second ebook, “Practice Makes Regexp.”  This book contains 50 tasks for you to accomplish using regular expressions.  Once you have solved the problem, I present the solution, walking you through the general approach that we would use in regexps, and then with greater depth (and code) to solve the problem in Python, Ruby, JavaScript, and PostgreSQL.  My assumption in the book is that you have already learned regexps elsewhere, but that you’re not quite sure when to use them, how to apply them, and when each metacharacter is most appropriate.

After you go through all 50 exercises, I’m sure that you’ll be a master of regular expressions.  It’ll be tough going, but the point is to sweat a bit working on the exercises, so that you can worry a lot less when you’re at work. I call this “controlled frustration” — better to get frustrated working on exercises, than when the boss is demanding that you get something done right away.

Right now, the book is more than 150 pages long, with four complete chapters (including 17 exercises).  Within two weeks, the remaining 33 exercises will be done.  And then I’ll start work on 50 screencasts, one for each of the exercises, in which I walk you through solutions in each of Python, Ruby, JavaScript, and PostgreSQL.  If my previous ebook is any guide, there will be about 5 hours (!) of screencasts when I’m all done.

If you have always shied away from learning regular expressions, or want to harness their power, Practice Makes Regexp is what you have been looking for.  It’s not a tutorial, but it will help you to understand and internalize regexps, helping you to master a technology that frustrates many people.

To celebrate this launch, I’m offering a discount of 10%.  Just use the “regexplaunch” offer code, and take 10% off of any of the packages — the book, the developer package (which includes the solutions in separate program files, as well as the 300+ slides from the two-day regexp course I give at Fortune 100 companies), or the consultant package (which includes the screencasts, as well as what’s in the developer package).

I’m very excited by this book.  I think that it’ll really help a lot of people to understand and use regular expressions.  And I hope that you’ll find it makes you a more valuable programmer, with an especially useful tool in your toolbox.