The easiest way to return to the last Git branch

I don’t know about you, but it’s common for me to switch between branches in Git.  After all, that’s one of the main advantages of using Git — the incredible ease with which you can create and merge branches.

Just a few minutes ago, I was in the “adwords” branch of an application I’m working on.  I wanted to go back to “master”, make sure that I hadn’t missed any commits from the central repository, and then go back to “adwords”.  If there were any commits in “master” that I was missing in “adwords”, I figured that I would just rebase from “master” to “adwords”.

So I did a “git checkout master”, and found that I was up to date with the central Git server. For reasons that I can’t explain, I then decided to try something out: Instead of returning to the previous branch with “git checkout adwords”, I instead typed

git checkout -

(That’s a minus sign after the word “checkout.”)

Sure enough, I returned to the “adwords” branch!

Now, there is a fair amount of logic to this: In the Unix shell, you can typically return to the previous directory with “cd -”, which has proven to be quite useful over the years.  In Git, of course, branches are just aliases to commits.  So “git checkout -” is returning you to the previous branch, but it’s really just taking you back to whatever the last commit was that you worked on.

I just checked the Git documentation (“git checkout –help”), and it seems that this is a special case of a more generalizable command structure:

As a special case, the "@{-N}" syntax for the N-th last branch/commit checks out 
branches (instead of detaching). You may also specify - which is synonymous
with "@{-1}".

I can’t imagine wanting to tell Git to return to the 5th-most-recent branch that I worked on, so this generalized formula seems a bit much to me.

I predict that this trick will save me precious seconds every day, all of which I squandered in writing this blog post.  But I do think that this is a super-cool trick and feature, and demonstrates once again how clever and useful Git is.

Remember: Don’t mix = and NULL

I was recently performing an analysis on the database for my dissertation software, and wanted to look at one particular set of rows.  The table I was looking at is called Nodes, and because each node (represented by a row in the table) can optionally belong to a group, the group_id column allows for NULL values.

So, if I want to find all of the nodes in group 2, I can say

SELECT * FROM Nodes WHERE group_id = 2;

My data consists (of course) of several tables, including Versions, which has a many-to-one relationship with Nodes.  If I want all of the versions for nodes in group 2, I can say:

SELECT COUNT(*) FROM Versions WHERE node_id in (SELECT id FROM Nodes WHERE group_id = 2);

Now, let’s say that I want to get all of the versions that are not from group 2.  How would I do that?  Well, one way would be:

SELECT COUNT(*) FROM Versions WHERE node_id 
   NOT IN (SELECT id FROM Nodes WHERE group_id = 2);

But at some point, I flipped the logic, turning the NOT IN into IN, and = into <>:

SELECT COUNT(*) FROM Versions WHERE node_id 
  IN (SELECT id FROM Nodes WHERE group_id <> 2);

That, of course, was a bug.  That’s because SQL uses trinary (or ternary) logic, not the binary true-false logic to which we’re accustomed.  The result of comparing anything with NULL results in a NULL.  For example, consider the following table:

select * from truth_test ;
| val    |
| t      |
| f      |
| [null] |

I can create a truth table pretty easily:

SELECT val, val = true as is_true, val=false as is_false, 
       val=null as equal_null, val is null as is_null 
FROM truth_test;
| val    | is_true | is_false | equal_null | is_null |
| t      | t       | f        | [null]     | f       |
| f      | f       | t        | [null]     | f       |
| [null] | [null]  | [null]   | [null]     | t       |

You can see, then, that TRUE is equal to TRUE, and FALSE is equal to FALSE, which we would expect.  But using the = operator with NULL is always going to result in NULL — even with itself!

If you want to compare with NULLs, then you need to use IS NULL.  (Or you can use a function such as COALESCE, which returns the first non-NULL value passed to it.)  And that’s the mistake that I made in my queries yesterday: I was trying to get all of the nodes for which group_id was not 2.  But by using =, I inadvertently excluded all of the rows for which group_id was NULL — in other words, all of the nodes that weren’t assigned to a group.  Which, it turns out, was the majority of them.  Whoops.

My solution was to create a view containing all of the IDs that I want to exclude:

CREATE OR REPLACE VIEW group2_model_ids
FROM nodes
WHERE nodes.group_id = 2;

Now, if I want the number of versions that aren’t in group 2, I can use a subselect, as follows:

FROM Versions 
WHERE node_id NOT IN (SELECT id FROM group2_model_ids);

Of course, I could have simply rewritten the query, either to use COALESCE or to check for both NULL and not-NULL values.  But I prefer this way, especially since I’m only dealing with a few tens of thousands of rows of data in this table, in part because it gives me a name and abstraction that I can use in future queries.

Book Review: The Undercover Economist Strikes Back

In order to get an undergraduate degree from MIT, at least when I was there, you needed to take a certain number of humanities and social-science courses.  This was to stop you from coming out a complete one-dimensional student; the idea was that rounding out your education with knowledge from other fields was good for you as a person, and also good for you as an engineer or scientist.  (And yes, I realize that not everyone at MIT studied science or engineering, but those were the overwhelming favorites.)  One of the most popular social sciences that people took was economics — which is a social science, although MIT’s version actually included a great deal of math.

At the time, I tended to be quite dismissive of economics.  I didn’t think that it could possibly be interesting, or why so many of my friends were taking so many courses in that field.  What insights could they gain?

And then, just before I graduated, MIT Press had one of its amazing sales on overstock books.  I bought a book by a then-professor at MIT, named Paul Krugman, called “The Age of Diminished Expectations.”  Reading this book was quite a revelation for me; I suddenly realized that economics was complex, fascinating, and described the world in all sorts of interesting ways.  For years, I read and followed Krugman’s writing, in Slate and then (of course) the New York Times, gleaning what I could about economics.  (I also happen to subscribe to many of his political views, but that’s secondary.)  Whenever I could find an interesting and well-written book about economics, I would get it, because I found it to be so interesting and compelling.

Several years ago, a friend asked if I had read “The Undercover Economist,” by Tim Harford.  I hadn’t, but decided that perhaps it was worth a read, and found it to be delightful, but in a different way from Krugman. Harford isn’t an economics researcher, but he knows just how to put economics research into words and a perspective that everyone can understand.  His examples are often drawn from pop culture, and he’s able to distill the academic debates and intrigue to their essence.  The fact that he’s very funny only adds to his appeal.  I’ve since become quite a fan of Harford’s, listening (among other things) to the “More or Less” podcast from the BBC that he hosts, a sort of Mythbusters of statistics (Mathbusters?).

So it should come as no surprise that I ordered his latest book, “The Undercover Economist Strikes Back,” almost as soon as it came out earlier this year.  I just read it cover to cover over the weekend, and came away delighted.  As someone who has been reading Krugman’s work for years, and who also listens to NPR’s excellent Planet Money podcast, I can’t say that there was a huge amount of new information in this book.  But it was written so well, and put things into such nice context, that this doesn’t matter.

Harford has a gift for making economics not only understandable, but also interesting and relevant to our own lives.  In The Undercover Economist, he describes microeconomics, which describes how businesses and individuals respond to incentives.  In this new book, he describes macroeconomics, which is a different kettle of fish altogether — it’s about how governments and economies work.  If you think of macroeconomics as a complex system, then it’s no surprise that the aggregate behaves differently from its individual, constituent agents. (This, it took me many years to learn, is a much better explanation than what economics instructors tell their students, which is simply that “macro is different from micro.”)

The book talks about all sorts of great stuff, starting with recessions, moving onto unemployment, and covering a large number of topics that are in the newspaper each day, that affect each and every one of us, and which probably seem strange or detached from our reality, but which are actually quite important — particularly if you’re in a democracy, and need to separate real economics from crazy talk.

Harford includes a great definition and discussion of what money is, and brings up the famous story of the island of Yap, which used huge, largely immovable stones as money.  He also introduces the different schools of thought on the subject, and where (and how) they differ — and how much of what politicians in the US and Europe have been saying and doing over the last five years has been foolish or misplaced.

The question-and-answer format in which he wrote the book is a little tedious, but much less than I expected it to be.  Really?  Yes, really.

In my mind, perhaps the topic that was most obviously missing from the book was a discussion of currency, and how that can affect an economy.  If you live in the US, or even in England or Europe, you can largely ignore currency issues.  Sure, there are exchange rates, and yes, they affect you to some degree, but it’s not a huge deal.

In Israel, by contrast, the exchange rate is  a huge deal, because Israel imports and exports so much.  The dollar’s rise and fall affects everyone, from high-tech software companies to people shopping at the supermarket. The ways in which the Bank of Israel played with exchange rates and buying dollars, just to keep things relatively stable (while claiming that they were doing no such thing) are impressive, and point to the sorts of challenges that small, trade-oriented economies have but that large ones don’t.  I’m not sure if this was an omission due to time or space constraints, or if as someone living in England, Harford hasn’t had to think or worry much about currency issues.

I’ve changed my tune 100 percent since I was an undergrad; i now find economics to be totally fascinating, and very much enjoy reading the sorts of book that Harford has put out.  If you’ve always wondered what macroeconomics is, or what the newspapers is talking about when they mentioned recessions, or whether the politicians suggesting budget cuts during the latest recession were saying the most obviously brilliant thing or the most foolish thing imaginable, Harford’s book is a fun, interesting read, and is highly recommended.


Turning a PostgreSQL array to rows

I’ve spent the last few blog posts in this series talking about PostgreSQL’s arrays — how to create them, query them, and even create them with the ARRAY function.  But there are cases in which I would like to do the opposite — turn a PostgreSQL array to rows.

There are a number of reasons why I might want to do this. For example, perhaps I want to find the intersection between two arrays.  The INTERSECT operator in PostgreSQL can easily do that with two sets of rows — but there isn’t any equivalent for arrays.  Similarly, the UNION operator lets me join two sets of rows, but nothing like that exists for arrays.  And of course, there are times when I might want to take the elements of an array and run a JOIN between them and the rows of an existing table.

The key to all of these is the UNNEST function.  You need to be a bit careful when you’re using UNNEST, because (as always with computers) PostgreSQL will do what you tell it to do, not necessarily what you want it to do.  For example, if I have the following table:

select * from foo;
| id | stuff         |
| 1  | {abc,def,abc} |
| 2  | {abc,xyz}     |

Let’s say that I use UNNEST on our first row:

SELECT UNNEST(stuff) FROM Foo where id = 1;[local]/reuven=# SELECT UNNEST(stuff) FROM Foo where id = 1;
| unnest |
| abc    |
| def    |
| abc    |

Sure enough, the three elements of “stuff” where id is 1 have now become three rows. We get the results as a set of rows, which we can then capture in a table, query in a subquery, or turn back into an array, if we want:

SELECT ARRAY(SELECT UNNEST(stuff) FROM Foo where id = 1);
| array         |
| {abc,def,abc} |

This example might seem pointless, but it raises some additional possibilities.  For example, let’s say that I want to ensure that the elements of my array are unique.  I can do that by applying the DISTINCT operation on the result of UNNEST, and then packing those distinct rows back into an array:

| array     |
| {abc,def} |

It’s this sort of operation that has led me to really love PostgreSQL arrays.  If I use them as actual arrays, then I feel bad, as though I’ve betrayed the entire world of normalized, relational databases. But if I think of arrays as frozen sets of rows, which I can then unfreeze (with UNNEST), manipulate, join, or otherwise modify, and then re-freeze (with ARRAY) as necessary, then things suddenly get much clearer and easier.

I’m in the last weeks of my PhD research and writing, and I’ve been using no small number of arrays in my research and analysis.  I wouldn’t dream of using arrays to store actual combinations of data that could otherwise be normalized and stored in a reasonable way.  But I am constantly creating views that turn complex queries into arrays, which I can then further query, nest, and unnest as possible.

I’m just about done with this series; next time, I’ll show how I can combine arrays and CTEs to perform the sorts of operations that I described above, without having to pollute my database namespace with lots of unnecessary views.

Turning PostgreSQL rows into arrays

So far in this series, I have described how you can create arrays and retrieve information from them — both the actual data stored in the array, and information about the array, such as its length. But the coolest trick, or set of tricks, that I use in PostgreSQL is the ability to turn arrays into rows, and vice versa.  This is where, in my mind, arrays start to get truly powerful, and allow us to do all sorts of things that would otherwise be quite difficult, or require another language. For example, let’s say that I want to keep track of my appointments in a database.  I create an Appointments table:

CREATE TABLE Appointments (id SERIAL,
 meeting_at TIMESTAMP,
 description TEXT

And then I enter information into this table, such that I have the following appointments:

[local]/reuven=# select * from appointments order by meeting_at;
 | id | meeting_at          | description   |
 | 5 | 2014-05-22 12:00:00  | Dissertation  |
 | 4 | 2014-05-22 14:00:00  | Devel meeting |
 | 1 | 2014-05-23 15:00:00  | Dissertation  |
 | 2 | 2014-05-23 17:00:00  | Dinner        |
 | 3 | 2014-05-23 23:00:00  | Sleep         |
 (5 rows)

If I want to find out what I’m doing on May 23rd, I can simply say:

[local]/reuven=# select * from appointments where meeting_at::date = '2014-may-23';
 | id | meeting_at          | description  |
 | 1 | 2014-05-23 15:00:00  | Dissertation |
 | 2 | 2014-05-23 17:00:00  | Dinner       |
 | 3 | 2014-05-23 23:00:00  | Sleep        |

Notice that in the above example, I’m using PostgreSQL’s casting operator (::) to turn the meeting_at column, which contains a TIMESTAMP value, into a DATE value.  Without that,  the comparison would have been at midnight. So far, this isn’t very exciting or surprising.  But let’s say that I want to get all of the descriptions of today’s appointments in a comma-separated list.

Normally, I would need to retrieve all of the rows, and then use another language to piece them together.  With PostgreSQL’s arrays, though, we can do this all in the database — which executes much more quickly, among other things. The way to accomplish this is with the ARRAY function.  Now, don’t confuse the ARRAY[] constructor with the ARRAY function, although it’s hard not to.  (To be honest, I’m not sure how PostgreSQL can tell the difference; it might be the square brackets vs. parentheses that tell it to do different things.)

The ARRAY function is built into PostgreSQL, and turns a set of rows into an array.  That is, ARRAY expects to get rows as its input, and returns a single row, containing a single array, as its output. Let’s start with a set of rows, containing the descriptions of today’s appointments.  I’m going to use PostgreSQL’s concatenation operator (||) to combine the time (using the ::time cast) of the appointment with the appointment itself.  My query looks like:

SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'
ORDER BY meeting_at;

The result I get is:

 | ?column?              |
 | 15:00:00 Dissertation |
 | 17:00:00 Dinner       |
 | 23:00:00 Sleep        |

Now that I have my rows, I can turn them into an array.  Now, the ARRAY function needs to be invoked via a SELECT.  Thus, using ARRAY means that we’re using a subselect.  The inner SELECT is what we did above.  The outer one is just our call to ARRAY:


SELECT ARRAY(SELECT meeting_at::time || ' ' || description 
FROM Appointments 
WHERE meeting_at::date = '2014-may-23'
ORDER BY meeting_at);

And sure enough, we get a one-row, one-column result:

 | array                                                        |
 | {"15:00:00 Dissertation","17:00:00 Dinner","23:00:00 Sleep"} |

You can see that our three rows have been turned into three elements of the array.  I do this all of the time when creating reports or summaries of other data: I write a view or CTE that grabs the data from the rows that are of interest to me, and stuffs that data into an array.  I can then check the array’s contents, length, or anything else that might be useful in my analysis. But earlier, I said that I want to have a comma-separated list of the appointment.  I currently have them in an array.  Fortunately, PostgreSQL’s array_to_string can save the day, acting similarly to the “join” method that I know and love from such languages as Ruby and Python.  This function, wrapped around a call to ARRAY, wrapped around our subselect, will then look like:

SELECT array_to_string( 
  ARRAY(SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'ORDER BY meeting_at), ', ');

And sure enough, the result is as follows:

 | array_to_string                                        |
 | 15:00:00 Dissertation, 17:00:00 Dinner, 23:00:00 Sleep |

Not bad, for one query; we’ve taken our three rows, and turned them into a string, without having to use anything other than SQL!  There’s just one final problem here that I’d like to fix, namely that when you invoke a function as we did here, the function’s name becomes the name of the column.  I’d like to change that, so the column has a more reasonable name.  This is particularly important if you’re planning to read the column using a high-level language; your code will look awfully funny if it includes statements like:

puts row['array_to_string']

when it could instead read

puts row['appointments_for_today']

Here’s how I can use “AS” to change the name of the column, making it easier to read and work with:

SELECT array_to_string( 
  ARRAY(SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'ORDER BY meeting_at), ', ') 
AS appointments_for_today;

It is in these sorts of uses, to aggregate data and create reports, that I’ve found PostgreSQL’s arrays to be particularly useful and powerful. Next time, we’ll see how we can do the opposite, turning an array back into rows — and then, we’ll see how the combination of these conversions lets us perform all sorts of tricks.

Looking in PostgreSQL arrays with ANY

So far, this series has looked at how to create PostgreSQL arrays, how to retrieve data from them, and how to get the length of an array’s outer and inner dimensions. But one of the most common actions that we’ll want to do with an array is look inside to see if a particular value is in there.  Today, I’ll show you how to do that with the PostgreSQL ANY operator and its cousins. For example, given my table “foo”:

[local]/reuven=# select * from foo;
| id | stuff               |
| 8  | {abc}               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
(3 rows)

We’ve already seen (in my last post on this topic) how we can find all of the rows in which the “stuff” array is of a certain length.  But sometimes — more often, perhaps — I want all of the rows in which a particular value is somewhere in the array.  (You can think of this as the SQL equivalent of Ruby’s “select” on an enumerable, or Python’s “filter” on a sequence.) For this, we need to use PostgreSQL’s ANY operator.  I have always found the syntax of ANY (and its synonym, SOME) to be a bit hard to understand.  The IN operator works by itself, but ANY only works with an = sign.  For example, if I want all of the rows from the above “foo” table in which the “stuff” array contains an element ‘abc’, I can say:

[local]/reuven=# select * from foo where 'abc' = ANY(stuff);
| id | stuff               |
| 8  | {abc}               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
(3 rows)

Sure enough, because each of these three arrays contains an element ‘abc’, we get all three rows back.  What if I ask for ‘def’?

[local]/reuven=# select * from foo where 'def' = ANY(stuff);
| id | stuff               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
(2 rows)

Again, not a surprise.  What if I search for ‘ghi’?

[local]/reuven=# select * from foo where 'ghi' = ANY(stuff);
| id | stuff |
(0 rows)

This is important to remember, actually: ANY looks at the elements of the array, and no further.  So the string ‘ghi’ was compared with the elements of “stuff”, which in the case of row ID 10, was the three strings ‘abc’, ‘def’, and ‘ghi jkl’.  PostgreSQL compared ‘ghi’ with each of these, found them to be unequal, and thus return 0 rows.  Remember this when you’re trying to search through your text[] array column; you can look for specific elements of the array, but no further. What I personally find confusing, and easy to forget, is the fact that using ANY requires that I use an = sign before the call to ANY.  Moreover, it’s important to remember that ANY and the array go on the right-hand side of the equal sign, for reasons that I don’t quite understand.  Thus, the following will not work:

[local]/reuven=# select * from foo where ANY(stuff) = 'abc';
ERROR: 42601: syntax error at or near "ANY"
LINE 1: select * from foo where ANY(stuff) = 'abc';

I’ve never had any occasion to use it, but you should know that PostgreSQL also offers an ALL operator.  It works the same way as ANY, except that it only returns a TRUE value if all of the elements in the array are equal to the value on the left-hand side.  For example:

[local]/reuven=# select * from foo where 'abc' = ALL(stuff);
| id | stuff |
| 8  | {abc} |
(1 row)

In this example, we only get one row back, because the stipulation is that all of the array elements need to be equal to ‘abc’. I’ve got a few more posts planned in this series about PostgreSQL arrays.  If there is something about this subject that you have always wanted to know, please contact me, and I’ll do what I can to answer.

The real questions to ask offshore developers

Friends of mine, who are not software developers, have a small, retail Internet business.  The original developers created the application in Python, and my friends are looking for a full-stack Web/Python developer to help them.  Frustrated with their inability to find someone who can commit to their project, my friends have decided to hire offshore developers, which is another way of saying, “cheap programmers in Eastern Europe or India.”

Earlier this week, these friends e-mailed me the resumes of three Ukrainian programmers, asking me which seemed most appropriate, and what questions they should be asking.

The resumes were, from my perspective, largely identical.  All the programmers declared themselves to be “experts,” or “very experienced,” or just “experienced,” at Python, JavaScript, Web development, SQL, and many other technologies.  And the fact is, they probably are quite skilled at all of these technologies; the Ukrainians with whom I have worked — as well as the Indians, Chinese, and Romanians — have all been quite skilled, technically.

But here’s the thing: Technical skill isn’t the primary consideration when hiring a developer. This is doubly true when hiring an offshore developer.  That’s because the problems that I’ve seen with offshore programmers aren’t technical, but managerial.  As I told my friends, you would much rather have a so-so Python programmer who is reliable, and communicative than a genius programmer who is unreliable, or uncommunicative.  The sad fact is that many of the offshore outsourcing companies have talented programmers, but poor management and leadership, leading to breakdowns in communication, transparency, and scheduling, rather than technology.

Sure, a developer might know the latest object-oriented techniques, or know how to create a RESTful JSON API in his or her sleep.  But the programmer’s job isn’t to do those things. Rather, the programmer’s job is to do whatever the business needs to grow and improve.  If that requires fancy-shmancy programming techniques and algorithms, then great.  But most of the time, it just requires someone willing to pay attention to the project’s needs and schedule, writing simple and reliable code that’s necessary for the business to succeed.

The questions that you should be asking an offshore developer aren’t that different from the ones that you should be asking a developer in your own country, who speaks your language, and lives in your time zone.  Specifically, you should be asking about their communication patterns and processes.  Of course, you don’t want a dunce working on your programming project — but good communication and processes will smoke out such a person very quickly.

If there are no plans or expectations for communication, then you’re basically hoping that the developer knows what you want, that he or she will do it immediately, and that things won’t change — a situation that is pretty much impossible.

Good processes and a good developer will lead to a successful project.  Good processes and a bad developer will make it clear that the developer needs to go, and soon.  Bad processes and a developer of any sort will make it hard to measure performance, leading to frustrating on everyone’s part — and probably missed deadlines, overspent budgets, and more.

So I told my friends that they should get back to these Ukrainian programmers, and ask them the following questions:

  • What task tracking system do you prefer to use, in order to know what needs to be done, what has been done, and who has taken responsibility for each task?
  • How often do you want to meet to review progress?
  • Do you use automated testing to ensure that when we make progress, we can be sure that it works, and that we haven’t introduced regressions?
  • How easily will a third party be able to download the repository from Git (or whatever version-control system you’re using), and run those tests to verify that everything is working?

The answers to these questions are far, far more important than the technical skills of the person you’re hiring. Moreover, these are things that we can test empirically: If the developer doesn’t do one or more of them, we’ll know right away, and can find out what is going wrong.

If the developer is good, then he or she will encourage you to set up a task tracker, meet every day (or at least, every other day) to review where things are.  You’ll hear that automated testing is part of the development progress, and that of course it’s possible to download, install, and run the application on any compatible computer.

If the developer hedges on these things, or if he or she asks you to trust him, then that’s a bad sign.  Truth be told, the developer might be fantastic, brilliant, and do everything you want.  But do you want to take that risk?

If the developer has regular communication with you, tests their code, and allows you to download and run the application on your own, then you’re in a position to either praise them and keep the relationship going — or discover that things aren’t good, and shut it down right away.

Which brings me to my final point: With these sorts of communication practices in place, you’ll very quickly discover if the developers are doing what they promised.  If so, then that’s great for everything.  But if not, then you’ll know this within a week or less — and then you can get rid of them.

There are plenty of talented software developers in the world, but there are many fewer who both understand your business and make its success a priority.  A developer who values your business will want to demonstrate value and progress on a very regular basis.  Someone who cannot demonstrate value and progress probably isn’t deserving of your attention or money, regardless of where they live or what language they speak.  But if you can find someone excellent, who values you and your business, and who wants to help you succeed?  Then by all means, hire them — and it doesn’t matter whether they’re in Ukraine, or anywhere else.

What questions do you ask offshore developers before hiring them?

PostgreSQL array indexes and length

In my last blog post, I introduced the idea of a PostgreSQL array, and showed how we can insert data into a table using either the curly-brace {} syntax, or the ARRAY construction syntax.  In this post, I want to talk about PostgreSQL indexes and length – what happens when we retrieve from indexes that exist (and don’t), how we can construct multidimensional arrays, and how we can ask PostgreSQL for the length of any dimension of our multidimensional array.

First of all, let’s talk about how you can retrieve data from arrays.  if I still have my three-row table from last time:

[local]/reuven=# select id, stuff from foo;
 │ id │ stuff               │
 │ 8  │ {abc}               │
 │ 9  │ {abc,def}           │
 │ 10 │ {abc,def,"ghi jkl"} │

What if I want to get just the first value back from the “stuff” column?  Well, then I have to take the first element of an array.  Most modern languages start to number arrays with index 0; PostgreSQL, by contrast, starts to count them with 1.  I can thus ask for just the first element of the “stuff” array from each row with:

[local]/reuven=# select id, stuff[1] from foo;
 │ id │ stuff │
 │ 8  │ abc   │
 │ 9  │ abc   │
 │ 10 │ abc   │

If we ask for an index that doesn’t exist, we get a NULL value back:

[local]/reuven=# select id, stuff[3] from foo;
 │ id │ stuff   │
 │ 8  │ [null]  │
 │ 9  │ [null]  │
 │ 10 │ ghi jkl │

Note that I have configured my PostgreSQL client to show NULL values as “[null]“, by putting the following line in my ~/.psqlrc file:

\pset null [null]

Without the above line in your .psqlrc (or running that command in psql yourself manually), you might see blank space for row IDs 8 and 9.

Now, it’s pretty rare for me to pull out a particular value from a PostgreSQL array.  Instead, I’m often finding out the lengths of the arrays on which I’m working.  I can do this with the array_length function:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo;
 │ id │ stuff               │ array_length │
 │ 8  │ {abc}               │ 1            │
 │ 9  │ {abc,def}           │ 2            │
 │ 10 │ {abc,def,"ghi jkl"} │ 3            │

Notice that array_length is a function that takes two parameters, an array and an integer.  The array is what we want to measure.  The integer describes which dimension should be measured in the array’s length. If you’re like me, and come from dynamic languages like Ruby and Python in which arrays (or lists) can be of any length, then you should realize here that PostgreSQL arrays can be multidimensional, but each inner array must be of the same length. So, for example, I can create a 2×3 array of integers with

[local]/reuven=# select ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]];
│ array             │
│ {{1,1,1},{2,2,2}} │

Trying to have a different inner dimension will not work:

[local]/reuven=# select ARRAY[ARRAY[1,1,1], ARRAY[2,2,2,2]];
ERROR: 2202E: multidimensional arrays must have array expressions with matching dimensions

Assuming that I have a legitimate array, I can get its length:

[local]/reuven=# select array_length(ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]], 1);
│ array_length │
│ 2            │

Or I can get the length of the inner dimension:

[local]/reuven=# select array_length(ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]], 2);
│ array_length │
│ 3 │

So, when retrieving our rows from the “foo” table:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo;
│ id │ stuff               │ array_length │
│ 8  │ {abc}               │       1      │
│ 9  │ {abc,def}           │       2      │
│ 10 │ {abc,def,"ghi jkl"} │       3      │

I can get the array length in a separate column, as in this example.  Or I can even sort in descending order of the array length:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo order by array_length(stuff, 1) desc;
│ id │ stuff               │ array_length │
│ 10 │ {abc,def,"ghi jkl"} │     3        │
│ 9  │ {abc,def}           │     2        │
│ 8  │ {abc}               │     1        │

Notice that our ORDER BY clause has to repeat the function that we used to create the third column.  Another way to do this is to declare an alias for the output of array_length, and then use the alias in ORDER BY:

select id, stuff, array_length(stuff, 1) len from foo order by len desc;
│ id │ stuff               │ len │
│ 10 │ {abc,def,"ghi jkl"} │ 3   │
│ 9  │ {abc,def}           │ 2   │
│ 8  │ {abc}               │ 1   │

Next time, we’ll look at how we can manipulate arrays using array functions.

Learning to love PostgreSQL arrays

I’ll admit it: When arrays were added to PostgreSQL a number of years ago, I thought that this was a really bad idea.  I’m a firm believer in normalization when it comes to database design and storage; and the idea of putting multiple values inside of a single column struck me as particularly foolish.  Besides, my impression was that PostgreSQL arrays were clumsy to work with, and didn’t really add much to my data model.

Of course, it turns out that arrays are extremely useful in PostgreSQL.  I still cringe when people want to use them for general-purpose storage, instead of working to normalize their database design.  But over the last few months, as I’ve been doing all sorts of complex PostgreSQL queries for my PhD dissertation, I’ve found that PostgreSQL arrays are extremely useful when it comes to aggregating and reporting data.

I’ve thus decided to dedicate a number of blog posts to PostgreSQL arrays: How to create them, use them, manipulate them, and decide when to use them.

Let’s start with the very basics; over the next few blog posts, I’ll try to show how arrays can be interesting — and even useful, and fit into more complex queries and needs.

You can create an array of just about any data type in PostgreSQL.  As the documentation says, “Arrays of any built-in or user-defined base type, enum type, or composite type can be created. Arrays of domains are not yet supported.” This means that you can create arrays of just about any data type you want: Integers, text, enums, other arrays (for multidimensional arrays), or even user-defined types.  To date, I have generally created arrays of integers and text, but that might not be representative of your use case.

To create a table with a text array in one column, just add square brackets ([]) after the type:

stuff TEXT[],

When I then ask for the definition of my “Foo” table, I see the following:

[local]/reuven=# \d foo
 Table ""
│ Column │ Type │ Modifiers │
│ id     │ integer │ not null default nextval('foo_id_seq'::regclass) │
│ stuff  │ text[]  │                                                  │
 "foo_pkey" PRIMARY KEY, btree (id)

Notice that the type of the “stuff” column is indeed recorded as “text[]“, showing that it’s an array.  If we try to insert a plain-text value into that column, PostgreSQL will complain:

[local]/reuven=# insert into foo (stuff) values ('abc');
ERROR: 22P02: array value must start with "{" or dimension information
LINE 1: insert into foo (stuff) values ('abc');

One of the many things that I love about PostgreSQL is the attention to detail in the error messages.  Not only does it tell us that the table is expecting an array value, but that the array must begin with a { character.  It also shows us, using a ^ character, where the parser had problems.  That’s not always a perfect indicator of where the problem lies, but it’s a great start.

If I want to insert an array value into my table, I can thus use the literal array syntax that PostgreSQL provides, with (as indicated above) curly braces:

[local]/reuven=# insert into foo (stuff) values ('{abc}');
[local]/reuven=# insert into foo (stuff) values ('{abc,def}');
[local]/reuven=# insert into foo (stuff) values ('{abc,def,ghi jkl}');

The above commands insert three rows into our table.  In all three cases, we are inserting array values into our table.  Notice that in all cases, the array is inserted as a string, surrounded by single quote marks.  Thus, ‘{abc}’ becomes a one-element array, and ‘{abc,def}’ becomes a two-element array.

What happens when there is a space character inside of the text?  PostgreSQL automatically quotes the value (with double quotes — be careful!).  What happens if you want a comma or single quote as part of the text?  Then things get even uglier.

A nice solution, and a better way (I believe) to insert arrays in any event, is to use the built-in ARRAY constructor syntax.  Then you don’t have to worry about such things.  For example, I can rewrite all of the above INSERT commands in what I believe to be a much nicer way:

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc']);

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc', 'def']);

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc', 'def', 'ghi jkl']);

[local]/reuven=# select * from foo;
│ id │ stuff │
│ 8  │ {abc}               │
│ 9  │ {abc,def}           │
│ 10 │ {abc,def,"ghi jkl"} │
(3 rows)

The same data was inserted into the table, but with less hassle than before.

Now, just because we can insert arrays directly into our tables doesn’t necessarily mean that we should do so.  You’ll see, over the course of this series, that I view arrays as a great way to aggregate and analyze existing data, particularly within the context of a view or a CTE.  So please don’t be tempted to start stuffing all of the data you want and need into a single column; normalization is still a good idea, and arrays can be tempting.  However, being familiar with the basics of defining and inserting array data into the database is quite useful, and will serve us well throughout the rest of this series.


Control-R, my favorite Unix shell command

If you use a modern, open-source Unix shell — and by that, I basically mean either bash or zsh — then you really should know this shortcut.  Control-R is probably the shell command (or keystroke, to be technical about it) that I use most often, since it lets me search through my command history.

Let’s start with the basics: When you use bash or zsh, your commands are saved into a history, typically put in the environment variable HISTFILE.  I use zsh (thanks to oh-my-zsh), and it puts my HISTFILE in ~/.zsh_history.  How many commands does it store?  That depends on the value of the environment variable HISTSIZE, which in my case is 10,000.  Yes, I store the 10,000 last commands that I entered into my shell.

Now, before control-R, there were a bunch of ways to search through and use the history.  Each command has its own number, and thus if you want to replay command 5329, you can do so by typing


But this requires that you keep track of the numbers, and while I used to do that, I found it to be more annoying than useful.  What I really wanted was just to repeat a command … you know, the last time I ssh’ed into a server, or something.  So yeah, you can do


and you’ll get the most recent “ssh” command that you entered.  But what if you have used ssh lots of times, to lots of servers?  You could start to search for the server name, but then things start to get complicated, messy, and annoying.

What control-R does is search backwards through HISTFILE, looking for a match for what you have entered until now.  If you use Emacs, then this will make perfect sense to you, since control-R is the reverse version of control-S in Emacs.  If you don’t know Emacs, then it’s a crying shame — but I’ll still be your friend, don’t worry.

Let’s say you have ssh’ed into five different servers today, and you want to ssh again into the third server of the bunch.  You type control-R, which puts you into bck-i-search (i.e., “backward incremental search”) mode.  Now type “s” (without enter).  The most recent command that you entered, which contains an “s”, will appear.  Now type another “s” (again, without pressing enter).  The most recent command containing two “s” characters in a row will appear.  Depending on your shell and configuration, the matching text might even be highlighted.

Now enter “h”.   In my case, I got to the most recent call to “ssh” that I made in my shell.  But I don’t want this last (fifth) one; I want the third one.  So I enter control-R again, and then again.  Now I’m at the third time (out of five) that I used ssh today, at the command I want.  I press “enter”, and I’ve now executed the command.

While searching backward, if you miss something because you hit control-R one too many times, you can use control-S to search forward.  You can use the “delete” key to remove characters, one at a time, from the search string.  And you can use “enter”, as described above, to end the search.  I should also note that I’ve modified my zsh prompts such that the matched text in control-R is highlighted, which has made it even more useful to me.

So, when was the last time I entered the full “ssh” command into a client’s server? I dunno, but it was a while ago… since the odds are that within the 10,000 most recent commands, I’ve got a mention of that client’s server.  And if I needed to pass specific options to ssh, such as a port number or a certificate file to get into AWS, that’ll be in the history, too.  By combining a huge history with control-R, you can basically write each command once, and then refer back to it many times.

Now the fact is that control-R isn’t really part of bash or zsh, per se.  Rather, it’s part of a GNU library called “readline” that is used in a large number of programs.  For example, it’s used in IPython, Pry, and the psql command-line client for PostgreSQL.  Everywhere I go, I can use control-R — and I do!  Each program saves its own history, so there’s no danger of mixing shell commands with PostgreSQL queries.