Yes, you can master regular expressions!

Announcing: My new book, “Practice Makes Regexp,” with 50 exercises meant to help you learn and master regular expressions. With explanations and code in Python, Ruby, JavaScript, and PostgreSQL.

I spend most of my time nowadays going to high-tech companies and training programmers in new languages and techniques. Actually, many of the things I teach them aren’t really new; rather, they’re new to the participants in my training. Python has been around for 25 years, but for my students, it’s new, and even a bit exciting.

I tell participants that my job is to add tools to their programming toolbox, so that if they encounter a new problem, they’ll have new and more appropriate or elegant ways to attack and solve it. Moreover, I tell them, once you are intimately familiar with a tool or technique, you’ll suddenly discover opportunities to use it.
Earlier this week, I was speaking with one of my consulting clients, who was worried that some potentially sensitive information had been stored in their Web application’s logfiles — and they weren’t sure if they had a good way to search through the logs.

 

I suggested the first solution that came to mind: Regular expressions.

Regular expressions are a lifesaver for anyone who works with text.  We can use them to search for patterns in files, in network data, and in databases. We can use them to search and replace.  To handle protocols that have changed ever so slightly from version to version. To handle human input, which is always messier than what we get from other computers.

Regular expressions are one of the most critical tools I have in my programming toolbox.  I use them at least a few times each day, and sometimes even dozens of times in a given day.

So, why don’t all developers know and use regular expressions? Quite simply, because the learning curve is so steep. Regexps, as they’re also known, are terse and cryptic. Changing one character can have a profound impact on what text a regexp matches, as well as its performance. Knowing which character to insert where, and how to build up your regexps, is a skill that takes time to learn and hone.

Many developers say, “If I have a problem that involves regular expressions, I’ll just go to Stack Overflow, where my problem has likely been addressed already.” And in many cases, they’re right.

But by that logic, I shouldn’t learn any French before I go to France, because I can always use a phrasebook.  Sure, I could work that way — but it’s far less efficient, and I’ll miss many opportunities that would come my way if I knew French.

Moreover, relying on Stack Overflow means that you never get a full picture of what you can really do with regular expressions. You get specific answers, but you don’t have a fully formed mental model of what they are and how they work.

But wait, it gets worse: If you’re under the gun, trying to get something done for your manager or a big client, you can’t spend time searching through Stack Overflow. You need to bring your best game to the table, demonstrating fluency in regular expressions.  Without that fluency, you’ll take longer to solve the problem — and possibly, not manage to solve it at all.

Believe me, I understand — my first attempt at learning regular expressions was a complete failure. I read about them in the Emacs manual, and thought to myself, “What could this seemingly random collection of characters really do for me?”  I ignored them for a few more years, until I started to program in Perl — a language that more or less expected you to use regexps.

So I spent some time learning regexp syntax.  The more I used them,  the more opportunities I found to use them.  And the more I found that they made my life easier, better, and more convenient.  I was able to solve problems that others couldn’t — or even if they could, they took much longer than I did.  Suddenly, processing text was a breeze.

I was so excited by what I had learned that when I started to teach advanced programming courses, I added regexps to the syllabus.  I figured that I could figure out a way to make regexps understandable in an hour or two.

But boy, was I wrong: If there’s something that’s hard for programmers to learn, it’s regular expressions.  I’ve thus created a two-day course for people who want to learn regular expressions.  I not only introduce the syntax, but I have them practice, practice, and practice some more.  I give them situations and tasks, and their job is to come up with a regexp that will solve the problem I’ve given them.  We discuss different solutions, and the way that different languages might go about solving the problem.

After lots of practice, my students not only know regexp syntax — they know when to use it, and how to use it.  They’re more efficient and valuable employees. They become the person to whom people can turn with tricky text-processing problems.  And when the boss is pressuring them for a

ImageAnd so, I’m delighted to announce the launch of my second ebook, “Practice Makes Regexp.”  This book contains 50 tasks for you to accomplish using regular expressions.  Once you have solved the problem, I present the solution, walking you through the general approach that we would use in regexps, and then with greater depth (and code) to solve the problem in Python, Ruby, JavaScript, and PostgreSQL.  My assumption in the book is that you have already learned regexps elsewhere, but that you’re not quite sure when to use them, how to apply them, and when each metacharacter is most appropriate.

After you go through all 50 exercises, I’m sure that you’ll be a master of regular expressions.  It’ll be tough going, but the point is to sweat a bit working on the exercises, so that you can worry a lot less when you’re at work. I call this “controlled frustration” — better to get frustrated working on exercises, than when the boss is demanding that you get something done right away.

Right now, the book is more than 150 pages long, with four complete chapters (including 17 exercises).  Within two weeks, the remaining 33 exercises will be done.  And then I’ll start work on 50 screencasts, one for each of the exercises, in which I walk you through solutions in each of Python, Ruby, JavaScript, and PostgreSQL.  If my previous ebook is any guide, there will be about 5 hours (!) of screencasts when I’m all done.

If you have always shied away from learning regular expressions, or want to harness their power, Practice Makes Regexp is what you have been looking for.  It’s not a tutorial, but it will help you to understand and internalize regexps, helping you to master a technology that frustrates many people.

To celebrate this launch, I’m offering a discount of 10%.  Just use the “regexplaunch” offer code, and take 10% off of any of the packages — the book, the developer package (which includes the solutions in separate program files, as well as the 300+ slides from the two-day regexp course I give at Fortune 100 companies), or the consultant package (which includes the screencasts, as well as what’s in the developer package).

I’m very excited by this book.  I think that it’ll really help a lot of people to understand and use regular expressions.  And I hope that you’ll find it makes you a more valuable programmer, with an especially useful tool in your toolbox.

Using regexps in PostgreSQL

After months of writing, editing, and procrastinating, my new ebook, “Practice Makes Regexp” is almost ready.  The book (similar to my earlier ebook, “Practice Makes Python“) contains 50 exercises to improve your fluency with regular expressions (“regexps”), with solutions in Python, Ruby, JavaScript, and PostgreSQL.

When I tell people this, they often say, “PostgreSQL?  Really?!?”  Many are surprised to hear that PostgreSQL supports regexps at all.  Others, once they take a look, are surprised by how powerful the engine is.  And even more are surprised by the variety of ways in which they can use regexps from within PostgreSQL.

I’m thus presenting an excerpt from the book, providing an overview of  PostgreSQL’s regexp operators and functions. I’ve used these many times over the years, and it’s quite possible that you’ll also find them to be of assistance when writing queries.

PostgreSQL

PostgreSQL isn’t a language per se, but rather a relational database system. That said, PostgreSQL includes a powerful regexp engine.  It can be used to test which rows match certain criteria, but it can also be used to retrieve selected text from columns inside of a table.  Regexps in PostgreSQL are a hidden gem, one which many people don’t even know exists, but which can be extremely useful.

Defining regexps

Regexps in PostgreSQL are defined using strings.  Thus, you will create a string (using single quotes only; you should never use double quotes in PostgreSQL), and then match that to another string. If there is a match, PostgreSQL returns “true.”

PostgreSQL’s regexp syntax is similar to that of Python and Ruby, in that you use backslashes to neutralize metacharacters. Thus, + is a metacharacter in PostgreSQL, whereas \+ is a plain “plus” character. However, there are differences between the regexp syntax for example, PostgreSQL’s word-boundary metacharacter is \y whereas in Python and Ruby, it is \b.  (This was likely done to avoid conflicts with the ASCII backspace character.)

Where things are truly different in PostgreSQL’s implementation is the set of operators and functions used to work with regexps. PostgreSQL’s operators are generally aimed at finding whether a particular regexp matches text, in order to include or exclude result rows from an SQL query.  By contrast, the regexp functions are meant to retrieve some or all of a string from a column’s text value.

True/false operators

PostgreSQL comes with four regexp operators. In each case, the text string to be matched should be on the left, and the regexp should be on the right.  All of these operators return true or false:

  • ~  case-sensitive match
  • ~*  case-insensitive match
  • !~  case-sensitive non-match
  • !~* case-insensitive non-match

Thus, you can say:

select 'abc' ~ 'a.c';   -- returns "true"
select 'abc' ~ 'A.C';   -- returns "false"
select 'abc' ~* 'A.C';  -- returns "true"

In addition to the standard character classes, we can also use POSIX-style character classes:

select 'abc' ~* '^[[:xdigit:]]$';    -- returns "false"
select 'abc' ~* '^[[:xdigit:]]+$';   -- returns "true"
select 'abcq' ~* '^[[:xdigit:]]+$';  -- returns "false"

This operator, as mentioned above, is often used to include or exclude rows in a query’s WHERE clause:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT id, thing FROM Stuff WHERE thing ~* '^[abc]{3}$';

This final query should return three rows, those in which thing is equal to abc, Abc, and ABC.

Extracting text

If you’re interested in the text that was actually matched, then you’ll need to use one of the built-in regexp functions that PostgreSQL provides. For example, the regexp_match function allows us not only to determine whether a regexp matches some text, but also to get the text that was matched.  For each matching column, regexp_match returns an array of text (even if that array contains a single element).  For example:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT regexp_matches(thing, '^[abc]{3}$') FROM Stuff;

The above will return a single row:

{abc}

As you can see, the above returned only a single column (from the function) and a single row (i.e., the one matching it).  That’s because when you invoke regexp_matches, you can provide additional flags that modify the way in which it operates. These flags are similar to those used in Python, Ruby, and JavaScript.

For example, we can use the i flag to make regexp_match case-insensitive:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT regexp_matches(thing, '^[abc]{3}$', 'i') FROM Stuff;

Now we’ll get three rows back, since we have made the match case-insensitive.  regexp_matches can take several other flags as well, including g (for a global search). For example:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC');
SELECT regexp_matches(thing, '.', 'g') FROM Stuff;

Here is the output from regexp_matches:

{A} 
{B} 
{C}

Notice how regexp_matches, because of the g option, returned three rows, with each row containing a single (one-character) array. This indicates that there were three matches.

Why is each returned row an array, rather than a string? Because if we use groups to capture parts of the text, the array will contain the groups:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC'), ('AqC');
SELECT regexp_matches(thing, '^(A)(..)$', 'ig') FROM Stuff;

Notice that in the above example, I combined the i and g flags, passing them in a single string.  The result is a set of arrays:

| regexp_matches |
|----------------|
| {A,BC}         |
| {A,qC}         |

Splitting

A common function in many high-level languages is split, which takes a string and returns an array of items. PostgreSQL offers this with its split_part function, but that only works on strings.

However, PostgreSQL also offers two other functions: regexp_split_to_array and regexp_split_to_table. This allows us to split a text string using a regexp, rather than a fixed string.  For example, if we say:

select regexp_split_to_array('abc def   ghi   jkl', '\s+');

The above will take any length of whitespace, and will use that to split the columns.  But you can use any regexp you want to split things, getting an array back.

A similar function is regexp_split_to_table, which returns not a single row containing an array, but rather one row for each element. Repeating the above example:

select regexp_split_to_table('abc def   ghi   jkl', '\s+');

The above would return a table of four rows, with each split text string in its own row.

Substituting text

The regexp_replace function allows us to create a new text string based on an old one.  For example:

SELECT regexp_replace('The quick brown fox jumped over the lazy dog',
                      '[aeiou]', '_');

The above returns:

Th_ quick brown fox jumped over the lazy dog

Why was only the first vowel replaced? Because we didn’t invoke regexp_replace with the g option, making it global:

SELECT regexp_replace('The quick brown fox jumped over the lazy dog',
                      '[aeiou]', '_', 'g');

Now all occurrences are replaced:

Th_ q__ck br_wn f_x j_mp_d _v_r th_ l_zy d_g

In PostgreSQL, as in life, don’t wait too long to commit

I recently helped a client with serious PostgreSQL problems. Their database is part of an application that configures and monitors large networks; each individual network node reports its status to the database every five minutes. The client was having problems with one table in particular, containing about 25,000 rows — quite small, by modern database standards, and thus unlikely to cause problems.

However, things weren’t so simple: This table was growing to more than 20 GB every week, even though autovacuum was running. In response, the company established a weekly ritual: Shut down the application, run a VACUUM FULL, and then restart things. This solved the problem in the short term — but by the end of the week, the database had returned to be about 20 GB in size.  Clearly, something needed to be done about it.

I’m happy to say that I was able to fix this problem, and that the fix wasn’t so difficult. Indeed, the source of the problem might well be obvious to someone with a great deal of PostgreSQL experience. That’s because the problem mostly stemmed from a failure to understand how PostgreSQL’s transactions work, and how they can influence the functioning of VACUUM, the size of your database, and the reliability of your system. I’ve found that this topic is confusing to many people who are new to PostgreSQL, and I thus thought that it would be a useful to walk others through the problem, and then describe how we solved it.

PostgreSQL, like many other modern relational databases, uses a technology known as multiversion concurrency control (“MVCC”).  MVCC reduces the number of exclusive locks placed on a database’s rows, by keeping around multiple versions of each row. If that sounds weird, then I’ll try to explain. Let’s start with a very simple table:

CREATE TABLE Foo (
   id SERIAL PRIMARY KEY,
   x  INTEGER NOT NULL
 );

Now let’s add 1 million rows to that table:

INSERT INTO Foo (x) VALUES (generate_series(1,1000000));

Now, how much space does this table take up on disk? We can use the built-in pg_relation_size function, but that’ll return a large integer representing a number of bytes. Fortunately, PostgreSQL also include pg_size_pretty, which turns a number into something understandable by computer-literate humans:

SELECT pg_size_pretty(pg_relation_size('foo'));

35 MB

What happens to the size of our table if we UPDATE each row, incrementing x by 1?

UPDATE Foo SET x = x + 1;
SELECT pg_size_pretty(pg_relation_size('foo'));

69 MB

If you’re new to MVCC, then the above is probably quite surprising. After all, you might expect an UPDATE query to change each of the existing 1 million rows, keeping the database size roughly identical.

However, this is not at all the case: Rather than replace or update existing rows, UPDATE in an MVCC system adds new rows. Indeed, in an MVCC-based system, our queries never directly change or remove any rows. INSERT adds new rows, DELETE marks rows as no longer needed, and UPDATE performs both an INSERT and a DELETE.

That’s right: In an MVCC system, UPDATE doesn’t change existing rows, although we might have the illusion of this occurring.  Every time you UPDATE a table, you’re doubling its size.

At this point, many newcomers to PostgreSQL are even more confused: How can it be that deleted rows don’t go away? And why would UPDATE work in such a crazy way? And how can all of this possibly help me?

The key to understanding what’s happening is to realize that once a row has been added to the system, it cannot be changed. However, its space can be reused by new rows, if the old row is no longer accessible by any transaction.

This starts to make a bit more sense if you realize that every transaction in PostgreSQL has a unique ID number, known as the “transaction ID.” This number constantly increases, providing us with a running count of transactions in the system. Every time we INSERT a row, PostgreSQL records the first transaction ID from which the row should now be visible, in a column known as “xmin.” You an think of xmin as the row’s birthday. You cannot normally see xmin, but PostgreSQL will show it to you, if you ask for it explicitly:

SELECT  xmin, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ id │ x │

│ 187665 │  1 │ 2 │
│ 187665 │  2 │ 3 │
│ 187665 │  3 │ 4 │
│ 187665 │  4 │ 5 │

In an MVCC-based system, DELETE doesn’t really delete a row.  Rather, it indicates that a row is no longer available to future transactions. We can see an initial hint of this if, when we SELECT the contents of a table, we ask for not only xmin, but also xmax — the final transaction (or “expiration date,” if you like) for which a row should be visible:

SELECT  xmin, xmax, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ xmax │ id │ x │

│ 187656 │    0 │  1 │ 2 │
│ 187656 │    0 │  2 │ 3 │
│ 187656 │    0 │  3 │ 4 │
│ 187656 │    0 │  4 │ 5 │

We can see that each of these rows was added in the same transaction ID (187656). Moreover, all of these rows are visible, since their xmax value is 0. (Don’t worry; we’ll see other xmax values in a little bit.)  If I update some  of these rows, we’ll see that their xmin value will be different:

UPDATE Foo SET x = x + 1 WHERE id IN (1,3);
SELECT  xmin, * FROM Foo WHERE id < 5 ORDER BY id;

│  xmin  │ id │ x │

│ 187668 │  1 │ 3 │
│ 187665 │  2 │ 3 │
│ 187668 │  3 │ 5 │
│ 187665 │  4 │ 5 │

The rows with ID 1 and 3 were added in transaction 187668, whereas the rows with ID 2 and 4 were added in an earlier transaction, number 187665.

Since I’ve already told you that rows cannot be changed by a query, this raises an interesting question: What happened to the rows with IDs 1 and 3 that were added in the first transaction, number 187665?

The answer is: Those rows still exist in the system! They have not been modified or removed. But the system did mark them as having an xmax of 187668, the transaction ID in which the new, replacement rows are now available.

Transaction IDs only go up. But you can imagine a hypothetical time-traveler’s database, in which you could move the transaction ID down, and see rows from an earlier era. (Somehow, I doubt that this would be a very popular television series.)

The thing is, we don’t need a hypothetical, time-traveler’s database in order to see such behavior. All we need is to open two separate sessions to our PostgreSQL database at the same time, and have each session open a transaction.

You see, when you BEGIN a new transaction, your actions are hidden from the rest of the world — and similarly, modifications made elsewhere in the database are hidden from you.  You can thus have a situation in two different sessions see completely different versions of the same row, thanks to their different transaction IDs.  This might sound crazy, but it actually helps the database to run smoothly, with a minimum of conflicts and locks.

So, let’s open two different PostgreSQL sessions, and take a look at what happens.  In order to distinguish between them, I used the \set command in psql to make PROMPT1 either “Session A>” or “Session B>”. Here’s session A:

Session A> BEGIN;
Session A> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187668 │    0 │  1 │ 3 │
│ 187665 │    0 │  2 │ 3 │
│ 187668 │    0 │  3 │ 5 │
│ 187665 │    0 │  4 │ 5 │

And here is session B:

Session B> BEGIN;
Session B> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187668 │    0 │  1 │ 3 │
│ 187665 │    0 │  2 │ 3 │
│ 187668 │    0 │  3 │ 5 │
│ 187665 │    0 │  4 │ 5 │

At this point, they are identical; both sessions see precisely the same rows, with the same xmin and xmax. Now let’s modify the values of two rows in session A:

Session A> update foo set x = x + 1 where id in (1,3);

What do we see now in session A?

Session A> select xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │ xmax │ id │ x │

│ 187670 │    0 │  1 │ 4 │
│ 187665 │    0 │  2 │ 3 │
│ 187670 │    0 │  3 │ 6 │
│ 187665 │    0 │  4 │ 5 │

Just as we would expect, we have two new rows (IDs 1 and 3), and two old rows (IDs 2 and 4), which we can distinguish via the xmin value.  In all cases, the rows have an xmax value of 0.

Let’s return to session B, and see what things look like there.  (Note that I haven’t committed any transactions here!)

Session B> select  xmin, xmax, * from foo where id < 5 order by id;

│  xmin  │  xmax  │ id │ x │

│ 187668 │ 187670 │  1 │ 3 │
│ 187665 │      0 │  2 │ 3 │
│ 187668 │ 187670 │  3 │ 5 │
│ 187665 │      0 │  4 │ 5 │

That’s right — the rows that session B sees are no longer the rows that session A sees.  Session B continues to see the same rows as were visible when it started the transaction. We can see that the xmax of session B’s old rows is the same as the xmin of session A’s new rows. That’s because any transaction after 187670 will see the new row (i.e., with an xmin of 187670), but any transaction started before 187670 will see the old row (i.e., with an xmax of 187670).

I hope that the picture is now coming together: When you INSERT a new row, its xmin is the current transaction ID, and its xmax is 0.  When you DELETE a row, its xmax is changed to be the current transaction ID.  And when you UPDATE a row, PostgreSQL does both of these actions.  That’s why the UPDATE I did earlier, on all of the rows in the table, doubled its size.  (More on that in a moment.)

This allows different transactions to work independently, without having to wait for locks. Imagine if session B wants to look at row with ID 3; because of MVCC, it doesn’t need to wait to read it.  Heck, session B can even change the row with ID 3, although it’ll hang until session A commits or rolls back.

And indeed, another benefit of MVCC is that rollbacks are trivially easy to implement: If we COMMIT session A and roll back session B, then session A’s view of the world will be the current one, for all following transactions.  But if we ROLLBACK session A, then everything is fine; PostgreSQL keeps track of the fact that session A was rolled back, and that we should actually be using (until further notice) the rows with an xmax of 187670.

As you can imagine, this can lead to some trouble. If every UPDATE increases the number of rows, and every DELETE fails to free up any space, every PostgreSQL database will eventually run out of room, no? Well, yes — except for VACUUM, which is the garbage-collecting mechanism for PostgreSQL’s rows. VACUUM can be run manually, but it’s normally run via “autovacuum,” a program that runs in the background, and invokes VACUUM on a regular basis.

VACUUM doesn’t free space from your PostgreSQL database. Rather, it identifies rows whose xmin is greater than the current transaction ID, and thus marks those rows as available for re-use. PostgreSQL thus keeps track of not only the current transaction ID, but also the ID of the earliest still-open transaction. So long as that transaction is open, it needs to be able to see the rows from that point in time.

Don’t confuse VACUUM with VACUUM FULL; the later does indeed clear out space from a PostgreSQL database, but locks the database while it’s doing so. If you have a serious 24/7 application, then VACUUM FULL is a very bad idea. The regular autovacuum daemon, albeit perhaps with a bit of configuration, should take care of identifying which rows can and should be marked for deletion.

In the case of my client’s problem, my initial thought was that autovacuum wasn’t running. However, a quick look at the pg_stat_user_tables view showed that autovacuum was running on a regular basis.

However, VACUUM will mark a row as eligible for reuse if no other transactions can see it. (After all, if a session remains open, then it still needs to see those old row versions.)  Meaning that if a transaction remains open from the time the system starts up, VACUUM will never identify any rows as eligible for reuse, because there is still an open transaction whose ID is lower than the xmax of the deleted rows.

And this is precisely what happened to my client: It turns out that their application, at startup, was opening several transactions with BEGIN statements… and then never committing those transactions or rolling them back.  None of those transactions led to a lock in PostgreSQL, because they weren’t executing any query that would lead to a conflict. And autovacuum was both running and reporting that it ran on each table — because it had indeed done so.

But when it came time to mark the old rows as eligible for reuse, VACUUM ignored all of the rows in the entire database, including our 23,000-row table — because of the transaction that the application had opened at startup. It didn’t matter how many times we can VACUUM, whether it was done manually or automatically, or what else was running at the time: All of the old versions of every row in our table were kept around by PostgreSQL, leading to a massive disk usage and database slowness. Stopping the application had the effect of closing those transactions, and thus allowing VACUUM FULL to do its magic, albeit at the cost of system downtime. And because the application opened the transactions at startup, any fix was bound to be a temporary one.

My client was a bit surprised to discover that the entire database could encounter such problems from an open transaction that was neither committed nor rolled back. However, I hope that you now see why the space wasn’t reclaimed, and why the database acted as it did.

The solution, of course, was to get rid of those three open, empty transactions at startup; once we did that, everything ran smoothly.

Transactions are a brilliant and useful invention, but in PostgreSQL (and other MVCC systems), you shouldn’t be holding onto open transactions for too long. Doing so might have surprisingly unpleasant consequences. Commit or rollback as soon as you can — or you might find yourself with a database accumulating gigabytes full of old, unnecessary rows.

Remember: Don’t mix = and NULL

I was recently performing an analysis on the database for my dissertation software, and wanted to look at one particular set of rows.  The table I was looking at is called Nodes, and because each node (represented by a row in the table) can optionally belong to a group, the group_id column allows for NULL values.

So, if I want to find all of the nodes in group 2, I can say

SELECT * FROM Nodes WHERE group_id = 2;

My data consists (of course) of several tables, including Versions, which has a many-to-one relationship with Nodes.  If I want all of the versions for nodes in group 2, I can say:

SELECT COUNT(*) FROM Versions WHERE node_id in (SELECT id FROM Nodes WHERE group_id = 2);

Now, let’s say that I want to get all of the versions that are not from group 2.  How would I do that?  Well, one way would be:

SELECT COUNT(*) FROM Versions WHERE node_id 
   NOT IN (SELECT id FROM Nodes WHERE group_id = 2);

But at some point, I flipped the logic, turning the NOT IN into IN, and = into <>:

SELECT COUNT(*) FROM Versions WHERE node_id 
  IN (SELECT id FROM Nodes WHERE group_id <> 2);

That, of course, was a bug.  That’s because SQL uses trinary (or ternary) logic, not the binary true-false logic to which we’re accustomed.  The result of comparing anything with NULL results in a NULL.  For example, consider the following table:

select * from truth_test ;
+--------+
| val    |
+--------+
| t      |
| f      |
| [null] |
+--------+

I can create a truth table pretty easily:

SELECT val, val = true as is_true, val=false as is_false, 
       val=null as equal_null, val is null as is_null 
FROM truth_test;
+--------+---------+----------+------------+---------+
| val    | is_true | is_false | equal_null | is_null |
+--------+---------+----------+------------+---------+
| t      | t       | f        | [null]     | f       |
| f      | f       | t        | [null]     | f       |
| [null] | [null]  | [null]   | [null]     | t       |
+--------+---------+----------+------------+---------+

You can see, then, that TRUE is equal to TRUE, and FALSE is equal to FALSE, which we would expect.  But using the = operator with NULL is always going to result in NULL — even with itself!

If you want to compare with NULLs, then you need to use IS NULL.  (Or you can use a function such as COALESCE, which returns the first non-NULL value passed to it.)  And that’s the mistake that I made in my queries yesterday: I was trying to get all of the nodes for which group_id was not 2.  But by using =, I inadvertently excluded all of the rows for which group_id was NULL — in other words, all of the nodes that weren’t assigned to a group.  Which, it turns out, was the majority of them.  Whoops.

My solution was to create a view containing all of the IDs that I want to exclude:

CREATE OR REPLACE VIEW group2_model_ids
AS
SELECT nodes.id
FROM nodes
WHERE nodes.group_id = 2;

Now, if I want the number of versions that aren’t in group 2, I can use a subselect, as follows:

SELECT COUNT(*) 
FROM Versions 
WHERE node_id NOT IN (SELECT id FROM group2_model_ids);

Of course, I could have simply rewritten the query, either to use COALESCE or to check for both NULL and not-NULL values.  But I prefer this way, especially since I’m only dealing with a few tens of thousands of rows of data in this table, in part because it gives me a name and abstraction that I can use in future queries.

Turning PostgreSQL rows into arrays

So far in this series, I have described how you can create arrays and retrieve information from them — both the actual data stored in the array, and information about the array, such as its length. But the coolest trick, or set of tricks, that I use in PostgreSQL is the ability to turn arrays into rows, and vice versa.  This is where, in my mind, arrays start to get truly powerful, and allow us to do all sorts of things that would otherwise be quite difficult, or require another language.

Get the code: Download the SQL from this post

For example, let’s say that I want to keep track of my appointments in a database.  I create an Appointments table:

CREATE TABLE Appointments (id SERIAL,
 meeting_at TIMESTAMP,
 description TEXT
 );

And then I enter information into this table, such that I have the following appointments:

[local]/reuven=# select * from appointments order by meeting_at;
 +----+---------------------+---------------+
 | id | meeting_at          | description   |
 +----+---------------------+---------------+
 | 5 | 2014-05-22 12:00:00  | Dissertation  |
 | 4 | 2014-05-22 14:00:00  | Devel meeting |
 | 1 | 2014-05-23 15:00:00  | Dissertation  |
 | 2 | 2014-05-23 17:00:00  | Dinner        |
 | 3 | 2014-05-23 23:00:00  | Sleep         |
 +----+---------------------+---------------+
 (5 rows)

If I want to find out what I’m doing on May 23rd, I can simply say:

[local]/reuven=# select * from appointments where meeting_at::date = '2014-may-23';
 +----+---------------------+--------------+
 | id | meeting_at          | description  |
 +----+---------------------+--------------+
 | 1 | 2014-05-23 15:00:00  | Dissertation |
 | 2 | 2014-05-23 17:00:00  | Dinner       |
 | 3 | 2014-05-23 23:00:00  | Sleep        |
 +----+---------------------+--------------+

Notice that in the above example, I’m using PostgreSQL’s casting operator (::) to turn the meeting_at column, which contains a TIMESTAMP value, into a DATE value.  Without that,  the comparison would have been at midnight. So far, this isn’t very exciting or surprising.  But let’s say that I want to get all of the descriptions of today’s appointments in a comma-separated list.

Normally, I would need to retrieve all of the rows, and then use another language to piece them together.  With PostgreSQL’s arrays, though, we can do this all in the database — which executes much more quickly, among other things. The way to accomplish this is with the ARRAY function.  Now, don’t confuse the ARRAY[] constructor with the ARRAY function, although it’s hard not to.  (To be honest, I’m not sure how PostgreSQL can tell the difference; it might be the square brackets vs. parentheses that tell it to do different things.)

The ARRAY function is built into PostgreSQL, and turns a set of rows into an array.  That is, ARRAY expects to get rows as its input, and returns a single row, containing a single array, as its output. Let’s start with a set of rows, containing the descriptions of today’s appointments.  I’m going to use PostgreSQL’s concatenation operator (||) to combine the time (using the ::time cast) of the appointment with the appointment itself.  My query looks like:

SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'
ORDER BY meeting_at;

The result I get is:

 +-----------------------+
 | ?column?              |
 +-----------------------+
 | 15:00:00 Dissertation |
 | 17:00:00 Dinner       |
 | 23:00:00 Sleep        |
 +-----------------------+

Now that I have my rows, I can turn them into an array.  Now, the ARRAY function needs to be invoked via a SELECT.  Thus, using ARRAY means that we’re using a subselect.  The inner SELECT is what we did above.  The outer one is just our call to ARRAY:

 

SELECT ARRAY(SELECT meeting_at::time || ' ' || description 
FROM Appointments 
WHERE meeting_at::date = '2014-may-23'
ORDER BY meeting_at);

And sure enough, we get a one-row, one-column result:

 +--------------------------------------------------------------+
 | array                                                        |
 +--------------------------------------------------------------+
 | {"15:00:00 Dissertation","17:00:00 Dinner","23:00:00 Sleep"} |
 +--------------------------------------------------------------+

You can see that our three rows have been turned into three elements of the array.  I do this all of the time when creating reports or summaries of other data: I write a view or CTE that grabs the data from the rows that are of interest to me, and stuffs that data into an array.  I can then check the array’s contents, length, or anything else that might be useful in my analysis. But earlier, I said that I want to have a comma-separated list of the appointment.  I currently have them in an array.  Fortunately, PostgreSQL’s array_to_string can save the day, acting similarly to the “join” method that I know and love from such languages as Ruby and Python.  This function, wrapped around a call to ARRAY, wrapped around our subselect, will then look like:

SELECT array_to_string( 
  ARRAY(SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'ORDER BY meeting_at), ', ');

And sure enough, the result is as follows:

 +--------------------------------------------------------+
 | array_to_string                                        |
 +--------------------------------------------------------+
 | 15:00:00 Dissertation, 17:00:00 Dinner, 23:00:00 Sleep |
 +--------------------------------------------------------+

Not bad, for one query; we’ve taken our three rows, and turned them into a string, without having to use anything other than SQL!  There’s just one final problem here that I’d like to fix, namely that when you invoke a function as we did here, the function’s name becomes the name of the column.  I’d like to change that, so the column has a more reasonable name.  This is particularly important if you’re planning to read the column using a high-level language; your code will look awfully funny if it includes statements like:

puts row['array_to_string']

when it could instead read

puts row['appointments_for_today']

Here’s how I can use “AS” to change the name of the column, making it easier to read and work with:

SELECT array_to_string( 
  ARRAY(SELECT meeting_at::time || ' ' || description
FROM Appointments
WHERE meeting_at::date = '2014-may-23'ORDER BY meeting_at), ', ') 
AS appointments_for_today;

It is in these sorts of uses, to aggregate data and create reports, that I’ve found PostgreSQL’s arrays to be particularly useful and powerful. Next time, we’ll see how we can do the opposite, turning an array back into rows — and then, we’ll see how the combination of these conversions lets us perform all sorts of tricks.

Get the code: Download the SQL from this post

Looking in PostgreSQL arrays with ANY

So far, this series has looked at how to create PostgreSQL arrays, how to retrieve data from them, and how to get the length of an array’s outer and inner dimensions. But one of the most common actions that we’ll want to do with an array is look inside to see if a particular value is in there.  Today, I’ll show you how to do that with the PostgreSQL ANY operator and its cousins. For example, given my table “foo”:

[local]/reuven=# select * from foo;
+----+---------------------+
| id | stuff               |
+----+---------------------+
| 8  | {abc}               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
+----+---------------------+
(3 rows)

We’ve already seen (in my last post on this topic) how we can find all of the rows in which the “stuff” array is of a certain length.  But sometimes — more often, perhaps — I want all of the rows in which a particular value is somewhere in the array.  (You can think of this as the SQL equivalent of Ruby’s “select” on an enumerable, or Python’s “filter” on a sequence.) For this, we need to use PostgreSQL’s ANY operator.  I have always found the syntax of ANY (and its synonym, SOME) to be a bit hard to understand.  The IN operator works by itself, but ANY only works with an = sign.  For example, if I want all of the rows from the above “foo” table in which the “stuff” array contains an element ‘abc’, I can say:

[local]/reuven=# select * from foo where 'abc' = ANY(stuff);
+----+---------------------+
| id | stuff               |
+----+---------------------+
| 8  | {abc}               |
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
+----+---------------------+
(3 rows)

Sure enough, because each of these three arrays contains an element ‘abc’, we get all three rows back.  What if I ask for ‘def’?

[local]/reuven=# select * from foo where 'def' = ANY(stuff);
+----+---------------------+
| id | stuff               |
+----+---------------------+
| 9  | {abc,def}           |
| 10 | {abc,def,"ghi jkl"} |
+----+---------------------+
(2 rows)

Again, not a surprise.  What if I search for ‘ghi’?

[local]/reuven=# select * from foo where 'ghi' = ANY(stuff);
+----+-------+
| id | stuff |
+----+-------+
+----+-------+
(0 rows)

This is important to remember, actually: ANY looks at the elements of the array, and no further.  So the string ‘ghi’ was compared with the elements of “stuff”, which in the case of row ID 10, was the three strings ‘abc’, ‘def’, and ‘ghi jkl’.  PostgreSQL compared ‘ghi’ with each of these, found them to be unequal, and thus return 0 rows.  Remember this when you’re trying to search through your text[] array column; you can look for specific elements of the array, but no further. What I personally find confusing, and easy to forget, is the fact that using ANY requires that I use an = sign before the call to ANY.  Moreover, it’s important to remember that ANY and the array go on the right-hand side of the equal sign, for reasons that I don’t quite understand.  Thus, the following will not work:

[local]/reuven=# select * from foo where ANY(stuff) = 'abc';
ERROR: 42601: syntax error at or near "ANY"
LINE 1: select * from foo where ANY(stuff) = 'abc';
                                ^

I’ve never had any occasion to use it, but you should know that PostgreSQL also offers an ALL operator.  It works the same way as ANY, except that it only returns a TRUE value if all of the elements in the array are equal to the value on the left-hand side.  For example:

[local]/reuven=# select * from foo where 'abc' = ALL(stuff);
+----+-------+
| id | stuff |
+----+-------+
| 8  | {abc} |
+----+-------+
(1 row)

In this example, we only get one row back, because the stipulation is that all of the array elements need to be equal to ‘abc’. I’ve got a few more posts planned in this series about PostgreSQL arrays.  If there is something about this subject that you have always wanted to know, please contact me, and I’ll do what I can to answer.

PostgreSQL array indexes and length

In my last blog post, I introduced the idea of a PostgreSQL array, and showed how we can insert data into a table using either the curly-brace {} syntax, or the ARRAY construction syntax.  In this post, I want to talk about PostgreSQL indexes and length — what happens when we retrieve from indexes that exist (and don’t), how we can construct multidimensional arrays, and how we can ask PostgreSQL for the length of any dimension of our multidimensional array.

First of all, let’s talk about how you can retrieve data from arrays.  if I still have my three-row table from last time:

[local]/reuven=# select id, stuff from foo;
 ┌────┬─────────────────────┐
 │ id │ stuff               │
 ├────┼─────────────────────┤
 │ 8  │ {abc}               │
 │ 9  │ {abc,def}           │
 │ 10 │ {abc,def,"ghi jkl"} │
 └────┴─────────────────────┘

What if I want to get just the first value back from the “stuff” column?  Well, then I have to take the first element of an array.  Most modern languages start to number arrays with index 0; PostgreSQL, by contrast, starts to count them with 1.  I can thus ask for just the first element of the “stuff” array from each row with:

[local]/reuven=# select id, stuff[1] from foo;
 ┌────┬───────┐
 │ id │ stuff │
 ├────┼───────┤
 │ 8  │ abc   │
 │ 9  │ abc   │
 │ 10 │ abc   │
 └────┴───────┘

If we ask for an index that doesn’t exist, we get a NULL value back:

[local]/reuven=# select id, stuff[3] from foo;
 ┌────┬─────────┐
 │ id │ stuff   │
 ├────┼─────────┤
 │ 8  │ [null]  │
 │ 9  │ [null]  │
 │ 10 │ ghi jkl │
 └────┴─────────┘

Note that I have configured my PostgreSQL client to show NULL values as “[null]”, by putting the following line in my ~/.psqlrc file:

\pset null [null]

Without the above line in your .psqlrc (or running that command in psql yourself manually), you might see blank space for row IDs 8 and 9.

Now, it’s pretty rare for me to pull out a particular value from a PostgreSQL array.  Instead, I’m often finding out the lengths of the arrays on which I’m working.  I can do this with the array_length function:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo;
 ┌────┬─────────────────────┬──────────────┐
 │ id │ stuff               │ array_length │
 ├────┼─────────────────────┼──────────────┤
 │ 8  │ {abc}               │ 1            │
 │ 9  │ {abc,def}           │ 2            │
 │ 10 │ {abc,def,"ghi jkl"} │ 3            │
 └────┴─────────────────────┴──────────────┘

Notice that array_length is a function that takes two parameters, an array and an integer.  The array is what we want to measure.  The integer describes which dimension should be measured in the array’s length. If you’re like me, and come from dynamic languages like Ruby and Python in which arrays (or lists) can be of any length, then you should realize here that PostgreSQL arrays can be multidimensional, but each inner array must be of the same length. So, for example, I can create a 2×3 array of integers with

[local]/reuven=# select ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]];
┌───────────────────┐
│ array             │
├───────────────────┤
│ {{1,1,1},{2,2,2}} │
└───────────────────┘

Trying to have a different inner dimension will not work:

[local]/reuven=# select ARRAY[ARRAY[1,1,1], ARRAY[2,2,2,2]];
ERROR: 2202E: multidimensional arrays must have array expressions with matching dimensions

Assuming that I have a legitimate array, I can get its length:

[local]/reuven=# select array_length(ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]], 1);
┌──────────────┐
│ array_length │
├──────────────┤
│ 2            │
└──────────────┘

Or I can get the length of the inner dimension:

[local]/reuven=# select array_length(ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]], 2);
┌──────────────┐
│ array_length │
├──────────────┤
│ 3 │
└──────────────┘

So, when retrieving our rows from the “foo” table:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo;
┌────┬─────────────────────┬──────────────┐
│ id │ stuff               │ array_length │
├────┼─────────────────────┼──────────────┤
│ 8  │ {abc}               │       1      │
│ 9  │ {abc,def}           │       2      │
│ 10 │ {abc,def,"ghi jkl"} │       3      │
└────┴─────────────────────┴──────────────┘

I can get the array length in a separate column, as in this example.  Or I can even sort in descending order of the array length:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo order by array_length(stuff, 1) desc;
┌────┬─────────────────────┬──────────────┐
│ id │ stuff               │ array_length │
├────┼─────────────────────┼──────────────┤
│ 10 │ {abc,def,"ghi jkl"} │     3        │
│ 9  │ {abc,def}           │     2        │
│ 8  │ {abc}               │     1        │
└────┴─────────────────────┴──────────────┘

Notice that our ORDER BY clause has to repeat the function that we used to create the third column.  Another way to do this is to declare an alias for the output of array_length, and then use the alias in ORDER BY:

select id, stuff, array_length(stuff, 1) len from foo order by len desc;
┌────┬─────────────────────┬─────┐
│ id │ stuff               │ len │
├────┼─────────────────────┼─────┤
│ 10 │ {abc,def,"ghi jkl"} │ 3   │
│ 9  │ {abc,def}           │ 2   │
│ 8  │ {abc}               │ 1   │
└────┴─────────────────────┴─────┘

Next time, we’ll look at how we can manipulate arrays using array functions.

Learning to love PostgreSQL arrays

I’ll admit it: When arrays were added to PostgreSQL a number of years ago, I thought that this was a really bad idea.  I’m a firm believer in normalization when it comes to database design and storage; and the idea of putting multiple values inside of a single column struck me as particularly foolish.  Besides, my impression was that PostgreSQL arrays were clumsy to work with, and didn’t really add much to my data model.

Of course, it turns out that arrays are extremely useful in PostgreSQL.  I still cringe when people want to use them for general-purpose storage, instead of working to normalize their database design.  But over the last few months, as I’ve been doing all sorts of complex PostgreSQL queries for my PhD dissertation, I’ve found that PostgreSQL arrays are extremely useful when it comes to aggregating and reporting data.

I’ve thus decided to dedicate a number of blog posts to PostgreSQL arrays: How to create them, use them, manipulate them, and decide when to use them.

Let’s start with the very basics; over the next few blog posts, I’ll try to show how arrays can be interesting — and even useful, and fit into more complex queries and needs.

You can create an array of just about any data type in PostgreSQL.  As the documentation says, “Arrays of any built-in or user-defined base type, enum type, or composite type can be created. Arrays of domains are not yet supported.” This means that you can create arrays of just about any data type you want: Integers, text, enums, other arrays (for multidimensional arrays), or even user-defined types.  To date, I have generally created arrays of integers and text, but that might not be representative of your use case.

To create a table with a text array in one column, just add square brackets ([]) after the type:

CREATE TABLE Foo (
id SERIAL NOT NULL,
stuff TEXT[],
PRIMARY KEY(id)
);

When I then ask for the definition of my “Foo” table, I see the following:

[local]/reuven=# \d foo
 Table "public.foo"
┌────────┬─────────┬──────────────────────────────────────────────────┐
│ Column │ Type │ Modifiers │
├────────┼─────────┼──────────────────────────────────────────────────┤
│ id     │ integer │ not null default nextval('foo_id_seq'::regclass) │
│ stuff  │ text[]  │                                                  │
└────────┴─────────┴──────────────────────────────────────────────────┘
Indexes:
 "foo_pkey" PRIMARY KEY, btree (id)

Notice that the type of the “stuff” column is indeed recorded as “text[]”, showing that it’s an array.  If we try to insert a plain-text value into that column, PostgreSQL will complain:

[local]/reuven=# insert into foo (stuff) values ('abc');
ERROR: 22P02: array value must start with "{" or dimension information
LINE 1: insert into foo (stuff) values ('abc');
                                        ^

One of the many things that I love about PostgreSQL is the attention to detail in the error messages.  Not only does it tell us that the table is expecting an array value, but that the array must begin with a { character.  It also shows us, using a ^ character, where the parser had problems.  That’s not always a perfect indicator of where the problem lies, but it’s a great start.

If I want to insert an array value into my table, I can thus use the literal array syntax that PostgreSQL provides, with (as indicated above) curly braces:

[local]/reuven=# insert into foo (stuff) values ('{abc}');
INSERT 0 1
[local]/reuven=# insert into foo (stuff) values ('{abc,def}');
INSERT 0 1
[local]/reuven=# insert into foo (stuff) values ('{abc,def,ghi jkl}');
INSERT 0 1

The above commands insert three rows into our table.  In all three cases, we are inserting array values into our table.  Notice that in all cases, the array is inserted as a string, surrounded by single quote marks.  Thus, ‘{abc}’ becomes a one-element array, and ‘{abc,def}’ becomes a two-element array.

What happens when there is a space character inside of the text?  PostgreSQL automatically quotes the value (with double quotes — be careful!).  What happens if you want a comma or single quote as part of the text?  Then things get even uglier.

A nice solution, and a better way (I believe) to insert arrays in any event, is to use the built-in ARRAY constructor syntax.  Then you don’t have to worry about such things.  For example, I can rewrite all of the above INSERT commands in what I believe to be a much nicer way:

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc']);
INSERT 0 1

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc', 'def']);
INSERT 0 1

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc', 'def', 'ghi jkl']);
INSERT 0 1

[local]/reuven=# select * from foo;
┌────┬─────────────────────┐
│ id │ stuff │
├────┼─────────────────────┤
│ 8  │ {abc}               │
│ 9  │ {abc,def}           │
│ 10 │ {abc,def,"ghi jkl"} │
└────┴─────────────────────┘
(3 rows)

The same data was inserted into the table, but with less hassle than before.

Now, just because we can insert arrays directly into our tables doesn’t necessarily mean that we should do so.  You’ll see, over the course of this series, that I view arrays as a great way to aggregate and analyze existing data, particularly within the context of a view or a CTE.  So please don’t be tempted to start stuffing all of the data you want and need into a single column; normalization is still a good idea, and arrays can be tempting.  However, being familiar with the basics of defining and inserting array data into the database is quite useful, and will serve us well throughout the rest of this series.

 

Don’t use double quotes in PostgreSQL

I spend a large proportion of my time teaching classes in a variety of open-source technologies — specifically, Ruby, Python, PostgreSQL, and Git.  One of the questions that invariably arises in these classes has to do with the case sensitivity of the technology in question. That is, is the variable “x” the same as the variable “X”?

In nearly ever case, the technologies with which I work are case sensitive, meaning that “x” and “X” are considered two completely different identifiers.  Indeed, the Ruby language goes so far as to give capitalized identifiers a special status, calling them “constants.”  (They’re not really constants, in that you can always redefine a Ruby constant.  However, you will get a warning when you reassign it.  For this reason, I prefer to call them “stubborns,” so that people don’t get the wrong idea.)

SQL is a completely different story, however: The SQL standard states that SQL queries and identifiers (e.g., table names) aren’t case sensitive.  Thus, there’s no difference between

select id, email from people;

and

SELECT ID, EMAIL FROM PEOPLE;

I find both of these styles to be somewhat unreadable, and over the years have generally followed Joe Celko‘s advice for capitalization in SQL queries:

  1. SQL keywords are in ALL CAPS,
  2. Table names have Initial Caps, and
  3. Column names are all in lowercase.

Given that rule, the above query would look like this:

SELECT id, email FROM People;

Again, this capitalization scheme is completely ignored by PostgreSQL.  It’s all for our benefit, as developers, who want to be able to read our code down the road.

Actually, that’s not entirely true: PostgreSQL doesn’t exactly ignore the case, but rather forces all of these names to be lowercase. So if you say

CREATE TABLE People (
  id SERIAL NOT NULL,
  email TEXT NOT NULL,
  PRIMARY KEY(id)
);

PostgreSQL will create a table named “people”, all in lowercase. But because of the way PostgreSQL works, forcing all names to lowercase, I can still say:

SELECT * FROM People;

And it will work just fine.

Now, there is a way around this, namely by using double quotes.  Whereas single quotes in PostgreSQL are used to create a text string, double quotes are used to name an identifier without changing its case.

Let me say that again, because so many people get this wrong: Single quotes and double quotes in PostgreSQL have completely different jobs, and return completely different data types.  Single quotes return text strings.  Double quotes return (if you can really think of them as “returning” anything) identifiers, but with the case preserved.

Thus, if I were to repeat the above table-creation query, but use double quotes:

CREATE TABLE "People" (
 id SERIAL NOT NULL,
 email TEXT NOT NULL,
 PRIMARY KEY(id)
);

I have now created a table in which the table name has not been forced to lowercase, but which has preserved the capital P.  This means that the following query will now fail:

select * from people;

ERROR: relation "people" does not exist
LINE 1: select * from people;
                      ^

It fails because I have created a table “People”, but I have told PostgreSQL to look for a table “people”.  Confusing?  Absolutely.  If you use double quotes on the name of a table, column, index, or other object when you create it, and if there is even one capital letter in that identifier, you will need to use double quotes every single time you use it.  That’s frustrating for everyone involved — it means that we can’t use the nice capitalization rules that I mentioned earlier, and that various queries will suddenly fail to work.

The bottom line, then, is to avoid using double quotes when creating anything.  Actually, you should avoid double quotes when retrieving things as well — otherwise, you might discover that you’re trying to retrieve a column that PostgreSQL doesn’t believe exists.

Now, let’s say that you like this advice, and you try to take it to heart.  Unfortunately, there are places where you still might get bitten, despite your best efforts.

For example, the GUI tool for PostgreSQL administration, PGAdmin 3, is used by many people.  (I’m an old-school Unix guy, and thus prefer the textual “psql” client.)  I’ve discovered over the years that while PGAdmin might be a useful and friendly way to manage your databases, it also automatically uses double quotes when creating tables.  This means that if you create a table with PGAdmin, you might find yourself struggling to find or query it afterwards.

Another source of frustration is the Active Record ORM (object-relational mapper), most commonly used in Ruby on Rails. Perhaps because Active Record was developed by users of MySQL, whose table and column names are case-sensitive by default, Active Record automatically puts double quotes around all table and column names in queries. This can lead to frustrating incompatibilities — such as if you want to access the column in Ruby using CamelCase, but in a case-insensitive way in the database.

PostgreSQL is a fabulous database, and has all sorts of great capabilities.  Unless you really want your identifiers to be case-sensitive, though, I strongly suggest that you avoid using double quotes.  And if you encounter problems working with columns, check the database logs to see whether the queries are being sent using double quotes.  You might be surprised, and manage to save yourself quite a bit of debugging time.