Another free regexp Q&A webinar!

The last Webinar I did, with Q&A about regular expressions, was great fun — so much, that I’ve decided to do another one.

So, if you have questions (big or little) about regular expressions in Python, Ruby, JavaScript, and/or PostgreSQL, sign up for this free Webinar on Monday, April 11th: https://www.crowdcast.io/e/regexpqa2

If you already have questions, you can leave them in advance using the Crowdcast Q&A system.  (Or just surprise me during the Webinar itself.)

I look forward to seeing you there!

Free Webinar: Regexp Q&A

practice-makes-regexp-coverTo celebrate the publication of my new ebook, Practice Makes Regexp, my upcoming Webinar (on March 22nd) is all about regular expressions (“regexps”) in Python, Ruby, JavaScript, and PostgreSQL, as well as the Unix “grep” command.

Unlike previous Webinars, in which I gave a presentation and then took Q&A, this time will be all about Q&A: I want you to come with your questions about regular expressions, or even projects that you’re wondering how to attack using them.

I’ll do my best to answer your questions, whether they be about regexp syntax, differences between implementations and languages, how to debug hairy regexps, and even when they might not be the most appropriate tool for the job.

Please join me on March 22nd by signing up here:

http://ccst.io/e/regexpqa

And when you sign up, please don’t forget to ask a question or two!  (You can do that it advance — and doing so will really help me to prepare detailed answers.)

I look forward to your questions on the 22nd!

Reuven

Yes, you can master regular expressions!

Announcing: My new book, “Practice Makes Regexp,” with 50 exercises meant to help you learn and master regular expressions. With explanations and code in Python, Ruby, JavaScript, and PostgreSQL.

I spend most of my time nowadays going to high-tech companies and training programmers in new languages and techniques. Actually, many of the things I teach them aren’t really new; rather, they’re new to the participants in my training. Python has been around for 25 years, but for my students, it’s new, and even a bit exciting.

I tell participants that my job is to add tools to their programming toolbox, so that if they encounter a new problem, they’ll have new and more appropriate or elegant ways to attack and solve it. Moreover, I tell them, once you are intimately familiar with a tool or technique, you’ll suddenly discover opportunities to use it.
Earlier this week, I was speaking with one of my consulting clients, who was worried that some potentially sensitive information had been stored in their Web application’s logfiles — and they weren’t sure if they had a good way to search through the logs.

 

I suggested the first solution that came to mind: Regular expressions.

Regular expressions are a lifesaver for anyone who works with text.  We can use them to search for patterns in files, in network data, and in databases. We can use them to search and replace.  To handle protocols that have changed ever so slightly from version to version. To handle human input, which is always messier than what we get from other computers.

Regular expressions are one of the most critical tools I have in my programming toolbox.  I use them at least a few times each day, and sometimes even dozens of times in a given day.

So, why don’t all developers know and use regular expressions? Quite simply, because the learning curve is so steep. Regexps, as they’re also known, are terse and cryptic. Changing one character can have a profound impact on what text a regexp matches, as well as its performance. Knowing which character to insert where, and how to build up your regexps, is a skill that takes time to learn and hone.

Many developers say, “If I have a problem that involves regular expressions, I’ll just go to Stack Overflow, where my problem has likely been addressed already.” And in many cases, they’re right.

But by that logic, I shouldn’t learn any French before I go to France, because I can always use a phrasebook.  Sure, I could work that way — but it’s far less efficient, and I’ll miss many opportunities that would come my way if I knew French.

Moreover, relying on Stack Overflow means that you never get a full picture of what you can really do with regular expressions. You get specific answers, but you don’t have a fully formed mental model of what they are and how they work.

But wait, it gets worse: If you’re under the gun, trying to get something done for your manager or a big client, you can’t spend time searching through Stack Overflow. You need to bring your best game to the table, demonstrating fluency in regular expressions.  Without that fluency, you’ll take longer to solve the problem — and possibly, not manage to solve it at all.

Believe me, I understand — my first attempt at learning regular expressions was a complete failure. I read about them in the Emacs manual, and thought to myself, “What could this seemingly random collection of characters really do for me?”  I ignored them for a few more years, until I started to program in Perl — a language that more or less expected you to use regexps.

So I spent some time learning regexp syntax.  The more I used them,  the more opportunities I found to use them.  And the more I found that they made my life easier, better, and more convenient.  I was able to solve problems that others couldn’t — or even if they could, they took much longer than I did.  Suddenly, processing text was a breeze.

I was so excited by what I had learned that when I started to teach advanced programming courses, I added regexps to the syllabus.  I figured that I could figure out a way to make regexps understandable in an hour or two.

But boy, was I wrong: If there’s something that’s hard for programmers to learn, it’s regular expressions.  I’ve thus created a two-day course for people who want to learn regular expressions.  I not only introduce the syntax, but I have them practice, practice, and practice some more.  I give them situations and tasks, and their job is to come up with a regexp that will solve the problem I’ve given them.  We discuss different solutions, and the way that different languages might go about solving the problem.

After lots of practice, my students not only know regexp syntax — they know when to use it, and how to use it.  They’re more efficient and valuable employees. They become the person to whom people can turn with tricky text-processing problems.  And when the boss is pressuring them for a

ImageAnd so, I’m delighted to announce the launch of my second ebook, “Practice Makes Regexp.”  This book contains 50 tasks for you to accomplish using regular expressions.  Once you have solved the problem, I present the solution, walking you through the general approach that we would use in regexps, and then with greater depth (and code) to solve the problem in Python, Ruby, JavaScript, and PostgreSQL.  My assumption in the book is that you have already learned regexps elsewhere, but that you’re not quite sure when to use them, how to apply them, and when each metacharacter is most appropriate.

After you go through all 50 exercises, I’m sure that you’ll be a master of regular expressions.  It’ll be tough going, but the point is to sweat a bit working on the exercises, so that you can worry a lot less when you’re at work. I call this “controlled frustration” — better to get frustrated working on exercises, than when the boss is demanding that you get something done right away.

Right now, the book is more than 150 pages long, with four complete chapters (including 17 exercises).  Within two weeks, the remaining 33 exercises will be done.  And then I’ll start work on 50 screencasts, one for each of the exercises, in which I walk you through solutions in each of Python, Ruby, JavaScript, and PostgreSQL.  If my previous ebook is any guide, there will be about 5 hours (!) of screencasts when I’m all done.

If you have always shied away from learning regular expressions, or want to harness their power, Practice Makes Regexp is what you have been looking for.  It’s not a tutorial, but it will help you to understand and internalize regexps, helping you to master a technology that frustrates many people.

To celebrate this launch, I’m offering a discount of 10%.  Just use the “regexplaunch” offer code, and take 10% off of any of the packages — the book, the developer package (which includes the solutions in separate program files, as well as the 300+ slides from the two-day regexp course I give at Fortune 100 companies), or the consultant package (which includes the screencasts, as well as what’s in the developer package).

I’m very excited by this book.  I think that it’ll really help a lot of people to understand and use regular expressions.  And I hope that you’ll find it makes you a more valuable programmer, with an especially useful tool in your toolbox.

Using regexps in PostgreSQL

After months of writing, editing, and procrastinating, my new ebook, “Practice Makes Regexp” is almost ready.  The book (similar to my earlier ebook, “Practice Makes Python“) contains 50 exercises to improve your fluency with regular expressions (“regexps”), with solutions in Python, Ruby, JavaScript, and PostgreSQL.

When I tell people this, they often say, “PostgreSQL?  Really?!?”  Many are surprised to hear that PostgreSQL supports regexps at all.  Others, once they take a look, are surprised by how powerful the engine is.  And even more are surprised by the variety of ways in which they can use regexps from within PostgreSQL.

I’m thus presenting an excerpt from the book, providing an overview of  PostgreSQL’s regexp operators and functions. I’ve used these many times over the years, and it’s quite possible that you’ll also find them to be of assistance when writing queries.

PostgreSQL

PostgreSQL isn’t a language per se, but rather a relational database system. That said, PostgreSQL includes a powerful regexp engine.  It can be used to test which rows match certain criteria, but it can also be used to retrieve selected text from columns inside of a table.  Regexps in PostgreSQL are a hidden gem, one which many people don’t even know exists, but which can be extremely useful.

Defining regexps

Regexps in PostgreSQL are defined using strings.  Thus, you will create a string (using single quotes only; you should never use double quotes in PostgreSQL), and then match that to another string. If there is a match, PostgreSQL returns “true.”

PostgreSQL’s regexp syntax is similar to that of Python and Ruby, in that you use backslashes to neutralize metacharacters. Thus, + is a metacharacter in PostgreSQL, whereas \+ is a plain “plus” character. However, there are differences between the regexp syntax for example, PostgreSQL’s word-boundary metacharacter is \y whereas in Python and Ruby, it is \b.  (This was likely done to avoid conflicts with the ASCII backspace character.)

Where things are truly different in PostgreSQL’s implementation is the set of operators and functions used to work with regexps. PostgreSQL’s operators are generally aimed at finding whether a particular regexp matches text, in order to include or exclude result rows from an SQL query.  By contrast, the regexp functions are meant to retrieve some or all of a string from a column’s text value.

True/false operators

PostgreSQL comes with four regexp operators. In each case, the text string to be matched should be on the left, and the regexp should be on the right.  All of these operators return true or false:

  • ~  case-sensitive match
  • ~*  case-insensitive match
  • !~  case-sensitive non-match
  • !~* case-insensitive non-match

Thus, you can say:

select 'abc' ~ 'a.c';   -- returns "true"
select 'abc' ~ 'A.C';   -- returns "false"
select 'abc' ~* 'A.C';  -- returns "true"

In addition to the standard character classes, we can also use POSIX-style character classes:

select 'abc' ~* '^[[:xdigit:]]$';    -- returns "false"
select 'abc' ~* '^[[:xdigit:]]+$';   -- returns "true"
select 'abcq' ~* '^[[:xdigit:]]+$';  -- returns "false"

This operator, as mentioned above, is often used to include or exclude rows in a query’s WHERE clause:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT id, thing FROM Stuff WHERE thing ~* '^[abc]{3}$';

This final query should return three rows, those in which thing is equal to abc, Abc, and ABC.

Extracting text

If you’re interested in the text that was actually matched, then you’ll need to use one of the built-in regexp functions that PostgreSQL provides. For example, the regexp_match function allows us not only to determine whether a regexp matches some text, but also to get the text that was matched.  For each matching column, regexp_match returns an array of text (even if that array contains a single element).  For example:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT regexp_matches(thing, '^[abc]{3}$') FROM Stuff;

The above will return a single row:

{abc}

As you can see, the above returned only a single column (from the function) and a single row (i.e., the one matching it).  That’s because when you invoke regexp_matches, you can provide additional flags that modify the way in which it operates. These flags are similar to those used in Python, Ruby, and JavaScript.

For example, we can use the i flag to make regexp_match case-insensitive:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC'), ('abc'), ('AbC'), ('Abq'), ('ABCq');
SELECT regexp_matches(thing, '^[abc]{3}$', 'i') FROM Stuff;

Now we’ll get three rows back, since we have made the match case-insensitive.  regexp_matches can take several other flags as well, including g (for a global search). For example:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC');
SELECT regexp_matches(thing, '.', 'g') FROM Stuff;

Here is the output from regexp_matches:

{A} 
{B} 
{C}

Notice how regexp_matches, because of the g option, returned three rows, with each row containing a single (one-character) array. This indicates that there were three matches.

Why is each returned row an array, rather than a string? Because if we use groups to capture parts of the text, the array will contain the groups:

CREATE TABLE Stuff (id SERIAL, thing TEXT);
INSERT INTO Stuff (thing) VALUES ('ABC'), ('AqC');
SELECT regexp_matches(thing, '^(A)(..)$', 'ig') FROM Stuff;

Notice that in the above example, I combined the i and g flags, passing them in a single string.  The result is a set of arrays:

| regexp_matches |
|----------------|
| {A,BC}         |
| {A,qC}         |

Splitting

A common function in many high-level languages is split, which takes a string and returns an array of items. PostgreSQL offers this with its split_part function, but that only works on strings.

However, PostgreSQL also offers two other functions: regexp_split_to_array and regexp_split_to_table. This allows us to split a text string using a regexp, rather than a fixed string.  For example, if we say:

select regexp_split_to_array('abc def   ghi   jkl', '\s+');

The above will take any length of whitespace, and will use that to split the columns.  But you can use any regexp you want to split things, getting an array back.

A similar function is regexp_split_to_table, which returns not a single row containing an array, but rather one row for each element. Repeating the above example:

select regexp_split_to_table('abc def   ghi   jkl', '\s+');

The above would return a table of four rows, with each split text string in its own row.

Substituting text

The regexp_replace function allows us to create a new text string based on an old one.  For example:

SELECT regexp_replace('The quick brown fox jumped over the lazy dog',
                      '[aeiou]', '_');

The above returns:

Th_ quick brown fox jumped over the lazy dog

Why was only the first vowel replaced? Because we didn’t invoke regexp_replace with the g option, making it global:

SELECT regexp_replace('The quick brown fox jumped over the lazy dog',
                      '[aeiou]', '_', 'g');

Now all occurrences are replaced:

Th_ q__ck br_wn f_x j_mp_d _v_r th_ l_zy d_g

All 50 “Practice Makes Python” screencasts are complete!

I’m delighted to announce that I’ve completed a screencast for every single one of the 50 exercises in my ebook, “Practice Makes Python.”  This is more than 300 minutes (5 hours!) of Python instruction, helping you to become a more expert Python programmer.

Each screencast consists of me solving one of the exercises in real-time, describing what I’m doing and what I’m doing it.   They range in length from 4 to 10 minutes.  The idea is that you’ll do the exercise, and then watch my video to compare your answer (and approach) with mine.

If you enjoy my Webinars or in-person courses, then I think you’ll also enjoy these videos.

The screencasts, available with the two higher-tier “Practice Makes Python” packages,  can be streamed in HD video quality, or can be downloaded (DRM-free) to your computer for more convenient viewing.

To celebrate finally finishing these videos, I’m offering the two higher-end packages at 20% off for the coming week, until February 18th. Just use the offer code “videodone” with either the “consultant” or “developer” package, and enjoy a huge amount of Python video.

You can explore these packages at the “Practice Makes Python” Web site.

Not interested in my book, but still want to improve your Python skills?  You can always take one of my two free e-mail courses, on Python variable scoping and working with files. Those are and will remain free forever. And of course, there’s my free Webinar on Python and data science next week.

Free Webinar: Pandas and Matplotlib

It’s time for another free hour-long Webinar! This time, I’ll be talking about the increasingly popular tools for data science in Python, namely Pandas and Matplotlib. How can you read data into Pandas, manipulate it, and then plot it? I’ll show you a large number of examples and use cases, and we’ll also have lots of time for Q&A. Previous Webinars have been lots of fun, and I expect that this one will be, too!

Register (for free) to participate here:

https://www.eventbrite.com/e/analzying-and-viewing-data-with-pandas-and-matplotlib-tickets-21198157259

If you aren’t sure whether you’ll be able to make it, you can still sign up; I’ll be sending information, and a URL with the recording afterwards, soon after the Webinar concludes.

I look forward to seeing you there; if you have any questions, please feel free to contact me at reuven@lerner.co.il or on Twitter as @reuvenmlerner.

My interview with the Freelance Transformation podcast

My interview with Matt Inglot, on the Freelance Transformation podcast, about how technical training is a great business for many consultants, is live!  If you’ve ever wanted to know about technical training, how that business works, how you can get started, or some of my strategies for working with clients, then this is your chance:

http://www.freelancetransformation.com/blog/how-to-leverage-your-technical-skills-to-sell-corporate-training-with-reuven-lerner

If you’re at all interested in technical training, then you should take my free, five-part e-mail course on the subject, which will help you to teach better.

 

Reminder: Free Webinar on data science in Python

There’s still time to register for my free, one-hour Webinar on data science in Python, which will be tomorrow (Tuesday).  There’s clearly too much material for me to give just one Webinar, so this will be the first in a series that I’ll be offering over the coming months.  But if you’re interested in hearing how Python fits into the world of data science, or how you can use free, open-source tools to do lots of great analysis work, then I invite you to join me for what should be a fun time:

https://www.eventbrite.com/e/data-analysis-with-python-tickets-19543502141

There will be plenty of live-coding demos, bad jokes, and chances for you to ask questions. And it should be lots of fun, besides!

Free Webinar: Data science with Python on December 8th

It’s time for me to do another free one-hour Webinar, this time about data science with Python. It’ll be on December 8th, at 9 p.m. GMT.

Data science is all the rage, and rightly so — and Python is one of the best-known and best-equipped languages in which to do it.  In this Webinar, I’ll review some of the most popular packages used for analysis, including NumPy, SciPy, Pandas, and matplotlib, and will show how they can be used to answer questions that we have about our data.

As always, I hope that there will be lots of questions — and if we’re lucky, I’ll be able to provide answers, too!  Please come prepared for a highly interactive and fun event.  I’ll e-mail all registered participants about an hour before the Webinar with links for participating.

You can register at Eventbrite: https://www.eventbrite.com/e/data-analysis-with-python-tickets-19543502141

I look forward to seeing you there!  If you have any questions, please let me know via e-mail at reuven@lerner.co.il or on Twitter as @reuvenmlerner.

Shark Tank increased our traffic by 1000x. Here’s how we handled it.

More than five years ago, I started work to with a small company with a simple idea. Rent Like a Champion was founded by alumni of the University of Notre Dame, which has a very strong culture of college football games. (Note to my non-American readers: We’re talking here about American football, rather than “real” football.)  However, South Bend, Indiana doesn’t have hotels close to the stadium, meaning that people don’t have a nearby, convenient place to stay when they come into town. RLAC thus offers homeowners the chance to rent out their homes on football weekends to visitors. Homeowners make some extra money, and football fans get to see the games they wanted while staying nearby.

I took over for a previous programmer, improving the site’s management and e-commerce features. RLAC has since grown massively, and my role has grown along with them. (And, I hasten to add, I’m no longer doing day-to-day coding on the site; my employee Genadi is doing a fantastic job of that, leaving me to handle more architectural, strategic, and managerial issues on the technology side.) There are still vestiges of the code that I inherited in 2010, but a great deal has changed and improved since then. We are using the latest version of both Ruby and Ruby on Rails, the latest version of MySQL, have both staging and production servers, and have a large number of automated tests. The site rarely goes down or is overwhelmed by traffic.

On the business side, we’re now on several dozen US college campuses, handle incoming and outgoing payments automatically, and even allow people to negotiate on the dates and prices of where they’re staying. Really, I couldn’t be prouder to be associated with this company.

A few months ago, we got some big news: Rent Like a Champion was going to be on Shark Tank, a reality TV show that allows you to pitch your company to investors. We didn’t know what this would mean for the company’s future, but we did know that it would mean lots of additional traffic. The numbers we heard were in the area of 10-20 thousand requests per second. This, from a site that had only 10-20 requests per second on our busiest days. Meaning, we had to figure out a reasonable way to scale our site up so that it could handle about 1,000 times the traffic we were used to seeing.

Now, Rails has a reputation for not being scalable. But it’s clear that Rails can scale, given enough computers. When someone says that it’s “not scalable,” what they probably mean is that given a certain amount of traffic, Rails requires more servers than other languages and frameworks. And yes, that’s probably true. But we clearly weren’t going to change our entire technology stack just for a few hours of television time.

We thus took a multi-pronged approach, working on every part of the site — from the infrastructure, to the application, to the servers we were using.

The bottom line, by the way, was that we more than survived the ordeal: Not only did we handle about 8,000 simultaneous requests/second for an extended period of time, but our servers were barely breaking a sweat. And as if that weren’t enough, we got investments from both Mark Cuban and Chris Sacca, two well-known billionaire investors.

So, what did we do to scale up? And what does this mean for projects you’re doing?

No state on the server

First and foremost, modern Web applications can be inherently scalable, if you design them correctly. Rails, like many other modern frameworks, assumes a “zero state” situation on the Web application server. This means that no user-related state is stored on the server itself. Instead, we store all such state in the database. This means that we can add as many servers as we want, because the actual data won’t be stored on the Web server.

One of the potentially tricky parts has to do with user sessions. Every modern Web framework offers developers the chance to use sessions; the user’s cookie contains an ID (typically encrypted) that allows us to look on disk, in memory, or in a database for the user’s session information.

If the session information is stored in a disk file, then you can only use a single Web server. That’s because if the same user visits several servers in the same day (which happens all of the time, if you have several servers behind a load balancer), and the session info is spread across multiple servers, they’ll effectively be logged out when they reach the second server. You can avoid this by storing session information in the database, but then that’ll slow down the application.

We decided to use cookie-based session information, in which the cookie itself contains the user’s session info. That means we don’t have to worry about how many servers we’re using, because the session info will be available to all of them. If you only store a user ID in the session, then the cookie will be particularly tiny (albeit encrypted), and thus you don’t have to worry about the size.

One part of our scaling was thus to ensure that users could move freely from one server to another. We then put all of the Web servers behind a single load balancer, giving the illusion that RentLikeAChampion.com was a single machine, but actually distributing the load among many. The user’s cookie would go from their browser, to the load balancer, to the ultimate machine. From there, an ActiveRecord lookup via the user ID gave the user’s information.

Bottom line: Your app might already be more scalable than you think, if session information is stored in cookies. If you’re using files to store session information, though, your application is inherently non-scalable; once you use more than one Web server, the user might end up being logged out.

Use VMs, but not necessarily AWS

Several of the biggest trends we’ve seen in the last few years involve the combination of cloud computing and virtual machines. Thanks to VMs, we can spin up as many servers as we need, only when we need them.

I’ve long been of the opinion that putting a server “in the cloud,” as everyone likes to say nowadays, is often unnecessary. There’s nothing wrong with a plain ol’ server, after all, especially since such servers are often going to be cheaper and easier to maintain. (I’m getting to the point where I might be changing my tune, as I see the advantages of deploying to a new VM or container with each release, rather than upgrading each existing release.)

However, in the case of RLAC, moving to the cloud was a no-brainer — not because we need it for our permanent infrastructure, but because the key worry we had was being able to scale up quickly. We didn’t know how many requests we would get, and putting in place a set of “real” servers that could handle that capacity would cost a fortune. Besides, we knew that we needed to scale up quickly for (and during) the Shark Tank airing, but then scale down just as quickly following the show’s broadcast.

Amazon Web Services is the first name that people think of in this space. We decided to go a different route, in no small part thanks to the suggestions and connections of RLAC’s CIO, Mike Hostetler: We went with Server Central, a Chicago-based company that has massive bandwidth, and lots of experience configuring virtual machines for this kind of work. Sure, AWS might be better known. But one of the advantages we got from Server Central was actual, in-person help — something that Amazon wouldn’t provide for a player as small as ourselves.

I have to say that Server Central’s staff amazed and impressed me at every turn: They were available, helpful, and polite, and knew a ton about how to configure and tune our VMs for maximum benefit. They set up our MariaDB cluster (more on that below), as well as a load balancer. They helped us to clone VMs ahead of schedule, and to connect them (virtually, of course) with our network.

In the end, we didn’t go with a Chef or Puppet configuration of our VM, but rather used a combination of Capistrano to deploy our software our main VMs, which we then cloned for some backup VMs. Not super elegant, I’ll admit, but it worked just fine, and meant that we could eliminate another learning curve. Mike H. worked extensively with Server Central, and really got the hang of configuring and deploying those VMs, such that we had more than 20 available for the night of broadcast.

Bottom line: If you want to scale up and/or down quickly, then VMs are almost certainly the way to go. And if you’re looking to get actual service, rather than a faceless SaaS company, I’d definitely suggest speaking with Server Central.

Switching from Apache to nginx

I’ve said it on many occasions: I have a warm spot in my heart for htApache httpd. I’ve been doing Web development since before the first version of Apache was released, much less the Apache Foundation was founded, and I have always found it to be an easy to use, flexible HTTP server.

However, there’s no doubt that another server, nginx, offers greater performance and scalability than Apache. In the case of RLAC and Shark Tank, we were far more interested in scalability than ease of use. However, we didn’t want to have a super-hard learning curve and transition, which we feared would be the case if we moved to a combination of nginx and Unicorn, a popular duo in the Ruby on Rails world. We thus settled on using nginx and Phusion Passenger, a plugin that provides Ruby on Rails applications which we had previously used with Apache.

I have to admit that I was pleasantly surprised on all fronts by nginx. It installed without a hitch, thanks in no small part to Passenger’s super-easy installation process on Linux boxes. The configuration is quite different from Apache, but not as difficult as I would have expected. It allowed us to use our existing SSL certificates quite easily. And the performance, without a doubt, was quite impressive. Indeed, any performance issues we saw were the result of Passenger, rather than nginx; Ruby and Rails both use lots of memory, and thus there is a limit to the number of simultaneous requests you can handle per machine.

Bottom line: I hate to say it, but I don’t see a big advantage to Apache any more. nginx documentation and tutorials are quite good, support for Rails applications is excellent with Passenger, and the performance we saw quite very good. Configuration of Apache is still easier for me, but that’s going to be true for anything I’ve been doing for 20 years, I expect.

Database cluster

With the Web servers scaling up nicely, the biggest potential bottleneck suddenly became our database. The database on RLAC, as with most Web applications, is involved in every single page displayed, from showing the user’s name to listing homes for a particular football game, to letting homeowners set the prices for individual games and events.

The problem is that the database is a finite resource; if too many people come and visit the site at the same time, the database will cease responding to some of them, causing a domino effect that will cause many people to get errors or timeouts.

I’m a big fan of PostgreSQL, which I have been using for about 20 years. But when I inherited the RLAC code, we were already using MySQL on the site — and switching technologies is almost never a good idea if things are already working, so I didn’t move things around.

While PostgreSQL’s master-master replication is still being discussed and designed, MySQL has master-master, high-availability clustering working already. Even better, following the purchase of MySQL by Oracle (via their purchase by Sun), Monty Widenius, the original author of MySQL, has been working on MariaDB, a MySQL-compatible plugin that offers superior performance.

Server Central suggested that we use a HA master-master cluster of VMs, all running MariaDB. From our perspective as developers, it looked and felt like a normal MySQL database. But the performance was rock solid and super fast, and it meant that we didn’t need to worry about the database being a bottleneck, unless we were crushed by people actually trying to use the application.

Importing our old database (from the MySQL server on Rackspace) to our new MariaDB cluster was fast and easy. The two database servers feel almost identical, and the dump-restore cycle took a matter of minutes.

Bottom line: I feel somewhat chastened; after years of telling people how much better it is to use PostgreSQL, and how master-slave fallover is probably fine for most purposes, I was pleasantly surprised to use a master-master cluster that more than suited our needs. PostgreSQL could probably have handled the load just as well, of course, but switching to a different database just before a major PR event would have been a very bad idea.

Caching

The above was a great way to get our system to work under high loads. But it’s always possible to optimize things further, and one of the best-known ways to make Web applications increase their speed is to add caching. If you can cache a page, then everyone benefits: The user gets a must faster response, and the server doesn’t have to spend its time running Ruby and SQL, because the request has been served without even touching the server.

On RLAC, we used three different types of caches. These worked together spectacularly well, improving our performance massively. Our working assumption was that we would have a number of VMs on standby in case we would need them, and that most users would look at a few pages before bouncing off of our site. But we still knew that we would need to server thousands (and many tens of thousands) of requests per second, many of which just wanted to see our home page while watching Shark Tank on their TVs.

Our caches were as follows:
  • In our application, we cached many of our SQL queries. (Ruby on Rails lets you do this inside of your application with very little fuss.) Thus, every time we asked the database for the list of homes at event #12345, we stored that information in a cache, in case that same event would be requested again in the near future.
  • On each of our servers, we (mostly Mike Hostetler) installed and configured Varnish, which provided us with caching of static assets, such as images, CSS, and JavaScript. Moreover, the configuration meant that if someone came to the RLAC home page without being logged in, they would get a cached version of the page. If, however, they came after being logged in, they would hit our Rails app. This was necessary to ensure that people who had logged. We had a separate copy of Varnish running on each of our VMs, which was admittedly not the best solution, but it worked more than well enough for our purposes. It meant that every HTTP request coming to a VM from the load balancer would first go through Varnish, and only if necessary go to the Rails server.  I should add that Varnish is one of the most impressive pieces of software I’ve seen in a long time; it’s very good at what it does, and has an incredibly powerful configuration language that lets you examine and rewrite HTTP queries in amazing ways.  It’s definitely worth looking into, if you haven’t already.
  • Finally, we used CloudFlare, a content distribution network (CDN) that did initial caching of static assets, including all of our images, JavaScript, and CSS. One of the nicest parts of CloudFlare is that they have an “emergency page” feature, so that if your site isn’t available, you can at least show a decent static page, and ask people to return later.

Bottom line: We could have optimized our caching even more than this. But after a lot of configuration of Varnish, CloudFlare, nginx, and the Rails app’s HTTP response headers, we got to the point where each of our VMs could handle about 2,000 simultaneous requests that hit the database, and far more than that if they were static and only required the home page. But the caching was incredibly helpful.

Images on S3

This isn’t directly related to the move to Server Central and getting ready for Shark Tank, but it certainly helped: Many of our images, and especially photos of homes available for rent, are now stored on Amazon’s S3, rather than our own server. This reduces the load on our VM and nginx, and provides users with much faster retrieval speeds than we can provide. Moving things to S3 turns out to be quite easy to do; most of the image-uploading gems now available for Rails have an option to store things in S3; once we set up the appropriate buckets, this was a piece of cake.

Bottom line: The cost of S3 storage is laughably low, because you’re just paying for the bandwidth, and the per-byte cost is basically a joke. It’s totally reasonable to do this, and you’ll end up spending far less than you ever thought.

Our biggest mistake: delayed jobs

Remember how I said every VM was stateless, and that it didn’t matter how many VMs we would run? Yeah, that’s what we thought — and then we started to migrate our systems, and remembered that our Web server was configured to run Delayed Jobs. DJ is a Ruby gem that works with Rails, which does just as the name describes — it allows you to offload things to the background, so that your users (and server) don’t sit around. For example, RLAC processes many photos from homeowners. That processing, and subsequent copying of images to S3, can take some time. So we have offloaded it to DJ, which runs in the background all of the time, processing images.

About three days before we were supposed to switch over to Server Central, with our many VMs, we realized that Delayed Jobs expected files to be uploaded to a specific server. That is, we had state — files on the system that DJ was going to look for, scoop up, and process. DJ uses the database, which meant that it seemed we had to choose between rewriting our app such that all image uploads would go to a single, designated server, or that we would upload to many servers, with DJ (on all but the server that received the file) and the database going crazy because it cannot find the uploaded file on the local filesystem.

After some discussion (and a bit of panic), my developer Genadi found the answer: DJ lets you designate a queue name when the system starts up. We thus set up our DJ startup script, such that it grabbed the hostname on which it was running, and used that hostname as the name of its DJ queue. Thus, with 20 servers, we had 20 separate DJ queues, each of which was from a particular machine. This wasn’t the most efficient solution; if we had thought about it earlier, we probably would have centralized things, with a single DJ server. But it worked beautifully, required very little configuration (except for figuring out how to start up DJ with this option via our startup scripts), and pulled us through.

Bottom line: Think very hard about whether you have any extra services running that might involve state.

Cron jobs

Finally, the RLAC system has a large number of processes that are run via cron jobs. These are crucial for the business, in that they send out reminder e-mail messages to homeowners and renters, handle credit cards, and make homeowner payments. You don’t want these to go down, but you also don’t want these to run on every machine. Clearly, in contrast with DJ, we didn’t want these cron jobs to run on each of our Web servers.

The solution was to designate one of our servers, and to set it up to use cron. Thus, while our servers are all on identical VMs, you could say that one is more equal than others, in that it is running a variety of rake tasks via cron to deal with system maintenance. Perhaps we’ll pull these tasks onto a separate VM one of these days, but for now, things are working quite well.

The only (minor) hitch was that RLAC’s e-mail servers are run by Google, and Gmail started to tag all of the e-mail produced by our cron jobs as spam, and rejecting it. The solution was to configure Postfix (our SMTP server) to use SendGrid, our outgoing e-mail provider. The moment we did that (which was shockingly easy to do, I must admit), I started to get reports from our cron jobs once again.

Bottom line: If you have regularly scheduled cron jobs, either for system features or for maintenance (e.g., checking disk space or making database backups), make sure that you know in advance where you’ll be putting them. And test things, to make sure that you’re able to receive e-mail from your cron jobs, in case of success or failure.

Team effort

Perhaps the most important aspect of what we did was that it was a team effort.  None of us could have done this by ourselves; even on a relatively small application, there were too many types of responsibility and expertise for anyone to have done it on their own. I really enjoyed getting to work with not only Genadi Reznichenko (the amazing developer who has worked for me for more than three years), but also Mike Hostetler (who took on the role of CIO over the last year), and Trevan Hetzel (our front-end developer), as well as the RLAC team led by CEO Mike Doyle.  Communication and trust are the most important aspects of any software project, and we did well on both fronts, thanks to GitHub, Slack, weekly group phone calls, and lots of pair-programming sessions.

We worked hard to get things up and running; it was a team effort, but one which paid off in spades for the company. I hope that this description helps some of you to think about scaling up your Web applications in various ways; if you have stories to share, or questions (or comments) about what I’ve written here, please let me know. I’d love to swap stories and/or hear what you went through!