The real questions to ask offshore developers

Friends of mine, who are not software developers, have a small, retail Internet business.  The original developers created the application in Python, and my friends are looking for a full-stack Web/Python developer to help them.  Frustrated with their inability to find someone who can commit to their project, my friends have decided to hire offshore developers, which is another way of saying, “cheap programmers in Eastern Europe or India.”

Earlier this week, these friends e-mailed me the resumes of three Ukrainian programmers, asking me which seemed most appropriate, and what questions they should be asking.

The resumes were, from my perspective, largely identical.  All the programmers declared themselves to be “experts,” or “very experienced,” or just “experienced,” at Python, JavaScript, Web development, SQL, and many other technologies.  And the fact is, they probably are quite skilled at all of these technologies; the Ukrainians with whom I have worked — as well as the Indians, Chinese, and Romanians — have all been quite skilled, technically.

But here’s the thing: Technical skill isn’t the primary consideration when hiring a developer. This is doubly true when hiring an offshore developer.  That’s because the problems that I’ve seen with offshore programmers aren’t technical, but managerial.  As I told my friends, you would much rather have a so-so Python programmer who is reliable, and communicative than a genius programmer who is unreliable, or uncommunicative.  The sad fact is that many of the offshore outsourcing companies have talented programmers, but poor management and leadership, leading to breakdowns in communication, transparency, and scheduling, rather than technology.

Sure, a developer might know the latest object-oriented techniques, or know how to create a RESTful JSON API in his or her sleep.  But the programmer’s job isn’t to do those things. Rather, the programmer’s job is to do whatever the business needs to grow and improve.  If that requires fancy-shmancy programming techniques and algorithms, then great.  But most of the time, it just requires someone willing to pay attention to the project’s needs and schedule, writing simple and reliable code that’s necessary for the business to succeed.

The questions that you should be asking an offshore developer aren’t that different from the ones that you should be asking a developer in your own country, who speaks your language, and lives in your time zone.  Specifically, you should be asking about their communication patterns and processes.  Of course, you don’t want a dunce working on your programming project — but good communication and processes will smoke out such a person very quickly.

If there are no plans or expectations for communication, then you’re basically hoping that the developer knows what you want, that he or she will do it immediately, and that things won’t change — a situation that is pretty much impossible.

Good processes and a good developer will lead to a successful project.  Good processes and a bad developer will make it clear that the developer needs to go, and soon.  Bad processes and a developer of any sort will make it hard to measure performance, leading to frustrating on everyone’s part — and probably missed deadlines, overspent budgets, and more.

So I told my friends that they should get back to these Ukrainian programmers, and ask them the following questions:

  • What task tracking system do you prefer to use, in order to know what needs to be done, what has been done, and who has taken responsibility for each task?
  • How often do you want to meet to review progress?
  • Do you use automated testing to ensure that when we make progress, we can be sure that it works, and that we haven’t introduced regressions?
  • How easily will a third party be able to download the repository from Git (or whatever version-control system you’re using), and run those tests to verify that everything is working?

The answers to these questions are far, far more important than the technical skills of the person you’re hiring. Moreover, these are things that we can test empirically: If the developer doesn’t do one or more of them, we’ll know right away, and can find out what is going wrong.

If the developer is good, then he or she will encourage you to set up a task tracker, meet every day (or at least, every other day) to review where things are.  You’ll hear that automated testing is part of the development progress, and that of course it’s possible to download, install, and run the application on any compatible computer.

If the developer hedges on these things, or if he or she asks you to trust him, then that’s a bad sign.  Truth be told, the developer might be fantastic, brilliant, and do everything you want.  But do you want to take that risk?

If the developer has regular communication with you, tests their code, and allows you to download and run the application on your own, then you’re in a position to either praise them and keep the relationship going — or discover that things aren’t good, and shut it down right away.

Which brings me to my final point: With these sorts of communication practices in place, you’ll very quickly discover if the developers are doing what they promised.  If so, then that’s great for everything.  But if not, then you’ll know this within a week or less — and then you can get rid of them.

There are plenty of talented software developers in the world, but there are many fewer who both understand your business and make its success a priority.  A developer who values your business will want to demonstrate value and progress on a very regular basis.  Someone who cannot demonstrate value and progress probably isn’t deserving of your attention or money, regardless of where they live or what language they speak.  But if you can find someone excellent, who values you and your business, and who wants to help you succeed?  Then by all means, hire them — and it doesn’t matter whether they’re in Ukraine, or anywhere else.

What questions do you ask offshore developers before hiring them?

PostgreSQL array indexes and length

In my last blog post, I introduced the idea of a PostgreSQL array, and showed how we can insert data into a table using either the curly-brace {} syntax, or the ARRAY construction syntax.  In this post, I want to talk about PostgreSQL indexes and length — what happens when we retrieve from indexes that exist (and don’t), how we can construct multidimensional arrays, and how we can ask PostgreSQL for the length of any dimension of our multidimensional array.

First of all, let’s talk about how you can retrieve data from arrays.  if I still have my three-row table from last time:

[local]/reuven=# select id, stuff from foo;
 ┌────┬─────────────────────┐
 │ id │ stuff               │
 ├────┼─────────────────────┤
 │ 8  │ {abc}               │
 │ 9  │ {abc,def}           │
 │ 10 │ {abc,def,"ghi jkl"} │
 └────┴─────────────────────┘

What if I want to get just the first value back from the “stuff” column?  Well, then I have to take the first element of an array.  Most modern languages start to number arrays with index 0; PostgreSQL, by contrast, starts to count them with 1.  I can thus ask for just the first element of the “stuff” array from each row with:

[local]/reuven=# select id, stuff[1] from foo;
 ┌────┬───────┐
 │ id │ stuff │
 ├────┼───────┤
 │ 8  │ abc   │
 │ 9  │ abc   │
 │ 10 │ abc   │
 └────┴───────┘

If we ask for an index that doesn’t exist, we get a NULL value back:

[local]/reuven=# select id, stuff[3] from foo;
 ┌────┬─────────┐
 │ id │ stuff   │
 ├────┼─────────┤
 │ 8  │ [null]  │
 │ 9  │ [null]  │
 │ 10 │ ghi jkl │
 └────┴─────────┘

Note that I have configured my PostgreSQL client to show NULL values as “[null]”, by putting the following line in my ~/.psqlrc file:

\pset null [null]

Without the above line in your .psqlrc (or running that command in psql yourself manually), you might see blank space for row IDs 8 and 9.

Now, it’s pretty rare for me to pull out a particular value from a PostgreSQL array.  Instead, I’m often finding out the lengths of the arrays on which I’m working.  I can do this with the array_length function:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo;
 ┌────┬─────────────────────┬──────────────┐
 │ id │ stuff               │ array_length │
 ├────┼─────────────────────┼──────────────┤
 │ 8  │ {abc}               │ 1            │
 │ 9  │ {abc,def}           │ 2            │
 │ 10 │ {abc,def,"ghi jkl"} │ 3            │
 └────┴─────────────────────┴──────────────┘

Notice that array_length is a function that takes two parameters, an array and an integer.  The array is what we want to measure.  The integer describes which dimension should be measured in the array’s length. If you’re like me, and come from dynamic languages like Ruby and Python in which arrays (or lists) can be of any length, then you should realize here that PostgreSQL arrays can be multidimensional, but each inner array must be of the same length. So, for example, I can create a 2×3 array of integers with

[local]/reuven=# select ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]];
┌───────────────────┐
│ array             │
├───────────────────┤
│ {{1,1,1},{2,2,2}} │
└───────────────────┘

Trying to have a different inner dimension will not work:

[local]/reuven=# select ARRAY[ARRAY[1,1,1], ARRAY[2,2,2,2]];
ERROR: 2202E: multidimensional arrays must have array expressions with matching dimensions

Assuming that I have a legitimate array, I can get its length:

[local]/reuven=# select array_length(ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]], 1);
┌──────────────┐
│ array_length │
├──────────────┤
│ 2            │
└──────────────┘

Or I can get the length of the inner dimension:

[local]/reuven=# select array_length(ARRAY[ARRAY[1,1,1], ARRAY[2,2,2]], 2);
┌──────────────┐
│ array_length │
├──────────────┤
│ 3 │
└──────────────┘

So, when retrieving our rows from the “foo” table:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo;
┌────┬─────────────────────┬──────────────┐
│ id │ stuff               │ array_length │
├────┼─────────────────────┼──────────────┤
│ 8  │ {abc}               │       1      │
│ 9  │ {abc,def}           │       2      │
│ 10 │ {abc,def,"ghi jkl"} │       3      │
└────┴─────────────────────┴──────────────┘

I can get the array length in a separate column, as in this example.  Or I can even sort in descending order of the array length:

[local]/reuven=# select id, stuff, array_length(stuff, 1) from foo order by array_length(stuff, 1) desc;
┌────┬─────────────────────┬──────────────┐
│ id │ stuff               │ array_length │
├────┼─────────────────────┼──────────────┤
│ 10 │ {abc,def,"ghi jkl"} │     3        │
│ 9  │ {abc,def}           │     2        │
│ 8  │ {abc}               │     1        │
└────┴─────────────────────┴──────────────┘

Notice that our ORDER BY clause has to repeat the function that we used to create the third column.  Another way to do this is to declare an alias for the output of array_length, and then use the alias in ORDER BY:

select id, stuff, array_length(stuff, 1) len from foo order by len desc;
┌────┬─────────────────────┬─────┐
│ id │ stuff               │ len │
├────┼─────────────────────┼─────┤
│ 10 │ {abc,def,"ghi jkl"} │ 3   │
│ 9  │ {abc,def}           │ 2   │
│ 8  │ {abc}               │ 1   │
└────┴─────────────────────┴─────┘

Next time, we’ll look at how we can manipulate arrays using array functions.

Learning to love PostgreSQL arrays

I’ll admit it: When arrays were added to PostgreSQL a number of years ago, I thought that this was a really bad idea.  I’m a firm believer in normalization when it comes to database design and storage; and the idea of putting multiple values inside of a single column struck me as particularly foolish.  Besides, my impression was that PostgreSQL arrays were clumsy to work with, and didn’t really add much to my data model.

Of course, it turns out that arrays are extremely useful in PostgreSQL.  I still cringe when people want to use them for general-purpose storage, instead of working to normalize their database design.  But over the last few months, as I’ve been doing all sorts of complex PostgreSQL queries for my PhD dissertation, I’ve found that PostgreSQL arrays are extremely useful when it comes to aggregating and reporting data.

I’ve thus decided to dedicate a number of blog posts to PostgreSQL arrays: How to create them, use them, manipulate them, and decide when to use them.

Let’s start with the very basics; over the next few blog posts, I’ll try to show how arrays can be interesting — and even useful, and fit into more complex queries and needs.

You can create an array of just about any data type in PostgreSQL.  As the documentation says, “Arrays of any built-in or user-defined base type, enum type, or composite type can be created. Arrays of domains are not yet supported.” This means that you can create arrays of just about any data type you want: Integers, text, enums, other arrays (for multidimensional arrays), or even user-defined types.  To date, I have generally created arrays of integers and text, but that might not be representative of your use case.

To create a table with a text array in one column, just add square brackets ([]) after the type:

CREATE TABLE Foo (
id SERIAL NOT NULL,
stuff TEXT[],
PRIMARY KEY(id)
);

When I then ask for the definition of my “Foo” table, I see the following:

[local]/reuven=# \d foo
 Table "public.foo"
┌────────┬─────────┬──────────────────────────────────────────────────┐
│ Column │ Type │ Modifiers │
├────────┼─────────┼──────────────────────────────────────────────────┤
│ id     │ integer │ not null default nextval('foo_id_seq'::regclass) │
│ stuff  │ text[]  │                                                  │
└────────┴─────────┴──────────────────────────────────────────────────┘
Indexes:
 "foo_pkey" PRIMARY KEY, btree (id)

Notice that the type of the “stuff” column is indeed recorded as “text[]”, showing that it’s an array.  If we try to insert a plain-text value into that column, PostgreSQL will complain:

[local]/reuven=# insert into foo (stuff) values ('abc');
ERROR: 22P02: array value must start with "{" or dimension information
LINE 1: insert into foo (stuff) values ('abc');
                                        ^

One of the many things that I love about PostgreSQL is the attention to detail in the error messages.  Not only does it tell us that the table is expecting an array value, but that the array must begin with a { character.  It also shows us, using a ^ character, where the parser had problems.  That’s not always a perfect indicator of where the problem lies, but it’s a great start.

If I want to insert an array value into my table, I can thus use the literal array syntax that PostgreSQL provides, with (as indicated above) curly braces:

[local]/reuven=# insert into foo (stuff) values ('{abc}');
INSERT 0 1
[local]/reuven=# insert into foo (stuff) values ('{abc,def}');
INSERT 0 1
[local]/reuven=# insert into foo (stuff) values ('{abc,def,ghi jkl}');
INSERT 0 1

The above commands insert three rows into our table.  In all three cases, we are inserting array values into our table.  Notice that in all cases, the array is inserted as a string, surrounded by single quote marks.  Thus, ‘{abc}’ becomes a one-element array, and ‘{abc,def}’ becomes a two-element array.

What happens when there is a space character inside of the text?  PostgreSQL automatically quotes the value (with double quotes — be careful!).  What happens if you want a comma or single quote as part of the text?  Then things get even uglier.

A nice solution, and a better way (I believe) to insert arrays in any event, is to use the built-in ARRAY constructor syntax.  Then you don’t have to worry about such things.  For example, I can rewrite all of the above INSERT commands in what I believe to be a much nicer way:

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc']);
INSERT 0 1

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc', 'def']);
INSERT 0 1

[local]/reuven=# insert into foo (stuff) values (ARRAY['abc', 'def', 'ghi jkl']);
INSERT 0 1

[local]/reuven=# select * from foo;
┌────┬─────────────────────┐
│ id │ stuff │
├────┼─────────────────────┤
│ 8  │ {abc}               │
│ 9  │ {abc,def}           │
│ 10 │ {abc,def,"ghi jkl"} │
└────┴─────────────────────┘
(3 rows)

The same data was inserted into the table, but with less hassle than before.

Now, just because we can insert arrays directly into our tables doesn’t necessarily mean that we should do so.  You’ll see, over the course of this series, that I view arrays as a great way to aggregate and analyze existing data, particularly within the context of a view or a CTE.  So please don’t be tempted to start stuffing all of the data you want and need into a single column; normalization is still a good idea, and arrays can be tempting.  However, being familiar with the basics of defining and inserting array data into the database is quite useful, and will serve us well throughout the rest of this series.

 

Control-R, my favorite Unix shell command

If you use a modern, open-source Unix shell — and by that, I basically mean either bash or zsh — then you really should know this shortcut.  Control-R is probably the shell command (or keystroke, to be technical about it) that I use most often, since it lets me search through my command history.

Let’s start with the basics: When you use bash or zsh, your commands are saved into a history, typically put in the environment variable HISTFILE.  I use zsh (thanks to oh-my-zsh), and it puts my HISTFILE in ~/.zsh_history.  How many commands does it store?  That depends on the value of the environment variable HISTSIZE, which in my case is 10,000.  Yes, I store the 10,000 last commands that I entered into my shell.

Now, before control-R, there were a bunch of ways to search through and use the history.  Each command has its own number, and thus if you want to replay command 5329, you can do so by typing

!5329

But this requires that you keep track of the numbers, and while I used to do that, I found it to be more annoying than useful.  What I really wanted was just to repeat a command … you know, the last time I ssh’ed into a server, or something.  So yeah, you can do

!?ssh

and you’ll get the most recent “ssh” command that you entered.  But what if you have used ssh lots of times, to lots of servers?  You could start to search for the server name, but then things start to get complicated, messy, and annoying.

What control-R does is search backwards through HISTFILE, looking for a match for what you have entered until now.  If you use Emacs, then this will make perfect sense to you, since control-R is the reverse version of control-S in Emacs.  If you don’t know Emacs, then it’s a crying shame — but I’ll still be your friend, don’t worry.

Let’s say you have ssh’ed into five different servers today, and you want to ssh again into the third server of the bunch.  You type control-R, which puts you into bck-i-search (i.e., “backward incremental search”) mode.  Now type “s” (without enter).  The most recent command that you entered, which contains an “s”, will appear.  Now type another “s” (again, without pressing enter).  The most recent command containing two “s” characters in a row will appear.  Depending on your shell and configuration, the matching text might even be highlighted.

Now enter “h”.   In my case, I got to the most recent call to “ssh” that I made in my shell.  But I don’t want this last (fifth) one; I want the third one.  So I enter control-R again, and then again.  Now I’m at the third time (out of five) that I used ssh today, at the command I want.  I press “enter”, and I’ve now executed the command.

While searching backward, if you miss something because you hit control-R one too many times, you can use control-S to search forward.  You can use the “delete” key to remove characters, one at a time, from the search string.  And you can use “enter”, as described above, to end the search.  I should also note that I’ve modified my zsh prompts such that the matched text in control-R is highlighted, which has made it even more useful to me.

So, when was the last time I entered the full “ssh” command into a client’s server? I dunno, but it was a while ago… since the odds are that within the 10,000 most recent commands, I’ve got a mention of that client’s server.  And if I needed to pass specific options to ssh, such as a port number or a certificate file to get into AWS, that’ll be in the history, too.  By combining a huge history with control-R, you can basically write each command once, and then refer back to it many times.

Now the fact is that control-R isn’t really part of bash or zsh, per se.  Rather, it’s part of a GNU library called “readline” that is used in a large number of programs.  For example, it’s used in IPython, Pry, and the psql command-line client for PostgreSQL.  Everywhere I go, I can use control-R — and I do!  Each program saves its own history, so there’s no danger of mixing shell commands with PostgreSQL queries.

 

Starting a new software project? Don’t start coding right away.

It’s always fun to start a new project. I should know; I’ve been a consultant since 1995, and have started hundreds of projects of various shapes and sizes.  It’s tempting, when I first meet a new client and come to an agreement, to dive right into the code, and start trying to solve their problems.

But that would be a mistake.

More important than code, more important than servers, more important than even finding out what problems I’m supposed to be solving, is the issue of communication.  How will the client communicate their questions and problems to me?  How will I tell them what I am doing?  Even more importantly, how I will I tell them where I’m having problems, or need help?

Before you begin to code, you need to set up two things: First, a time and frequency of meeting.  Will it be every day at 8 a.m.?  Every Monday at 2 p.m.?  Tuesdays and Thursdays at 12 noon?  It doesn’t matter that much, although I have found that daily morning meetings are a good way to start the day.  (When you work on an international team, though, someone’s “morning” meeting is someone else’s evening meeting.)  These meetings, whether you want to call them standups, weekly reviews, or something else, are to make sure that everyone is on the same page.  Are there problems?  Issues?  Bugs?  New feature requests?  Is someone stuck, and needs help?  All of that can be discussed in the meeting.  And by setting a regular time for the meeting, you raise the chances that when something goes wrong (and it will), there will be a convenient time and place to discuss the problems.

I’m actually of the opinion that it’s often good to have both a daily meeting (for daily updates) and a weekly one (for review and planning).  Whatever works for you, stick with it.  But you want it to be on everyone’s schedule.

The second thing that you should do is set up a task tracker.  Whether it’s Redmine, Trello, GitHub issues, or even Pivotal Tracker, every software project should have such a task tracker.  They come in all shapes, sizes, and price points, including free.  A task tracker allows you to know, at a glance, what tasks are finished, which are being worked on right now, and which are next in line.  A task tracker lets you prioritize tasks for the coming days.  And it allows you to keep track of who is doing what.

Once you have set up the tracker and meeting times, you can meet to discuss initial priorities, putting these tasks (or “stories,” as the cool agile kids like to say) in the tracker.  Now, when a developer isn’t sure what to work on next, he or she can go to the task tracker and simply pick the top things off of the list.

This isn’t actually all that hard to do.  But it makes a world of difference when working on a project.

Summary of my “reduce” series

I teach Ruby and Python to a lot of people — in formal courses, and in one-on-one pairing sessions, both online and in person.  I’ve found that for many people, the whole notion of functional programming seems strange and difficult, as well as something of a waste of time.  After all, if you have objects, why would you need functional programming?

The answer, of course, is that no single paradigm has all of the answers; each has its strengths and weaknesses.  Understanding how to use them together can provide great benefits.  As a result, I spend time when teaching both Ruby and Python on the basics of functional programming, and then the various functions and methods that each language provides in this area.

Of these, the function that most often causes people to wrinkle their noses and/or get confused is “reduce”.  And to be honest, I often tell students in my classes that “reduce” is one of those functions that is incredibly powerful and clever, but for which you sometimes need to wait in order to find a use case.  I decided to explore some use cases, and ways in which “reduce” could be used — and I hope that these have been useful.

To summarize, here are the posts that I wrote on this topic:

I hope that this series has been useful and interesting, and would appreciate hearing ideas for additional deep-dives into areas of Python, Ruby, or programming in general.  What subjects do you find confusing?  What methods do you think are somewhat useless?  Let me know, and I’ll try to address them in future blog posts!

 

Implementing “filter” with “reduce”, in Ruby and Python

We’re nearly at the end of my tour of the “reduce” function in Ruby and Python.  Just as I showed in the previous installment how we can implement the “map” function using “reduce”, I want to show how we can implement another functional-programming standard, “filter”, using “reduce” as well.  As before, I’ll show examples in both Ruby and Python.

Whereas “map” transforms each element of its input, but returns a value for each one, “filter” doesn’t change its inputs at all — but it does only selectively return them. The combination of “map” and “filter” is a classic and powerful one.  While Python still does include the “map” and “filter” functions, they are more usually expressed using list comprehensions.  The “map” part is the left-hand-side of the list comprehension, defining the transformation.  The “filter” part is the (optional) right-hand side of the comprehension, indicating when a value should be placed in the output.

Here is a use of “filter” to keep only odd numbers (taking advantage of the fact that in Python, 0 is a false value, and % is the modulus operator on integers):

>>> filter(lambda x: x%2, range(10))
[1, 3, 5, 7, 9]

It turns out that we can use “reduce” to implement our own version of “filter”, which does the same thing. Once again, we’ll define a function that takes two parameters, a sequence and a function:

def myfilter(sequence, f):
 return reduce(lambda total, current: total + ([current] if f(current) else []),
               sequence,
               [])

In other words: Our function invokes reduce, initializing it with an empty list ([]). We go over every element of “sequence”; for each element, we invoke the user-supplied function, f, passing “current” as a parameter.  We always return “total”, which is the list that we have created so far.  The question is whether we also want to return “current”, and that depends on whether f(current) returns True or False.  If it returns True, then we add “current” to the accumulated list.  If it returns False, then we add [] (i.e., the empty list) to the list. If we again want to filter a list of numbers, such that we get only the odd ones, we can now say:

>>> myfilter(range(10), lambda x: x%2)
[1, 3, 5, 7, 9]

What can be tricky for many newcomers to Python and functional programming is that the implementation of “myfilter” involves a lambda (i.e., anonymous function) in our invocation of “reduce”, but also a function (often defined as a lambda) that is passed to “myfilter” by the user.  I’ve found in my teaching that “lambda” tends to surprise and confuse many people, and two lambda expressions in the same place can really be a cognitive burden for newcomers.

Now that we’ve implemented a version of “filter” in Python, let’s turn to Ruby.  Here, as with “map”, we’re not going to pass a function (method) as a parameter. Rather, we’ll pass a block to our “myfilter” method, and then yield to the block whenever we want to invoke it.  For example:

def myfilter(enumerable)
 enumerable.reduce([]) {|total, current| yield(current) ? total << current : total }
end

(By the way, “filter” is traditionally called both “select” and “find_all” in Ruby circles.)  Notice that there are two blocks involved here: Enumerable#reduce invokes a block for each element of “enumerable” over which it iterates.  And then within that block, we yield to the block passed to “myfilter”, to check and see if the current element should be shared.

We can then invoke “myfilter” as follows:

e = (1..10).to_a
=> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

myfilter(e) {|n| n.even?}
=> [2, 4, 6, 8, 10]

Sure enough, we’ve managed to create our own version of “filter”, just as we were able to do for “map”.

Next time: The exciting conclusion of our tour of the “reduce” function.  Same reduce-time, same reduce-channel!

Implementing “map” with “reduce”, in Ruby and Python

This is another installment in my “reduce” series of posts, aimed at helping programmers understand this function, with examples in both Ruby and Python.  So far, we have seen how we can build a number of different types of data structures — integers, strings, arrays, and hashes — using “reduce”.  But the really interesting use of “reduce,” in my mind, is as a way to implement “map” and “filter,” two classic functions from the functional-programming world.  In this post, I show how to implement “map” in both Ruby and Python.

First, let’s remember what “map” does: It takes a sequence, and applies a function to that sequence, one element at a time. The output of applying that function on each element produces a sequence of values, which we then get back from “map”.   In Ruby, we can thus say:

[1,2,3,4,5].map {|n| n*n}

which produces the array:

[1,4,9,16,25]

In Ruby, any block that you want to pass is fair game.  The block takes a single parameter, the current element of the enumerable that should be transformed. Whatever the block returns is put into the output array; because nearly everything in Ruby is an expression, just about anything is fair game.  Moreover, the output array will always consist of the same number of elements the input array.  So if you run Enumerable#map on an array of 15 items, you can be sure that you’ll get a new, anonymous array of 15 items.  The value of each of these items will depend on what the block emits.

If we want to implement “map” using “reduce,” we know that we need to invoke a block once per element of the enumerable.  The block should add one element to the end of an array that we have created.  The element that we add to the array should be the result of invoking a method on the current element.

For example, if we want to use “reduce” to implement the “map” example from above, in which we get the squares of the input numbers, we could do the following:

[1,2,3,4,5].reduce([]) {|total, current| total << current * current}

and sure enough, we get the following output:

[1, 4, 9, 16, 25]

If squaring numbers were the only thing we were interested in doing, we could write a “mymap” method that would do that:

def mymap(enumerable)
  enumerable.reduce([]) {|total, current| total << current * current}
end

mymap([1,2,3,4,5])

And sure enough, we get the array of squared numbers.  But as you can imagine, this implementation of map is somewhat lacking… we would really like to be able to pass a block, just as we can do with the built-in Enumerable#map method.  In Ruby, we can do this very easily by passing a block to our method, and then yielding to it (i.e., invoking it) once per iteration within the “reduce”.  For example:

def mymap(enumerable)
  enumerable.reduce([]) {|total, current| total << yield(current)}
end

mymap([1,2,3,4,5]) {|n| n*n}
=> [1, 4, 9, 16, 25]

mymap([1,2,3,4,5]) {|n| (n * 30).to_s}
=> ["30", "60", "90", "120", "150"]

Not bad!  Using “reduce”, we’ve managed to implement a version of “map”.  That’s pretty snazzy, if you ask me.

We can do almost exactly the same thing in Python.  Python doesn’t have Ruby-style blocks, but it does have functions — both named functions and anonymous ones created with “lambda” — that can be passed easily as parameters, and then invoked using parentheses. The Python version of “mymap” would thus look something like this:

def mymap(sequence, f):
  return reduce(lambda total, current: total + [f(current)], sequence, [])

In other words, our function takes two parameters, a sequence and a function. We then build the resulting list, one element at a time, based on the output from invoking our “f” function on each element in sequence.  It turns out that “map” is not that different from our use of “reduce” to build lists; the only difference, in fact, is that we’re letting the user pass us a function, which we then invoke on the current item.

It turns out, then, that implementing “map” isn’t so hard or mysterious!  In the next installment of this series, we’ll see how to implement the “filter” function (called “select” or “find_all” in Ruby), another staple of functional programming.

Creating Ruby hashes with “reduce”

In yesterday’s post, I showed how we can use Python’s “reduce” function to create a dictionary.  Ruby, of course, also has dictionaries, but calls them “hashes.”  In this posting, I’m going to show how we can create a hash in Ruby when iterating over an enumerable.  In so doing, we’ll see how we can use the “reduce” method to create interesting hashes, building them up one step at a time.

Let’s start with the simplest and easiest solutions, which will be quite similar to what we did with Python: We’ll take an array of arrays, and turn that into a hash:

[['a', 1], ['b', 2]].to_h

Oh, wait — you’re not using Ruby 2.1?  Too bad; that version added Array#to_h, a method that creates key-value pairs from nested arrays.  The idea is that each inner array should have two elements, the first being a key and the second being a value.  So the above call to Array#to_h will result in:

{"a"=>1, "b"=>2}

Of course, we often want to use symbols for our hash keys, rather than strings.  But let’s ignore that for today, and just concentrate on how we can create hashes, rather than quibble over key types.

If you’re using a version of Ruby earlier than 2.1, and thus don’t have access to Array#to_h, then you can always use Hash[],

Hash['a', 1, 'b', 2]
=> {"a"=>1, "b"=>2}

Note that whereas Array#to_h expected to get an array of arrays, Hash[] expects to get a single array between the [ ].  So if we’re interested in using the “reduce” method to build up a hash, we can do it as follows:

Hash[%w(a b c).reduce([]) {|total, current| total << [current, current.ord]}]

In other words, we create an array consisting of three strings (‘a’, ‘b’, and ‘c’), by creating an array of arrays, and then turning that into a hash.  But of course, hashes (like all data structures in Ruby) are mutable, and (more importantly) methods that modify data structures often return the object on which they worked.  So we can actually build up our hash in a more direct way:

%w(a b c).reduce({}) {|total, current| total.update(current => current.ord)}

Here, we once again create an array of “a”, “b”, and “c”.  Then we initialize our call to “reduce” with an empty hash.  Then we reduce, with each call invoking Hash#update.  Hash#update not only merges the parameter’s value (a hash) into “total”, but returns the resulting hash.  Thus, with each invocation, “total” is overwritten with the new hash, to which we have added a new pair.

Of course, this example is a pretty trivial one; you can imagine that instead of invoking current.ord, that you create a new object based on the parameter, or that you do a calculation based on it instead.  The bottom line, though, is that creating hashes incrementally in Ruby is pretty easy to do.  Moreover, if you find that a single-line block is not enough space in which to do all of the calculations you need, you can always use a do-end style of block, which will let you have as many lines as you want.

Next time: Implementing “map” and “filter” with “reduce”!  If you understand those, then you’ve totally internalized the power (and dare I say it, fun) of this method.

 

 

 

Creating Python dictionaries with “reduce”

In the last few installments (first, second, third, and fourth) of this series, we saw how “reduce” can be used to build up a scalar value (such as a number or string), or even a simple collection, such as a list (in Python) or an array (in Ruby).  The jewel in the data-structure crown for these high-level languages is known by many names: Dictionary, hash, hash table, hash map, and mapping.  I tend to use these quite a bit, and I know that I’m not alone; hashes are easy to use and work with, have O(1) lookup characteristics, guarantee the uniqueness of their keys, and make the code self-documenting.  What’s not to like?

There are times when you might want to build a dictionary step by step, and “reduce” can help you to do that.  I should note that recent versions of Python offer “dictionary comprehensions,” which are one of my favorite features of the language, and can be used similarly to “reduce” — and probably with less confusion among new programmers.  Nevertheless, I find it interesting and instructive to see how we can use “reduce” to create a data structure that we wouldn’t normally associate with this function.

I should note that some of the things you’re going to see here are not really recommended coding practices, particularly in Python.  But they will be fun, which is important, no?

Let’s start with Python: If I want to create a dictionary using “reduce”, I have a few different options.  Perhaps the first and easiest one is not exactly what you might have had in mind, namely passing the “dict” function (which creates a new instance of “dict” — i.e., a new dictionary) a list of two-element tuples.  For example, I can say:

dict([('a', 1), ('b', 2)])

and I get:

{'a': 1, 'b': 2}

So if I can tell “reduce” to emit a list of tuples, I can then pass that list of tuples to “dict”, and get a dictionary back.  Let’s try that:

reduce(lambda output, current: output + [(current, ord(current))], 'abc', [])

In the above case (and in all of the examples we’ll use here), I’m trying to build an oh-so-useful dictionary in which the keys are the letters a, b, and c, and the values are the ASCII codes for those letters.  In the above Python code, I iterate over the string ‘abc’, which is a three-element sequence of those three letters.  For each letter, I return the current value of “output” (which is guaranteed to be a list), plus a new, single-element list.  That single-element list is a tuple consisting of the current letter and its ASCII code.  So the output of the above “reduce” call will be:

[('a', 97), ('b', 98), ('c', 99)]

which, when we feed it into dict():

>>> dict([('a', 97), ('b', 98), ('c', 99)])
{'a': 97, 'b': 98, 'c': 99}

Voila!  We’ve created a dictionary.  (By the way, the “dict.items” method, which is often used to iterate over the keys and values of a dictionary, returns a list of tuples in precisely this format.)

So this is nice, but perhaps there’s a way to have “reduce” build up a dictionary all by itself?  The answer is “yes, but” — because Python tries hard to stop us from modifying data structures within a lambda.  But if we’re willing to be a bit weird, we can get around that.  How? By taking the dictionary that we received from the previous iteration (i.e., “output”), turning it into a list of tuples with dict.items, adding that list to our current pair, and then turning the whole thing back into a dictionary.

reduce(lambda output, current: dict(output.items() + 
                     [(current, ord(current))]), 'abc', {})

Note that while this is unwieldy, it doesn’t violate one of the key tenets of functional programming, namely treating data as immutable.  However, putting this sort of code in a production system will lead to hatred from your colleagues, job security, or (if you’re really lucky) both.  Oh, and it’s probably rather inefficient as well, although I’m not crazy enough to actually benchmark it.

That said, consider what we’ve done here: We create a new dictionary with each iteration, passing it to “current”.  The new dictionary is created by taking the old one, breaking it into a list of tuples, adding to it a new tuple, and then turning it all into a dictionary again.

There is at least one other way to do this, but it’s going to make the above code seem like beautiful, classic Python in comparison: Remember that we cannot assign to a dictionary within a lambda.  However, we can invoke methods, including the dict.update method, which lets us merge one dictionary into another.  The thing is. dict.update, like most methods in Python that modify data structures, returns None.  If we’re willing to take some risks (and why stop now), we can use the following code to modify our existing dictionary and then return it to the next iteration:

reduce(lambda d, current: d.update({current : ord(current)}) or d, 'abc', {})

The above code takes advantage of the fact that “and” and “or” are short-circuit operators.  Thus, we first invoke d.update on our dictionary.  That returns None.  Our “or” operator then says, “Well, I’d better go to the second argument, because the first one returned a false-y value.  Sure enough, our dictionary — because it isn’t empty — is a true value, which is what the statement returns.  And sure enough, we get our dictionary.

A final way to do this would be to hide the assignment of the dictionary behind an external function.  That is, we have a function do the dirty work for us.  That’ll certainly work, although I see it as somewhat less fun than lambdas and trying to work around their limitations.

def update_and_return(d, new_key, new_value):
    d.update({new_key: new_value})
    return d

>>> reduce(lambda output, current: update_and_return(output, current, ord(current)), 'abc', {})

{'a': 97, 'b': 98, 'c': 99}

Fun stuff!  And, perhaps, the worst method ever for getting an ASCII table into a dictionary.

Next time, we’ll see how to do this with Ruby.