The Shoelace

It’s not the large things that send a man to the madhouse. Death he’s
ready for, or murder, incest, robbery, fire, flood… No, it’s the continuing
series of small tragedies that send a man to the madhouse… Not the death
of his love but a shoelace that snaps with no time left…

-Charles Bukowski

I absolutely love the R language; I think it’s both amazingly powerful and also very simple to work with. Moreover, R has essentially become the lingua franca for statistical computing. The depth and breath of its statistical packages are unrivaled.

In the last few years we’ve seen the popularity of R increase quite a bit. Of course, there are a multitude of reasons behind the growth, but at least one reason would have to be the monumentally popular packages by Hadley Wickham. His packages—ggplot2, dplyr, tidyr, to name a few—are among the most popular of any R packages, and have truly changed the way that R code is written.

As the quote above suggests, this post is about minor annoyances. I was recently annoyed by a not-so-obvious feature of the tibble data type (Hadley’s answer to the data.frame). In addition to an odd name, tibble type objects have a few nice features that traditional R dataframes do not (e.g., column type information, sane printing behavior, etc). But as I recently discovered, they also have a surprising nuance. Consider the following code.

R> library(dplyr)

R> dat <- data.frame("x1" = c(31, 43, 59), 
                     "x2" = c("a", "b", "c"),
                     "x3" = c(7, 8, 9))

R> dat_tib <- as_data_frame(dat)

R> dat_tib[1, 1]

What does the code above do, and what makes it annoying? I’m glad you asked about that. The code above creates a data.frame object with three columns that are cleverly called “x1”, “x2”, and “x3”. Next, we use the as_data_frame() function to create an object of type tibble.

Incidentally, I don’t love the fact that the function as_data_frame() returns an object of type tibble, but let’s save that for another time. Instead, I’m bothered by what dat_tib[1, 1] returns. You might have the exceedingly reasonable expectation that dat_tib[1, 1] would return the first element of the first column in dat_tib, which in this case would be the value 31.

Well, I’m afraid to say that your reasonable expectation is not correct. The call dat_tib[1, 1] actually returns a 1-by-1 tibble with a single element: 31. If you want the actual value 31 returned, you would need to use either double brackets commonly used with list objects in R (i.e., dat_tib[[1, 1]]), or you’d need the dollar-sign notation for column indexing (i.e., dat_tib$x1[1]).

This nuance isn’t exactly a secret; it’s actually prominently displayed in the docs. But I wasn’t aware of it until recently, and I found myself baffled by the result of an expression along the lines of dat_tib[1, 1] == 31. Of course, the expression was returning FALSE because the object on the left-hand side was a tibble, and thus, clearly not equal to the value 31.

There is a lot to be said for consistency. And from that perspective, the fact that the single-bracket notation always returns a tibble is quite nice—even if it disagrees with data.frame object conventions. However, from a practical point of view, I would have thought it was widely accepted that when trying to get a single element from an array-like object, you’re interested in the actual value. You’re probably not interested in getting a 1-by-1 slice of the original array.

I suppose this is a good reminder to always read the docs. And be careful when you tie your shoes.

Leave a Reply

Your email address will not be published. Required fields are marked *