Let us now praise ugly code!

In this blog post I will revisit the first piece of code I wrote with the R Programming language, back in the early part of this decade.

Coming from an Octave/MATLAB background, I really enjoyed the functional nature of R. I imagined flinging vectors into Matrices, collapsing them with dot-products, Tetris like. I refused to write a single for loop... I framed everything as functions and maps. As I gained experience with R, I found pipes and data wrangling libraries, but early on, my code was pretty ugly, as you will see shortly.

I have a project that keeps track of comic books, their publishers, their prices and their customers. The model stores data in excel and to make things readable, I use a columnar store. In this way, I can quickly add new entries to the table by adding columns. Each column has an arbitrary number of rows. I know this might not be the best way to store data, but bear with me here. This blog looks at the processing of that data, not the storing of the data. Besides, in the real world, you sometimes have no choice but to start with ugly data.

The Ugly Way...

Let us proceed. First, take a look at Titles:

> Titles.orig <- data.frame(DC=c('Batman',
                                 'Superman',
                                 'Captain_Marvel',
                                 ''),
                          Image=c('Youngblood',
                                  'Spawn',
                                  '',
                                  ''),
                          Marvel=c('Spiderman',
                                   'Iron_Man',
                                   'Cable',
                                   'Doctor_Strange'),
                          stringsAsFactors = FALSE)

> Titles.orig

              DC      Image         Marvel
1         Batman Youngblood      Spiderman
2       Superman      Spawn       Iron_Man
3 Captain_Marvel                     Cable
4                           Doctor_Strange

Notice that a rotation doesn't really buy us anything. Instead of an arbitrary number of rows for each entry, a rotation gets us an arbitrary number of columns.

> t(Titles.orig)
       [,1]         [,2]       [,3]             [,4]            
DC     "Batman"     "Superman" "Captain_Marvel" ""              
Image  "Youngblood" "Spawn"    ""               ""              
Marvel "Spiderman"  "Iron_Man" "Cable"          "Doctor_Strange"

When I process Titles.orig R, I first transform it to a key-value store. My approach relies on data frame index logic (commands inside the [] brackets).

In my original approach, I create two vectors, one that repeats the column several times, and another that un-packs (unlists) the data. When I put them together, I get key-value pairs (with some empties).

My first vector repeats each column name n times, with n being the number of rows. Since the data frame has four rows, I repeat each column name four times. I first try the rep() function.

> Titles <- Titles.orig
> rep(names(Titles),nrow(Titles))
 [1] "DC"     "Image"  "Marvel" "DC"     "Image"  "Marvel" "DC"     "Image"  "Marvel" "DC"    
[11] "Image"  "Marvel"

This attempt fails. I want it in the form: 'DC, DC, DC, DC, Image, Image etc.'

After a few Google searches, I find that matrix()allows us to stack rows, so I stuff the repeat statement into matrix():

> matrix(rep(names(Titles),nrow(Titles)),nrow=nrow(Titles))

     [,1]     [,2]     [,3]    
[1,] "DC"     "Image"  "Marvel"
[2,] "Image"  "Marvel" "DC"    
[3,] "Marvel" "DC"     "Image" 
[4,] "DC"     "Image"  "Marvel"

Close, but not quite what I need. I then add the byrow flag:

> matrix(rep(names(Titles),nrow(Titles)),nrow=nrow(Titles),byrow='T')

     [,1] [,2]    [,3]    
[1,] "DC" "Image" "Marvel"
[2,] "DC" "Image" "Marvel"
[3,] "DC" "Image" "Marvel"
[4,] "DC" "Image" "Marvel"

From here, we convert to a vector:

> as.vector(matrix(rep(names(Titles),nrow(Titles)),nrow=nrow(Titles),byrow='T'))

 [1] "DC"     "DC"     "DC"     "DC"     "Image"  "Image"  "Image"  "Image"  "Marvel" "Marvel"
[11] "Marvel" "Marvel"

As you can see, vector works "down the column" by default (which makes sense, since columns are vectors).

Let's move past the titles. To create a vector from our data, we need to unlist() the data first and then vectorize it:

> as.vector(unlist(Titles))

 [1] "Batman"         "Superman"       "Captain_Marvel" ""               "Youngblood"    
 [6] "Spawn"          ""               ""               "Spiderman"      "Iron_Man"      
[11] "Cable"          "Doctor_Strange"

I bind these two vectors together as columns and then create a data frame.

> Titles <-  data.frame(cbind(as.vector(matrix(rep(names(Titles),nrow(Titles)),
                                             nrow=nrow(Titles),byrow='T')),
                            as.vector(unlist(Titles))))
> Titles

       X1             X2
1      DC         Batman
2      DC       Superman
3      DC Captain_Marvel
4      DC               
5   Image     Youngblood
6   Image          Spawn
7   Image               
8   Image               
9  Marvel      Spiderman
10 Marvel       Iron_Man
11 Marvel          Cable
12 Marvel Doctor_Strange

I give names to the data:

> names(Titles) <- c('publisher','title')

And then remove the empty rows. A lot of my early code follows this convention. I scan a data frame with index logic, using a comma to separate row and column logic. In the line below, I scan the index to return only rows that have a non-empty title, and return all columns. Such syntax appears a little confusing, as I reference the data frame Titles in three separate parts.

> Titles <- Titles[which(Titles$title != ""),]

> Titles
   publisher          title
1         DC         Batman
2         DC       Superman
3         DC Captain_Marvel
5      Image     Youngblood
6      Image          Spawn
9     Marvel      Spiderman
10    Marvel       Iron_Man
11    Marvel          Cable
12    Marvel Doctor_Strange

The Pretty Way...

Let's recap. We had nested hell to transform the columnar table to a key-value table, and then we needed two more commands to name the data frame columns and remove the empties.

With pipes (dplyr and magrittr) and tidyr, we can produce the same result with one line of code.

> library("dplyr")
> library("magrittr")
> library("tidyr")
> Titles <- Titles.orig
> Titles %>% 
    gather(publisher,title) %>% 
    filter(nzchar(title))

  publisher          title
1        DC         Batman
2        DC       Superman
3        DC Captain_Marvel
4     Image     Youngblood
5     Image          Spawn
6    Marvel      Spiderman
7    Marvel       Iron_Man
8    Marvel          Cable
9    Marvel Doctor_Strange

To dump and then set the variable, we use the %<>% pipe.

> Titles %<>% 
    gather(publisher,title) %>% 
    filter(nzchar(title))

More Pretty Code

Now we have a separate table of customers. This is a more traditional table, and we can arbitrarily add columns and rows as we see fit.

> Customers <- data.frame(title = c('Batman',
                                    'Superman',
                                    'Captain_Marvel',
                                    'Youngblood',
                                    'Spawn',
                                    'Spiderman',
                                    'Iron_Man',
                                    'Cable',
                                    'Doctor_Strange'),
                        Micky = c(2,0,0,0,0,0,2,0,1),
                        Mike = c(5,1,1,1,1,1,1,1,1),
                        Peter = c(1,1,0,0,0,1,1,2,0),
                        Davy = c(2,7,1,5,1,2,0,0,1),
                        stringsAsFactors=FALSE)
> Customers

           title Micky Mike Peter Davy
1         Batman     2    5     1    2
2       Superman     0    1     1    7
3 Captain_Marvel     0    1     0    1
4     Youngblood     0    1     0    5
5          Spawn     0    1     0    1
6      Spiderman     0    1     1    2
7       Iron_Man     2    1     1    0
8          Cable     0    1     2    0
9 Doctor_Strange     1    1     0    1

Let's try the gather function on this table to see what we get. We want each row to contain the comic title, the customer name, and the quantity they want to purchase.

> Customers %>% 
    gather(customer,qty) %>% 
    suppressWarnings %>% 
    head(12)

   customer            qty
1     title         Batman
2     title       Superman
3     title Captain_Marvel
4     title     Youngblood
5     title          Spawn
6     title      Spiderman
7     title       Iron_Man
8     title          Cable
9     title Doctor_Strange
10    Micky              2
11    Micky              0
12    Micky              0

As you can see, this is not what we want. For correct syntax, we need to specify a start and end column.

> Customers %>% 
    gather(customer,qty,Micky:Davy) %>% 
    head(12)

            title customer qty
1          Batman    Micky   2
2        Superman    Micky   0
3  Captain_Marvel    Micky   0
4      Youngblood    Micky   0
5           Spawn    Micky   0
6       Spiderman    Micky   0
7        Iron_Man    Micky   2
8           Cable    Micky   0
9  Doctor_Strange    Micky   1
10         Batman     Mike   5
11       Superman     Mike   1
12 Captain_Marvel     Mike   1

I have an issue with this code in that I need to refactor it each time I add a new customer.

To future proof, we modify the code as follows:

> Customers %>% 
    gather(customer,qty,2:ncol(Customers)) %>% 
    head(12)

In a separate table I have prices for each title.

> Price <- data.frame(title = c('Batman',
                                'Superman',
                                'Captain_Marvel',
                                'Youngblood',
                                'Spawn',
                                'Spiderman',
                                'Iron_Man',
                                'Cable',
                                'Doctor_Strange'), 
                    price = c(1.95,
                              1.95,
                              2.95,
                              2.95,
                              1.75,
                              1.75,
                              3.95,
                              3.95,
                              1.95), 
                    stringsAsFactors = FALSE )
> Price

           title price
1         Batman  1.95
2       Superman  1.95
3 Captain_Marvel  2.95
4     Youngblood  2.95
5          Spawn  1.75
6      Spiderman  1.75
7       Iron_Man  3.95
8          Cable  3.95
9 Doctor_Strange  1.95

We can easily add a price column to Customers with the merge() function:

> Customers %>% 
    merge(Price)

           title Micky Mike Peter Davy price
1         Batman     2    5     1    2  1.95
2          Cable     0    1     2    0  3.95
3 Captain_Marvel     0    1     0    1  2.95
4 Doctor_Strange     1    1     0    1  1.95
5       Iron_Man     2    1     1    0  3.95
6          Spawn     0    1     0    1  1.75
7      Spiderman     0    1     1    2  1.75
8       Superman     0    1     1    7  1.95
9     Youngblood     0    1     0    5  2.95

Pretty Showdown: Hard vs. Easy

How do we find per customer totals? I'll show a hard way and an easy way. Let's look at the pipe/ dplyr/ tydr method first.

First, we narrow the table and merge with price:

> Customers %>% 
    gather(customer,qty,2:ncol(Customers)) %>%
    merge(Price) %>% 
    head(12)

            title customer qty price
1          Batman    Micky   2  1.95
2          Batman     Davy   2  1.95
3          Batman    Peter   1  1.95
4          Batman     Mike   5  1.95
5           Cable     Davy   0  3.95
6           Cable    Peter   2  3.95
7           Cable     Mike   1  3.95
8           Cable    Micky   0  3.95
9  Captain_Marvel     Davy   1  2.95
10 Captain_Marvel    Peter   0  2.95
11 Captain_Marvel     Mike   1  2.95
12 Captain_Marvel    Micky   0  2.95

Then, we add a fifth column that calculates the subtotal:

> Customers %>% 
    gather(customer,qty,2:ncol(Customers)) %>% 
    merge(Price) %>% 
    mutate(subtotal= qty * price) %>% 
    head(12)

            title customer qty price subtotal
1          Batman    Micky   2  1.95     3.90
2          Batman     Davy   2  1.95     3.90
3          Batman    Peter   1  1.95     1.95
4          Batman     Mike   5  1.95     9.75
5           Cable     Davy   0  3.95     0.00
6           Cable    Peter   2  3.95     7.90
7           Cable     Mike   1  3.95     3.95
8           Cable    Micky   0  3.95     0.00
9  Captain_Marvel     Davy   1  2.95     2.95
10 Captain_Marvel    Peter   0  2.95     0.00
11 Captain_Marvel     Mike   1  2.95     2.95
12 Captain_Marvel    Micky   0  2.95     0.00

Then, we sum the subtotal for each customer. We can achieve this with ease using the group_by() and summarize() functions:

> Customers %>% 
    gather(customer,qty,2:ncol(Customers)) %>% 
    merge(Price) %>% 
    mutate(subtotal= qty * price) %>% 
    group_by(customer) %>% 
    summarize(sum(subtotal))

# A tibble: 4 x 2
  customer sum(subtotal)
     <chr>         <dbl>
1     Davy         42.45
2    Micky         13.75
3     Mike         30.95
4    Peter         17.50

POP quiz... did we just execute the hard or easy method to find the totals? I will show you the easy way next and you can decide for yourself. In short, we can solve this problem with simple linear algebra.

We first create our vector

> x <- Price$price

Then our matrix

> A <- Customers %>% 
         select(Micky:Davy) %>% 
         as.matrix()

We do a simple dot product and we're done:

> x %*% A

     Micky  Mike Peter  Davy
[1,] 13.75 30.95  17.5 42.45

We could also do it in one line:

> Price$price %*% 
    (Customers %>% 
       select(Micky:Davy) %>% 
    as.matrix())

     Micky  Mike Peter  Davy
[1,] 13.75 30.95  17.5 42.45

My Octave/ MATLAB experience led me to use linear algebra right out of the gate. Sometimes, even in the face of fancy new functions, it turns out I produce beautiful code on the first try.