Your browser doesn't appear to support the HTML5 canvas element.

Monday 16 July 2007

SORTING DATA FRAMES IN R

This is a grandfathered post copied across from my old blog when I was using MovableType (who remembers MovableType?!)

I frequently find myself having to re-order rows of a data.frame based on the levels of an ordered factor in R.

For example, I want to take this data.frame:

       product store sales
        1       a    s1    12
        2       b    s1    24
        3       a    s2    32
        4       c    s2    12
        5       a    s3     9
        6       b    s3     2
        7       c    s3    29
And sort it so that the sales data from the stores with the most sales occur first:
   product store sales
   3       a    s2    32
   4       c    s2    12
   5       a    s3     9
   6       b    s3     2
   7       c    s3    29
   1       a    s1    12
   2       b    s1    24
I keep forgetting the exact semantics of how its done and Google never offers any assistance on the topic, so here is a quick post to get it down once and for all, both for my own benefit and the greater good. First we need some data:
   productSalesByStore = data.frame(
         product = c('a', 'b', 'a', 'c', 'a', 'b', 'c'),
         store = c('s1', 's1', 's2', 's2', 's3', 's3', 's3'),
         sales = c(12, 24, 32, 12, 9, 2, 29)
      )
Now construct a sorted summary of sales by store:
   storeSalesSummary =
         aggregate(
                  productSalesByStore$sales,
                  list(store = productSalesByStore$store),
         sum)
   storeSalesSummary =
      storeSalesSummary[ 
         order(storeSalesSummary$x, decreasing=TRUE), 
         ]
storeSalesSummary should look like this:
   store  x
    2    s2 44
    3    s3 40
    1    s1 36
Use that summary data to construct an ordered factor of store names:
   storesBySales =
      ordered(
         storeSalesSummary$store,
         levels=storeSalesSummary$store
         )
storesBySales is now an ordered factor that looks like this:
      [1] s2 s3 s1
  Levels: s2 < s3 < s1
Re-construct productSalesByStore$store so that it is an ordered factor with the same levels as storesBySales
   productSalesByStore$store =
      ordered(productSalesByStore$store, levels=storesBySales)
Note that neither the contents nor the order of productSalesByStore has changed (yet). Just the datatype of the store column. Finally, we use the implicit ordering of store to generate an explicit permutation of productSalesByStore so that we can sort the rows in a stable manner:
   productSalesByStore = 
      productSalesByStore[ order(productSalesByStore$store), ]
And we are done!