'Using gtools::mixedsort or alternatives with dplyr::arrange

I am trying to order a dataframe by making use of dplyr::arrange. The issue is that the column I am trying to sort on contains both a fixed string followed by a number, as for instance generated by the dummycode below.

  dummydf<-data.frame(values=rnorm(100),sortcol=paste0("ABC",sample(1:100,100,replace=FALSE)))

By default, using dummydf %>% arrange(sortcol) would generate a df which is sorted alphanumerically (?) but this is of course not the desired result:

values sortcol
0.708081720    ABC1
0.041348322   ABC10
1.730962886  ABC100
0.423480861   ABC11
-1.545837266   ABC12
-1.345539947   ABC13
-0.078998792   ABC14
0.088712174   ABC15
0.670583024   ABC16
1.238837680   ABC17
-1.459044293   ABC18
-2.028535223   ABC19
0.779514385    ABC2
1.360509910   ABC20

In this example, I would like to sort the column as gtools::mixedsort would do, making sure ABC2 follows ABC1 and is not preceed by ABC1-19 and ABC100 mixedsort(as.character(dummydf$sortcol)) would do that trick.

Now, I am aware I could do this by using sub in my arrange argument: dummydf %>% arrange(as.numeric(sub("ABC","",sortcol))) but that is mainly because my string is something fixed (although any regex could be used to capture the last digits following any string I suppose).

I am just wondering: is there a more "elegant" and generic way to get this done with dplyr::arrange, in the same fashion as gtools::mixedsort?

Kind regards,

FM



Solution 1:[1]

Here's a functional solution making use of the mysterious identity order(order(x)) == rank(x).

mixedrank = function(x) order(gtools::mixedorder(x))
dummydf %>% dplyr::arrange(mixedrank(sortcol))

Solution 2:[2]

I don't see this answer posted so I'll throw it out. You can use mixedorder with slice to arrange it.

dummydf %>% 
  slice(mixedorder(sortcol))

Solution 3:[3]

Using data.table

library(data.table)
dummydf = data.table(dummydf)
dummydf[gtools::mixedorder(as.character(sortcol))]

Honestly just copied your example and stuck it in as the select argument in the data.table syntax. You already did all the hard work :).

Solution 4:[4]

Credit to Akhil Nair for his data.table answer which is what the first code snippet derives from. If you like the data.table answer but still want magrittr piping, you can consider calculating a new column and using piping with data.table to get your output:

dummydf %>% 
  dplyr::mutate(row_lookup = gtools::mixedorder(as.character(sortcol))) %>%
  data.table::data.table() %>% 
  .[.$row_lookup]

I think it's debatable whether that helps or detracts from the readability.

If you don't want to call data.table, you can go through some extra contortions to calculate a column you can use dplyr::arrange on. Here's one example:

library(dplyr)
bind_cols(dummydf,
          dummydf %>% 
            tibble::rowid_to_column("order") %>% 
            mutate(rowname = gtools::mixedorder(as.character(sortcol))) %>% 
            arrange(rowname) %>% 
            select(order)) %>% 
arrange(order)

I think this code is more confusing to read and isn't worth those extra contortions to avoid data.table.

Solution 5:[5]

Here is a solution that will allow for sorting if there are repeats and multiple conditions to sort. Most previous answers are not generic: they freeze the ordering at level 1.

df <- data.frame(values = rnorm(100),
                 sortcol1 = paste0("ASORT", sample(1:100, 100, replace = TRUE)),
                 sortcol2 = paste0("BSORT", sample(1:100, 100, replace = TRUE)),
                 stringsAsFactors = F)

df %>%
mutate(
    `sortcol1` = factor(`sortcol1`, ordered = T, levels = unique(gtools::mixedsort(`sortcol1`))),
    `sortcol2` = factor(`sortcol2`, ordered = T, levels = unique(gtools::mixedsort(`sortcol2`)))
) %>%
arrange(`sortcol1`, `sortcol2`)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 bischrob
Solution 3 Akhil Nair
Solution 4
Solution 5 Viktor Horváth