'Coalesce columns and create another column to specify source

I'm using dplyr::coalesce() to combine several columns into one. Originally, across columns, each row has only one column with actual value while the other columns are NA. Based on the coalescing, I want to create an additional column that will specify the source column from which the coalesced value was taken from.

My attempt is inspired by existing functionality in other dplyr functions. For example, dplyr::bind_rows() has .id argument that specifies the source dataframe for each row in the new dataframe.

From bind_rows()'s documentation:

When .id is supplied, a new column of identifiers is created to link each row to its original data frame. The labels are taken from the named arguments to bind_rows(). When a list of data frames is supplied, the labels are taken from the names of the list. If no names are found a numeric sequence is used instead.

Again, my current question is about coalesce(), not bind_rows(), but I just wanted to put it in context.

Data

df <-
  data.frame(
  group_1 = c(NA, NA, NA, NA, 2),
  group_2 = c(NA, 4, NA, NA, NA),
  group_3 = c(NA, NA, 5, NA, NA),
  group_4 = c(1, NA, NA, 2, NA),
  group_5 = c(NA, NA, NA, NA, NA)
)

df

##   group_1 group_2 group_3 group_4 group_5         ## each row
## 1      NA      NA      NA       1      NA         ## has one value
## 2      NA       4      NA      NA      NA         ## and the rest
## 3      NA      NA       5      NA      NA         ## are NAs
## 4      NA      NA      NA       2      NA
## 5       2      NA      NA      NA      NA

Combining the columns into one (additional) column

library(dplyr)

df %>%
  mutate(one_col = coalesce(group_1, group_2, group_3, group_4, group_5))

##   group_1 group_2 group_3 group_4 group_5 one_col
## 1      NA      NA      NA       1      NA       1
## 2      NA       4      NA      NA      NA       4
## 3      NA      NA       5      NA      NA       5
## 4      NA      NA      NA       2      NA       2
## 5       2      NA      NA      NA      NA       2



How can I add yet another column that will specify the "source", i.e., from which column the value in one_col was taken from?

Desired output

  group_1 group_2 group_3 group_4 group_5 one_col source_col
1      NA      NA      NA       1      NA       1    group_4
2      NA       4      NA      NA      NA       4    group_2
3      NA      NA       5      NA      NA       5    group_3
4      NA      NA      NA       2      NA       2    group_4
5       2      NA      NA      NA      NA       2    group_1


EDIT


@Karthik's answer below led me thinking that the example data I used above demonstrates a situation that is too narrow and specific. The solution Karthik offers is independent of the coalescing operation. Thus, the code would still work if we swap the order and create the source_col first and only then coalesce.

However, if the data had more than one NA per row, coalesce would still do its thing, but we could no longer base source_col on finding the single non-missing value. Therefore, I'm revising the question and the data.

Data

df_2 <-
  data.frame(
  group_1 = c(NA, NA, NA, NA, 2),
  group_2 = c(NA, 4, NA, NA, 1),
  group_3 = c(NA, NA, 5, NA, NA),
  group_4 = c(1, NA, NA, 2, NA),
  group_5 = c(NA, 3, NA, NA, NA)
)

> df_2

##   group_1 group_2 group_3 group_4 group_5
## 1      NA      NA      NA       1      NA   ## <--- one non-NA
## 2      NA       4      NA      NA       3   ## <--- *two* non-NA
## 3      NA      NA       5      NA      NA   ## <--- one non-NA
## 4      NA      NA      NA       2      NA   ## <--- one non-NA
## 5       2       1      NA      NA      NA   ## <--- *two* non-NA

Coalescing

> df_2 %>%
   mutate(one_col = coalesce(group_1, group_2, group_3, group_4, group_5))

##   group_1 group_2 group_3 group_4 group_5 one_col
## 1      NA      NA      NA       1      NA       1
## 2      NA       4      NA      NA       3       4
## 3      NA      NA       5      NA      NA       5
## 4      NA      NA      NA       2      NA       2
## 5       2       1      NA      NA      NA       2

How can I add a source column that will match the value chosen by coalesce() with the original column it came from?

Desired Output

  group_1 group_2 group_3 group_4 group_5 one_col source_col
1      NA      NA      NA       1      NA       1    group_4
2      NA       4      NA      NA       3       4    group_2
3      NA      NA       5      NA      NA       5    group_3
4      NA      NA      NA       2      NA       2    group_4
5       2       1      NA      NA      NA       2    group_1


Solution 1:[1]

Does this work:

df %>%
   mutate(one_col = coalesce(group_1, group_2, group_3, group_4, group_5)) %>% 
rowwise() %>% mutate(group_col = names(df)[!is.na(c_across(group_1:group_5))])
# A tibble: 5 x 7
# Rowwise: 
  group_1 group_2 group_3 group_4 group_5 one_col group_col
    <dbl>   <dbl>   <dbl>   <dbl> <lgl>     <dbl> <chr>    
1      NA      NA      NA       1 NA            1 group_4  
2      NA       4      NA      NA NA            4 group_2  
3      NA      NA       5      NA NA            5 group_3  
4      NA      NA      NA       2 NA            2 group_4  
5       2      NA      NA      NA NA            2 group_1  
>

Updated Answer:

df_2 %>% mutate(one_col = coalesce(group_1, group_2, group_3, group_4, group_5)) %>% rowwise() %>% 
   mutate(group_col = names(df_2)[!is.na(c_across(group_1:group_5))][1])
# A tibble: 5 x 7
# Rowwise: 
  group_1 group_2 group_3 group_4 group_5 one_col group_col
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>    
1      NA      NA      NA       1      NA       1 group_4  
2      NA       4      NA      NA       3       4 group_2  
3      NA      NA       5      NA      NA       5 group_3  
4      NA      NA      NA       2      NA       2 group_4  
5       2       1      NA      NA      NA       2 group_1  

Solution 2:[2]

This appears to be reshaping the data from wide format to long format.

df2 <- reshape(df,
               direction = 'long',
               varying = 1:5,
               v.names = 'one_col',
               timevar = 'source_col',
               times = paste0('group_', 1:5))

Either of these functions will drop the NA columns returned by reshape

df2[!is.na(df2$one_col), ]
na.omit(df2)

          source_col one_col id
5.group_1    group_1       2  5
2.group_2    group_2       4  2
3.group_3    group_3       5  3
1.group_4    group_4       1  1
4.group_4    group_4       2  4

Solution 3:[3]

Here is a quick base solution:

cbind(df_2,
      t(apply(df_2, 1, function(i){
        c(i[ which(!is.na(i))[1] ],
          colnames(df_2)[ which(!is.na(i))[1] ])
      }))
      )
#   group_1 group_2 group_3 group_4 group_5 1       2
# 1      NA      NA      NA       1      NA 1 group_4
# 2      NA       4      NA      NA       3 4 group_2
# 3      NA      NA       5      NA      NA 5 group_3
# 4      NA      NA      NA       2      NA 2 group_4
# 5       2       1      NA      NA      NA 2 group_1

Pretty sure, this can be done using "which.min/is.na/arrayInd" combo without apply loops, no time at the moment to test.

Solution 4:[4]

If you work with the transpose of the dataframe, then the rownames of the which( . , arr.in) turn out to be exactly the right result.

cbind( df, group_col = rownames( which(!is.na(t(df)), arr.ind=TRUE)) )
  group_1 group_2 group_3 group_4 group_5 group_col
1      NA      NA      NA       1      NA   group_4
2      NA       4      NA      NA      NA   group_2
3      NA      NA       5      NA      NA   group_3
4      NA      NA      NA       2      NA   group_4
5       2      NA      NA      NA      NA   group_1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 manotheshark
Solution 3 zx8754
Solution 4 IRTFM