'Identifying non-sequential tibble columns by index, position or number in dplyr

A fairly common data issue in some circles is coding an instrument, for example this one, where related items are separated in the instrument. The idea is to avoid cuing the respondent that all of these questions - say all those beginning with A or D in this example, are related.

rm(list=ls())
Column_Names = c('A2','B1','D1','A1','B3','B2','D3','A3','A4','D2')
Matrix <- matrix(nrow=10,ncol=10,floor(runif(n=100,min=0, max=10)))
Data <- as_tibble(Matrix)
names(Data) = Column_Names
Identifier = sample(letters, size=10)
Data <- Data %>% mutate(Identifier = Identifier)
rm(Matrix, Column_Names, Identifier)

Scoring these instruments typically requires you to take all the related items (B1, B2, B2,,) and do something with them. This is often something very simple - a mean, and it may be the case that some items have to be recoded (0=9, 1=8, 2-7,3-6 ...). All of that is easy. Typically there are sub-scales (A,B,C,D here), and these are calculated on fixed subsets of the variables - A1,A2,A3 might be one subscale, and D1,D2 and D3 another, for example.

The hard bit is selecting the columns. What happens in real examples, is there might be 50 items, and they should be in a fixed order. (These instruments provide the questions in a fixed order). The tricky bit is that different investigators will use different NAMES for the SAME variables. They don't use names like the ones in my example, where each subscale has a name that refers directly to the subscale.

There are only two sane ways to do this, and get it right (I think). One is to rename the variables to a common set of names, and score these. The other is to pick the variables out by position - so subscale A is columns 1, 3 and 8, subscale B is columns 2,5 and 6.

This is trivial, if you put the variable numbers in your code.

Data %>% select(2,5,6) %>% names()

It's not very generalisable. I have a case where I have 70 variables, 8 subscales with 4 to 8 variables in each, and an overall score with 40 variables. I would like, but can't see a way, to read in the column numbers from a file, and pull them out.

Scales <- c('A','B','D')
Numbers <- c('1,4,8,9','2,5,6','3,7,10')
Scale <- tibble(Scales,Numbers)

I'd like to get them out

Data %>% select(2,5,6) %>% names()

works, but

    Data %>% select(Scale$Numbers[1]) %>% names()
    Data %>% select(as.numeric(Scale$Numbers[1])) %>% names()

don't, as do many efforts involving quo, !!, and the like. I know Hadley Wickham disapproves of using column positions and I get why, but this is a reasonable use case, maybe the only one I've ever come across.

This is what I'm doing for this particular study -

Somatic <- Questionnaire1 %>%
  select(   1,7,16,32,43,
          45,50,51, Identifier) %>%
  rowwise(Identifier) %>%
  mutate(Somatic = mean(c_across(1:8),na.rm=TRUE)) %>%
  select(Identifier, Somatic)
  
Cognitive  <- Questionnaire %>%
  select( 2, 4, 8,11,15,
         21,22,23,36,42,
         46,54,59, Identifier) %>%
  rowwise(Identifier) %>%
  mutate(Cognitive = mean(c_across(1:13),na.rm=TRUE)) %>%
  select(Identifier, Cognitive)

In this application, there are nine of these for 8 subclasses, and one overall score. In this study, there are 5 different instruments being used, with five different sets of variables, and a total of twenty subscales. This is why I want to have a programmatic solution.

Suggestions welcomed! Thanks Anthony

r dplyr

Solution 1:^[1]

It sounds like regardless you are going to need to create some type of mapping. Consider this approach to create a string of attributes that capture the group each question belongs to. I used "I" for the Identifier as the length needs to match the number of columns.



Attributes <- c("A", "B", "D", "A", "B", "B", "D", "A", "A", list(c("B", "D")), "I")
attr(Data, "Group") <- Attributes

Data %>%
    select_if(purrr::map_lgl(attr(Data, "Group"), `%in%`, x = "B")) %>% names()
#[1] "B1" "B3" "B2" "D2"

Solution 2:^[2]

Here is another approach so you don't have to identify the column numbers in advance.

Data %>%
    select(starts_with("B")) %>%
    names()
#[1] "B1" "B3" "B2"

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	stomper

'Identifying non-sequential tibble columns by index, position or number in dplyr

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]