'Summarizing repeated items by student attempt

Problem and explanation of the expected output

I have raw data of how many attempts each student had in an exam and which items they responded (input). For this example, I have a pool of 5 items, but students only respond to 3 of those items. They can take the exam multiple times. Hence, I would like to summarize the raw data in a way that I can see all info below:

Trial number (e.g., 1st trial, 2nd trial, 3rd trial...). trial column
How many correct items (e.g., total correct items) How many total items they've responded to (in this case, it is supposed to be 3 for everyone),
Percent of correct. percent column
Main problem: How many of the items had they seen before? repeated column (e.g., if they responded to item 1 in the first trial and also responded to item 1 in the second trial, then I want to know this info)
- If there are repeated items, I want to see which repeated items were there. repeated_items column

Input

As input, I have the raw data, with students name, the number of the trial (e.g., 1, 2, 3...), and all items from the pool i1:i5. The values for the items (0, 1) shows whether the student got it right or wrong (1 = correct, 0 = wrong). And missing values show that the student didn't take that item on that attempt.

library(dplyr)

df <- tibble(name  = c("John", "John", "Mary", "Mary"),
             trial = c(1,  2,  1,  2),
             i1    = c(1,  0,  0,  NA),
             i2    = c(NA, NA, 1,  1),
             i3    = c(NA, 1,  NA, 1),
             i4    = c(0,  1,  1,  NA),
             i5    = c(0,  NA, NA, 1))

# # A tibble: 4 × 7
#   name  trial    i1    i2    i3    i4    i5
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 John      1     1    NA    NA     0     0
# 2 John      2     0    NA     1     1    NA
# 3 Mary      1     0     1    NA     1    NA
# 4 Mary      2    NA     1     1    NA     1

Expected Output

As output I want to see a summary table like this one below, showing all points I've put in the description above. I've created this table manually, and now I'm trying to organize it with functions, so I can speed up this process. My main problem is to generate columns repeated and repeated_items. The first trial of each students will always have NA for repeated_items and zero for repeated given this is the first trial, so they've never seen those items before.

repeated column counts how many items the participant had already seen in a previous attempt. So for instance, in John's first attempt, he responded to items i1, i4, and i5. In his second attempt, he responded to items i1, i3, and i4. Hence, in his second attempt, he responded to two items that were previously seen (i1, and i4). So I want to use the column repeated to count how many items were repeated from the previous attempt.

repeated_item column tracks which specific items were "repeated" in this new trial. So, in John's case, in the second trial, items i1 and i4 were repeated. And in Mary's second trial, only item i2 was repeated when compared to her first trial.

 # # A tibble: 4 × 7
 #   name  trial correct n_items percent repeated repeated_items
 # <chr> <dbl>   <dbl>   <dbl>   <dbl>    <dbl> <chr>         
 # 1 John      1       1       3   0.333        0 NA            
 # 2 John      2       2       3   0.667        2 i1, i4        
 # 3 Mary      1       2       3   0.667        0 NA            
 # 4 Mary      2       3       3   1            1 i2

r dplyr tidyverse data-wrangling apache-spark pyspark amazon-emr

Solution 1:^[1]

library(tidyverse)

df %>% 
  mutate(correct = rowSums(across(starts_with("i")), na.rm = T),
         n_items = rowSums(!is.na(across(starts_with("i")))),
         percent = correct / n_items) %>% 
  group_by(name) %>% 
  mutate(repeated_items = across(starts_with("i")) %>% 
           imap_chr(~ ifelse(!is.na(diff(.)), .y, NA)) %>% 
           na.omit() %>% 
           str_flatten(","), 
         repeated_items = ifelse(row_number() == 1, NA, repeated_items),
         repeated = replace_na(str_count(repeated_items, ",") + 1, 0)) %>% 
  ungroup()

Output

  name  trial    i1    i2    i3    i4    i5 correct n_items percent repeated_items repeated
  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl> <chr>             <dbl>
1 John      1     1    NA    NA     0     0       1       3   0.333 NA                    0
2 John      2     0    NA     1     1    NA       2       3   0.667 i1,i4                 2
3 Mary      1     0     1    NA     1    NA       2       3   0.667 NA                    0
4 Mary      2    NA     1     1    NA     1       3       3   1     i2                    1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	LMc