'How can I make new observations from columns in R?
I received a dataset from a survey, and it has a rather strange structure. Every row has 90 variables, with the first 15 being general questions and the other 75 being 5 sets of 15 variables, that are actually observations. The concrete case is an employer being asked questions about employees. It holds the following structure:
set.seed(42)
n <- 100
data <- data.frame(
EmployerVar1=1:n,,
EmployerVar2=rnorm(n),
EmployerVar3=rnorm(n),
EmployerVar4=rnorm(n),
EmployerVar5=rnorm(n),
EmployerVar6=rnorm(n),
EmployerVar7=rnorm(n),
EmployerVar8=rnorm(n),
EmployerVar9=rnorm(n),
EmployerVar10=rnorm(n),
EmployerVar11=rnorm(n),
EmployerVar12=rnorm(n),
EmployerVar13=rnorm(n),
EmployerVar14=rnorm(n),
EmployerVar15=rnorm(n),
Employee1Var1=rnorm(n),
Employee1Var2=rnorm(n),
Employee1Var3=rnorm(n),
Employee1Var4=rnorm(n),
Employee1Var5=rnorm(n),
Employee1Var6=rnorm(n),
Employee1Var7=rnorm(n),
Employee1Var8=rnorm(n),
Employee1Var9=rnorm(n),
Employee1Var10=rnorm(n),
Employee1Var11=rnorm(n),
Employee1Var12=rnorm(n),
Employee1Var13=rnorm(n),
Employee1Var14=rnorm(n),
Employee1Var15=rnorm(n),
Employee2Var1=rnorm(n),
Employee2Var2=rnorm(n),
Employee2Var3=rnorm(n),
Employee2Var4=rnorm(n),
Employee2Var5=rnorm(n),
Employee2Var6=rnorm(n),
Employee2Var7=rnorm(n),
Employee2Var8=rnorm(n),
Employee2Var9=rnorm(n),
Employee2Var10=rnorm(n),
Employee2Var11=rnorm(n),
Employee2Var12=rnorm(n),
Employee2Var13=rnorm(n),
Employee2Var14=rnorm(n),
Employee2Var15=rnorm(n),
Employee3Var1=rnorm(n),
Employee3Var2=rnorm(n),
Employee3Var3=rnorm(n),
Employee3Var4=rnorm(n),
Employee3Var5=rnorm(n),
Employee3Var6=rnorm(n),
Employee3Var7=rnorm(n),
Employee3Var8=rnorm(n),
Employee3Var9=rnorm(n),
Employee3Var10=rnorm(n),
Employee3Var11=rnorm(n),
Employee3Var12=rnorm(n),
Employee3Var13=rnorm(n),
Employee3Var14=rnorm(n),
Employee3Var15=rnorm(n),
Employee4Var1=rnorm(n),
Employee4Var2=rnorm(n),
Employee4Var3=rnorm(n),
Employee4Var4=rnorm(n),
Employee4Var5=rnorm(n),
Employee4Var6=rnorm(n),
Employee4Var7=rnorm(n),
Employee4Var8=rnorm(n),
Employee4Var9=rnorm(n),
Employee4Var10=rnorm(n),
Employee4Var11=rnorm(n),
Employee4Var12=rnorm(n),
Employee4Var13=rnorm(n),
Employee4Var14=rnorm(n),
Employee4Var15=rnorm(n),
Employee5Var1=rnorm(n),
Employee5Var2=rnorm(n),
Employee5Var3=rnorm(n),
Employee5Var4=rnorm(n),
Employee5Var5=rnorm(n),
Employee5Var6=rnorm(n),
Employee5Var7=rnorm(n),
Employee5Var8=rnorm(n),
Employee5Var9=rnorm(n),
Employee5Var10=rnorm(n),
Employee5Var11=rnorm(n),
Employee5Var12=rnorm(n),
Employee5Var13=rnorm(n),
Employee5Var14=rnorm(n),
Employee5Var15=rnorm(n))
For analysis the dataset needs to have every observation in a new row, being the employer characteristics with the 15 variables linked to one employee. So the number of rows needs to increase fivefold.
I asked this question once before and while I thought this was the solution, it eventually didn't work. The suggestion was to:
library(tidyr)
X_wide <- data.frame(id = 1:3, P1 = 4:6, P2 = 7:9, P3 = 10:12)
X_long <- pivot_longer(X_wide, cols = P1:P3, names_to = "person", values_to = "score")
X_long <- as.data.frame(X_long)
This does not work as it would collapse all observation-columns in one new variable. All the data needs to remain the same, it only needs to land in a new spot. To clarify, this is the structure that I would need for analysis.
| Col1 | Col2 | Col3 | Col4 | Col5 | Col6 |
|---|---|---|---|---|---|
| EmployerVar1 | EmployerVar2 | EmployerVar3 | Employee1Var1 | Employee1Var2 | Employee1Var3 |
| EmployerVar1 | EmployerVar2 | EmployerVar3 | Employee2Var1 | Employee2Var2 | Employee2Var3 |
| EmployerVar1 | EmployerVar2 | EmployerVar3 | Employee3Var1 | Employee3Var2 | Employee3Var3 |
Solution 1:[1]
You're trying to reshape the data slightly longer.
This is not a simple pivot_longer() because:
- There are some columns that you don't need to reshape - the only columns that need reshaping start with
Employee. You can fix this by usingstarts_with(). - The columns that need reshaping contain 2 parts: the first part (
Employee) needs to create new rows, but the second part (Question) needs to stay in separate columns. For this, you can useseparate()to split the two parts, andpivot_wider().
To show you a possible solution, I recreated a minimum reproducible dataset from scratch.
The problem can be simplified to a minimum dataset of:
- two rows (one for employer)
- two columns that will stay the same
- four columns that will need reshaping (2 per employee and 2 per question)
The sample data you provided is different to your actual data: you used rnorm() to simulate all columns, but you said in the comments that your dataset actually contains mostly categorical data. This is relevant because having different data types will complicate the pivotting. I made an assumption that your dataset contains only categorical or numeric columns.
I also slightly changed Var into .Qst so that it was clearer to me what they meant, and it was easier to split the employee part from the question number part.
library(tidyr)
library(dplyr)
dat <- data.frame(
EmployerVar1 = c("a", "b"),
EmployerVar2 = c("d", "e"),
Employee1.Qst1 = c("A", "B"),
Employee1.Qst2 = c(1.5, 2.5),
Employee2.Qst1 = c("F", "G"),
Employee2.Qst2 = c(-6,-7)
)
dat |>
# Change all numerical columns to character
mutate(across(where(is.double), as.character)) |>
# Pivot longer
pivot_longer(cols = starts_with("Employee"),
names_to = "Employee.Qst") |>
# Split
separate(col = Employee.Qst, c("Employee", "Question")) |>
# Pivot wider
pivot_wider(names_from = Question,
values_from = value)
PS I really recommend going through the pivoting vignette from tidyr, it's really helpful.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
