'Merging rows with same value with conditions for keeping multiple dummies
Creating a subset example of the DF (the code for a part of the actual one is at the end)
ANO_CENSO PK_COD_TURMA PK_COD_ENTIDADE MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
2011 27 12 1 0 0 1
2011 41 12 1 1 0 0
2011 18 13 0 0 0 1
2011 16 14 1 1 0 1
I want to merge the rows with the same value for PK_COD_ENTIDADE into a single one, and keep the values "1" for the dummies with the same PK_COD_ENTIDADE. I don't care for the different values in PK_COD_TURMA, doesn't matter which one stays at the final DF (27 or 41). MY DF have multiple variables like PK_COD_TURMA that I don't care for the final value, the important one are the PK_COD_ENTIDADE and the dummies with value "1" It would look like this at the end:
ANO_CENSO PK_COD_TURMA PK_COD_ENTIDADE MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
2011 27 12 1 1 0 1
2011 18 13 0 0 0 1
2011 16 14 1 1 0 1
Look at how I have the values "1" for 2 dummies in the first observation of PK_COD_ENTIDADE = 12 and another value "1" in another dummy with the PK_COD_ENTIDADE = 12, and at the end they merged in a single observation for the same PK_COD_ENTIDADE keeping the different dummies "1" (and the same dummies with 1 for different observations don't sum to 2, because they are dummies)
I have no idea how to do this, I searched for some solutions with dplyr but couldn't apply anything close to working...
Here is the structure of the df with all variables:
dftest2 <- structure(list(ANO_CENSO = c(2011, 2011, 2011, 2011), PK_COD_TURMA = c(27,
41, 18, 16), NU_DURACAO_TURMA = c(250, 255, 255,
255), FK_COD_ETAPA_ENSINO = c(41, 19, 19, 19), PK_COD_ENTIDADE = c(12,
12, 13, 14), FK_COD_ESTADO = c(11, 11, 11,
11), SIGLA = c("RO", "RO", "RO", "RO"), FK_COD_MUNICIPIO = c(1100023,
1100023, 1100023, 1100023), ID_LOCALIZACAO = c(1, 1, 1, 1), ID_DEPENDENCIA_ADM = c(2,
2, 2, 2), MAIS_ENSINO_FUND = c(1, 1, 0, 1), MAIS_ENSINO_MED = c(0,
1, 0, 1), ENSINO_INTEG_FUND = c(0L, 0L, 0L, 0L), ENSINO_INTEG_MED = c(1L,
0L, 1L, 1L)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
Solution 1:[1]
The sample data you give for dftest2 does not match the data you present at the beginning of your post.
In response to your question, an option is to use aggregate
aggregate(
. ~ PK_COD_ENTIDADE,
data = transform(dftest2, SIGLA = as.factor(SIGLA)),
FUN = max)
#P K_COD_ENTIDADE ANO_CENSO PK_COD_TURMA NU_DURACAO_TURMA FK_COD_ETAPA_ENSINO
#1 12 2011 41 255 41
#2 13 2011 18 255 19
#3 14 2011 16 255 19
# FK_COD_ESTADO SIGLA FK_COD_MUNICIPIO ID_LOCALIZACAO ID_DEPENDENCIA_ADM
#1 11 1 1100023 1 2
#2 11 1 1100023 1 2
#3 11 1 1100023 1 2
# MAIS_ENSINO_FUND MAIS_ENSINO_MED ENSINO_INTEG_FUND ENSINO_INTEG_MED
#1 1 1 0 1
#2 0 0 0 1
#3 1 1 0 1
Explanation: We first convert the character column SIGLA to a factor; then we aggregate data in all columns (except PK_COD_ENTIDADE) by PK_COD_ENTIDADE, and return the max value (which should be consistent with your problem statement).
You can do something similar using dplyrs group_by and summarise_all
library(dplyr)
dftest2 %>%
group_by(PK_COD_ENTIDADE) %>%
summarise_all(~ifelse(is.character(.x), last(.x), max(.x))) %>%
ungroup()
# A tibble: 3 x 14
PK_COD_ENTIDADE ANO_CENSO PK_COD_TURMA NU_DURACAO_TURMA FK_COD_ETAPA_EN…
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2011 41 255 41
2 13 2011 18 255 19
3 14 2011 16 255 19
# … with 9 more variables: FK_COD_ESTADO <dbl>, SIGLA <chr>,
# FK_COD_MUNICIPIO <dbl>, ID_LOCALIZACAO <dbl>, ID_DEPENDENCIA_ADM <dbl>,
# MAIS_ENSINO_FUND <dbl>, MAIS_ENSINO_MED <dbl>, ENSINO_INTEG_FUND <int>,
# ENSINO_INTEG_MED <int>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
