'Remove duplicate rows based on value of another column
I'm an R newbie and this is my first SO post (but a long-time user), so sorry if this is a dumb question. Thanks in advance for any assistance.
I've got a large dataset with more than 50 columns. Human error results in some items entered twice, but with different information in a key variable. I have reduced this to a two-column problem for simplicity: teacher-class number (tc_num), exam status (x_gen).
I can't share the actual dataset, unfortunately, but here is essentially what I have:
| tc_num | x_gen |
|---|---|
| 12355 | N |
| 12355 | Y |
| 26421 | Y |
| 26421 | N |
| 78943 | N |
| 45679 | Y |
In the case of duplicate tc_num values (e.g., 12355, 26421), I want to select the row with the "Y" value and discard the "N" value However, most tc_num values are unique (e.g., 78943, 45679), and I want to keep all of those rows (in other words, I can't just discard all rows where x_gen = "N").
So, I want to keep all rows UNLESS there is a duplicate tc_num value, in which case I want to keep the one with the "Y" value.
Thanks in advance. I appreciate this community, as it's been a big help to me over the years.
Solution 1:[1]
subset(df, x_gen == "Y" | ave(tc_num, tc_num, FUN = length) == 1)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
