'How to summarize rows of a data frame into one while removing Duplicates in R?

so, I have a data frame with 2 or more rows and different columns (ID, Location, Task, Skill, ...). I want to summarize these rows into (a) one row (dataframe) where different column entries should be joined together (but only if different! i.e. if for two rows the IDs are the same, the final dataframe row should show only one ID not the same twice i.e. "ID1", but if they are different, both should be shown i.e. 'ID1, ID2") and some numerical values should be added (+) together.

df = data.frame("ID" = c(PA1, PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures,  react to changes
", "PO - data based operations"), "Field" = c("PA","PA"), "Work" = c(0.5, 0.1), "Skill1" = c(CRO, CRO), "Skill2" = c(0, PPto), "ds" = c(5, 5))
print(df)

and the output should look like this

df_final = data.frame("ID" = c(PA1), "Occupation" = c("PO - react to DCS, initiate corrective measures,  react to changes, data based operations"), "Field" = c("PA"), "Work" = c(0.6), "Skill1" = c(CRO), "Skill2" = c(PPto), "ds" = c(5))
print(df_final)

Thank you!



Solution 1:[1]

Let's ignore Skill2 for now: How close is the following code to what you want to do?

df2 %>%
  group_by(ID)%>%
  summarise(work = sum(Work), 
            skill1 = unique(Skill1),
            ds = unique(ds), 
            occupation = paste0(Occupation, collapse = " "),
            field = unique(Field))

You can also mutate(occupation = str_replace_all(occupation, "PO - ")) to get rid of the duplicate "PO - "'s.

You're going to run into problems if the variables like Skill1/Skill2/ds are not unique to each ID, as in they have cardinality > 1.

df2 %>%
  group_by(ID)%>%
  summarise(work = sum(Work), 
            skill1 = unique(Skill1),
            skill2 = unique(Skill2),
            ds = unique(ds), 
            occupation = paste0(Occupation, collapse = " "),
            field = unique(Field))

If it's a simple data-entry issue, you could do a bit of wrangling to filter for only Skill2 entries with letters contained, and then join this frame back to your original frame.

You could also use the past0() collapse = trick, but then you'll end up with Skill2 = c(NA, "PPto"), which I'm pretty sure you don't want.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 jpenzer