'How to randomly duplicate rows in a data frame to certain criteria in R?

Suppose there is a data frame df containing book categories and the number of occurrences of a word in a particular book:

book_id book_category book_word_hi book_word_bye book_word_yes
1       drama         3            0             4
2       action        1            4             5
3       drama         5            3             2
4       fantasy       5            5             5
5       documentary   4            6             5

Using the code below, we can count the total number of words per book_category

tapply(rowSums(df[3:5]), df[2], sum)


drama: 17 
action: 10
fantasy: 15
documentary: 15

I want to look at the category with the highest count (drama here) and then randomly select and duplicate rows from other categories if duplicating them would mean that the total count would be closer for that category.

So e.g. the code would not duplicate the fantasy and documentary rows here (because doing so would take the total count for each to 30, and 30 is further away from 17 (drama) than their current values of 15. However, the code should duplicate the action row in the data set because doing so would take the total count to 20 (which is closer to 17 (drama) than 10).

Does anyone know how this would be possible? It is essentially a task of oversampling.

r dataframe oversampling

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to randomly duplicate rows in a data frame to certain criteria in R?

Sources

Related Questions