'Retain observations whose NA is <= 20% of total variables
Suppose we have this dataframe with six observations and four variables
df <- data.frame(a = c(1, NA, NA, 4, NA, 5),
b = c(NA, NA, NA, NA, NA, 1),
c = c(1, 2, 3, 4, NA, 6),
d = c(6, 7, NA, NA, 4, 4))
| a | b | c | d |
|---|---|---|---|
| 1 | NA | 1 | 6 |
| NA | NA | 2 | 7 |
| NA | NA | 3 | NA |
| 4 | NA | 4 | NA |
| NA | NA | NA | 4 |
| 5 | 1 | 6 | 4 |
How can we retain observations whose NA's does not exceed 50% of the variables? (In this case each observation left will have two NA's at most; thus only 4 observations will be retained.)
Solution 1:[1]
You use rowSums() to count up the NAs in each row. Then you discard the rows with more than threshold*ncol(df) NAs in their row.
threshold <- 0.5
df <- df[-which(rowSums(is.na(df)) > threshold*ncol(df)), ]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ben Smith |
