'R summary() gives incorrect values for too many NAs
The Setup
I have a data set that consists of 3.5e6 1's, 7.5e6 0's, and 4.4e6 NA's. When I call summary() on it, I get a mean and maximum that are wrong (in disagreement with mean() and max()).
> summary(data, digits = 10)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 1 1 1 1 4365239
When mean() is called separately, it returns a reasonable value:
> mean(data, na.rm = T)
[1] 0.6804823
Characterization of the problem
It looks like this problem is generic to any vector with more than 3162277 NA values in it.
With just under the cutoff:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.5 0.5 1.0 1.0 3162277
And just over:
> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 1 1 3162278
It doesn't seem to matter how many non-missing values there are either.
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.2 0.5 0.5 0.8 1.0 3162277
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 1 1 3162278
Research
- In searching for an answer, I came across the well-known rounding error, but that doesn't affect this behavior (see the first code chunk).
- I thought this might be some sort of bizarre quirk of my environment/machine/planetary alignment, so I had my sister run the same code. She got the same results on her machine.
Closing remarks
Clearly, this isn't a critical problem because the mean() and max() functions can be used instead of summary(), but I'm curious if anyone knows what causes this behavior. Also, neither my sister nor I could find any mention of it, so I figured I'd document it for posterity.
EDIT: I said mean and max the whole post but the max is fine. 1st quantile, median, and 3rd quantile differ.
Solution 1:[1]
Here's some example data:
x <- rep(c(1,0,NA), c(3.5e6,7.5e6,4.4e6))
out <- summary(x)
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 4400000
mean(x, na.rm=TRUE)
#[1] 0.3181818
The issue can be traced back to zapsmall() as it does some rounding in a line that essentially does:
c(out)
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0.000e+00 0.000e+00 0.000e+00 3.182e-01 1.000e+00 1.000e+00 4.400e+06
round(c(out), max(0L, getOption("digits")-log10(4400000)))
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 4400000
The critical turning point here is 3162277 to 3162278 NA values where it tips the rounding threshold from 0 to 1 as it goes across 0.5.
dput(max(0L,getOption("digits")-log10(3162277)))
#0.500000090664876
dput(max(0L,getOption("digits")-log10(3162278)))
#0.499999953328896
out[7] <- 3162277
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0.0 0.0 0.0 0.3 1.0 1.0 3162277
out[7] <- 3162278
out
# Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
# 0 0 0 0 1 1 3162278
Solution 2:[2]
An update to @thelatemail 's answer:
R (version 4.1.2, at least) will do this for many fewer than 3162277 NA's if your other values are small. For instance, with only 10 NA's,
bar <- c(1:10/100, rep(x=NA, times=10))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0100 0.0325 0.0550 0.0550 0.0775 0.1000 10
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0100 0.0325 0.0550 0.0550 0.0775 0.1000 10.0000
summary() gets it right. But with 10^4 NA's there's some bad rounding,
bar <- c(1:10/100, rep(x=NA, times=1E4))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0.0 0.0 0.1 0.1 0.1 0.1 10000
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu.
0.0100 0.0325 0.0550 0.0550 0.0775
Max. NA's
0.1000 10000.0000
and with 10^5 NA's everything's rounded to zero:
bar <- c(1:10/100, rep(x=NA, times=1E5))
> summary(bar, digits=12)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
0 0 0 0 0 0 100000
> print.default(summary(bar), digits=12)
Min. 1st Qu. Median Mean 3rd Qu.
0.0100 0.0325 0.0550 0.0550 0.0775
Max. NA's
0.1000 100000.0000
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | thelatemail |
| Solution 2 | eac2222 |
