'R summary() gives incorrect values for too many NAs

The Setup

I have a data set that consists of 3.5e6 1's, 7.5e6 0's, and 4.4e6 NA's. When I call summary() on it, I get a mean and maximum that are wrong (in disagreement with mean() and max()).

> summary(data, digits = 10)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
 0       0       1       1       1       1 4365239 

When mean() is called separately, it returns a reasonable value:

> mean(data, na.rm = T)
[1] 0.6804823

Characterization of the problem

It looks like this problem is generic to any vector with more than 3162277 NA values in it.

With just under the cutoff:

> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162277)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.5     0.5     1.0     1.0 3162277 

And just over:

> thingie <- as.numeric(c(rep(0,1e6), rep(1,1e6), rep(NA,3162278)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       1       1 3162278 

It doesn't seem to matter how many non-missing values there are either.

> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162277)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.2     0.5     0.5     0.8     1.0 3162277 
> thingie <- as.numeric(c(rep(0,1), rep(1,1), rep(NA,3162278)))
> summary(thingie)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       1       1 3162278 

Research

  1. In searching for an answer, I came across the well-known rounding error, but that doesn't affect this behavior (see the first code chunk).
  2. I thought this might be some sort of bizarre quirk of my environment/machine/planetary alignment, so I had my sister run the same code. She got the same results on her machine.

Closing remarks

Clearly, this isn't a critical problem because the mean() and max() functions can be used instead of summary(), but I'm curious if anyone knows what causes this behavior. Also, neither my sister nor I could find any mention of it, so I figured I'd document it for posterity.

EDIT: I said mean and max the whole post but the max is fine. 1st quantile, median, and 3rd quantile differ.

r


Solution 1:[1]

Here's some example data:

x <- rep(c(1,0,NA), c(3.5e6,7.5e6,4.4e6))
out <- summary(x)
out
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0       0       0       0       1       1 4400000

mean(x, na.rm=TRUE)
#[1] 0.3181818

The issue can be traced back to zapsmall() as it does some rounding in a line that essentially does:

c(out)
#      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
# 0.000e+00 0.000e+00 0.000e+00 3.182e-01 1.000e+00 1.000e+00 4.400e+06

round(c(out), max(0L, getOption("digits")-log10(4400000)))
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0       0       0       0       1       1 4400000 

The critical turning point here is 3162277 to 3162278 NA values where it tips the rounding threshold from 0 to 1 as it goes across 0.5.

dput(max(0L,getOption("digits")-log10(3162277)))
#0.500000090664876

dput(max(0L,getOption("digits")-log10(3162278)))
#0.499999953328896

out[7] <- 3162277
out
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#    0.0     0.0     0.0     0.3     1.0     1.0 3162277 

out[7] <- 3162278
out
#   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#      0       0       0       0       1       1 3162278

Solution 2:[2]

An update to @thelatemail 's answer:

R (version 4.1.2, at least) will do this for many fewer than 3162277 NA's if your other values are small. For instance, with only 10 NA's,

bar <- c(1:10/100, rep(x=NA, times=10))
> summary(bar, digits=12)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0100  0.0325  0.0550  0.0550  0.0775  0.1000      10 
> print.default(summary(bar), digits=12) 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 0.0100  0.0325  0.0550  0.0550  0.0775  0.1000 10.0000 

summary() gets it right. But with 10^4 NA's there's some bad rounding,

bar <- c(1:10/100, rep(x=NA, times=1E4))
> summary(bar, digits=12)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    0.0     0.0     0.1     0.1     0.1     0.1   10000 
> print.default(summary(bar), digits=12) 
      Min.    1st Qu.     Median       Mean    3rd Qu. 
    0.0100     0.0325     0.0550     0.0550     0.0775 
      Max.       NA's 
    0.1000 10000.0000

and with 10^5 NA's everything's rounded to zero:

bar <- c(1:10/100, rep(x=NA, times=1E5))
> summary(bar, digits=12)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
      0       0       0       0       0       0  100000 
> print.default(summary(bar), digits=12) 
       Min.     1st Qu.      Median        Mean     3rd Qu. 
     0.0100      0.0325      0.0550      0.0550      0.0775 
       Max.        NA's 
     0.1000 100000.0000 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 thelatemail
Solution 2 eac2222