'Question about getting counts in the R survey package
I'm using the 2018 CBECS data set from the Energy Information Administration (available here: https://www.eia.gov/consumption/commercial/data/2018/xls/2018_public_use_data.csv) and I've set up the sample design according to their user guide. I'm noticing a discrepancy when I use the svyby function as opposed to just the svytotal function and I'm hoping somebody can explain what it is I'm seeing and/or what I'm doing wrong.
Here is the set up for the sample design:
library(survey)
library(spatstat)
library(tidyverse)
cbecs2018 <- read_csv(paste0(getwd(), "/2018_public_use_data.csv"))
samp_wts <- cbecs2018$FINALWT
rep_wts <- cbecs2018[, grepl("^FINALWT", names(cbecs2018))]
rep_wts$FINALWT <- NULL
samp_design <- svrepdesign(weights=samp_wts, repweights=rep_wts,
type="JK2", mse=TRUE, data=cbecs2018)
sqftc <- factor(cbecs2018$SQFTC) #this is categorical variable classifying buildings by size
When I run svytotal to get a count of buildings by each category in sqftc, I get the output below, which is consistent with what EIA has:
svytotal(~sqftc, samp_design)
total SE
sqftc2 2836939.2 138709.13
sqftc3 1358439.0 78632.96
sqftc4 966092.8 55503.86
sqftc5 396595.4 23727.58
sqftc6 218416.8 11718.72
sqftc7 93085.9 5179.07
sqftc8 39865.5 1993.62
sqftc9 6664.8 620.07
sqftc10 2111.8 255.25
However, when I try to break it out by census region, I get completely different counts by category. For example, instead of showing 2,836,939 buildings in the second sqftc group, the table below makes it look like there are 3,605,529 buildings in the group.
x <- svyby(~sqftc, ~region, samp_design, svytotal)
> sum(x$sqftc2)
[1] 3605529
print(x)
region sqftc2 sqftc3 sqftc4 sqftc5 sqftc6 sqftc7 sqftc8 sqftc9 sqftc10 se1 se2 se3 se4 se5 se6 se7 se8
1 1 679858.4 382470.2 466330.8 383649.9 638936.3 777312.6 918361.9 220786.7 97105.4 70972.33 58987.22 57377.8 41027.49 79224.73 100678.28 104811.7 26387.60
2 2 1142179.1 634697.1 752421.8 762969.8 929830.8 1107860.2 1382698.4 369059.3 149810.3 131036.12 88954.07 102800.3 120901.81 88769.62 118328.83 146119.8 56056.48
3 3 859228.7 456788.7 521518.6 540952.1 779310.4 912930.2 1062321.1 285638.1 100881.7 86845.98 50065.79 56198.4 53630.90 66850.76 68490.26 87545.5 34443.43
4 4 924262.5 499895.4 541658.9 555604.6 820252.5 927657.6 1205995.5 298595.7 96787.1 96106.38 51019.41 58771.1 58782.50 60113.72 85934.54 134417.5 41790.27
se9
1 14502.07
2 39303.04
3 21410.55
4 13725.39
I feel like whatever I'm doing wrong is probably pretty straightforward, but any pointers would be greatly appreciated.
Solution 1:[1]
maybe review your minimal reproducible example? :-) when i run this, the numbers match
library(survey)
cbecs2018 <- read.csv("https://www.eia.gov/consumption/commercial/data/2018/xls/2018_public_use_data.csv")
samp_design <-
svrepdesign(
weights = ~ FINALWT ,
repweights = "^FINALWT[0-9]" ,
type = 'JK2' ,
mse = TRUE ,
data = cbecs2018
)
samp_design <- update( samp_design , SQFTC = factor( SQFTC ) )
svytotal(~SQFTC, samp_design)
svyby(~SQFTC,~REGION,samp_design,svytotal)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Anthony Damico |
