'Question about getting counts in the R survey package

I'm using the 2018 CBECS data set from the Energy Information Administration (available here: https://www.eia.gov/consumption/commercial/data/2018/xls/2018_public_use_data.csv) and I've set up the sample design according to their user guide. I'm noticing a discrepancy when I use the svyby function as opposed to just the svytotal function and I'm hoping somebody can explain what it is I'm seeing and/or what I'm doing wrong.

Here is the set up for the sample design:

library(survey)
library(spatstat)
library(tidyverse)

cbecs2018 <- read_csv(paste0(getwd(), "/2018_public_use_data.csv"))

samp_wts <- cbecs2018$FINALWT

rep_wts <- cbecs2018[, grepl("^FINALWT", names(cbecs2018))]

rep_wts$FINALWT <- NULL

samp_design <- svrepdesign(weights=samp_wts, repweights=rep_wts,
                         type="JK2", mse=TRUE, data=cbecs2018)

sqftc <- factor(cbecs2018$SQFTC) #this is categorical variable classifying buildings by size

When I run svytotal to get a count of buildings by each category in sqftc, I get the output below, which is consistent with what EIA has:

svytotal(~sqftc, samp_design)

          total        SE
sqftc2  2836939.2 138709.13
sqftc3  1358439.0  78632.96
sqftc4   966092.8  55503.86
sqftc5   396595.4  23727.58
sqftc6   218416.8  11718.72
sqftc7    93085.9   5179.07
sqftc8    39865.5   1993.62
sqftc9     6664.8    620.07
sqftc10    2111.8    255.25

However, when I try to break it out by census region, I get completely different counts by category. For example, instead of showing 2,836,939 buildings in the second sqftc group, the table below makes it look like there are 3,605,529 buildings in the group.

x <- svyby(~sqftc, ~region, samp_design, svytotal)

> sum(x$sqftc2)
[1] 3605529

print(x)
  region    sqftc2   sqftc3   sqftc4   sqftc5   sqftc6    sqftc7    sqftc8   sqftc9  sqftc10       se1      se2      se3       se4      se5       se6      se7      se8
1      1  679858.4 382470.2 466330.8 383649.9 638936.3  777312.6  918361.9 220786.7  97105.4  70972.33 58987.22  57377.8  41027.49 79224.73 100678.28 104811.7 26387.60
2      2 1142179.1 634697.1 752421.8 762969.8 929830.8 1107860.2 1382698.4 369059.3 149810.3 131036.12 88954.07 102800.3 120901.81 88769.62 118328.83 146119.8 56056.48
3      3  859228.7 456788.7 521518.6 540952.1 779310.4  912930.2 1062321.1 285638.1 100881.7  86845.98 50065.79  56198.4  53630.90 66850.76  68490.26  87545.5 34443.43
4      4  924262.5 499895.4 541658.9 555604.6 820252.5  927657.6 1205995.5 298595.7  96787.1  96106.38 51019.41  58771.1  58782.50 60113.72  85934.54 134417.5 41790.27
       se9
1 14502.07
2 39303.04
3 21410.55
4 13725.39

I feel like whatever I'm doing wrong is probably pretty straightforward, but any pointers would be greatly appreciated.



Solution 1:[1]

maybe review your minimal reproducible example? :-) when i run this, the numbers match

library(survey)

cbecs2018 <- read.csv("https://www.eia.gov/consumption/commercial/data/2018/xls/2018_public_use_data.csv")

samp_design <-
    svrepdesign(
        weights = ~ FINALWT ,
        repweights = "^FINALWT[0-9]" ,
        type = 'JK2' ,
        mse = TRUE ,
        data = cbecs2018
    )
    
samp_design <- update( samp_design , SQFTC = factor( SQFTC ) )

svytotal(~SQFTC, samp_design)

svyby(~SQFTC,~REGION,samp_design,svytotal)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Anthony Damico