'How can I standardize the unit change of an interval variable with uneven intervals?

I am constructing OLS models in R and I have run into a methodological issue. The main independent variable for the study is "town size," which is coded (in the codebook) as:

  • 1.- Under 2,000
  • 2.- 2,000 - 5,000
  • etc
data$G_TOWNSIZE[data$G_TOWNSIZE == 1] <- "Under 2,000"
data$G_TOWNSIZE[data$G_TOWNSIZE == 2] <- "2,000-5,000"
data$G_TOWNSIZE[data$G_TOWNSIZE == 3] <- "5,000-10,000"
data$G_TOWNSIZE[data$G_TOWNSIZE == 4] <- "10,000-20,000"
data$G_TOWNSIZE[data$G_TOWNSIZE == 5] <- "20,000-50,000"
data$G_TOWNSIZE[data$G_TOWNSIZE == 6] <- "50,000-100,000"
data$G_TOWNSIZE[data$G_TOWNSIZE == 7] <- "100,000-500,000"
data$G_TOWNSIZE[data$G_TOWNSIZE == 8] <- "500,000 and more"

(This data is from the World Value Survey- Wave 7) Now- I understand that this is really a semi-categorical variable. In fact, the above code was not included as part of our regression until today. We had been relying only on the scale 1:8 to test for linear relationship relationship. Yes- sorry. We now know this is wrong (haha).

We want to examine the degree to which political participation is a function of population density. All our dependent variables are ordered categorical variables. In the following model Q221R is self-reported voting tendency that we have re-coded as:

data$Q221R[data$Q221 == 1] <- 3
data$Q221R[data$Q221 == 2] <- 2
data$Q221R[data$Q221 == 3] <- 1
    1. Never
    1. Usually
    1. Always
model6 <- lm(Q221R ~ G_TOWNSIZE + Q262 + Q260 + Q240FR + Q275 + Q288R, data=GER)

Based on our literature review we expect there to be a linear relationship. Indeed, even with how messed up our usage of G_TOWNSIZE is, we do observe a correlation when testing on this subset (Germany). But the unit change observed in town size is obviously arbitrary.

Is there a way to re-code or re-weight town size so that the change between living in a "1" town and living in a "2" town actually makes sense? The data model includes no other variable for population density and it is beyond the scope of our project to match the geodetic data of each respondent to their respective towns in order to find out the actual population. Assume we are intellectually capable of doing the math- I promise we can. We are just new to statistics. Thank you very much.



Solution 1:[1]

I'm setting aside the appropriateness of the regression model here, but I think the most statistically robust approach would be to convert G_TOWNSIZE into a factor variable, and then include it as a set of dummies. The as_factor or as.factor functions will do this.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jonathan Graves