'Singularity error when doing Tobit regression
I'm trying to estimate a standard tobit model which is censored left at zero.
Variables are
Dependent variable: Happiness
Independent variable:
- City(Chicago,New York),
- Gender(Man,Woman),
- Employment(0=Unemployed, 1=Employed),
- Worktype(Unemployed, Bluecolor, Whitecolor),
- Holiday(Unemployed, 1day a week, 2days a week)
'Worktype' and 'Holiday' variables are interacted with 'Employment' variable.
I'm using censReg package for tobit regression.
censReg(Happiness ~ City + Gender + Employment:Worktype + Employment:Holiday)
But summary() returns following error.
Error in printCoefmat(coef(x, logSigma = logSigma), digits = digits) :
'x' must be coefficient matrix/data frame
To find out why, I ran OLS regression.
There are some NA values, which I think is because of the model design and variable setting(there seems to be singularities for some variables. And people with 'Employment' = 0 has value 'Worktype' = Unemployed , 'Holidays' = Unemployed. This might be the reason?)
lm(Happiness ~ City + Gender + Employment:Worktype + Employment:Holiday)
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 41.750 9.697 4.305 0.0499 *
CityNew York -44.500 11.197 -3.974 0.0579 .
Gender1 2.750 14.812 0.186 0.8698
Employment:WorktypeUnemployed NA NA NA NA
Employment:WorktypeBluecolor 35.000 17.704 1.977 0.1867
Employment:WorktypeWhitecolor 102.750 14.812 6.937 0.0202 *
Employment:Holiday1 day a week -70.000 22.394 -3.126 0.0889 .
Employment:Holiday2 day a week NA NA NA NA
How can I just ignore NA values and run tobit regression without error?
below are reproducible code.
Happiness <- c(0, 80, 39, 0, 69, 90, 100, 30)
City <- as.factor(c("New York", "Chicago", "Chicago", "New York", "Chicago",
"Chicago", "New York", "New York"))
Gender <- as.factor(c(0, 1, 0, 1, 1, 1, 0, 1)) # 0 = man, 1 = woman.
Employment <- c(0,1, 0, 0, 1 ,1 , 1 , 1) # 0 = unemployed, 1 = employed.
Worktype <- as.factor(c(0, 2, 0, 0, 1, 1, 2,2))
levels(Worktype) <- c("Unemployed", "Bluecolor", "Whitecolor")
Holiday <- as.factor(c(0, 1, 0, 0, 2, 2, 2, 1))
levels(Holiday) <- c("Unemployed", "1 day a week", "2 day a week")
data <- data.frame(Happiness, City, Gender, Employment, Worktype, Holiday)
reg <- lm(Happiness ~ City + Gender + Employment:Worktype +
Employment:Holiday)
summary(reg)
install.packages("censReg")
library(censReg)
tobitreg <- censReg(Happiness ~ City + Gender + Employment:Worktype +
Employment:Holiday)
summary(tobitreg)
Solution 1:[1]
If you debug step by step the call to censReg, you reach following maxLik optimization:
result <- maxLik(censRegLogLikCross, start = start,
yVec = yVec, xMat = xMat, left = left, right = right,
obsBelow = obsBelow, obsBetween = obsBetween, obsAbove = obsAbove,
...)
The initial condition vector start which is determined using an OLS regression contains NA for two coefs as you already found out:
- Employment:WorktypeUnemployed
- Employment:Holiday2 day a week
This causes maxLik to return NULL, with error message :
Return code 100: Initial value out of range.
The summary function gets this NULL which explains the final error message you get.
To override this, you could set the start parameter :
tobitreg <- censReg(formula = Happiness ~ City + Gender + Employment:Worktype +
Employment:Holiday, start = rep(0,9) )
summary(tobitreg)
Call:
censReg(formula = Happiness ~ City + Gender + Employment:Worktype +
Employment:Holiday, start = rep(0, 9))
Observations:
Total Left-censored Uncensored Right-censored
8 2 6 0
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 38.666 Inf 0 1
CityNew York -50.669 Inf 0 1
Gender1 -360.633 Inf 0 1
Employment:WorktypeUnemployed 0.000 Inf 0 1
Employment:WorktypeBluecolor 345.674 Inf 0 1
Employment:WorktypeWhitecolor 56.210 Inf 0 1
Employment:Holiday1 day a week 346.091 Inf 0 1
Employment:Holiday2 day a week 55.793 Inf 0 1
logSigma 1.794 Inf 0 1
Newton-Raphson maximisation, 141 iterations
Return code 1: gradient close to zero
Log-likelihood: -19.35431 on 9 Df
Even though the error message disappeared, results aren't reliable :
- error = Inf
- gradient close to 0 : no optimal value, the solution is an hyperplane
NA coefficients in a regression indicate that the coefficients are lineary linked to the others, so that you need to drop some of them in order to get a unique solution.
As you suspected, the reason for this is that you only have Employement = 0 when worktype = Unemployed, so that the model can't estimate the coefficient for Employment:WorktypeUnemployed. Same problem with Employment:Holiday coefficients.
So I fear there's no single optimal solution to the regression model you're evaluating.
If you get rid of the linked variables, this works:
tobitreg <- censReg(formula = Happiness ~ City + Gender + Employment )
summary(tobitreg)
Call:
censReg(formula = Happiness ~ City + Gender + Employment)
Observations:
Total Left-censored Uncensored Right-censored
8 2 6 0
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 38.6141 5.7188 6.752 1.46e-11 ***
CityNew York -50.1813 6.4885 -7.734 1.04e-14 ***
Gender1 -70.3859 8.2943 -8.486 < 2e-16 ***
Employment 111.5672 10.0927 11.054 < 2e-16 ***
logSigma 1.7930 0.2837 6.320 2.61e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Newton-Raphson maximisation, 8 iterations
Return code 1: gradient close to zero
Log-likelihood: -19.36113 on 5 Df
Solution 2:[2]
I came across a similar issue, i.e., I got inf for all std. errs. in my censreg regression output when using one (two) categorical categories with three or more levels. My data is censored at 0 (lower bound) and 50 (upper bound).
I found two work arounds which are, however, not silver bullets.
Leaving out the upper truncation level let the model run. When doing this deletion of the upper truncation level in a reduced model without the problematic variable(s) this, however, changed the coefficients slightly. Moreover, deleting the truncation limits does arguably not correctly represent your data in that case.
Converting the categorical variables to continuous variables via
as.numeric(). Admittedly, this might conflict with the nature of the underlying phenomenon measured with this respective variable. This seems okay in the case of "education" levels. But in the case of "gender" this solution seems inadequate since there is neither natural ordering in different genders nor any values between "male", "female" and "diverse" (etc.)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | baconandphil |
