'Singularity error when doing Tobit regression

I'm trying to estimate a standard tobit model which is censored left at zero.

Variables are

Dependent variable: Happiness

Independent variable:

  • City(Chicago,New York),
  • Gender(Man,Woman),
  • Employment(0=Unemployed, 1=Employed),
  • Worktype(Unemployed, Bluecolor, Whitecolor),
  • Holiday(Unemployed, 1day a week, 2days a week)

'Worktype' and 'Holiday' variables are interacted with 'Employment' variable.

I'm using censReg package for tobit regression.

censReg(Happiness ~ City + Gender + Employment:Worktype + Employment:Holiday)

But summary() returns following error.

Error in printCoefmat(coef(x, logSigma = logSigma), digits = digits) : 
  'x' must be coefficient matrix/data frame

To find out why, I ran OLS regression.

There are some NA values, which I think is because of the model design and variable setting(there seems to be singularities for some variables. And people with 'Employment' = 0 has value 'Worktype' = Unemployed , 'Holidays' = Unemployed. This might be the reason?)

lm(Happiness ~ City + Gender + Employment:Worktype + Employment:Holiday)


Coefficients: (2 not defined because of singularities)
                               Estimate Std. Error t value Pr(>|t|)  
(Intercept)                      41.750      9.697   4.305   0.0499 *
CityNew York                    -44.500     11.197  -3.974   0.0579 .
Gender1                           2.750     14.812   0.186   0.8698  
Employment:WorktypeUnemployed        NA         NA      NA       NA  
Employment:WorktypeBluecolor     35.000     17.704   1.977   0.1867  
Employment:WorktypeWhitecolor   102.750     14.812   6.937   0.0202 *
Employment:Holiday1 day a week  -70.000     22.394  -3.126   0.0889 .
Employment:Holiday2 day a week       NA         NA      NA       NA 

How can I just ignore NA values and run tobit regression without error?

below are reproducible code.

Happiness <- c(0, 80, 39, 0, 69, 90, 100, 30)

 City <- as.factor(c("New York", "Chicago", "Chicago", "New York", "Chicago", 
"Chicago", "New York", "New York"))
 Gender <- as.factor(c(0, 1, 0, 1, 1, 1, 0, 1)) # 0 = man, 1 = woman.
 Employment <- c(0,1, 0, 0, 1 ,1 , 1 , 1) # 0 = unemployed, 1 = employed.
 Worktype <- as.factor(c(0, 2, 0, 0, 1, 1, 2,2))
 levels(Worktype) <- c("Unemployed", "Bluecolor", "Whitecolor")
 Holiday <- as.factor(c(0, 1, 0, 0, 2, 2, 2, 1))
 levels(Holiday) <- c("Unemployed", "1 day a week", "2 day a week")

 data <- data.frame(Happiness, City, Gender, Employment, Worktype, Holiday)
 reg <- lm(Happiness ~ City + Gender + Employment:Worktype +      
           Employment:Holiday)
 summary(reg)

 install.packages("censReg")
 library(censReg)
 tobitreg <- censReg(Happiness ~ City + Gender + Employment:Worktype +      
                     Employment:Holiday)
 summary(tobitreg)


Solution 1:[1]

If you debug step by step the call to censReg, you reach following maxLik optimization:

result <- maxLik(censRegLogLikCross, start = start, 
      yVec = yVec, xMat = xMat, left = left, right = right, 
      obsBelow = obsBelow, obsBetween = obsBetween, obsAbove = obsAbove, 
      ...)

The initial condition vector start which is determined using an OLS regression contains NA for two coefs as you already found out:

  • Employment:WorktypeUnemployed
  • Employment:Holiday2 day a week

This causes maxLik to return NULL, with error message :

Return code 100: Initial value out of range.

The summary function gets this NULL which explains the final error message you get.

To override this, you could set the start parameter :

tobitreg <- censReg(formula = Happiness ~ City + Gender + Employment:Worktype +      
                      Employment:Holiday, start = rep(0,9) )
summary(tobitreg)

Call:
censReg(formula = Happiness ~ City + Gender + Employment:Worktype + 
    Employment:Holiday, start = rep(0, 9))

Observations:
         Total  Left-censored     Uncensored Right-censored 
             8              2              6              0 

Coefficients:
                               Estimate Std. error t value Pr(> t)
(Intercept)                      38.666        Inf       0       1
CityNew York                    -50.669        Inf       0       1
Gender1                        -360.633        Inf       0       1
Employment:WorktypeUnemployed     0.000        Inf       0       1
Employment:WorktypeBluecolor    345.674        Inf       0       1
Employment:WorktypeWhitecolor    56.210        Inf       0       1
Employment:Holiday1 day a week  346.091        Inf       0       1
Employment:Holiday2 day a week   55.793        Inf       0       1
logSigma                          1.794        Inf       0       1

Newton-Raphson maximisation, 141 iterations
Return code 1: gradient close to zero
Log-likelihood: -19.35431 on 9 Df

Even though the error message disappeared, results aren't reliable :

  • error = Inf
  • gradient close to 0 : no optimal value, the solution is an hyperplane

NA coefficients in a regression indicate that the coefficients are lineary linked to the others, so that you need to drop some of them in order to get a unique solution.

As you suspected, the reason for this is that you only have Employement = 0 when worktype = Unemployed, so that the model can't estimate the coefficient for Employment:WorktypeUnemployed. Same problem with Employment:Holiday coefficients.

So I fear there's no single optimal solution to the regression model you're evaluating.

If you get rid of the linked variables, this works:

tobitreg <- censReg(formula = Happiness ~ City + Gender + Employment )
summary(tobitreg)
Call:
censReg(formula = Happiness ~ City + Gender + Employment)

Observations:
         Total  Left-censored     Uncensored Right-censored 
             8              2              6              0 

Coefficients:
             Estimate Std. error t value  Pr(> t)    
(Intercept)   38.6141     5.7188   6.752 1.46e-11 ***
CityNew York -50.1813     6.4885  -7.734 1.04e-14 ***
Gender1      -70.3859     8.2943  -8.486  < 2e-16 ***
Employment   111.5672    10.0927  11.054  < 2e-16 ***
logSigma       1.7930     0.2837   6.320 2.61e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Newton-Raphson maximisation, 8 iterations
Return code 1: gradient close to zero
Log-likelihood: -19.36113 on 5 Df

Solution 2:[2]

I came across a similar issue, i.e., I got inf for all std. errs. in my censreg regression output when using one (two) categorical categories with three or more levels. My data is censored at 0 (lower bound) and 50 (upper bound). I found two work arounds which are, however, not silver bullets.

  1. Leaving out the upper truncation level let the model run. When doing this deletion of the upper truncation level in a reduced model without the problematic variable(s) this, however, changed the coefficients slightly. Moreover, deleting the truncation limits does arguably not correctly represent your data in that case.

  2. Converting the categorical variables to continuous variables via as.numeric(). Admittedly, this might conflict with the nature of the underlying phenomenon measured with this respective variable. This seems okay in the case of "education" levels. But in the case of "gender" this solution seems inadequate since there is neither natural ordering in different genders nor any values between "male", "female" and "diverse" (etc.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 baconandphil