'How to make predictions even with NAs using predict()?

I want to use predict() with a polr() model to predict variable z, as per the following code. This first is the df to train the model and the subsequent test data.

df <- data.frame(x=c(1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 2, 2, 1, 2, 1, 1, 2, 2),
                 y=c(32, 67, 12, 89, 45, 78, 43, 47, 14, 67, 16, 36, 25, 23, 56, 26, 35, 79, 13, 44),
                 z=as.factor(c(1, 2, 3, 2, 1, 2, 3, 2, 1, 2, 3, 2, 3, 2, 1, 2, 1, 2, 1, 2)))
test <- data.frame(x=c(1, 2, 1, 1, 2, 1, 2, 2, 1, 1),
                   y=c(34, NA, 78, NA, 89, 17, 27, 83, 23, 48),
                   z=c(1, 2, 3, 1, 2, 3, 1, 2, 3, 1))

This is the polr() model:

mod <- polr(z ~ x + y, data = df, Hess = TRUE)

And this is the predict() function with its outcome:

predict(mod, newdata = test)
[1] 2    <NA> 2    <NA> 2    2    2    2    2    2 

My problem is that I want the model to make predictions even when there are NAs, as in the 2nd and 4th cases. I have tried the following, with the same result:

predict(mod, newdata = test, na.action = "na.exclude")
predict(mod, newdata = test, na.action = "na.pass")
predict(mod, newdata = test, na.action = "na.omit")
predict(mod, newdata = test, na.rm=T)
[1] 2    <NA> 2    <NA> 2    2    2    2    2    2 

How can I get the model to make predictions even when there's some missing data?



Solution 1:[1]

This is more of a statistical or mathematical problem than a programming problem. To simplify things a little bit (and show that it's general, I'll illustrate with a linear regression, but the concept extends to ordinal regression as well.

Suppose I've estimated a linear relationship, say z = 1 + 2*x + 3*y, and I want to predict a response when the predictors are {x=3, y=NA}. I get 1 + 2*3 + 3*NA, which is clearly NA.

If you want predictions when some of the predictor variables are unknown, you have to make some kind of assumption/decision about what to do — this is a question of interpretation, not mathematics. For example, you could set unknown values of y to the mean of the original data set, or the mean of the new data set, or some sensible reference value, or you could do multiple imputation — i.e., making several predictions based on several different draws from a reasonable distribution, then averaging the results. (For a linear regression model this will give you the same answer (point estimate) as using the mean of the distribution, but (1) the results will differ if you have an effectively nonlinear model like an ordinal or generalized linear regression; (2) multiple imputation will allow you to get sensible standard errors on the prediction.)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Ben Bolker