'R Error: Error in Linear Regression Model Prediction, and Redundancy

R Novice here. I'm working on a project to evaluate if there is a difference in perceived stress as stratified by gender (Male=0,Female=1). I'm simultaneously learning the statistics and the code, so I think there's some redundancy in my code. I was using the covariates of income, education, and activity levels to build a predictive model.

The data set is titled Data. Gender is (0/1), perceived stress is (0-20, treated as continuous), income (4 categories(coded 1-4), education is (0/1), and activity levels is a scale(0-5). I have a separate code to evaluate the perceived stress mean by gender groups via two sample t test. I'm also working on a regression model. I believe linear regression is correct here, but I'm having some issues.

The error code is Error: Assigned data predict(full, Data = new, na.omit = TRUE) must be compatible with existing data. x Existing data has 2653 rows. x Assigned data has 2243 rows. Only vectors of size 1 are recycled. Backtrace:

  1. base::$<-(*tmp*, lmprediction, value = <dbl>)
  2. tibble <fn>(<vctrs___>)

How can I adjust this to run the linear prediction? Also, I know I forgot something, so if you notice anything wrong, missing, or redundant, please let me know! Thanks!

Data sample: tibble 6x6
age Income HSgrad activeIndex perceivedStress gender
  <dbl>  <dbl>  <dbl>       <dbl> <fct>          <dbl>
1  63.4      1      0        1.75 12             0
2  56.0      3      1        2    7              1
3  56.5      4      1        2.75 0              1
4  40.0      2      1        2.75 9              1
5  47.7      2      0        1    10             1
6  68.1     NA      0        2.5  0              0


   gender<- ifelse(dfJHS$sex=="Male",0,1)
dfJHS$gender <- gender
View(dfJHS)
Data<-dfJHS %>% select(-sex)
View(Data)
dim(Data)
Data$perceivedStress <- factor(Data$perceivedStress)
#Remove NA
Data %>% drop_na()
Data[complete.cases(Data),]
#section with data visualizations you probably won't need for this (lots of histograms, shapiro test, and a qq plot)

 #check linear fit for two primary variables and perform  linear regression.
model <-lm(perceivedStress ~ gender, data = Data)
summary(model)

#Checking this data meets assumptions for a linear regression.
aug <- augment(model)
resids <- residuals(model)
fitted <- fitted(model)
#Convert primary dependent variable to factor for analysis
Data$perceivedStress <- factor(Data$perceivedStress)
#check linear fit for two primary variables and perform  linear regression.
model <-lm(perceivedStress ~ gender, data = Data)
summary(model)

#Checking this data meets assumptions for a linear regression.
aug <- augment(model)
resids <- residuals(model)
fitted <- fitted(model)
##Assumption 1: Residuals Normally Distributed
ggplot(aug) + geom_histogram(aes(x=.resid),
                             bins=15)
ggplot(aug) + geom_qq(aes(sample=.resid))
  
##Assumption 2: Homoscedasticity
ggplot(aug) + geom_point(aes(x=.fitted, y=.resid)) +
  geom_hline(yintercept=0, lty=2)+
  theme_bw()
##Assumption 4: Linear Relationship
ggplot(aug, aes(x=gender, y=perceivedStress)) + geom_point()+
  geom_smooth(method = "lm", se = FALSE)+
  theme_bw()

#determine if glm is better fit - No notable differences due to no change in complexity.
mod <- glm(perceivedStress ~ gender, data=Data)
summary(survive_age)
summary(mod)
aug <- augment(mod)
resids <- residuals(mod)
fitted <- fitted(mod)
## Assumption 1: Residuals Normally Distributed
ggplot(aug) + geom_histogram(aes(x=.resid),
                             bins=15)
ggplot(aug) + geom_qq(aes(sample=.resid))

## Assumption 2: Homoscedasticity
ggplot(aug) + geom_point(aes(x=.fitted, y=.resid)) +
  geom_hline(yintercept=0, lty=2)+
  theme_bw()

## Assumption 4: Linear Relationship
ggplot(aug, aes(x=gender, y=perceivedStress)) + 
  geom_point()+
  geom_smooth(method = "lm", se = FALSE)+
  theme_bw()

#ERROR OCCURS IN THIS CHUNK
#Confirm Model works using predictions and Model Specification
new<-(Data$gender=1)
full <- lm(formula = as.numeric(perceivedStress) ~ gender*age*Income*HSgrad, data=Data)
full
Data$lmprediction<- predict(full, Data = new, na.omit=TRUE)
var<-Data$perceivedStress
Data$lmprediction<- predict(full, Subset)


rmse2 <- function(x=gender, y=perceivedStress, data=Data, na.rm = TRUE){
  res <- sqrt(mean((Data$gender-Data$perceivedStress)^2, na.rm = TRUE))
  return(res)}
#observed RMSE of full model
rmse2(x=gender, y=lmprediction, data=Data)
#test other models

model1 <- lm(formula = perceivedStress~., data=Data)
model1
#Total models include model(only perceivedStress and gender), mod(.), and full(interactions). 
#Model validation through backwards selection
aic.backwards <- step(full, trace=TRUE) 
glance(aic.backwards)
tidy(aic.backwards)


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source