'Logit model output is incorrect

I am running a Logit Model on data I found in Kaggle.

https://www.kaggle.com/datasets/leonardopena/top50spotify2019

My goal is to predict which songs will be an international hit (TRUE). The model seems to predicting the songs that not going to be an international hit (FALSE).

Could someone shed some light on why the model is predicting FALSE instead of TRUE? Appreciate all help.

structure(list(bpm = c(105L, 170L, 120L, 87L, 129L, 125L), 
nrgy = c(72L, 
71L, 42L, 38L, 71L, 94L), dnce = c(72L, 74L, 75L, 72L, 58L, 
74L), dB = c(-7L, -4L, -8L, -8L, -8L, -1L), hit = c(TRUE, 
TRUE, TRUE, FALSE, TRUE, FALSE)), row.names = c(8L, 80L, 
15L, 361L, 42L, 185L), class = "data.frame")

dfTop50 <- read.csv("SpotifyTop50country_prepared.csv", 
row.names = 1, stringsAsFactors = FALSE)
train <- 0.7 
nCases <- nrow(dfTop50)

set.seed(123)
trainCases <- sample(1:nCases, floor(train*nCases))

dfTop50Train <- dfTop50[ trainCases ,]
dfTop50Test <- dfTop50[ -trainCases ,]

mdlA <- hit ~ bpm + nrgy + dnce + dB 
str(mdlA)

rsltLogit <- glm(mdlA, data = dfTop50Train, family = 
binomial("logit"))
predLogit <- predict(rsltLogit, dfTop50Test, type = 
"response")


head(cbind(Observed = dfTop50Test$hit, Predicted = 
predLogit))
predLogit <- factor(as.numeric(predLogit > 0.5),
                levels = c(0,1),
                labels=c("FALSE","TRUE"))
accLogit <- mean(predLogit == dfTop50Test$hit)
describe(accLogit)

tblLog <- table(Predicted = predLogit,
  Observed = dfTop50Test$hit)
View(tblLog)


Solution 1:[1]

Without your testing/training data or code to get the Kaggle data into the same format, it's hard to know for sure. However, using the small snippet of data at the start of your code, it is pretty clear that a GLM on those data is predicting hit=TRUE. Note that the predicted probabilities (pr in the data below) are approximately 1 for the hits and 0 for the non-hits.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
dat <- structure(list(bpm = c(105L, 170L, 120L, 87L, 129L, 125L), 
               nrgy = c(72L, 
                        71L, 42L, 38L, 71L, 94L), dnce = c(72L, 74L, 75L, 72L, 58L, 
                                                           74L), dB = c(-7L, -4L, -8L, -8L, -8L, -1L), hit = c(TRUE, 
                                                                                                               TRUE, TRUE, FALSE, TRUE, FALSE)), row.names = c(8L, 80L, 
                                                                                                                                                               15L, 361L, 42L, 185L), class = "data.frame")


g <- glm(hit ~ bpm + nrgy + dnce + dB, data=dat, family=binomial)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
broom::augment(g) %>% 
  select(hit, .fitted) %>%
  mutate(pr = plogis(.fitted))
#> # A tibble: 6 × 3
#>   hit   .fitted       pr
#>   <lgl>   <dbl>    <dbl>
#> 1 TRUE     25.7 1.00e+ 0
#> 2 TRUE     39.1 1   e+ 0
#> 3 TRUE     23.4 1.00e+ 0
#> 4 FALSE   -23.7 4.95e-11
#> 5 TRUE     24.2 1.00e+ 0
#> 6 FALSE   -25.9 5.87e-12

Created on 2022-04-18 by the reprex package (v2.0.1)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 DaveArmstrong