Before I dive into my favorite method, let’s take a list at what methods we’ve covered thus far.

Models/Methods: * Regression Tree’s * Classification Tree’s * Boosting * Bagging * RandomForest * Simple Linear Regression * Multiple Linear Regression * Logistic Regression * k - Nearest Neighbors * Clustering * Principal Components Analysis

I think my favorite would be logistic regression because of just how useful answering Yes/No and evaluating the odds.

We’ll run this example using the Titanic Dataset to predict whether an individual survived or not using some other variables we have.

titanicData <- read_csv("_Rmd/_datasets/titanic.csv") #assigning to different variable
## Rows: 1310 Columns: 14
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, sex, ticket, cabin, embarked, boat, home.dest
## dbl (7): pclass, survived, age, sibsp, parch, fare, body
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
attributes(titanicData)$names # take a look at the different variables we have
##  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"     "parch"     "ticket"   
##  [9] "fare"      "cabin"     "embarked"  "boat"      "body"      "home.dest"

Here are the variables we have access to. I think we will keep it to pclass, sex, and age for predictors. I think a couple of these variables will have some collinearity.

We will also just take a quick look at what the table would look like to make sure we don’t miss anthing

knitr::kable(head(titanicData)) # A simple table to see example data
pclass survived name sex age sibsp parch ticket fare cabin embarked boat body home.dest
1 1 Allen, Miss. Elisabeth Walton female 29.0000 0 0 24160 211.3375 B5 S 2 NA St Louis, MO
1 1 Allison, Master. Hudson Trevor male 0.9167 1 2 113781 151.5500 C22 C26 S 11 NA Montreal, PQ / Chesterville, ON
1 0 Allison, Miss. Helen Loraine female 2.0000 1 2 113781 151.5500 C22 C26 S NA NA Montreal, PQ / Chesterville, ON
1 0 Allison, Mr. Hudson Joshua Creighton male 30.0000 1 2 113781 151.5500 C22 C26 S NA 135 Montreal, PQ / Chesterville, ON
1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) female 25.0000 1 2 113781 151.5500 C22 C26 S NA NA Montreal, PQ / Chesterville, ON
1 1 Anderson, Mr. Harry male 48.0000 0 0 19952 26.5500 E12 S 3 NA New York, NY
glmFit <- glm(survived ~ age*sex*pclass, data = titanicData, family = "binomial")
summary(glmFit)
## 
## Call:
## glm(formula = survived ~ age * sex * pclass, family = "binomial", 
##     data = titanicData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0381  -0.7029  -0.4614   0.5132   2.4255  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         6.811494   1.617141   4.212 2.53e-05 ***
## age                -0.030302   0.041356  -0.733 0.463733    
## sexmale            -5.652917   1.771477  -3.191 0.001417 ** 
## pclass             -2.153601   0.574775  -3.747 0.000179 ***
## age:sexmale         0.003032   0.045641   0.066 0.947042    
## age:pclass          0.003824   0.015539   0.246 0.805619    
## sexmale:pclass      1.613719   0.647919   2.491 0.012752 *  
## age:sexmale:pclass -0.012706   0.018117  -0.701 0.483090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1414.62  on 1045  degrees of freedom
## Residual deviance:  940.68  on 1038  degrees of freedom
##   (264 observations deleted due to missingness)
## AIC: 956.68
## 
## Number of Fisher Scoring iterations: 5

So we can the most significant were the intercept, your class, and your gender. No surprises there, but there was a noteworth interactino between sex and class. I would imagine that if you were a man in first class you probably knew important people.

predict(glmFit, newdata = data.frame(pclass = c(1,2,3), sex = c('male','female','female'), age = c(25,80,30)),
                type = "response", se.fit = TRUE)
## $fit
##         1         2         3 
## 0.4291980 0.6664000 0.4466476 
## 
## $se.fit
##          1          2          3 
## 0.05221614 0.15010290 0.04641147 
## 
## $residual.scale
## [1] 1

Surprisingly, we find that 2nd class 80 yr old woman would be predicted more likely to survive than a 30 yr old woman in 3rd class.

ggiraphExtra::ggPredict(glmFit)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Observing our graph we can also see the significant increase at the middle class in the count of female survivors in comparison to male survivors in 1st or 3rd class.

This may also be due to how many survivors there were of either gender at each class, but we can delve into that another time.

In summary of why I like logistic regression, and regression in general, its easier in my experience to tell what matters and when. I’ve spent more time in other classes looking at the amount of variability explained by adding and removing matters and it was mainly focused on regression examples. SO maybe I’m biased, but I think the power of logistic regression in these yes/no type scenarios to be concretely useful.

R to make this post: #rmarkdown::render(“_Rmd/2022-07-10-Project-2-Blog-Post.Rmd”,output_format = #md_document(“markdown_github”),output_dir = “_posts”, output_options = list(keep_html=FALSE))


<
Previous Post
Project 2 Blog Post
>
Blog Archive
Archive of all previous blog posts