Favorite Modeling Methods I’ve Learned from ST558

Before I dive into my favorite method, let’s take a list at what methods we’ve covered thus far.

Models/Methods: * Regression Tree’s * Classification Tree’s * Boosting * Bagging * RandomForest * Simple Linear Regression * Multiple Linear Regression * Logistic Regression * k - Nearest Neighbors * Clustering * Principal Components Analysis

I think my favorite would be logistic regression because of just how useful answering Yes/No and evaluating the odds.

We’ll run this example using the Titanic Dataset to predict whether an individual survived or not using some other variables we have.

titanicData <- read_csv("_Rmd/_datasets/titanic.csv") #assigning to different variable

## Rows: 1310 Columns: 14
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): name, sex, ticket, cabin, embarked, boat, home.dest
## dbl (7): pclass, survived, age, sibsp, parch, fare, body
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

attributes(titanicData)$names # take a look at the different variables we have

##  [1] "pclass"    "survived"  "name"      "sex"       "age"       "sibsp"     "parch"     "ticket"   
##  [9] "fare"      "cabin"     "embarked"  "boat"      "body"      "home.dest"

Here are the variables we have access to. I think we will keep it to pclass, sex, and age for predictors. I think a couple of these variables will have some collinearity.

We will also just take a quick look at what the table would look like to make sure we don’t miss anthing

knitr::kable(head(titanicData)) # A simple table to see example data

pclass	survived	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	boat	body	home.dest
1	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	2	NA	St Louis, MO
1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	11	NA	Montreal, PQ / Chesterville, ON
1	0	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	NA	NA	Montreal, PQ / Chesterville, ON
1	0	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	NA	135	Montreal, PQ / Chesterville, ON
1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	NA	NA	Montreal, PQ / Chesterville, ON
1	1	Anderson, Mr. Harry	male	48.0000	0	0	19952	26.5500	E12	S	3	NA	New York, NY

glmFit <- glm(survived ~ age*sex*pclass, data = titanicData, family = "binomial")
summary(glmFit)

## 
## Call:
## glm(formula = survived ~ age * sex * pclass, family = "binomial", 
##     data = titanicData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0381  -0.7029  -0.4614   0.5132   2.4255  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         6.811494   1.617141   4.212 2.53e-05 ***
## age                -0.030302   0.041356  -0.733 0.463733    
## sexmale            -5.652917   1.771477  -3.191 0.001417 ** 
## pclass             -2.153601   0.574775  -3.747 0.000179 ***
## age:sexmale         0.003032   0.045641   0.066 0.947042    
## age:pclass          0.003824   0.015539   0.246 0.805619    
## sexmale:pclass      1.613719   0.647919   2.491 0.012752 *  
## age:sexmale:pclass -0.012706   0.018117  -0.701 0.483090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1414.62  on 1045  degrees of freedom
## Residual deviance:  940.68  on 1038  degrees of freedom
##   (264 observations deleted due to missingness)
## AIC: 956.68
## 
## Number of Fisher Scoring iterations: 5

So we can the most significant were the intercept, your class, and your gender. No surprises there, but there was a noteworth interactino between sex and class. I would imagine that if you were a man in first class you probably knew important people.

predict(glmFit, newdata = data.frame(pclass = c(1,2,3), sex = c('male','female','female'), age = c(25,80,30)),
                type = "response", se.fit = TRUE)

## $fit
##         1         2         3 
## 0.4291980 0.6664000 0.4466476 
## 
## $se.fit
##          1          2          3 
## 0.05221614 0.15010290 0.04641147 
## 
## $residual.scale
## [1] 1

Surprisingly, we find that 2nd class 80 yr old woman would be predicted more likely to survive than a 30 yr old woman in 3rd class.

ggiraphExtra::ggPredict(glmFit)

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Observing our graph we can also see the significant increase at the middle class in the count of female survivors in comparison to male survivors in 1st or 3rd class.

This may also be due to how many survivors there were of either gender at each class, but we can delve into that another time.

In summary of why I like logistic regression, and regression in general, its easier in my experience to tell what matters and when. I’ve spent more time in other classes looking at the amount of variability explained by adding and removing matters and it was mainly focused on regression examples. SO maybe I’m biased, but I think the power of logistic regression in these yes/no type scenarios to be concretely useful.

R to make this post: #rmarkdown::render(“_Rmd/2022-07-10-Project-2-Blog-Post.Rmd”,output_format = #md_document(“markdown_github”),output_dir = “_posts”, output_options = list(keep_html=FALSE))

Project 2 Blog Post

Blog Archive

Archive of all previous blog posts