# LOAD THE NECESSARY PACKAGES
# CONSIDER SUPPRESSING WARNINGS & MESSAGES
library(tidyverse)
library(skimr)
# for challenge
library(broom)Basic Regression I
Activity 08
Overview
The focus of Activity 08 will be on data modeling using linear regression. Specifically using simple linear regression — one numerical predictor/explanatory variable. We will also continue to practice our data wrangling and visualization skills.
Needed Packages
The following loads the packages that are needed for this activity. We assume that you have installed the packages. Also note that we have suppressed the messages so the compiled html is less cluttered.
Make sure to load the necessary packages!!
Tasks
Complete the following series of tasks. Remember to render early and render often.
Task 1
Import state SAT data
We will be exploring a relatively small dataset, but hopefully one that you find interesting. The dataset contains educational data for each US state. That is, each observations is a US state.
The following variables are in the data set: - state := state’s abbreviation - region := state’s Census defined regions - sat_math := state’s mean SAT mathematics score - teach_pay := state’s median teacher pay (thousands of dollars) - pct_taking := state’s percentage of seniors/juniors taking the SAT - division := state’s Census defined division
Import the state_sat.csv file which should be located in your data/ subdirectory. Store the data as state_sat.
# import and store the data
state_sat <- read_csv("data/state_sat.csv")
# inspect data with glimpse()
glimpse(state_sat)Rows: 51
Columns: 6
$ state <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL",…
$ region <chr> "South", "West", "West", "South", "West", "West", "Northeas…
$ sat_math <dbl> 558, 513, 521, 550, 511, 538, 504, 495, 473, 496, 477, 510,…
$ teach_pay <dbl> 31.3, 49.6, 32.5, 29.3, 43.1, 35.4, 50.3, 40.5, 43.7, 33.3,…
$ pct_taking <dbl> 8, 47, 28, 6, 45, 30, 79, 66, 50, 48, 63, 54, 15, 14, 57, 5…
$ division <chr> "East South Central", "Pacific", "Mountain", "West South Ce…
Task 2
EDA
In this section we will be preforming an Exploratory Data Analysis (EDA).
Let’s go a little deeper and use skim_without_charts() from the skimr package which was introduced in the reading. It is an extremely useful package for preforming a quick EDA! It will provide a nice breakdown of the variables in the data with summary statistics.
# LOAD skimr IN load-pkgs CHUNK: MAY NEED TO INSTALL IT
# skim_without_charts() the data
skim_without_charts(state_sat)| Name | state_sat |
| Number of rows | 51 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| state | 0 | 1 | 2 | 2 | 0 | 51 | 0 |
| region | 0 | 1 | 4 | 9 | 0 | 4 | 0 |
| division | 0 | 1 | 7 | 18 | 0 | 9 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| sat_math | 0 | 1 | 529.27 | 34.83 | 473.0 | 500.00 | 521 | 557.00 | 600.0 |
| teach_pay | 0 | 1 | 35.89 | 6.23 | 26.3 | 31.55 | 35 | 40.05 | 50.3 |
| pct_taking | 0 | 1 | 35.49 | 26.29 | 4.0 | 9.00 | 30 | 61.00 | 80.0 |
Use this output to describe important features of the dataset such as: missingness for variables, number of region levels/categories, number of division levels/categories, and a quick check of the numeric variables — do the summary statistics imply the data is reasonable.
There is no missing data.There are 4 unique regions and 9 unique divisions. There are no clear outliers or strange values for the numerical variables. The data appears to be reasonable.
We will be exploring the relationship between SAT math scores (sat_math) and teacher pay (teach_pay).
Let’s begin exploring the relationship between sat_math and teach_pay visually.
I’ve provided a scatterplot below - remove eval: false so that it will be included in the output.
ggplot(state_sat, aes(x = teach_pay, y = sat_math)) +
geom_point(shape = 1) +
labs(
title = "Relationship between state SAT math scores and teacher pay",
x = "Median Teacher Pay (in thousands of dollars)",
y = "Mean SAT Math Score"
)Does there look to be a linear relationship/association between the two variables? Calculate the correlation between these two variables.
# CALCULATE CORRELATION
cor(state_sat$sat_math, state_sat$teach_pay)[1] -0.4039747
# alternative way
state_sat %>%
summarise(correlation = cor(sat_math, teach_pay))# A tibble: 1 × 1
correlation
<dbl>
1 -0.404
Given the graph and correlation, describe the relationship/association between these two variables. Does this relationship seem reasonable or expected? Explain.
Looks like there is a modest negative linear association between a state’s median teacher pay and its mean SAT math score.
The relationship does not seem reasonable. More pay is associated with a decrease in math performance. We would expect that if a state typically pays better they would typically get better results (i.e. math scores).
Copy the graphic above and let’s layer/add on the line of best fit to our graph to help us visualize the relationship. Recall from the book we will want method = "lm" and se = FALSE for this layer.
# BUILD GRAPHIC HERE
ggplot(state_sat, aes(x = teach_pay, y = sat_math)) +
geom_point(shape = 1) +
labs(
title = "Relationship between state SAT math scores and teacher pay",
x = "Median Teacher Pay (in thousands of dollars)",
y = "Mean SAT Math Score"
) +
geom_smooth(method = "lm", se = FALSE)`geom_smooth()` using formula 'y ~ x'
Task 3
Fit a model
In the last graphic we fit a simple linear regression/model to the data. We want to calculate the equation for this line and interpret its values in the context of our problem. Use lm() to calculate the model fit and store the result in math_pay_model.
# fit & store the model sat_math ~ teach_pay
#| label: fit-model
math_pay_model <- lm(sat_math ~ teach_pay, data = state_sat)Below the following R chunk is an equation for the line of best fit, but it is missing the estimated intercept (\(b_0\)) and slope (\(b_1\)). We can get these estimates by running/printing math_pay_model. We can also get them as part of a more comprehensive print out that summarizes the fitted model. The code for both are provided below.
## REMOVE eval=FALSE IN R CHUNK OPTIONS
## ASSUMES YOU HAVE CREATED math_pay_model
# just estimated coefficients
math_pay_model
Call:
lm(formula = sat_math ~ teach_pay, data = state_sat)
Coefficients:
(Intercept) teach_pay
610.40 -2.26
# estimated coefficients with std output
summary(math_pay_model)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 610.395612 26.626085 22.924723 7.578726e-28
teach_pay -2.260258 0.731169 -3.091294 3.283602e-03
Replace INTERCEPT and SLOPE with there estimated values in the equation below.
\[\widehat{sat\_math} = INTERCEPT + SLOPE\left(teach\_pay \right)\] Interpret the estimated intercept and slope in context:
- Intercept: For a state with median teacher pay of $0 we would EXPECT its mean SAT math score to be about 610.
- Slope: For each additional $1,000 in median teacher pay a state’s mean SAT math score DECREASES, on average, by about 2.26 points.
Task 4
Fitted values & residuals
Let’s take our model and calculate fitted/predicted values for each state’s mean SAT math score and compare them to their actual/observed values — residuals.
To do this:
- Begin with
state_satthen - keep the variables
state,teach_pay, andsat_math, then - create/calculate two new variables called
sat_math_hatandresidualusingmath_pay_modelwith the functionsfitted()andresiduals()respectively, then - keep only the following observations/states: South Dakota (
"SD"), Illinois ("IL"), and Florida ("FL"). Feel free to add another state of your choice.
# calculate fitted values and residuals
state_sat %>%
select(state, teach_pay, sat_math) %>%
mutate(
sat_math_hat = fitted(math_pay_model),
residual = residuals(math_pay_model)
) %>%
filter(state %in% c("SD", "IL", "FL"))# A tibble: 3 × 5
state teach_pay sat_math sat_math_hat residual
<chr> <dbl> <dbl> <dbl> <dbl>
1 FL 33.3 496 535. -39.1
2 IL 40.9 575 518. 57.0
3 SD 26.3 566 551. 15.0
Identify if the model over- or under-estimates the mean SAT math score for each state. Which of the three states does it estimate best? Worst? Does a negative residual correspond with an over- or under-estimate by the model?
Overestimates for FL and underestimates for IL and SD. It does the best at estimating SD’s mean SAT math score a worst for IL. Negative residuals correspond to overestimation by the model.
Task 5
Prediction
Suppose we told you that a mythical 52nd state existed and it paid its teacher’s a median salary of $47.5 thousand. Using the regression model, what would you estimate it’s mean SAT math score to be? Answer should be in a complete sentence.
# estimate using the equation above
610.4 - 2.26 * 47.5[1] 503.05
We would PREDICT the 52nd state to have a mean SAT math score of 500 (rounded from 503.05).