Basic Regression I

Activity 08

Author

Solutions

Overview

The focus of Activity 08 will be on data modeling using linear regression. Specifically using simple linear regression — one numerical predictor/explanatory variable. We will also continue to practice our data wrangling and visualization skills.

Needed Packages

The following loads the packages that are needed for this activity. We assume that you have installed the packages. Also note that we have suppressed the messages so the compiled html is less cluttered.

Make sure to load the necessary packages!!

# LOAD THE NECESSARY PACKAGES
# CONSIDER SUPPRESSING WARNINGS & MESSAGES
library(tidyverse)
library(skimr)
# for challenge
library(broom)

Tasks

Complete the following series of tasks. Remember to render early and render often.

Task 1

Import state SAT data

We will be exploring a relatively small dataset, but hopefully one that you find interesting. The dataset contains educational data for each US state. That is, each observations is a US state.

The following variables are in the data set: - state := state’s abbreviation - region := state’s Census defined regions - sat_math := state’s mean SAT mathematics score - teach_pay := state’s median teacher pay (thousands of dollars) - pct_taking := state’s percentage of seniors/juniors taking the SAT - division := state’s Census defined division

Import the state_sat.csv file which should be located in your data/ subdirectory. Store the data as state_sat.

# import and store the data
state_sat <- read_csv("data/state_sat.csv")
# inspect data with glimpse()
glimpse(state_sat)

Rows: 51
Columns: 6
$ state      <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL",…
$ region     <chr> "South", "West", "West", "South", "West", "West", "Northeas…
$ sat_math   <dbl> 558, 513, 521, 550, 511, 538, 504, 495, 473, 496, 477, 510,…
$ teach_pay  <dbl> 31.3, 49.6, 32.5, 29.3, 43.1, 35.4, 50.3, 40.5, 43.7, 33.3,…
$ pct_taking <dbl> 8, 47, 28, 6, 45, 30, 79, 66, 50, 48, 63, 54, 15, 14, 57, 5…
$ division   <chr> "East South Central", "Pacific", "Mountain", "West South Ce…

Task 2

EDA

In this section we will be preforming an Exploratory Data Analysis (EDA).

Let’s go a little deeper and use skim_without_charts() from the skimr package which was introduced in the reading. It is an extremely useful package for preforming a quick EDA! It will provide a nice breakdown of the variables in the data with summary statistics.

# LOAD skimr IN load-pkgs CHUNK: MAY NEED TO INSTALL IT
# skim_without_charts() the data
skim_without_charts(state_sat)

Data summary
Name	state_sat
Number of rows	51
Number of columns	6
_______________________
Column type frequency:
character	3
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
state	1	2	2	51
region	1	4	9	4
division	1	7	18	9

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100
sat_math	1	529.27	34.83	473.0	500.00	521	557.00	600.0
teach_pay	1	35.89	6.23	26.3	31.55	35	40.05	50.3
pct_taking	1	35.49	26.29	4.0	9.00	30	61.00	80.0

Use this output to describe important features of the dataset such as: missingness for variables, number of region levels/categories, number of division levels/categories, and a quick check of the numeric variables — do the summary statistics imply the data is reasonable.

There is no missing data.There are 4 unique regions and 9 unique divisions. There are no clear outliers or strange values for the numerical variables. The data appears to be reasonable.

We will be exploring the relationship between SAT math scores (sat_math) and teacher pay (teach_pay).

Let’s begin exploring the relationship between sat_math and teach_pay visually.

I’ve provided a scatterplot below - remove eval: false so that it will be included in the output.

ggplot(state_sat, aes(x = teach_pay, y = sat_math)) +
  geom_point(shape = 1) +
  labs(
    title = "Relationship between state SAT math scores and teacher pay",
    x = "Median Teacher Pay (in thousands of dollars)",
    y = "Mean SAT Math Score"
    )

Does there look to be a linear relationship/association between the two variables? Calculate the correlation between these two variables.

# CALCULATE CORRELATION
cor(state_sat$sat_math, state_sat$teach_pay)

[1] -0.4039747

# alternative way
state_sat %>% 
  summarise(correlation = cor(sat_math, teach_pay))

# A tibble: 1 × 1
  correlation
        <dbl>
1      -0.404

Given the graph and correlation, describe the relationship/association between these two variables. Does this relationship seem reasonable or expected? Explain.

Looks like there is a modest negative linear association between a state’s median teacher pay and its mean SAT math score.

The relationship does not seem reasonable. More pay is associated with a decrease in math performance. We would expect that if a state typically pays better they would typically get better results (i.e. math scores).

Copy the graphic above and let’s layer/add on the line of best fit to our graph to help us visualize the relationship. Recall from the book we will want method = "lm" and se = FALSE for this layer.

# BUILD GRAPHIC HERE
ggplot(state_sat, aes(x = teach_pay, y = sat_math)) +
  geom_point(shape = 1) +
  labs(
    title = "Relationship between state SAT math scores and teacher pay",
    x = "Median Teacher Pay (in thousands of dollars)",
    y = "Mean SAT Math Score"
    ) +
  geom_smooth(method = "lm", se = FALSE)

`geom_smooth()` using formula 'y ~ x'

Task 3

Fit a model

In the last graphic we fit a simple linear regression/model to the data. We want to calculate the equation for this line and interpret its values in the context of our problem. Use lm() to calculate the model fit and store the result in math_pay_model.

# fit & store the model sat_math ~ teach_pay
#| label: fit-model
math_pay_model <- lm(sat_math ~ teach_pay, data = state_sat)

Below the following R chunk is an equation for the line of best fit, but it is missing the estimated intercept ($b_0$) and slope ($b_1$). We can get these estimates by running/printing math_pay_model. We can also get them as part of a more comprehensive print out that summarizes the fitted model. The code for both are provided below.

## REMOVE eval=FALSE IN R CHUNK OPTIONS
## ASSUMES YOU HAVE CREATED math_pay_model
# just estimated coefficients
math_pay_model


Call:
lm(formula = sat_math ~ teach_pay, data = state_sat)

Coefficients:
(Intercept)    teach_pay  
     610.40        -2.26

# estimated coefficients with std output
summary(math_pay_model)$coefficients

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 610.395612  26.626085 22.924723 7.578726e-28
teach_pay    -2.260258   0.731169 -3.091294 3.283602e-03

Replace INTERCEPT and SLOPE with there estimated values in the equation below.

\[\widehat{sat\_math} = INTERCEPT + SLOPE\left(teach\_pay \right)\] Interpret the estimated intercept and slope in context:

Intercept: For a state with median teacher pay of $0 we would EXPECT its mean SAT math score to be about 610.
Slope: For each additional $1,000 in median teacher pay a state’s mean SAT math score DECREASES, on average, by about 2.26 points.

Task 4

Fitted values & residuals

Let’s take our model and calculate fitted/predicted values for each state’s mean SAT math score and compare them to their actual/observed values — residuals.

To do this:

Begin with state_sat then
keep the variables state, teach_pay, and sat_math, then
create/calculate two new variables called sat_math_hat and residual using math_pay_model with the functions fitted() and residuals() respectively, then
keep only the following observations/states: South Dakota ("SD"), Illinois ("IL"), and Florida ("FL"). Feel free to add another state of your choice.

# calculate fitted values and residuals
state_sat %>% 
  select(state, teach_pay, sat_math) %>% 
  mutate(
    sat_math_hat = fitted(math_pay_model),
    residual = residuals(math_pay_model)
  ) %>% 
  filter(state %in% c("SD", "IL", "FL"))

# A tibble: 3 × 5
  state teach_pay sat_math sat_math_hat residual
  <chr>     <dbl>    <dbl>        <dbl>    <dbl>
1 FL         33.3      496         535.    -39.1
2 IL         40.9      575         518.     57.0
3 SD         26.3      566         551.     15.0

Identify if the model over- or under-estimates the mean SAT math score for each state. Which of the three states does it estimate best? Worst? Does a negative residual correspond with an over- or under-estimate by the model?

Overestimates for FL and underestimates for IL and SD. It does the best at estimating SD’s mean SAT math score a worst for IL. Negative residuals correspond to overestimation by the model.

Task 5

Prediction

Suppose we told you that a mythical 52^nd state existed and it paid its teacher’s a median salary of $47.5 thousand. Using the regression model, what would you estimate it’s mean SAT math score to be? Answer should be in a complete sentence.

# estimate using the equation above
610.4 - 2.26 * 47.5

[1] 503.05

We would PREDICT the 52nd state to have a mean SAT math score of 500 (rounded from 503.05).