Basic Regression II

Activity 09

Author

Solutions

Overview

The focus of Activity 09 will be on data modeling using linear regression with a categorical predictor/explanatory variable. We will also continue to practice our data wrangling and visualization skills.


Needed Packages

The following loads the packages that are needed for this activity. You may want to suppress messages so the compiled html is less cluttered.

Make sure to load the necessary packages!!

# LOAD THE NECESSARY PACKAGES
library(tidyverse)
library(skimr)
# CONSIDER SUPPRESSING WARNINGS & MESSAGES
# for challenge
library(cowplot)


Tasks

Complete the following series of tasks. Remember to render early and render often.


Task 1

Import state SAT data

Import the state_sat.csv file which should be located in your data/ subdirectory — from the last activity. Store the data as state_sat. You may code it yourself or use the RStudio interface, either way make sure to include your code below. Otherwise the Rmd won’t knit for later exercises. We have also included a little extra code to make an adjustment to the dataset. We’ve suppressed the messages so the compiled html is less cluttered.

# import and store the data
state_sat <- read_csv("data/state_sat.csv")
# MAKE ADJUSTMENT TO THE state_sat DATASET
# REMOVE THE #'s IN THE NEXT TWO LINES SO THEY WILL RUN
state_sat <- state_sat %>%
  mutate(region = as.factor(region))


Recall the pieced together codebook from last activity:

  • state := state’s abbreviation
  • region := state’s Census defined (4 regions)
  • sat_math := state’s mean SAT mathematics score
  • teach_pay := state’s median teacher pay (thousands of dollars)
  • pct_taking := state’s percentage of seniors/juniors taking the SAT
  • division := state’s Census defined division (9 divisions)

It is always helpful to have an understanding of the variables in our dataset.


Task 2

EDA

We conducted an EDA in Activity 08. For time purposes we will skip skiming the data, however it is expected that you know how to do this. Let’s look at the visualizations and summary statistics.


We will be exploring the relationship between a state’s mean SAT math scores (sat_math) and its region. Phrased a little differently, we are interested in comparing the differences between state mean SAT math scores for different regions.

Let’s begin exploring the relationship between sat_math and region visually using a side-by-side boxplot graphic. Consider using the labs() function to layer on nice labels for your plot.

# CONSTRUCT PLOT
ggplot(state_sat, aes(x = region, y = sat_math)) +
  geom_boxplot() +
  labs(
    title = "Relationship between SAT math scores and region",
    x = "Region",
    y = "Mean SAT Math Score"
    )


What does the graphic tell you about the relationship between these two variables?


The states in the Midwest region tend to have larger mean math SAT scores. There is a lot of variability in mean SAT math scores for states in the South region while this very little variability for states in the northeast region. We need to be a little cautious here because each of the 4 regions probably doesn’t have many observations in it (51 states total broken into 4 groups).


Let’s calculate summary statistics to go along with the graphic.

  1. Start with state_sat, then
  2. per region calculate the summary statistics (count and mean) for sat_math, and then
  3. order the data by the mean of sat_math.
  4. Store the results in by_region.
  5. Print by_region.
# calculate & store the summary stats per region in by_region
by_region <- state_sat %>% 
  group_by(region) %>% 
  summarise(
    count = n(),
    # how you name the the calculated mean here matters for later
    mean = mean(sat_math)
  ) %>% 
  arrange(mean)
# print by_region
by_region
# A tibble: 4 × 3
  region    count  mean
  <fct>     <int> <dbl>
1 Northeast     9  501.
2 South        17  516.
3 West         13  530 
4 Midwest      12  568.
## alternative with skim
# missing count per group though
state_sat %>% 
  group_by(region) %>% 
  skim_without_charts(sat_math)
Data summary
Name Piped data
Number of rows 51
Number of columns 6
_______________________
Column type frequency:
numeric 1
________________________
Group variables region

Variable type: numeric

skim_variable region n_missing complete_rate mean sd p0 p25 p50 p75 p100
sat_math Midwest 0 1 568.42 29.49 494 565.75 570 587.75 600
sat_math Northeast 0 1 500.78 7.05 491 498.00 500 504.00 514
sat_math South 0 1 516.18 32.88 473 495.00 504 550.00 558
sat_math West 0 1 530.00 19.99 507 513.00 521 544.00 575
# quick way to group count
state_sat %>% 
  count(region)
# A tibble: 4 × 2
  region        n
  <fct>     <int>
1 Midwest      12
2 Northeast     9
3 South        17
4 West         13


Confirm that the means tell the same story as the graphic above.


Let’s compare each region’s mean sat_math to the mean of the Midwest region. That is, subtract the mean for the Midwest region from the means of each region. Using midwest_mean calculate the differences by:

  1. Begin with by_region, then
  2. create a new variable diff_with_midwest by taking midwest_mean away from the mean of each region.
# REMOVE eval: false FROM CHUNK OPTIONS
# stores mean of sat_math for midwest region
midwest_mean <- by_region %>% 
  filter(region == "Midwest") %>% 
  pull(mean)
# if you named your variable something other than "mean" you will need to edit the code line above
# calculate the differences in mean from midwest below
by_region %>% 
  mutate(
    diff_with_midwest = mean - midwest_mean
  )
# A tibble: 4 × 4
  region    count  mean diff_with_midwest
  <fct>     <int> <dbl>             <dbl>
1 Northeast     9  501.             -67.6
2 South        17  516.             -52.2
3 West         13  530              -38.4
4 Midwest      12  568.               0  


What do the differences tell you about the comparisons of the each region’s mean sat_math with that of the mean of the Midwest region?


Since all the differences are negative we can quickly determine that the mean state SAT math scores for each region are less than the mean SAT math score for states in the Midwest region.


Keep the above values in mind as we go on to fitting a model.


Task 3

Fit a model

Let’s fit the model where we use region as the predictor and sat_math is the response. Store the fitted model in region_model and then print the estimated coefficients for the model (there are multiple ways to do this — just pick one).

# fit sat_math ~ region
region_model <- lm(sat_math ~ region, data = state_sat)
# print estimated coefficients
region_model

Call:
lm(formula = sat_math ~ region, data = state_sat)

Coefficients:
    (Intercept)  regionNortheast      regionSouth       regionWest  
         568.42           -67.64           -52.24           -38.42  
# another option
summary(region_model)$coefficients
                 Estimate Std. Error   t value     Pr(>|t|)
(Intercept)     568.41667   7.538899 75.397831 1.095715e-50
regionNortheast -67.63889  11.515859 -5.873543 4.180018e-07
regionSouth     -52.24020   9.846514 -5.305451 2.973936e-06
regionWest      -38.41667  10.454573 -3.674628 6.095418e-04
# another option using broom
broom::tidy(region_model)
# A tibble: 4 × 5
  term            estimate std.error statistic  p.value
  <chr>              <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)        568.       7.54     75.4  1.10e-50
2 regionNortheast    -67.6     11.5      -5.87 4.18e- 7
3 regionSouth        -52.2      9.85     -5.31 2.97e- 6
4 regionWest         -38.4     10.5      -3.67 6.10e- 4


What is the baseline region? Midwest

Interpret each of the coefficients. Notice the slope values, they should look familiar!

  • Intercept: The EXPECTED/PREDICTED mean SAT math score for states in the Midwest is about 568.

  • Coefficient for regionNortheast: The mean SAT math scores for states in the northeast are, ON AVERAGE, 67.6 points less than they are for states in the Midwest.

  • Coefficient for regionSouth: The mean SAT math scores for states in the South are, ON AVERAGE, 52.2 points less than they are for states in the Midwest.

  • Coefficient for regionWest: The mean SAT math scores for states in the West are, ON AVERAGE, 38.4 points less than they are for states in the Midwest.

The equations for the model can be written as

\[\widehat{sat\_math} = b_0 + b_1(Northeast) + b_2(South) + b_3(West)\]

where \(Northeast\), \(South\), and \(West\) are indicator variables (for example `Northeast is shorthand for \(1_{Northeast}(x)\)).

Make the appropriate adjustments/substitutions to complete the equation below with the information from the fitted model, region_model.

\[\widehat{sat\_math} = b_0 + b_1(Northeast) + b_2(South) + b_3(West)\]

Compare your model results with your calculations in Task 2 diff_with_midwest. What do you notice?

The intercept is the mean of the Midwest and the other coefficients are the differences from the midwest.


Task 5

Dummy/Indicator Coding Information

Run the below code chunk to import dummy_coded_state_sat.rds.

# import data
dummy_coded_data <- read_rds("data/dummy_coded_state_sat.rds")
# print
dummy_coded_data
# A tibble: 51 × 6
   sat_math Midwest Northeast South  West region   
      <dbl>   <dbl>     <dbl> <dbl> <dbl> <fct>    
 1      558       0         0     1     0 South    
 2      513       0         0     0     1 West     
 3      521       0         0     0     1 West     
 4      550       0         0     1     0 South    
 5      511       0         0     0     1 West     
 6      538       0         0     0     1 West     
 7      504       0         1     0     0 Northeast
 8      495       0         0     1     0 South    
 9      473       0         0     1     0 South    
10      496       0         0     1     0 South    
# … with 41 more rows


This demonstrates the indicator values that are plugged into the above equation in Task 3 for each observation. Variables that are coded 0 for not being in the indicated group and 1 if in the indicated group. For example, region has four categories: Northeast, South, West, Midwest. We take region and create 4 indicator variables, one for each region.

For modeling purposes we only need 3 of the indicator variables to uniquely id what group each observation belongs to and this means the indicator that gets left out is our reference/baseline group for the model.


# fit models
midwest_model <- lm(sat_math ~ Northeast + South + West + Midwest, data = dummy_coded_data) 
west_model <- lm(sat_math ~ Midwest + Northeast + South + West, data = dummy_coded_data)
south_model <- lm(sat_math ~ West + Midwest + Northeast + South, data = dummy_coded_data)
northest_model <- lm(sat_math ~ South + West + Midwest + Northeast, data = dummy_coded_data)
# coefficients for each model
region_model
midwest_model
west_model
south_model
northest_model

Call:
lm(formula = sat_math ~ region, data = state_sat)

Coefficients:
    (Intercept)  regionNortheast      regionSouth       regionWest  
         568.42           -67.64           -52.24           -38.42  


Call:
lm(formula = sat_math ~ Northeast + South + West + Midwest, data = dummy_coded_data)

Coefficients:
(Intercept)    Northeast        South         West      Midwest  
     568.42       -67.64       -52.24       -38.42           NA  


Call:
lm(formula = sat_math ~ Midwest + Northeast + South + West, data = dummy_coded_data)

Coefficients:
(Intercept)      Midwest    Northeast        South         West  
     530.00        38.42       -29.22       -13.82           NA  


Call:
lm(formula = sat_math ~ West + Midwest + Northeast + South, data = dummy_coded_data)

Coefficients:
(Intercept)         West      Midwest    Northeast        South  
     516.18        13.82        52.24       -15.40           NA  


Call:
lm(formula = sat_math ~ South + West + Midwest + Northeast, data = dummy_coded_data)

Coefficients:
(Intercept)        South         West      Midwest    Northeast  
     500.78        15.40        29.22        67.64           NA  


Why do the midwest_model, west_model, south_model, and northeast_model have an NA coefficient?

Identify the reference/baseline group! Everything is interpreted with reference to this group!