# LOAD THE NECESSARY PACKAGES
library(tidyverse)
library(skimr)
# CONSIDER SUPPRESSING WARNINGS & MESSAGES
# for challenge
library(cowplot)Basic Regression II
Activity 09
Overview
The focus of Activity 09 will be on data modeling using linear regression with a categorical predictor/explanatory variable. We will also continue to practice our data wrangling and visualization skills.
Needed Packages
The following loads the packages that are needed for this activity. You may want to suppress messages so the compiled html is less cluttered.
Make sure to load the necessary packages!!
Tasks
Complete the following series of tasks. Remember to render early and render often.
Task 1
Import state SAT data
Import the state_sat.csv file which should be located in your data/ subdirectory — from the last activity. Store the data as state_sat. You may code it yourself or use the RStudio interface, either way make sure to include your code below. Otherwise the Rmd won’t knit for later exercises. We have also included a little extra code to make an adjustment to the dataset. We’ve suppressed the messages so the compiled html is less cluttered.
# import and store the data
state_sat <- read_csv("data/state_sat.csv")
# MAKE ADJUSTMENT TO THE state_sat DATASET
# REMOVE THE #'s IN THE NEXT TWO LINES SO THEY WILL RUN
state_sat <- state_sat %>%
mutate(region = as.factor(region))Recall the pieced together codebook from last activity:
state:= state’s abbreviationregion:= state’s Census defined (4 regions)sat_math:= state’s mean SAT mathematics scoreteach_pay:= state’s median teacher pay (thousands of dollars)pct_taking:= state’s percentage of seniors/juniors taking the SATdivision:= state’s Census defined division (9 divisions)
It is always helpful to have an understanding of the variables in our dataset.
Task 2
EDA
We conducted an EDA in Activity 08. For time purposes we will skip skiming the data, however it is expected that you know how to do this. Let’s look at the visualizations and summary statistics.
We will be exploring the relationship between a state’s mean SAT math scores (sat_math) and its region. Phrased a little differently, we are interested in comparing the differences between state mean SAT math scores for different regions.
Let’s begin exploring the relationship between sat_math and region visually using a side-by-side boxplot graphic. Consider using the labs() function to layer on nice labels for your plot.
# CONSTRUCT PLOT
ggplot(state_sat, aes(x = region, y = sat_math)) +
geom_boxplot() +
labs(
title = "Relationship between SAT math scores and region",
x = "Region",
y = "Mean SAT Math Score"
)What does the graphic tell you about the relationship between these two variables?
The states in the Midwest region tend to have larger mean math SAT scores. There is a lot of variability in mean SAT math scores for states in the South region while this very little variability for states in the northeast region. We need to be a little cautious here because each of the 4 regions probably doesn’t have many observations in it (51 states total broken into 4 groups).
Let’s calculate summary statistics to go along with the graphic.
- Start with
state_sat, then - per
regioncalculate the summary statistics (count and mean) forsat_math, and then - order the data by the mean of
sat_math. - Store the results in
by_region. - Print
by_region.
# calculate & store the summary stats per region in by_region
by_region <- state_sat %>%
group_by(region) %>%
summarise(
count = n(),
# how you name the the calculated mean here matters for later
mean = mean(sat_math)
) %>%
arrange(mean)
# print by_region
by_region# A tibble: 4 × 3
region count mean
<fct> <int> <dbl>
1 Northeast 9 501.
2 South 17 516.
3 West 13 530
4 Midwest 12 568.
## alternative with skim
# missing count per group though
state_sat %>%
group_by(region) %>%
skim_without_charts(sat_math)| Name | Piped data |
| Number of rows | 51 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | region |
Variable type: numeric
| skim_variable | region | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|---|
| sat_math | Midwest | 0 | 1 | 568.42 | 29.49 | 494 | 565.75 | 570 | 587.75 | 600 |
| sat_math | Northeast | 0 | 1 | 500.78 | 7.05 | 491 | 498.00 | 500 | 504.00 | 514 |
| sat_math | South | 0 | 1 | 516.18 | 32.88 | 473 | 495.00 | 504 | 550.00 | 558 |
| sat_math | West | 0 | 1 | 530.00 | 19.99 | 507 | 513.00 | 521 | 544.00 | 575 |
# quick way to group count
state_sat %>%
count(region)# A tibble: 4 × 2
region n
<fct> <int>
1 Midwest 12
2 Northeast 9
3 South 17
4 West 13
Confirm that the means tell the same story as the graphic above.
Let’s compare each region’s mean sat_math to the mean of the Midwest region. That is, subtract the mean for the Midwest region from the means of each region. Using midwest_mean calculate the differences by:
- Begin with
by_region, then - create a new variable
diff_with_midwestby takingmidwest_meanaway from the mean of each region.
# REMOVE eval: false FROM CHUNK OPTIONS
# stores mean of sat_math for midwest region
midwest_mean <- by_region %>%
filter(region == "Midwest") %>%
pull(mean)
# if you named your variable something other than "mean" you will need to edit the code line above
# calculate the differences in mean from midwest below
by_region %>%
mutate(
diff_with_midwest = mean - midwest_mean
)# A tibble: 4 × 4
region count mean diff_with_midwest
<fct> <int> <dbl> <dbl>
1 Northeast 9 501. -67.6
2 South 17 516. -52.2
3 West 13 530 -38.4
4 Midwest 12 568. 0
What do the differences tell you about the comparisons of the each region’s mean sat_math with that of the mean of the Midwest region?
Since all the differences are negative we can quickly determine that the mean state SAT math scores for each region are less than the mean SAT math score for states in the Midwest region.
Keep the above values in mind as we go on to fitting a model.
Task 3
Fit a model
Let’s fit the model where we use region as the predictor and sat_math is the response. Store the fitted model in region_model and then print the estimated coefficients for the model (there are multiple ways to do this — just pick one).
# fit sat_math ~ region
region_model <- lm(sat_math ~ region, data = state_sat)
# print estimated coefficients
region_model
Call:
lm(formula = sat_math ~ region, data = state_sat)
Coefficients:
(Intercept) regionNortheast regionSouth regionWest
568.42 -67.64 -52.24 -38.42
# another option
summary(region_model)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 568.41667 7.538899 75.397831 1.095715e-50
regionNortheast -67.63889 11.515859 -5.873543 4.180018e-07
regionSouth -52.24020 9.846514 -5.305451 2.973936e-06
regionWest -38.41667 10.454573 -3.674628 6.095418e-04
# another option using broom
broom::tidy(region_model)# A tibble: 4 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 568. 7.54 75.4 1.10e-50
2 regionNortheast -67.6 11.5 -5.87 4.18e- 7
3 regionSouth -52.2 9.85 -5.31 2.97e- 6
4 regionWest -38.4 10.5 -3.67 6.10e- 4
What is the baseline region? Midwest
Interpret each of the coefficients. Notice the slope values, they should look familiar!
Intercept: The EXPECTED/PREDICTED mean SAT math score for states in the Midwest is about 568.
Coefficient for
regionNortheast: The mean SAT math scores for states in the northeast are, ON AVERAGE, 67.6 points less than they are for states in the Midwest.Coefficient for
regionSouth: The mean SAT math scores for states in the South are, ON AVERAGE, 52.2 points less than they are for states in the Midwest.Coefficient for
regionWest: The mean SAT math scores for states in the West are, ON AVERAGE, 38.4 points less than they are for states in the Midwest.
The equations for the model can be written as
\[\widehat{sat\_math} = b_0 + b_1(Northeast) + b_2(South) + b_3(West)\]
where \(Northeast\), \(South\), and \(West\) are indicator variables (for example `Northeast is shorthand for \(1_{Northeast}(x)\)).
Make the appropriate adjustments/substitutions to complete the equation below with the information from the fitted model, region_model.
\[\widehat{sat\_math} = b_0 + b_1(Northeast) + b_2(South) + b_3(West)\]
Compare your model results with your calculations in Task 2 diff_with_midwest. What do you notice?
The intercept is the mean of the Midwest and the other coefficients are the differences from the midwest.
Task 5
Dummy/Indicator Coding Information
Run the below code chunk to import dummy_coded_state_sat.rds.
# import data
dummy_coded_data <- read_rds("data/dummy_coded_state_sat.rds")
# print
dummy_coded_data# A tibble: 51 × 6
sat_math Midwest Northeast South West region
<dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 558 0 0 1 0 South
2 513 0 0 0 1 West
3 521 0 0 0 1 West
4 550 0 0 1 0 South
5 511 0 0 0 1 West
6 538 0 0 0 1 West
7 504 0 1 0 0 Northeast
8 495 0 0 1 0 South
9 473 0 0 1 0 South
10 496 0 0 1 0 South
# … with 41 more rows
This demonstrates the indicator values that are plugged into the above equation in Task 3 for each observation. Variables that are coded 0 for not being in the indicated group and 1 if in the indicated group. For example, region has four categories: Northeast, South, West, Midwest. We take region and create 4 indicator variables, one for each region.
For modeling purposes we only need 3 of the indicator variables to uniquely id what group each observation belongs to and this means the indicator that gets left out is our reference/baseline group for the model.
# fit models
midwest_model <- lm(sat_math ~ Northeast + South + West + Midwest, data = dummy_coded_data)
west_model <- lm(sat_math ~ Midwest + Northeast + South + West, data = dummy_coded_data)
south_model <- lm(sat_math ~ West + Midwest + Northeast + South, data = dummy_coded_data)
northest_model <- lm(sat_math ~ South + West + Midwest + Northeast, data = dummy_coded_data)
# coefficients for each model
region_model
midwest_model
west_model
south_model
northest_model
Call:
lm(formula = sat_math ~ region, data = state_sat)
Coefficients:
(Intercept) regionNortheast regionSouth regionWest
568.42 -67.64 -52.24 -38.42
Call:
lm(formula = sat_math ~ Northeast + South + West + Midwest, data = dummy_coded_data)
Coefficients:
(Intercept) Northeast South West Midwest
568.42 -67.64 -52.24 -38.42 NA
Call:
lm(formula = sat_math ~ Midwest + Northeast + South + West, data = dummy_coded_data)
Coefficients:
(Intercept) Midwest Northeast South West
530.00 38.42 -29.22 -13.82 NA
Call:
lm(formula = sat_math ~ West + Midwest + Northeast + South, data = dummy_coded_data)
Coefficients:
(Intercept) West Midwest Northeast South
516.18 13.82 52.24 -15.40 NA
Call:
lm(formula = sat_math ~ South + West + Midwest + Northeast, data = dummy_coded_data)
Coefficients:
(Intercept) South West Midwest Northeast
500.78 15.40 29.22 67.64 NA
Why do the midwest_model, west_model, south_model, and northeast_model have an NA coefficient?
Identify the reference/baseline group! Everything is interpreted with reference to this group!