Multiple Regression II

Activity 11

Author

Solutions


Overview

The focus of Activity 11 will be on data modeling using multiple (linear) regression. Specifically when using one numerical predictor variable and one categorical variable. We will also continue to practice our data wrangling and visualization skills.


Needed Packages

The following loads the packages that are needed for this activity. You may want to suppress themessages so the compiled html is less cluttered.

# LOAD THE NECESSARY PACKAGES
library(tidyverse)
#might need to install moderndive
library(moderndive)
# CONSIDER SUPPRESSING WARNINGS & MESSAGES


Tasks

Complete the following series of tasks. Remember to render early and render often.


Task 1

CDC data

The below code loads the cdc data.

load("data/cdc.rda")

We are going to skip the exploratory data analysis for time purposes however I expect by now you know how to skim and analyze the data.

Task 1

We are going to predict height based on a person’s weight. However, we also believe height likely differs across gender.

Check to make sure there is a linear relationship between height and weight.

ggplot(cdc, aes(x = weight, y = height)) +
  geom_point()

cdc %>% 
  select(weight, height) %>% 
  cor()
          weight    height
weight 1.0000000 0.5553222
height 0.5553222 1.0000000
# not required - break down by gender
cdc %>% 
  filter(gender == "f") %>% 
  ggplot(aes(x = weight, y = height)) +
  geom_point()

cdc %>% 
  filter(gender == "m") %>% 
  ggplot(aes(x = weight, y = height)) +
  geom_point()

There is a moderate positive linear relationship between height and weight. The correlation coefficient is 0.55. [There appears to be a stronger relationship for males than for females.]

Task 2

Same slopes model

Adding a categorical variable in a multiple regression model can be extremely useful if you expect the response varies by category. Since we have a grasp of how to interpret a categorical variable we can now build this more complicated model.

Build the multiple regression model to predict height based on weight and gender. Assume the relationship between height and weight (ie: the slope) is the same for each gender.

# fit parallel slopes model could call it `model_parallel`
model_parallel <- lm(height ~ weight + gender, data = cdc)
# print estimated coefficients
summary(model_parallel)$coefficients
               Estimate   Std. Error    t value Pr(>|t|)
(Intercept) 64.72845616 0.1060020350  610.63409        0
weight       0.02917341 0.0005405099   53.97387        0
genderf     -4.78532667 0.0433675416 -110.34351        0

Remember we are estimating the coefficients (\(b_0\), \(b_1\), \(b_2\)) for the model equation below:

\[\widehat{height} = b_0 + b_1\left(weight \right) + b_2\left(1_{gender}\right)\]

In the context of the data, provide an interpretation for each of the coefficient estimates:

  • Intercept: the intercept for males (baseline). This is the EXPECTED height for a male when weight is 0.
  • Slope for weight: For every 1 increase in weight, the EXPECTED height of a person regardless of gender (because no interaction term) will increase by 0.03 inches.
  • Coefficient for gender: The OFFSET in intercept for females. Females are expected to be 4.79 inches SHORTER than males.

Task 3

Same slopes plot

Plot the parallel slopes model.

#plot same slopes model
#this uses the library(moderndive)
#instead of adding `geom_smooth(method="lm", se=FALSE) add the layer `geom_parallel_slopes(se=FALSE)`
ggplot(cdc, aes(x = weight, y = height, color= gender)) +
  geom_point() +
  labs(title = "Parallel Slopes Model")+
  geom_parallel_slopes(se = FALSE)

Task 4

Interaction model

Let’s add the interaction term which means we expect the relationship between weight and height also varies by gender. Fit this model, store it in model_int, and print its estimated coefficients.

# fit interaction model
model_int <- lm(height ~ weight + gender + weight*gender, data = cdc)
# print estimated coefficients
summary(model_int)$coefficients
                  Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)    63.48201758 0.1453414870 436.77837 0.000000e+00
weight          0.03575709 0.0007537745  47.43738 0.000000e+00
genderf        -2.49860488 0.1882460264 -13.27308 4.899260e-40
weight:genderf -0.01344270 0.0010770868 -12.48061 1.293024e-35


Remember we are estimating the coefficients (\(b_0\), \(b_1\), \(b_2\) & \(b_3\)) for the model equation below:

\[\widehat{height} = b_0 + b_1\left(weight \right) + b_2\left(1_{gender}\right)+ b_3\left(weight*1_{gender}\right)\] What is the baseline/reference gender? males

What does \(b_0\) represent? The intercept for males. This is the EXPECTED height of a male when the weight is 0.

What does \(b_1\) represent? The slope of weight for males. For every 1 pound increase in weight for males, the EXPECTED height increases by 0.036 inches.

What does \(b_2\) represent? The offset in intercept for females. Holding weight constant, females are on AVERAGE 2.499 inches shorter than males.

What does \(b_3\) represent? The offset in slope of weight for females. For every 1 pound increase in weight, the EXPECTED female height will increase at a rate of -0.013 inches less than male heights.

This model is much more complicated. It allows different genders to have both different intercepts and different slopes which is important if we suspect height varies by gender.

We technically have two different lines of best fit (one for each gender) built into this equation. Use the model coefficients above to calculate the intercept and slope of each line:

  • Male intercept: 63.482

  • Male slope: 0.036

  • Female intercept: 63.482-2.499 = 60.98

  • Female slope: 0.036-0.013 = 0.023


Task 5

Interaction plot

Plot the interaction model below. Use the aesthetic color to plot the gender. Add the line of best fit using `geom_smooth()

#plot interaction model
ggplot(cdc, aes(x = weight, y = height, color= gender)) +
  geom_point() +
  labs(title = "Interaction Model")+
  geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula 'y ~ x'

Note: generally if the models are very similar you choose the simplest model (parallel slopes) as the “best” model. If the lines are significantly different the interaction model is usually “best” because it is telling us the relationship varies based on a category (in this case gender).


Task 6

Prediction

Using the interaction model:

Calculate the ESTIMATED height for a male who weighs 150 pounds.

Calculate the ESTIMATED height for a female who weight 150 pounds.

#use this as a calculator by plugging in the equation
# 150 pound male
63.482+0.036*150
[1] 68.882
# 150 pound female
63.48+0.036*150-2.499*1 -0.0134*150*1
[1] 64.371

A 150 pound male is EXPECTED to be 68.882 inches tall (on average).

A 150 pound female is EXPECTED to be 64.371 inches tall (on average).