Confidence Intervals

Activity 17

Author

Solution

Overview

The focus of Activity 17 will be on using the properties of a sampling distribution to construct confidence intervals for parameters using sample estimates, standard errors, and critical values. That is, building an interval of plausible values from our randomly collected data that we are confident includes the the parameter of interest (for example the mean, $\mu$)

Needed Packages

The following loads the packages that are needed for this activity.

# load packages
library(tidyverse)
library(moderndive)
library(skimr)

Tasks

THIS IS A TWO DAY ACTIVITY

Exercise 1

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 adults in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

The sample data is saved in data/cdc.rda.

genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair, or poor;
exerany variable indicates whether the respondent exercised in the past month (1) or did not (0);
hlthplan indicates whether the respondent had some form of health coverage (1) or did not (0);
smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in their lifetime (1) or not (0);
height is respondent’s height in inches;
weight is respondent’s weight in pounds;
wtdesire is respondent’s desired weight in pounds;
age is respondent’s age in years;
gender coded m for male and f for female.

load("data/cdc.rda")

Question 1

Construct a 92% confidence interval for the average height of US men.

# store dataset of interest
cdc_men <- cdc %>% 
  filter(gender == "m")

t.test(cdc_men$height, conf.level = 0.92)


    One Sample t-test

data:  cdc_men$height
t = 2283.7, df = 9568, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
92 percent confidence interval:
 70.19778 70.30551
sample estimates:
mean of x 
 70.25165

Interpret the confidence interval.

We are 92% confident that the average height of adult men in the US are between 70.2 and 70.3 inches tall.

Question 2

Construct a 99.9% confidence interval for the proportion of the US population that consider themselves to be in at most fair general health.

cdc %>% 
  count(genhlth)

# A tibble: 5 × 2
  genhlth       n
  <fct>     <int>
1 excellent  4657
2 very good  6972
3 good       5675
4 fair       2019
5 poor        677

# at most fair = fair or poor
x1 <- 2019 + 677
# total people surveyed (could have got this from the 20000 obs)
n1 <- 4657 + 6972 + 5675 + 2019 + 677

prop.test(x = x1, n = n1, conf.level = 0.999, correct = FALSE)


    1-sample proportions test without continuity correction

data:  x1 out of n1, null probability 0.5
X-squared = 10670, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
99.9 percent confidence interval:
 0.1270512 0.1429440
sample estimates:
     p 
0.1348

Interpret the confidence interval.

We are 99.9% confident that the proportion of adults in the US that are in at most fair health is between 0.1271 and 0.1429.

Question 3

Construct a 95% confidence interval to determine the difference in the proportion of smokers for US adults considering themselves in excellent general health and the proportion of smokers for US adults considering themselves in poor general health.

cdc %>% 
  count(smoke100, genhlth)

# A tibble: 10 × 3
   smoke100 genhlth       n
   <fct>    <fct>     <int>
 1 no       excellent  2879
 2 no       very good  3758
 3 no       good       2782
 4 no       fair        911
 5 no       poor        229
 6 yes      excellent  1778
 7 yes      very good  3214
 8 yes      good       2893
 9 yes      fair       1108
10 yes      poor        448

x_exc <- 1778
n_exc <- 1778 + 2879

x_poor <- 448
n_poor <- 448 + 229

prop.test(x = c(x_exc, x_poor),
          n = c(n_exc, n_poor),
          conf.level = 0.95,
          correct = FALSE)


    2-sample test for equality of proportions without continuity correction

data:  c(x_exc, x_poor) out of c(n_exc, n_poor)
X-squared = 190.51, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.3182250 -0.2416793
sample estimates:
   prop 1    prop 2 
0.3817909 0.6617430

Interpret the confidence interval.

We are 95% confident that the difference in proportion of US adults in excellent health who smoke and US adults in poor health who smoke is between -0.3182 and -0.2417.

Question 4

Suppose we are interested in examining US women between 35 and 45 years old (including 35 and 45). For US women between 35 and 45 years old, construct a 98% confidence interval for the difference in mean weight for US women considering themselves in good general health and the mean weight of US women considering themselves in fair general health.

# interested in only women between 35 and 45
cdc_women <- cdc %>% 
  filter(gender == "f", age >= 35, age <= 45)

#get dataset1
cdc_good <- cdc_women %>% 
  filter(genhlth == "good")

#get dataset2
cdc_fair <- cdc_women %>% 
  filter(genhlth == "fair")

#t.test for difference
t.test(x = cdc_good$weight, y = cdc_fair$weight, conf.level = 0.98)


    Welch Two Sample t-test

data:  cdc_good$weight and cdc_fair$weight
t = -3.468, df = 330.72, p-value = 0.0005936
alternative hypothesis: true difference in means is not equal to 0
98 percent confidence interval:
 -20.214100  -3.935649
sample estimates:
mean of x mean of y 
 159.0139  171.0888

Interpret the confidence interval.

We are 98% confident that for US women between the ages of 35 and 45 the difference in average weight between those in good health and those in fair health is between -20.214 and-3.935 pounds.

Another way to think of it: We are 98% confident that US women between the ages of 35 and 45 in good health weigh on average 3.935 to 20.214 pounds less than US women between the ages of 35 and 45 in fair health.

Exercise 2

Suppose we have access to a full days worth of Chicago ride share data from Uber, Lyft, etc. Let’s restrict ourselves to only rides that involve the basic fare and no sharing of the ride. This brings us to about 100,000 rides. In the data we have price of the ride, duration (in minutes) of the ride, and wait_time which is the time (in minutes) that a rider waits to be picked up.

In this case we happen to have the whole population. Consider taking a random sample of size 100, stored in ride_sample. You will need to remove the comment (#) and enter a number in the set.seed() function.

# load data
ride_data <- read_rds("data/ride_data.rds")

# enter any number in set.seed(###)
set.seed(202)

# random sample of 100 rides
ride_sample <- ride_data %>%
  sample_n(size = 100)

Question 5

Using your random sample, construct a 90% confidence interval for the average price of a ride.

t.test(ride_sample$price, conf.level = 0.9)


    One Sample t-test

data:  ride_sample$price
t = 42.794, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
 18.65622 20.16238
sample estimates:
mean of x 
  19.4093

Interpret the confidence interval.

We are 90% confident the average price of a ride share in Chicago is between $18.66 and $20.16.

Your answer should be different because we have different random samples! Do NOT use the same seed as me or you are missing the point of this Exercise.

Question 6

Using your random sample, run a simple regression to predict price based on duration. Then construct a 90% confidence interval for the intercept and slope using the confint() function.

#build linear model of ride_sample using lm function
model_ride <- lm(price ~ duration, data = ride_sample)

# This will show your sample estimate (what the CI will center around)
summary(model_ride)


Call:
lm(formula = price ~ duration, data = ride_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7949 -0.6817 -0.1240  0.6660  2.5204 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.542900   0.302511   21.63   <2e-16 ***
duration    0.430801   0.009582   44.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9803 on 98 degrees of freedom
Multiple R-squared:  0.9538,    Adjusted R-squared:  0.9533 
F-statistic:  2021 on 1 and 98 DF,  p-value: < 2.2e-16

#we could calculate the estimate and SE of our intercept and slope by hand but it is much easier to use confint().
confint(model_ride, level = 0.9)

                  5 %      95 %
(Intercept) 6.0405650 7.0452341
duration    0.4148895 0.4467133

What is the 90% CI for B_0 and B_1?

We are 90% confident that the intercept (expected price when duration is 0) is between 6.04 and 7.05.

We are 90% confident that the slope (relationship between duration and price) is between 0.415 and 0.447.

Your answer should be different because we have different random samples! Do NOT use the same seed as me or you are missing the point of this Exercise.

Question 7

We are lucky enough to have the entire population of ride shares for a day in ride_data. Let’s compare our confidence intervals in Question 5 and 6 to the truth.

ride_data %>% 
  summarize(mean_price = mean(price, na.rm = TRUE))

# A tibble: 1 × 1
  mean_price
       <dbl>
1       19.5

The true average price of a ride is $19.50. Did your confidence interval in Question 5 contain the true value?

Since my confidence interval was from $18.66 and $20.16, the CI captured the true population parameter

model_population <- lm(price~duration, data=ride_data)
summary(model_population)$coefficients

             Estimate   Std. Error   t value Pr(>|t|)
(Intercept) 5.9384962 0.0101953125  582.4732        0
duration    0.4504223 0.0003217278 1400.0105        0

The true population intercept is 5.94 and true population slope is 0.450. Do your confidence intervals contain the true values?

My CI for the intercept was 6.04 and 7.05 and my CI for the slope was 0.415 and 0.447. I did NOT capture the true population parameter for either of these!

Every member in our class (100 students) collected a random sample and constructed a confidence interval for the average ride share price, intercept, and slope. How many student’s confidence intervals do you expect to have captured the true population parameter for each of these estimators?

I would expect 90 students to capture the true population parameter and 10 students to not.

Optional Challenge

For Questions 1 - 4, construct the confidence intervals “by hand” using the formulas in Table 10.1 of the textbook.

Q1 by hand

Construct a 92% confidence interval for the average height of US men.

cdc %>% 
  filter(gender == "m") %>% 
  summarize(xbar = mean(height, na.rm = TRUE),
            s = sd(height, na.rm = TRUE),
            n = n(),
            se = s/sqrt(n),
            cv = qt(0.04, df = n-1, lower.tail = FALSE),
            lower = xbar - cv*se,
            upper = xbar + cv*se)

# A tibble: 1 × 7
   xbar     s     n     se    cv lower upper
  <dbl> <dbl> <int>  <dbl> <dbl> <dbl> <dbl>
1  70.3  3.01  9569 0.0308  1.75  70.2  70.3

Q2 by hand:

Construct a 99.9% confidence interval for the proportion of the US population that consider themselves to be in at most fair general health.

cdc %>% 
  summarize(x = sum(genhlth == "fair" | genhlth == "poor"),
            n = n(),
            p = x/n,
            se = sqrt(p*(1-p)/n),
            cv = qnorm(0.0005, lower.tail = FALSE),
            lower = p - cv*se,
            upper = p + cv*se)

# A tibble: 1 × 7
      x     n     p      se    cv lower upper
  <int> <int> <dbl>   <dbl> <dbl> <dbl> <dbl>
1  2696 20000 0.135 0.00241  3.29 0.127 0.143

Q3 by hand:

cdc %>% 
  count(smoke100, genhlth)

# A tibble: 10 × 3
   smoke100 genhlth       n
   <fct>    <fct>     <int>
 1 no       excellent  2879
 2 no       very good  3758
 3 no       good       2782
 4 no       fair        911
 5 no       poor        229
 6 yes      excellent  1778
 7 yes      very good  3214
 8 yes      good       2893
 9 yes      fair       1108
10 yes      poor        448

cdc %>% 
  summarize(x1 = sum(genhlth == "excellent" & smoke100 == "yes"),
            n1 = sum(genhlth == "excellent"),
            x2 = sum(genhlth == "poor" & smoke100 == "yes"),
            n2 = sum(genhlth == "poor" ),
            p1 = x1/n1,
            p2 = x2/n2,
            dif = p1 - p2,
            se = sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2),
            cv = qnorm(0.025, lower.tail = FALSE),
            lower = dif - cv*se,
            upper = dif + cv*se)

# A tibble: 1 × 11
     x1    n1    x2    n2    p1    p2    dif     se    cv  lower  upper
  <int> <int> <int> <int> <dbl> <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
1  1778  4657   448   677 0.382 0.662 -0.280 0.0195  1.96 -0.318 -0.242

Q4 by hand:

cdc_women <- cdc %>% 
  filter(gender == "f", age >= 35, age <= 45)

cdc_women %>% 
  group_by(genhlth) %>% 
  summarize(xbar = mean(weight, na.rm = TRUE),
            s = sd(weight, na.rm = TRUE),
            n = n())

# A tibble: 5 × 4
  genhlth    xbar     s     n
  <fct>     <dbl> <dbl> <int>
1 excellent  141.  24.3   637
2 very good  153.  33.8   915
3 good       159.  40.2   647
4 fair       171.  45.4   214
5 poor       173.  48.5    60

xbar_dif = 159.0139 - 171.0888
se = sqrt(40.22173^2/647 + 45.37788^2/214)
cv = qt(0.01, df = min(647-1, 214-1))

xbar_dif - cv*se

[1] -3.913718

xbar_dif + cv*se

[1] -20.23608