# load packages
library(tidyverse)
library(moderndive)
library(skimr)Confidence Intervals
Activity 17
Overview
The focus of Activity 17 will be on using the properties of a sampling distribution to construct confidence intervals for parameters using sample estimates, standard errors, and critical values. That is, building an interval of plausible values from our randomly collected data that we are confident includes the the parameter of interest (for example the mean, \(\mu\))
Needed Packages
The following loads the packages that are needed for this activity.
Tasks
THIS IS A TWO DAY ACTIVITY
Exercise 1
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 adults in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
The sample data is saved in data/cdc.rda.
genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair, or poor;exeranyvariable indicates whether the respondent exercised in the past month (1) or did not (0);hlthplanindicates whether the respondent had some form of health coverage (1) or did not (0);smoke100variable indicates whether the respondent had smoked at least 100 cigarettes in their lifetime (1) or not (0);heightis respondent’s height in inches;weightis respondent’s weight in pounds;wtdesireis respondent’s desired weight in pounds;ageis respondent’s age in years;gendercodedmfor male andffor female.
load("data/cdc.rda")Question 1
Construct a 92% confidence interval for the average height of US men.
# store dataset of interest
cdc_men <- cdc %>%
filter(gender == "m")
t.test(cdc_men$height, conf.level = 0.92)
One Sample t-test
data: cdc_men$height
t = 2283.7, df = 9568, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
92 percent confidence interval:
70.19778 70.30551
sample estimates:
mean of x
70.25165
Interpret the confidence interval.
We are 92% confident that the average height of adult men in the US are between 70.2 and 70.3 inches tall.
Question 2
Construct a 99.9% confidence interval for the proportion of the US population that consider themselves to be in at most fair general health.
cdc %>%
count(genhlth)# A tibble: 5 × 2
genhlth n
<fct> <int>
1 excellent 4657
2 very good 6972
3 good 5675
4 fair 2019
5 poor 677
# at most fair = fair or poor
x1 <- 2019 + 677
# total people surveyed (could have got this from the 20000 obs)
n1 <- 4657 + 6972 + 5675 + 2019 + 677
prop.test(x = x1, n = n1, conf.level = 0.999, correct = FALSE)
1-sample proportions test without continuity correction
data: x1 out of n1, null probability 0.5
X-squared = 10670, df = 1, p-value < 2.2e-16
alternative hypothesis: true p is not equal to 0.5
99.9 percent confidence interval:
0.1270512 0.1429440
sample estimates:
p
0.1348
Interpret the confidence interval.
We are 99.9% confident that the proportion of adults in the US that are in at most fair health is between 0.1271 and 0.1429.
Question 3
Construct a 95% confidence interval to determine the difference in the proportion of smokers for US adults considering themselves in excellent general health and the proportion of smokers for US adults considering themselves in poor general health.
cdc %>%
count(smoke100, genhlth)# A tibble: 10 × 3
smoke100 genhlth n
<fct> <fct> <int>
1 no excellent 2879
2 no very good 3758
3 no good 2782
4 no fair 911
5 no poor 229
6 yes excellent 1778
7 yes very good 3214
8 yes good 2893
9 yes fair 1108
10 yes poor 448
x_exc <- 1778
n_exc <- 1778 + 2879
x_poor <- 448
n_poor <- 448 + 229
prop.test(x = c(x_exc, x_poor),
n = c(n_exc, n_poor),
conf.level = 0.95,
correct = FALSE)
2-sample test for equality of proportions without continuity correction
data: c(x_exc, x_poor) out of c(n_exc, n_poor)
X-squared = 190.51, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
-0.3182250 -0.2416793
sample estimates:
prop 1 prop 2
0.3817909 0.6617430
Interpret the confidence interval.
We are 95% confident that the difference in proportion of US adults in excellent health who smoke and US adults in poor health who smoke is between -0.3182 and -0.2417.
Question 4
Suppose we are interested in examining US women between 35 and 45 years old (including 35 and 45). For US women between 35 and 45 years old, construct a 98% confidence interval for the difference in mean weight for US women considering themselves in good general health and the mean weight of US women considering themselves in fair general health.
# interested in only women between 35 and 45
cdc_women <- cdc %>%
filter(gender == "f", age >= 35, age <= 45)
#get dataset1
cdc_good <- cdc_women %>%
filter(genhlth == "good")
#get dataset2
cdc_fair <- cdc_women %>%
filter(genhlth == "fair")
#t.test for difference
t.test(x = cdc_good$weight, y = cdc_fair$weight, conf.level = 0.98)
Welch Two Sample t-test
data: cdc_good$weight and cdc_fair$weight
t = -3.468, df = 330.72, p-value = 0.0005936
alternative hypothesis: true difference in means is not equal to 0
98 percent confidence interval:
-20.214100 -3.935649
sample estimates:
mean of x mean of y
159.0139 171.0888
Interpret the confidence interval.
We are 98% confident that for US women between the ages of 35 and 45 the difference in average weight between those in good health and those in fair health is between -20.214 and-3.935 pounds.
Another way to think of it: We are 98% confident that US women between the ages of 35 and 45 in good health weigh on average 3.935 to 20.214 pounds less than US women between the ages of 35 and 45 in fair health.
Exercise 2
Suppose we have access to a full days worth of Chicago ride share data from Uber, Lyft, etc. Let’s restrict ourselves to only rides that involve the basic fare and no sharing of the ride. This brings us to about 100,000 rides. In the data we have price of the ride, duration (in minutes) of the ride, and wait_time which is the time (in minutes) that a rider waits to be picked up.
In this case we happen to have the whole population. Consider taking a random sample of size 100, stored in ride_sample. You will need to remove the comment (#) and enter a number in the set.seed() function.
# load data
ride_data <- read_rds("data/ride_data.rds")
# enter any number in set.seed(###)
set.seed(202)
# random sample of 100 rides
ride_sample <- ride_data %>%
sample_n(size = 100)Question 5
Using your random sample, construct a 90% confidence interval for the average price of a ride.
t.test(ride_sample$price, conf.level = 0.9)
One Sample t-test
data: ride_sample$price
t = 42.794, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
18.65622 20.16238
sample estimates:
mean of x
19.4093
Interpret the confidence interval.
We are 90% confident the average price of a ride share in Chicago is between $18.66 and $20.16.
Your answer should be different because we have different random samples! Do NOT use the same seed as me or you are missing the point of this Exercise.
Question 6
Using your random sample, run a simple regression to predict price based on duration. Then construct a 90% confidence interval for the intercept and slope using the confint() function.
#build linear model of ride_sample using lm function
model_ride <- lm(price ~ duration, data = ride_sample)
# This will show your sample estimate (what the CI will center around)
summary(model_ride)
Call:
lm(formula = price ~ duration, data = ride_sample)
Residuals:
Min 1Q Median 3Q Max
-1.7949 -0.6817 -0.1240 0.6660 2.5204
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.542900 0.302511 21.63 <2e-16 ***
duration 0.430801 0.009582 44.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9803 on 98 degrees of freedom
Multiple R-squared: 0.9538, Adjusted R-squared: 0.9533
F-statistic: 2021 on 1 and 98 DF, p-value: < 2.2e-16
#we could calculate the estimate and SE of our intercept and slope by hand but it is much easier to use confint().
confint(model_ride, level = 0.9) 5 % 95 %
(Intercept) 6.0405650 7.0452341
duration 0.4148895 0.4467133
What is the 90% CI for B_0 and B_1?
We are 90% confident that the intercept (expected price when duration is 0) is between 6.04 and 7.05.
We are 90% confident that the slope (relationship between duration and price) is between 0.415 and 0.447.
Your answer should be different because we have different random samples! Do NOT use the same seed as me or you are missing the point of this Exercise.
Question 7
We are lucky enough to have the entire population of ride shares for a day in ride_data. Let’s compare our confidence intervals in Question 5 and 6 to the truth.
ride_data %>%
summarize(mean_price = mean(price, na.rm = TRUE))# A tibble: 1 × 1
mean_price
<dbl>
1 19.5
The true average price of a ride is $19.50. Did your confidence interval in Question 5 contain the true value?
Since my confidence interval was from $18.66 and $20.16, the CI captured the true population parameter
model_population <- lm(price~duration, data=ride_data)
summary(model_population)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.9384962 0.0101953125 582.4732 0
duration 0.4504223 0.0003217278 1400.0105 0
The true population intercept is 5.94 and true population slope is 0.450. Do your confidence intervals contain the true values?
My CI for the intercept was 6.04 and 7.05 and my CI for the slope was 0.415 and 0.447. I did NOT capture the true population parameter for either of these!
Every member in our class (100 students) collected a random sample and constructed a confidence interval for the average ride share price, intercept, and slope. How many student’s confidence intervals do you expect to have captured the true population parameter for each of these estimators?
I would expect 90 students to capture the true population parameter and 10 students to not.
Optional Challenge
For Questions 1 - 4, construct the confidence intervals “by hand” using the formulas in Table 10.1 of the textbook.
Q1 by hand
Construct a 92% confidence interval for the average height of US men.
cdc %>%
filter(gender == "m") %>%
summarize(xbar = mean(height, na.rm = TRUE),
s = sd(height, na.rm = TRUE),
n = n(),
se = s/sqrt(n),
cv = qt(0.04, df = n-1, lower.tail = FALSE),
lower = xbar - cv*se,
upper = xbar + cv*se)# A tibble: 1 × 7
xbar s n se cv lower upper
<dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 70.3 3.01 9569 0.0308 1.75 70.2 70.3
Q2 by hand:
Construct a 99.9% confidence interval for the proportion of the US population that consider themselves to be in at most fair general health.
cdc %>%
summarize(x = sum(genhlth == "fair" | genhlth == "poor"),
n = n(),
p = x/n,
se = sqrt(p*(1-p)/n),
cv = qnorm(0.0005, lower.tail = FALSE),
lower = p - cv*se,
upper = p + cv*se)# A tibble: 1 × 7
x n p se cv lower upper
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2696 20000 0.135 0.00241 3.29 0.127 0.143
Q3 by hand:
Construct a 95% confidence interval to determine the difference in the proportion of smokers for US adults considering themselves in excellent general health and the proportion of smokers for US adults considering themselves in poor general health.
cdc %>%
count(smoke100, genhlth)# A tibble: 10 × 3
smoke100 genhlth n
<fct> <fct> <int>
1 no excellent 2879
2 no very good 3758
3 no good 2782
4 no fair 911
5 no poor 229
6 yes excellent 1778
7 yes very good 3214
8 yes good 2893
9 yes fair 1108
10 yes poor 448
cdc %>%
summarize(x1 = sum(genhlth == "excellent" & smoke100 == "yes"),
n1 = sum(genhlth == "excellent"),
x2 = sum(genhlth == "poor" & smoke100 == "yes"),
n2 = sum(genhlth == "poor" ),
p1 = x1/n1,
p2 = x2/n2,
dif = p1 - p2,
se = sqrt(p1*(1-p1)/n1 + p2*(1-p2)/n2),
cv = qnorm(0.025, lower.tail = FALSE),
lower = dif - cv*se,
upper = dif + cv*se)# A tibble: 1 × 11
x1 n1 x2 n2 p1 p2 dif se cv lower upper
<int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1778 4657 448 677 0.382 0.662 -0.280 0.0195 1.96 -0.318 -0.242
Q4 by hand:
Suppose we are interested in examining US women between 35 and 45 years old (including 35 and 45). For US women between 35 and 45 years old, construct a 98% confidence interval for the difference in mean weight for US women considering themselves in good general health and the mean weight of US women considering themselves in fair general health.
cdc_women <- cdc %>%
filter(gender == "f", age >= 35, age <= 45)
cdc_women %>%
group_by(genhlth) %>%
summarize(xbar = mean(weight, na.rm = TRUE),
s = sd(weight, na.rm = TRUE),
n = n())# A tibble: 5 × 4
genhlth xbar s n
<fct> <dbl> <dbl> <int>
1 excellent 141. 24.3 637
2 very good 153. 33.8 915
3 good 159. 40.2 647
4 fair 171. 45.4 214
5 poor 173. 48.5 60
xbar_dif = 159.0139 - 171.0888
se = sqrt(40.22173^2/647 + 45.37788^2/214)
cv = qt(0.01, df = min(647-1, 214-1))
xbar_dif - cv*se[1] -3.913718
xbar_dif + cv*se[1] -20.23608