Data Wrangling II

Activity 06

Author

Solutions

Overview

The focus of Activity 06 will be on additional tools for data wrangling. In conjunction with past tools (filter() and summarize()) we will use group_by(), mutate(), arrange(), select(), and maybe a few others. Rarely do we only use one of these tools in isolation. We “pipe” them together to wrangle data.

Needed Packages

The following loads the packages that are needed for this activity. We assume that you have installed the packages. Also note that we have suppressed the messages so the compiled html is less cluttered.

library(dplyr)
library(ggplot2)

Tasks

Complete the following series of tasks. Remember to render early and render often.

Background: CDC’s BRFSS

As a reminder, the Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000 and a subset of 9 variables.

# Load the data
load("data/cdc.rda")
cdc_codebook <- read.csv("data/cdc_codebook.csv")

It is a good idea to inspect the data by running View(cdc) in your console — not in the qmd file because this will produce an error when rendering! Alternatively, you can inspect it by going to the Environment tab in the upper right hand pane and click on the cdc dataset. Look at the meaning of the variables using the cdc_codebook.

Task 1

Height by Gender

Let’s explore and compare the distributions of height by gender. That is, let’s explore and compare the heights of males to the heights of females in our dataset.

For each gender calculate the mean height, standard deviation of height, median height, interquartile range of height, and the number of respondents.

# Height summary stats by gender 
cdc %>% 
  group_by(gender) %>% 
  summarize(
    mean    = mean(height),
    std_dev = sd(height),
    med     = median(height),
    iqr     = IQR(height),
    n       = n()
  )

# A tibble: 2 × 6
  gender  mean std_dev   med   iqr     n
  <fct>  <dbl>   <dbl> <dbl> <dbl> <int>
1 m       70.3    3.01    70     4  9569
2 f       64.4    2.79    64     4 10431

Use the summary statistics to describe the distribution of height.

The mean for male heights is 70.3 inches and the standard deviation is 3.01 inches. The mean for female heights is 64.4 inches and the standard deviation is 2.79 inches. We can say the typical male in the sample is about 6 inches taller than the typical female.

Task 2

Body Mass Index (BMI)

The formula to calculate a person’s body mass index in standard units (pounds and inches) is \[bmi = 703 \times \frac{weight}{height^2}\]. If this notation is difficult to read, click inside it or render the document to see how it prints.

Create a new variable called bmi using the formula above and add it to the cdc dataset.

# Add bmi to cdc
cdc <- cdc %>% 
  mutate(bmi = 703 * weight / height^2)

Task 3

Female BMI by general health

For women in our sample, let’s investigate the distribution of bmi by genhlth. That is, for women in the cdc dataset let’s explore and compare the distribution of body mass index for each of the five general health groups. Begin by calculating the mean bmi, standard deviation of bmi, median bmi, interquartile range of bmi, and the number of respondents for women broken down by each general health category.

#First access the data and then
cdc %>% 
  #only interested in women and then
  filter(gender == "f") %>% 
  #information per health group and then
  group_by(genhlth) %>% 
  #summary statistics
  summarize(
    mean    = mean(bmi),
    std_dev = sd(bmi),
    med     = median(bmi),
    iqr     = IQR(bmi),
    n       = n()
  )

# A tibble: 5 × 6
  genhlth    mean std_dev   med   iqr     n
  <fct>     <dbl>   <dbl> <dbl> <dbl> <int>
1 excellent  23.8    4.25  23.0  4.93  2359
2 very good  25.4    5.02  24.4  6.26  3590
3 good       26.4    5.86  25.4  7.38  2953
4 fair       28.1    6.87  27.1  8.58  1135
5 poor       28.5    7.49  27.4  8.79   394

Use the summary statistics to describe how the five distributions of bmi relate to one another. Stated another way. What do these summary statistics tell you about the general relationship between general health and bmi for women in our sample?

The typical BMI decreases as the reported general health gets better. We also see that the spread of BMIs decreases as the reported general health gets better.

Task 4

BMI summary by gender and smoke100

Let’s continue our investigation of bmi by looking at the same summary statistics broken down by gender and smoke100. That is, calculate the same set of summary statistics we have been previously calculating for bmi for men that have smoked 100 cigarettes in their lifetime, men that have not smoked 100 cigarettes in their lifetime, women that have smoked 100 cigarettes in their lifetime, and women that have not smoked 100 cigarettes in their lifetime.

# BMI summary stats by gender and smoke100 
cdc %>% 
  group_by(smoke100, gender) %>% 
  summarize(
    mean    = mean(bmi),
    std_dev = sd(bmi),
    med     = median(bmi),
    iqr     = IQR(bmi),
    n       = n()
  )

# A tibble: 4 × 7
# Groups:   smoke100 [2]
  smoke100 gender  mean std_dev   med   iqr     n
  <fct>    <fct>  <dbl>   <dbl> <dbl> <dbl> <int>
1 no       m       26.9    4.67  26.1  5.30  4547
2 no       f       25.8    5.61  24.6  6.55  6012
3 yes      m       27.0    4.65  26.4  5.34  5022
4 yes      f       25.7    5.63  24.6  6.55  4419

Within each gender group, do the summary statistics suggest that BMI is meaningfully different for those that had smoked 100 cigarettes and those that had not? Explain your answer.

No, for each there doesn’t appear to be a meaningful difference between BMIs for those that had smoked 100 cigarettes and those that did not. The measures of center (mean and median) and spread (standard deviation and IQR) are very similar between these groups. There certainly are differences between gender groups

Task 5

Calculate the count of each genhlth group, then print the results of just the largest 3 counts? (Hint: try using two of our “other useful functions” to complete this)

cdc %>% 
  count(genhlth) %>% 
  top_n(n=3)

# A tibble: 3 × 2
  genhlth       n
  <fct>     <int>
1 excellent  4657
2 very good  6972
3 good       5675

Optional Challenges

Do not have to complete.

Challenge 1

We only calculated summary statistics when exploring BMI for women across general health groups. It would be useful to create a grouped boxplot (side-by-side boxplot) and faceted histograms to aid in the exploration.

# bmi by genhlth for females boxplot
cdc %>% 
  filter(gender == "f") %>% 
  ggplot(aes(genhlth, bmi)) +
    geom_boxplot(varwidth = TRUE) +
    coord_flip()

# faceted histograms 
cdc %>% 
  filter(gender == "f") %>% 
  ggplot(aes(bmi)) +
    geom_histogram(bins = 40, color = "white") +
    facet_wrap(~ genhlth, ncol = 1, scales = "free_y")

After you get the plots working consider adding a couple extras: set varwidth = "TRUE" for geom_boxplot() and in the facet_wrap() set scales = "free_y". You probably know what varwidth = "TRUE" does already for the boxplots. What does the scales = "free_y" do and why is it helpful in this case?

It makes it so each panel/small multiple plots do not have to have the same y-axis scale. This makes it easier to compare distibution because it keeps them from being distorted by the amount of relative observations/measurements in each sub-group.

It would be nice if these graphics were near the analysis they belong with. Consider moving the code chunk with these plots to be right after the calculation of the corresponding summary statistics.

Challenge 2

It would nice to have the measurements of height, weight, and wtdesire (desired weight) also reported in metric units — centimeters (cm) and kilograms (kg). Create/add three knew variables to cdc named hgt_cm, wgt_kg, and desired_wgt_kg.

# add metric versions of measurements
cdc <- cdc %>% 
  mutate(
    hgt_cm = 2.54 * height,
    wgt_kg = 0.45359237 * weight,
    desired_wgt_kg = 0.45359237 * wtdesire,
  )

Using the metric units, create grouped boxplots for both height and weight by gender.

# grouped boxplot for hgt_cm
ggplot(cdc, aes(gender, hgt_cm)) +
  geom_boxplot() +
  coord_flip()

# grouped boxplot for wgt_kg
ggplot(cdc, aes(gender, wgt_kg)) +
  geom_boxplot() +
  coord_flip()