Data Visualization III

Activity 04

Author

Solutions

Overview

The focus of Activity 04 will be on reinforcing the basics of “The Grammar of Graphics” through the introduction of our last named graphs (boxplots and barplots).

Needed Packages

Load the ggplot2 and dplyr library into the code chunk below. You cannot access the functions contained in these packages unless you load them!!

Also note that we have suppressed the messages so the compiled html is less cluttered.

#Load packages here!
library(ggplot2)
library(dplyr)

Tasks

Complete the following series of tasks. Remember to render early and render often.

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site (http://www.cdc.gov/brfss) contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While there are over 200 variables in this data set, we will work with a small subset. The code below loads this dataset.

# Load the data
load("data/cdc.rda")

Make sure to inspect the data by running View(cdc) in your console — not in the Rmd file because this will produce an error when knitting! Alternatively, you can inspect it by going to the Environment tab in the upper right hand pane and click on the dataset.

Each variable corresponds to a question that was asked in the survey. For example, for genhlth, respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in the past month (1) or did not (0). hlthplan indicates whether the respondent had some form of health coverage (yes) or did not (no). The smoke100 variable indicates whether the respondent had smoked at least 100 cigarettes in her lifetime: yes they have or no they have not. The other variables record the respondent’s height in inches, weight in pounds as well as their desired weight, wtdesire, age in years, and gender (recorded as m or f).

Task 1

Boxplots

Let’s begin by constructing a boxplot for the variable weight which is the weight of each participant in the behavioral risk factor survey.

ggplot(cdc, aes(y = weight)) +
  geom_boxplot()

Task 2

Use a boxplot to compare the distributions of weight by exerany. Put another way, is the distribution of weight different for participants who have exercised in the past month?

ggplot(cdc, aes(x = factor(exerany), y = weight)) +
  geom_boxplot()

If you are only seeing ONE boxplot, there is an issue! Knowing how to fix this is important.

In a few sentences explain what the graphic is telling us about the weight for participants who exercise and don’t exercise. Is it the same or different?

If an indicator variable is entered as a number you can change it to a character variable using factor(). Example: factor(exerany).

The distribution of weight is similar for participants who have exercised and those who have not with a median around 175. The spread in terms of IQR is roughly 50 for both groups. Surprisingly the exercise group has a few higher outliers.

Task 3

Construct a barplot for genhlth.

ggplot(data = cdc, mapping = aes(x = genhlth)) +
  geom_bar()

Task 4

Stacked barplot

What if we want to know the relationship of smoking in each of the genhlth groups? But we are also interested in seeing the cumulative genhlth counts.

ggplot(data = cdc, 
       mapping = aes(x = genhlth, fill = smoke100)) +
  geom_bar()

Proportion of smokers within each category

Add the argument position = "fill" into geom_bar(). Also add the layer ylab("proportion") since it now represents a proportion instead of count.

ggplot(data = cdc, 
       mapping = aes(x = genhlth, fill = smoke100)) +
  geom_bar(position = "fill") +
  ylab("proportion")

This has changed the graph so that each bar has a height of 1 (100%). We should now be able to easily determine what proportion of respondents in each general health category have or have not smoked 100 cigarettes in their lifetime.

Side by side count of smokers in each category

Instead of stacking smoke100 within genhlth, construct a side-by-side barplot where the shading is still determined by smoke100.

ggplot(data = cdc, 
       mapping = aes(x = genhlth, fill = smoke100)) +
  geom_bar(position = "dodge")

Which of the above barplots do you find most useful when exploring the relationship between genhlth and smoke100? Explain.

I found the stacked barplot using proportion (2nd plot) the easiest to compare since it clearly showed the rate of smokers in each category and was not dependent on count. The side-by-side plot was fairly easy to visualize as well.

Explain what the barplot tells us about the relationship between genhlth and smoke100 for our respondents.

Respondents that smoked more than 100 cigarettes in their lifetime make up a larger portion of each bar as we go from excellent to poor general health.

Optional Challenges

Do not have to complete.

Challenge 1

Using the cdc dataset, construct a side-by-side boxplot of weight grouped by genhlth. Arrange the boxplots horizontally and include varwidth = TRUE in the geom_boxplot() layer.

ggplot(cdc, aes(x = gender, y = weight)) +
  geom_boxplot(varwidth = TRUE)

Explain what the varwidth = TRUE argument did.

It changes the width of the boxplot relative to the number of observations in the category.

Challenge 2

Sometimes you may want to rearrange the order of the categorical variables in a plot. Copy one of your plots from Task 3 into the code chunk below. We are going to reverse the order so “fair” is first and “excellent” is last.

Add on the layer scale_x_discrete() because we are rearranging genhlth which is a discrete x variable. Now inside that layer include limits = c("poor", ..., "excellent"). You must type out each level exactly as it appears in the dataset in the order you want them to appear (in place of the “…”).

ggplot(data = cdc, 
       mapping = aes(x = genhlth, fill = smoke100)) +
  geom_bar(position = "fill") +
  ylab("proportion") +
  scale_x_discrete(limits = c("poor", "fair", "good","very good", "excellent"))