Data Visualization II

Activity 03

Author

Solutions

Overview

The focus of Activity 03 will be on reinforcing the basics of “The Grammar of Graphics” through the introduction of two more of our five named graphs (linegraphs and histograms) and the concept of faceting (sometimes called small multiples plot).

Needed Packages

The following loads the packages that are needed for this activity. We assume that you have installed the packages. Also note that we have suppressed the messages so the compiled html is less cluttered.

library(ggplot2)
library(dplyr)

Tip: Find the Data Visualization with ggplot2 under Help > Cheat Sheets.

Tasks

Complete the following series of tasks. Remember to render early and render often.

Task 1

LEGO sets Linegraphs

Let’s explore the legosets dataset contained in your data/ subdirectory. Recall this dataset contains information about every Lego set manufactured from 1970 to 2015, a total of 6,172 sets.

The code below loads the legosets data from your data/ subdirectory and then does a little data wrangling/processing to produce yearly_legosets_nonduplo which contains yearly information about non-Duplo LEGO sets. Don’t be overly concerned that you may not understand the code used to manipulate the data right now, we will return to it when we get to Chapter 3. The comments in the code, which are indicated by # in R, give a brief description of what each line of code does, if you are interested.

# Load data from file
load("data/legosets.rda")
# Yearly measures for non-duplo LEGO sets
yearly_legoset_nonduplo <- legosets %>% 
  # keep non-duplo sets
  filter(Theme != "Duplo") %>% 
  # keep interesting variables
  select(Year, Pieces, USD_MSRP, Theme) %>% 
  # remove all observations missing any data
  na.omit() %>% 
  # do calculations by Year
  group_by(Year) %>% 
  # calculate the following
  summarize(
    num_sets        = n(),               
    num_themes      = n_distinct(Theme), 
    mean_pieces     = mean(Pieces),      
    mean_usd_msrp   = mean(USD_MSRP),    
    median_pieces   = median(Pieces), 
    median_usd_msrp = median(USD_MSRP)
  )

Make sure to inspect the data by running View(yearly_legoset_nonduplo) in your console — not in the qmd file because this will produce an error when knitting!

The variable names are fairly straight forward and hopefully you can deduce what each variable measures from its name. For instance num_sets provides the number of LEGO sets per year, mean_pieces provides the mean number of pieces in a lego set per year, and median_usd_msrp provides the median manufacturer’s suggested retail price for LEGO sets per year.

Aside:

You should already be familiar with the mean (average) and the median which are the most common measures of center — should have been covered in a high school math class. Please see Appendix A Basic statistical terms if you are not familiar with these terms.

Let’s build a series of graphics to help us build a better understanding of LEGO sets through time.

Construct a linegraph of num_sets by Year.

ggplot(data = yearly_legoset_nonduplo, 
       mapping = aes(x = Year, y = num_sets)) +
  geom_line()

Construct a linegraph of mean_usd_msrp by Year, but let’s get a little fancy by using two geoms. Use both a scatterplot layer and linegraph layer.

ggplot(data = yearly_legoset_nonduplo, mapping = aes(x = Year, y = mean_usd_msrp)) +
  geom_point() +
  geom_line()

Construct a linegraph of mean_pieces by Year. Up to you if you want to keep being fancy by using multiple geom layers.

ggplot(data = yearly_legoset_nonduplo, 
       mapping = aes(x = Year, y = mean_pieces)) +
  geom_point() +
  geom_line()

Construct a linegraph of median_pieces by Year. Up to you if you want to keep being fancy by using multiple geom layers.

ggplot(data = yearly_legoset_nonduplo, 
       mapping = aes(x = Year, y = median_pieces)) +
  geom_point() +
  geom_line()

In a few sentences explain what the series of graphics tells you about how LEGO sets have changed over time.

The typical lego set has general increased in the number of pieces it has and its price has generally increased over time too.

Task 2

LEGO sets Histogram

Let’s continue to explore LEGO sets, but let’s not focus on the relationship with Year. In this case we just want to generally understand the USD_MSRP (in other words the price in dollars) of Duplo LEGO sets. Meaning we want to understand the distribution of USD_MSRP and we can do that by using a histogram. The code below does a little data wrangling to produce duplo_legosets which contains just the LEGO sets which have a Duplo theme and no missing price.

duplo_legosets <- legosets %>% 
  filter(Theme == "Duplo", !is.na(USD_MSRP))

Construct a histogram to show the distribution of USD_MSRP in a legoset using the newly constructed duplo_legosets data. Set the border color of the bars to "white" and adjust the number of bins to a number that you think is appropriate (might have to try out a few until you find one that you think is right).

#construct histogram
ggplot(data = duplo_legosets, mapping = aes(x = USD_MSRP)) +
  geom_histogram(bins=13, color="white")

Describe the histogram.

The distribution of USD_MSRP appears to be unimodal and right skewed. The center can be estimated to around $20. The spread in terms of range is 0 to 120 with a few outliers. We will learn how to calculate exact values in Chapter 3. (If you chose a slightly higher bin it might have appeared bi-modal and right skewed.)

Task 3

Movie Lengths Histogram

Let’s explore the movie_lengths dataset contained in your data/ subdirectory. This dataset contains the year and length (in minutes) of all movies on IMDB from 1913 to 2005, a total of 58,425 movies.

# Load data from file
load("data/movie_lengths.rda")

The histogram of movie lengths is given below.

ggplot(data = movie_lengths, mapping = aes(x = length)) +
  geom_histogram()

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Clearly there is a problem. Can you explain what might be causing this problem? It is quite common when building histograms. Extreme values/outliers in the dataset are distorting the graph. These values could be data entry errors or just very strange cases that should be considered separately.

So let’s create a subset of the dataset that focuses on movies that are no more than 200 minutes in length. This will have the effect of removing 250 movies which still leaves us 58,175 movies or about 99.5% of the data.

# create modified dataset
mod_movie_lengths <- movie_lengths %>% 
  filter(length <= 200)

Now construct a new histogram using mod_movie_lengths. Set the border color of the bars to "red" and adjust the number of bins to a number that you think is appropriate (might have to try out a few until you find one that you think is right).

ggplot(data = mod_movie_lengths, mapping = aes(x = length)) +
  geom_histogram(bins = 40, color = "red")

In a few sentences describe what the histogram is telling you about the distribution of movie lengths from the dataset. Important: since we modified the original data for visibility purposes it is critical in a write-up that you make a comment about what you did!

There are two peaks with one minor peak around 10 minutes and one major peak around 90 minutes. There is a left skew. Describing the center and spread is easier to do if we calculate some basic statistics which we will do in the next chapter. For now we can estimate the mean to be around 80 minutes and the spread as measured by range looks to be about 200 minutes (not surprisingly).

Task 4

Faceting

Let’s add a new variable called before_1984 that identifies a movie’s release year as before 1984 or after 1984 (1984 is included in this group) to mod_movie_lengths.

# Create modified dataset
mod_movie_lengths <- mod_movie_lengths %>% 
  # add a need variable to the dataset
  mutate(
    before_1984 = if_else(year < 1984, "before 1984", "after 1984")
  )

Inspect mod_movie_lengths after adding this variable by running View(mod_movie_lengths) in your console — not in the qmd file because this will produce an error when knitting!

Take the last histogram you built and add a facet layer which facets/groups by the new variable before_1984.

ggplot(data = mod_movie_lengths, mapping = aes(x = length)) +
  geom_histogram(bins = 40, color = "white") +
  facet_wrap(~ before_1984, scales="free")

You should see two histograms, one for the movies before 1984 and one for movies after 1984. How do these two distributions of movie lengths compare? That is, do they have about the same general shape, center, and spread? Another way to ask this, are movies lengths before and after 1984 distributed similarly?

Very similar distributions except for the minor peak for the before 1984 seems to have relatively higher peak and it peaks a little sooner than that of the minor peak in the after 1984 distribution.