Data Visualization I

Activity 02

Author

Solutions

Overview

The focus of Activity 02 will be on the basics of “The Grammar of Graphics” and scatterplots — our first of the five named graph (5NG).


Needed Packages

The following loads the necessary packages for this activity. Also note that we have suppressed messages so the compiled html is less cluttered.

library(nycflights13)
library(ggplot2)
library(dplyr)


The Grammar of Graphics

“The Grammar of Graphics” define a set of rules for constructing statistical graphics by combining different types of layers. This is why you may hear it alternatively called the “The Layered Grammar of Graphics.”


Helpful Tip

If you are ever unsure of how to use a function use the ? followed by the function name to see a description of the function and useful examples. If you have not loaded the library with the function you will need to use ?? (ex: ?ggplot or ??ggplot).


Tasks

Complete the following series of tasks. Remember to render early and render often.


Task 1

Alaska Flights dataset

Some datasets exist in packages while others are stored on your computer (in this case the Cloud) and accessed through the data folder.

The flights dataset is included in the nycflights13 package. To learn more about the dataset you can type ?flights in the console window.


Below is the code to create a dataset that contains only the 714 Alaska Airlines flights that left NYC in 2013. (We used filter to only include AS which is Alaska Airlines).

Run the code chunk below.

alaska_flights <- flights %>% 
  filter(carrier == "AS")

Now navigate to the Environment tab in the top right panel. Notice there is a new dataset created named alaska_flights. Click on it.

How many variables and how many observations does the dataset have?

19 variables


When defining aesthetics in a plot (ie: x, y, color, shape, size), they must be variables from the dataset you specify. To see the names of the possible variables you must explore your dataset.


Alaska flights scatterplot

Below is the code that produced Figure 2.4 in the book.

ggplot(data = alaska_flights, 
       mapping = aes(x = dep_delay, y = arr_delay)) + 
  geom_point(alpha = 0.2) 
Warning: Removed 5 rows containing missing values (geom_point).

What does alpha = 0.2 do to the graph?

It changes the transparency of the points to address overplotting issues. A value of 0 is 100% transparent and a value of 1 is 100% opaque.


Describe what the scatterplot is communicating about these 714 Alaska Airlines flights that left NYC in 2013.

There is a clear positive linear association between arr_delay and dep_delay for these flights. We also see that a significant portion (maybe majority) arrive early and depart on-time/early.


Task 2

Miles per Gallon

Let’s explore the mpg dataset which is loaded with the ggplot2 package. You can learn more about the mpg dataset by running ?mpg in the console.

You can load the mpg dataset into your Environment pane by typing data(mpg) in the Console (not necessary but some students prefer to see their dataset there).

# print the data
mpg
# A tibble: 234 × 11
   manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
   <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
 1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
 2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
 3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
 4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
 5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
 6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
 7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
 8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
 9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
# … with 224 more rows


Below is a scatterplot of hwy (highway miles per gallon) by cty (city miles per gallon). What is wrong with the plot?

Clear overplotting issues.

Make the appropriate adjustment to the code (Hint: section 2.3.2).

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_jitter(aes(color = drv))


Additionally, map the variable drv to the color aesthetic of the points. Note that drv is a character variable. What are the possible drive types?

f = front-wheel drive, r = rear wheel drive, 4 = 4wd

In a few sentences describe what the scatterplot is communicating about the cars in the mpg dataset.

There is a clear positive linear association between city (cty) and highway (hwy) miles per gallon. It appears that front-wheel drive vehicles generally have better gas mileage in this dataset.


Task 3

Blue Jays

Let’s explore the blue_jays dataset which should be in your data/ subdirectory. This dataset contains measurements on 123 blue jays. The code below loads the data from your data subdirectory, stores it in a more useful format, and then finally prints it for us to inspect.

# Load data from file
load("data/blue_jays.rda") 
# Store in a more useful format
blue_jays <- blue_jays %>% 
  as_tibble()
# Print data
blue_jays
# A tibble: 123 × 9
   BirdID     KnownSex BillDepth BillWidth BillLength  Head  Mass Skull   Sex
   <fct>      <fct>        <dbl>     <dbl>      <dbl> <dbl> <dbl> <dbl> <int>
 1 0000-00000 M             8.26      9.21       25.9  56.6  73.3  30.7     1
 2 1142-05901 M             8.54      8.76       25.0  56.4  75.1  31.4     1
 3 1142-05905 M             8.39      8.78       26.1  57.3  70.2  31.2     1
 4 1142-05907 F             7.78      9.3        23.5  53.8  65.5  30.3     0
 5 1142-05909 M             8.71      9.84       25.5  57.3  74.9  31.8     1
 6 1142-05911 F             7.28      9.3        22.2  52.2  63.9  30       0
 7 1142-05912 M             8.74      9.28       25.4  57.1  75.1  31.8     1
 8 1142-05914 M             8.72      9.94       30    60.7  78.1  30.7     1
 9 1142-05917 F             8.2       9.01       22.8  52.8  64    30.0     0
10 1142-05920 F             7.67      9.31       24.6  54.9  67.3  30.3     0
# … with 113 more rows


Construct a scatterplot of Head, distance from tip of bill to back of head (in mm), by Mass, body mass (in grams). Also map the KnownSex, sex of the bird, to both the color and shape aesthetics.

# BUILD THE PLOT IN THIS CODE CHUNK
ggplot(data = blue_jays, mapping = aes(x = Mass, y = Head)) +
  geom_point(aes(color = KnownSex, shape = KnownSex))


In a few sentences describe what the scatterplot is communicating about the blue jays in our blue_jays dataset.

There is a clear positive linear association between the distance from tip of bill to back of the head and body mass of blue jays in this dataset. It also appears that female blue jays in the dataset tend to smaller in both measurements.


Why might we map a single variable to two aesthetics?

To help people that are color blind.


Task 4

Legosets

Let’s explore the legosets dataset which should be in your data/ subdirectory. This dataset contains information about every Lego set manufactured from 1970 to 2015, a total of 6172 sets. The code below loads the data from your data subdirectory and creates to two separate datasets. One containing Duplo lego sets which are for toddlers/preschoolers and one containing all other lego sets which are for older kids. Finally we glimpse one of the datasets because both should have the same structure.

# Load data from file
load("data/legosets.rda")
# Non-duplo (for big kids, like Professor Kuyper)
lego_nonduplo <- legosets %>% 
  filter(Theme != "Duplo")
# Duplo (for young kids, toddlers)
lego_duplo <- legosets %>% 
  filter(Theme == "Duplo")
# Check for variables
glimpse(lego_nonduplo)
Rows: 5,701
Columns: 14
$ Item_Number  <chr> "10246", "10247", "10248", "10249", "10677", "10679", "10…
$ Name         <chr> "Detective's Office", "Ferris Wheel", "Ferrari F40", "Toy…
$ Year         <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 201…
$ Theme        <chr> "Advanced Models", "Advanced Models", "Advanced Models", …
$ Subtheme     <chr> "Modular Buildings", "Fairground", "Vehicles", "Winter Vi…
$ Pieces       <int> 2262, 2464, 1158, 898, 74, 57, 99, 132, 134, 113, 226, 13…
$ Minifigures  <int> 6, 10, NA, NA, 1, 2, 2, 2, 2, 2, 3, NA, NA, NA, NA, NA, N…
$ Image_URL    <chr> "http://images.brickset.com/sets/images/10246-1.jpg", "ht…
$ GBP_MSRP     <dbl> 132.99, 149.99, 69.99, 59.99, 9.99, 9.99, 15.99, 15.99, 1…
$ USD_MSRP     <dbl> 159.99, 199.99, 99.99, 79.99, 9.99, 9.99, 19.99, 19.99, 1…
$ CAD_MSRP     <dbl> 199.99, 229.99, 119.99, NA, 12.99, 12.99, 24.99, 24.99, 2…
$ EUR_MSRP     <dbl> 149.99, 179.99, 89.99, 69.99, 9.99, 9.99, 19.99, 19.99, 2…
$ Packaging    <chr> "Box", "Box", "Box", "Box", "Box", "Box", "Box", "Box", "…
$ Availability <chr> "Retail - limited", "Retail - limited", "LEGO exclusive",…


There are only a few real candidates for constructing a scatterplot. You should be able to identify them quickly. It might be useful to know that MSRP stands for Manufacturer’s Suggested Retail Price.

Using the lego_nonduplo dataset, construct a scatterplot of USD_MSRP, a set’s MSRP in US dollars, by Pieces, the number of pieces in a set.

ggplot(data = lego_nonduplo, mapping = aes(x = Pieces, y = USD_MSRP)) +
  geom_point(alpha = 0.1)
Warning: Removed 419 rows containing missing values (geom_point).


Does the relationship you observe in the scatterplot make sense? Explain.

Yes, it does make sense. We see a positive association which means that the more pieces a lego set contains the greater its price.

Optional Challenges

Do not have to complete.

Challenge 1

It is often useful to “add” another geometric layer when exploring data with scatterplots. This additional layer helps us to detect and describe relationships. All we have to do is add geom_smooth(se = FALSE).

Copy and paste the code for one of the completed scatterplots above and simply layer on geom_smooth(se = FALSE) by “adding” it to the code.


ggplot(data = lego_nonduplo, mapping = aes(x = Pieces, y = USD_MSRP)) +
  geom_point(alpha = 0.1) +
  geom_smooth(se = FALSE) +
  labs(title = "Nonduplo Lego Price by Pieces") +
  theme_minimal()
`geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Warning: Removed 419 rows containing non-finite values (stat_smooth).
Warning: Removed 419 rows containing missing values (geom_point).


Challenge 2

With an additional layer or two you can make your plot look a lot prettier. Add a title to your plot in the above ‘Challenge 1’ by “adding” on the layer labs(title = "Your title"). Where “Your title” is an appropriate title for your plot.

Then “add” on the layer theme_minimal() to make the background and display more visually appealing.