Data Wrangling I

Activity 05

Author

Solutions

Overview

The focus of Activity 05 will be on data wrangling. In real data applications it is rare that data will be in the “perfect” form for us to use. We will typically need to manipulate or wrangle the data into a form that is useful. We will begin our development of data wrangling skills by becoming familiar with the “pipe”, %>%, and using it with filter() and summarize(). We will also utilize our visualization tools from the previous activities.


Needed Packages

The following loads the packages that are needed for this activity. We assume that you have installed the packages. Also note that we have suppressed the messages so the compiled html is less cluttered.

library(dplyr)
library(ggplot2)
library(nycflights13)


Tasks

Complete the following series of tasks. Remember to render early and render often.


Background

Chicago flight times

We will be using the familiar flights dataset from nycflights13 for this section. It might be useful to review the dataset a little. Run ?flights in your console and a help file will appear with information about the dataset. You might find the definitions for the variables useful. It is also advisable to take a quick peek at the data by running View(flights) in your console — remember never to run/leave this in your Rmd file because it will produce an error.


Task 1

O’hare (ORD)

Let’s focus on flights from New York City to Chicago’s O’hare (ORD). Create a dataset named ord_flights that contains all flights in 2013 to ORD from New York City’s airports.

# create ord_flights
ord_flights <- flights %>% 
  filter(dest == "ORD")


It is always a good idea to inspect your newly created dataset as a quick spot check. You can do that by printing it (run ord_flights in the console), by View()ing it, and/or checking that it is in the Environment tab in the upper right pane.

How many flights are in ord_flights? Does each flight correspond to an observation or variable?

There are 17,283 flights in ord_flights.


We are interested in the flight time as measured by air_time from NYC to Chicago’s ORD. That is, we want to investigate the distribution of air_time. Begin the investigation by calculating summary statistics: mean, standard deviation, median, interquartile range, minimum, and maximum. Store the results in ord_summary.

# Create ord_summary
ord_summary <- ord_flights %>% 
  summarize(
    avg     = mean(air_time, na.rm = TRUE),
    std_dev = sd(air_time, na.rm = TRUE),
    med     = median(air_time, na.rm = TRUE),
    iqr     = IQR(air_time, na.rm = TRUE),
    minimum = min(air_time, na.rm = TRUE),
    maximum = max(air_time, na.rm = TRUE)
  )
# Print ord_summary
ord_summary
# A tibble: 1 × 6
    avg std_dev   med   iqr minimum maximum
  <dbl>   <dbl> <dbl> <dbl>   <dbl>   <dbl>
1  116.    10.2   114    13      87     198


Explain what these values tell you about the flight times from NYC airports to Chicago’s O’hare in 2013. That is, what is the typical flight time? How do flights spread out or deviate from the typical flight time?


The typical flight time from an NYC airport to Chicago’s O’hare in 2013 took about 116 minutes. The distribution of flight time for these flights has a spread, as measured by standard deviation, of about 10.2 minutes. This information would indicate that most of the flights between NYC airports and ORD took about 96 to 136 minutes. Found values by taking 116 - 2*10 and 116 + 2*10.


It might be useful to also visualize the distribution of air_time. You should use two of the 5NGs to do this. Create the two graphs in the code chunk below.

# Construct plot 1
ggplot(ord_flights, aes(x = air_time)) +
  geom_histogram(color = "white", bins = 40) 
Warning: Removed 717 rows containing non-finite values (stat_bin).

# Construct plot 2
ggplot(ord_flights, aes(y = air_time)) +
  geom_boxplot() +
  coord_flip()
Warning: Removed 717 rows containing non-finite values (stat_boxplot).


Do the summary statistics and plots above tell the same story about air_time? Explain.

Yes, the summary statistics and plots tell the a similar story. We see the center is about 116 minutes and the spread is around 13 minutes (IQR). The interval we mentioned above covers the vast majority of the data. One thing that is easier to see in the graphs is that there is a slight right skew in the distribution as a result of there being several outliers to the right. The mean being to the right of the median does imply that there are outliers to the right (therefore possible right skew).


Task 2

The useful is.na() function

When calculating the summary statistics above and building the graphs we had to take care of missing values for air_time. How does a flight have NA for its air_time? Either there was a recording error (unlikely with FAA records) or maybe the flight was cancelled. When analyzing flight time (air_time) do we really care about flights that were cancelled? Probably not, so we could filter them out.

This means we need a way of identifying if a flight has NA for its air_time measurement. Your first instinct should be to use something like air_time != NA (keep air_time that are not equal to NA), but that won’t do the trick. Missing values are tricky and handling them takes a little extra care. We need the is.na() function which does exactly what it seems to imply. It answers the question “TRUE or FALSE: Is the input NA?”

Now if we use is.na(air_time) in a filter we would be keeping the flights that have NA for air_time which is the opposite of what we want. Therefore we should use !is.na(air_time). Create a dataset named ord_not_cancelled that contains all flights from NYC to ORD that were not cancelled.

# Create ord_not_cancelled
ord_not_cancelled <- flights %>% 
  filter(dest == "ORD", !is.na(air_time))
# Print ord_not_cancelled
ord_not_cancelled
# A tibble: 16,566 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      554            558        -4      740            728
 2  2013     1     1      558            600        -2      753            745
 3  2013     1     1      608            600         8      807            735
 4  2013     1     1      629            630        -1      824            810
 5  2013     1     1      656            700        -4      854            850
 6  2013     1     1      709            700         9      852            832
 7  2013     1     1      715            713         2      911            850
 8  2013     1     1      739            745        -6      918            930
 9  2013     1     1      749            710        39      939            850
10  2013     1     1      828            830        -2     1027           1012
# … with 16,556 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>


Now there should be no missing value issues with air_time. Test this out by re-running one of your graphs, but use the ord_not_cancelled dataset. There shouldn’t be a warning message this time because we removed all the flights that had missing values.

ggplot(ord_not_cancelled, aes(y = air_time)) +
  geom_boxplot() +
  coord_flip()


It is important to note that we didn’t just throw out missing values. We thought about them and determined that they represented flights that had no impact on what we were exploring and could be removed.

This is a very useful function that you may be using often so I recommend making a note of it somewhere!


Task 3

Additional practice

Suppose we want to count the total number of flights that were

  • in the month of January,
  • and departed late Hint: if they departed late think about which (dep_delay) values you want to keep

Store your results in delay_jan.

# Create delay_jan
delay_jan <- flights %>% 
  filter(month == 1, dep_delay>0) %>%
  summarize(count = n())
# Print delay_jan
delay_jan
# A tibble: 1 × 1
  count
  <int>
1  9662


Task 4

More practice

Suppose we want to summarize the average and standard deviation of arr_delay for flights

  • headed to ORD or MDW,
  • and in the month of June

Store your results in arr_june.

# Create arr_june
arr_june <- flights %>% 
  filter((dest =='ORD' | dest =='MDW'), month == 6) %>%
  summarize(mean = mean(arr_delay, na.rm=TRUE),
            sd = sd(arr_delay, na.rm=TRUE))
# Print arr_june
arr_june
# A tibble: 1 × 2
   mean    sd
  <dbl> <dbl>
1  14.8  55.9


Optional Challenge

Do not have to complete.

Copy and paste your code used to create ord_not_cancelled, but remove the ord_not_cancelled <-. We don’t want to store it. What happens when you run the code?

# PLACE CODE HERE
flights %>% 
  filter(dest == "ORD", !is.na(air_time))
# A tibble: 16,566 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      554            558        -4      740            728
 2  2013     1     1      558            600        -2      753            745
 3  2013     1     1      608            600         8      807            735
 4  2013     1     1      629            630        -1      824            810
 5  2013     1     1      656            700        -4      854            850
 6  2013     1     1      709            700         9      852            832
 7  2013     1     1      715            713         2      911            850
 8  2013     1     1      739            745        -6      918            930
 9  2013     1     1      749            710        39      939            850
10  2013     1     1      828            830        -2     1027           1012
# … with 16,556 more rows, and 11 more variables: arr_delay <dbl>,
#   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>


Now build on that code by piping it directly into the histogram code. That is add a %>% to the end of code from the previous chunk and paste your code for a histogram of air_time right after it. Remove the data reference from your ggplot() code. That is remove data = ord_not_cancelled from the the base layer so it now looks like ggplot(mapping = aes(x = air_time)). Run the code. What happens?

flights %>% 
  filter(dest == "ORD", !is.na(air_time)) %>% 
  ggplot(aes(x = air_time)) +
  geom_histogram(color = "white", bins = 40) 


Why might this be useful?

We don’t have to waste time saving intermediate datasets and we can make minor adjustments without having to re-create intermediate datasets.