library(dplyr)
library(ggplot2)
library(nycflights13)Data Wrangling I
Activity 05
Overview
The focus of Activity 05 will be on data wrangling. In real data applications it is rare that data will be in the “perfect” form for us to use. We will typically need to manipulate or wrangle the data into a form that is useful. We will begin our development of data wrangling skills by becoming familiar with the “pipe”, %>%, and using it with filter() and summarize(). We will also utilize our visualization tools from the previous activities.
Needed Packages
The following loads the packages that are needed for this activity. We assume that you have installed the packages. Also note that we have suppressed the messages so the compiled html is less cluttered.
Tasks
Complete the following series of tasks. Remember to render early and render often.
Background
Chicago flight times
We will be using the familiar flights dataset from nycflights13 for this section. It might be useful to review the dataset a little. Run ?flights in your console and a help file will appear with information about the dataset. You might find the definitions for the variables useful. It is also advisable to take a quick peek at the data by running View(flights) in your console — remember never to run/leave this in your Rmd file because it will produce an error.
Task 1
O’hare (ORD)
Let’s focus on flights from New York City to Chicago’s O’hare (ORD). Create a dataset named ord_flights that contains all flights in 2013 to ORD from New York City’s airports.
# create ord_flights
ord_flights <- flights %>%
filter(dest == "ORD")It is always a good idea to inspect your newly created dataset as a quick spot check. You can do that by printing it (run ord_flights in the console), by View()ing it, and/or checking that it is in the Environment tab in the upper right pane.
How many flights are in ord_flights? Does each flight correspond to an observation or variable?
There are 17,283 flights in ord_flights.
We are interested in the flight time as measured by air_time from NYC to Chicago’s ORD. That is, we want to investigate the distribution of air_time. Begin the investigation by calculating summary statistics: mean, standard deviation, median, interquartile range, minimum, and maximum. Store the results in ord_summary.
# Create ord_summary
ord_summary <- ord_flights %>%
summarize(
avg = mean(air_time, na.rm = TRUE),
std_dev = sd(air_time, na.rm = TRUE),
med = median(air_time, na.rm = TRUE),
iqr = IQR(air_time, na.rm = TRUE),
minimum = min(air_time, na.rm = TRUE),
maximum = max(air_time, na.rm = TRUE)
)
# Print ord_summary
ord_summary# A tibble: 1 × 6
avg std_dev med iqr minimum maximum
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 116. 10.2 114 13 87 198
Explain what these values tell you about the flight times from NYC airports to Chicago’s O’hare in 2013. That is, what is the typical flight time? How do flights spread out or deviate from the typical flight time?
The typical flight time from an NYC airport to Chicago’s O’hare in 2013 took about 116 minutes. The distribution of flight time for these flights has a spread, as measured by standard deviation, of about 10.2 minutes. This information would indicate that most of the flights between NYC airports and ORD took about 96 to 136 minutes. Found values by taking 116 - 2*10 and 116 + 2*10.
It might be useful to also visualize the distribution of air_time. You should use two of the 5NGs to do this. Create the two graphs in the code chunk below.
# Construct plot 1
ggplot(ord_flights, aes(x = air_time)) +
geom_histogram(color = "white", bins = 40) Warning: Removed 717 rows containing non-finite values (stat_bin).
# Construct plot 2
ggplot(ord_flights, aes(y = air_time)) +
geom_boxplot() +
coord_flip()Warning: Removed 717 rows containing non-finite values (stat_boxplot).
Do the summary statistics and plots above tell the same story about air_time? Explain.
Yes, the summary statistics and plots tell the a similar story. We see the center is about 116 minutes and the spread is around 13 minutes (IQR). The interval we mentioned above covers the vast majority of the data. One thing that is easier to see in the graphs is that there is a slight right skew in the distribution as a result of there being several outliers to the right. The mean being to the right of the median does imply that there are outliers to the right (therefore possible right skew).
Task 2
The useful is.na() function
When calculating the summary statistics above and building the graphs we had to take care of missing values for air_time. How does a flight have NA for its air_time? Either there was a recording error (unlikely with FAA records) or maybe the flight was cancelled. When analyzing flight time (air_time) do we really care about flights that were cancelled? Probably not, so we could filter them out.
This means we need a way of identifying if a flight has NA for its air_time measurement. Your first instinct should be to use something like air_time != NA (keep air_time that are not equal to NA), but that won’t do the trick. Missing values are tricky and handling them takes a little extra care. We need the is.na() function which does exactly what it seems to imply. It answers the question “TRUE or FALSE: Is the input NA?”
Now if we use is.na(air_time) in a filter we would be keeping the flights that have NA for air_time which is the opposite of what we want. Therefore we should use !is.na(air_time). Create a dataset named ord_not_cancelled that contains all flights from NYC to ORD that were not cancelled.
# Create ord_not_cancelled
ord_not_cancelled <- flights %>%
filter(dest == "ORD", !is.na(air_time))
# Print ord_not_cancelled
ord_not_cancelled# A tibble: 16,566 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 554 558 -4 740 728
2 2013 1 1 558 600 -2 753 745
3 2013 1 1 608 600 8 807 735
4 2013 1 1 629 630 -1 824 810
5 2013 1 1 656 700 -4 854 850
6 2013 1 1 709 700 9 852 832
7 2013 1 1 715 713 2 911 850
8 2013 1 1 739 745 -6 918 930
9 2013 1 1 749 710 39 939 850
10 2013 1 1 828 830 -2 1027 1012
# … with 16,556 more rows, and 11 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Now there should be no missing value issues with air_time. Test this out by re-running one of your graphs, but use the ord_not_cancelled dataset. There shouldn’t be a warning message this time because we removed all the flights that had missing values.
ggplot(ord_not_cancelled, aes(y = air_time)) +
geom_boxplot() +
coord_flip()It is important to note that we didn’t just throw out missing values. We thought about them and determined that they represented flights that had no impact on what we were exploring and could be removed.
This is a very useful function that you may be using often so I recommend making a note of it somewhere!
Task 3
Additional practice
Suppose we want to count the total number of flights that were
- in the
monthof January, - and departed late Hint: if they departed late think about which (
dep_delay) values you want to keep
Store your results in delay_jan.
# Create delay_jan
delay_jan <- flights %>%
filter(month == 1, dep_delay>0) %>%
summarize(count = n())
# Print delay_jan
delay_jan# A tibble: 1 × 1
count
<int>
1 9662
Task 4
More practice
Suppose we want to summarize the average and standard deviation of arr_delay for flights
- headed to
ORDorMDW, - and in the month of June
Store your results in arr_june.
# Create arr_june
arr_june <- flights %>%
filter((dest =='ORD' | dest =='MDW'), month == 6) %>%
summarize(mean = mean(arr_delay, na.rm=TRUE),
sd = sd(arr_delay, na.rm=TRUE))
# Print arr_june
arr_june# A tibble: 1 × 2
mean sd
<dbl> <dbl>
1 14.8 55.9
Optional Challenge
Do not have to complete.
Copy and paste your code used to create ord_not_cancelled, but remove the ord_not_cancelled <-. We don’t want to store it. What happens when you run the code?
# PLACE CODE HERE
flights %>%
filter(dest == "ORD", !is.na(air_time))# A tibble: 16,566 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 554 558 -4 740 728
2 2013 1 1 558 600 -2 753 745
3 2013 1 1 608 600 8 807 735
4 2013 1 1 629 630 -1 824 810
5 2013 1 1 656 700 -4 854 850
6 2013 1 1 709 700 9 852 832
7 2013 1 1 715 713 2 911 850
8 2013 1 1 739 745 -6 918 930
9 2013 1 1 749 710 39 939 850
10 2013 1 1 828 830 -2 1027 1012
# … with 16,556 more rows, and 11 more variables: arr_delay <dbl>,
# carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
# air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Now build on that code by piping it directly into the histogram code. That is add a %>% to the end of code from the previous chunk and paste your code for a histogram of air_time right after it. Remove the data reference from your ggplot() code. That is remove data = ord_not_cancelled from the the base layer so it now looks like ggplot(mapping = aes(x = air_time)). Run the code. What happens?
flights %>%
filter(dest == "ORD", !is.na(air_time)) %>%
ggplot(aes(x = air_time)) +
geom_histogram(color = "white", bins = 40) Why might this be useful?
We don’t have to waste time saving intermediate datasets and we can make minor adjustments without having to re-create intermediate datasets.