Importing Data & Tidy Data

Activity 07

Author

Solution

Overview

The focus of Activity 07 will be on how to import and tidy data. This activity also offers additional review, in practice you should almost always be using data wrangling in conjunction with visualizations to explain your data.

Needed Packages

The following loads the packages that are needed for this activity. We assume that you have installed the packages. Also note that we have suppressed the messages so the compiled html is less cluttered.

Make sure to add the necessary packages!!

# LOAD THE NECESSARY PACKAGES
library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.2.3

Warning: package 'ggplot2' was built under R version 4.2.3

Warning: package 'tibble' was built under R version 4.2.3

Warning: package 'tidyr' was built under R version 4.2.3

Warning: package 'readr' was built under R version 4.2.3

Warning: package 'purrr' was built under R version 4.2.3

Warning: package 'dplyr' was built under R version 4.2.3

Warning: package 'forcats' was built under R version 4.2.3

Warning: package 'lubridate' was built under R version 4.2.3

library(skimr)

Task 1

Import data

We will be exploring the mm_data.csv dataset stored in your data subdirectory.

The M&M data was collected by students in past introductory courses and workshops. The dataset contains the count for each color (blue, orange, green, yellow, brown, and red) for 303 1.69 oz bags of milk chocolate M&M’s.

#import data
mm_data <- read_csv("data/mm_data.csv")

Rows: 303 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): bag, blue, orange, green, yellow, brown, red

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Begin by inspecting mm_data. Hopefully it is clear that this is not a tidy dataset. Explain why it is not tidy.

[YOUR ANSWER HERE]

Task 2

Tidy mm_data. That is, use pivot_longer() to tidy up the dataset and store it in mm_data_tidy.

# tidy and store m&m data
mm_data_tidy <- mm_data |> 
  pivot_longer(col = -bag,
               names_to = "color",
               values_to = "n")
# print mm_data_tidy
mm_data_tidy

# A tibble: 1,818 × 3
     bag color      n
   <dbl> <chr>  <dbl>
 1     1 blue       6
 2     1 orange    18
 3     1 green     12
 4     1 yellow     6
 5     1 brown      7
 6     1 red        7
 7     2 blue      15
 8     2 orange    11
 9     2 green      9
10     2 yellow     1
# ℹ 1,808 more rows

Tidy 3

Use the skim() function to determine the number of variables, observations, identify any missingness issues, and check for any potentially weird values that could cause issues in the analysis.

# skim dataset
skim(mm_data_tidy)

Data summary
Name	mm_data_tidy
Number of rows	1818
Number of columns	3
_______________________
Column type frequency:
character	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
color	0	1	3	6	0	6	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bag	0	1	152.00	87.49	1	76	152	228	303	▇▇▇▇▇
n	0	1	9.37	4.04	0	6	9	12	28	▂▇▃▁▁

[YOUR ANSWER HERE]

Task 4

Create a plot that shows the distribution of count by color.

# plot of mm color
ggplot(mm_data_tidy, aes(x = n, y = color)) +
  geom_boxplot()

Calculate the following summary statistics for the count of each color per bag of M&M’s: mean, standard deviation, median, IQR, minimum, and maximum. Ensure that it prints the colors in descending order according to the mean. That is the color with the largest mean should be at the top and the one with the smallest should be at the bottom.

# summary statistics for the number of each color in a bag
mm_data_tidy |> 
  group_by(color) |> 
  summarize(mean = mean(n),
            sd = sd(n),
            median = median(n),
            iqr = IQR(n),
            min = min(n),
            max = max(n),
            count = n())

# A tibble: 6 × 8
  color   mean    sd median   iqr   min   max count
  <chr>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
1 blue   12.2   4.02     12   4.5     1    28   303
2 brown   7.22  3.00      7   4       0    19   303
3 green  10.6   3.63     11   5       1    20   303
4 orange 11.8   3.51     12   5       0    23   303
5 red     7.37  2.75      7   3.5     1    16   303
6 yellow  7.00  3.17      7   4       0    17   303

In a few sentences, describe what the summary statistics tell you about the distributions of the number of colors per bag of milk chocolate M&M’s that are in the sample. Should be able to describe what a “typical” bag of M&M’s from the sample would have in terms of colors.

[YOUR ANSWER HERE]

Optional Challenges

Do not have to complete.

Challenge 1

It would be useful to save the mm_tidy so the next time you import it you won’t have to tidy it up. This is where the write_* functions come in. Similarly to the read_* functions there are many of them for writing out (exporting) data of various forms. Let’s use the write_csv() function. Run ?write_csv() in the console to access the function’s documentation.

The first argument in the function should be the data you want to write out, mm_tidy.

The second argument you give it is where to save it and its file name, "data/mm_tidy.csv" which is telling R to write it out to the data/ subdirectory and name the file "mm_tidy.csv".

Try it out and verify that it is now in your data/ subdirectoy.

# write data out