The focus of Activity 07 will be on how to import and tidy data. This activity also offers additional review, in practice you should almost always be using data wrangling in conjunction with visualizations to explain your data.
Needed Packages
The following loads the packages that are needed for this activity. We assume that you have installed the packages. Also note that we have suppressed the messages so the compiled html is less cluttered.
Make sure to add the necessary packages!!
# LOAD THE NECESSARY PACKAGESlibrary(tidyverse)
Warning: package 'tidyverse' was built under R version 4.2.3
Warning: package 'ggplot2' was built under R version 4.2.3
Warning: package 'tibble' was built under R version 4.2.3
Warning: package 'tidyr' was built under R version 4.2.3
Warning: package 'readr' was built under R version 4.2.3
Warning: package 'purrr' was built under R version 4.2.3
Warning: package 'dplyr' was built under R version 4.2.3
Warning: package 'forcats' was built under R version 4.2.3
Warning: package 'lubridate' was built under R version 4.2.3
library(skimr)
Task 1
Import data
We will be exploring the mm_data.csv dataset stored in your data subdirectory.
The M&M data was collected by students in past introductory courses and workshops. The dataset contains the count for each color (blue, orange, green, yellow, brown, and red) for 303 1.69 oz bags of milk chocolate M&M’s.
Rows: 303 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (7): bag, blue, orange, green, yellow, brown, red
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Begin by inspecting mm_data. Hopefully it is clear that this is not a tidy dataset. Explain why it is not tidy.
[YOUR ANSWER HERE]
Task 2
Tidy mm_data. That is, use pivot_longer() to tidy up the dataset and store it in mm_data_tidy.
# tidy and store m&m datamm_data_tidy <- mm_data |>pivot_longer(col =-bag,names_to ="color",values_to ="n")# print mm_data_tidymm_data_tidy
# A tibble: 1,818 × 3
bag color n
<dbl> <chr> <dbl>
1 1 blue 6
2 1 orange 18
3 1 green 12
4 1 yellow 6
5 1 brown 7
6 1 red 7
7 2 blue 15
8 2 orange 11
9 2 green 9
10 2 yellow 1
# ℹ 1,808 more rows
Tidy 3
Use the skim() function to determine the number of variables, observations, identify any missingness issues, and check for any potentially weird values that could cause issues in the analysis.
# skim datasetskim(mm_data_tidy)
Data summary
Name
mm_data_tidy
Number of rows
1818
Number of columns
3
_______________________
Column type frequency:
character
1
numeric
2
________________________
Group variables
None
Variable type: character
skim_variable
n_missing
complete_rate
min
max
empty
n_unique
whitespace
color
0
1
3
6
0
6
0
Variable type: numeric
skim_variable
n_missing
complete_rate
mean
sd
p0
p25
p50
p75
p100
hist
bag
0
1
152.00
87.49
1
76
152
228
303
▇▇▇▇▇
n
0
1
9.37
4.04
0
6
9
12
28
▂▇▃▁▁
[YOUR ANSWER HERE]
Task 4
Create a plot that shows the distribution of count by color.
# plot of mm colorggplot(mm_data_tidy, aes(x = n, y = color)) +geom_boxplot()
Calculate the following summary statistics for the count of each color per bag of M&M’s: mean, standard deviation, median, IQR, minimum, and maximum. Ensure that it prints the colors in descending order according to the mean. That is the color with the largest mean should be at the top and the one with the smallest should be at the bottom.
# summary statistics for the number of each color in a bagmm_data_tidy |>group_by(color) |>summarize(mean =mean(n),sd =sd(n),median =median(n),iqr =IQR(n),min =min(n),max =max(n),count =n())
# A tibble: 6 × 8
color mean sd median iqr min max count
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 blue 12.2 4.02 12 4.5 1 28 303
2 brown 7.22 3.00 7 4 0 19 303
3 green 10.6 3.63 11 5 1 20 303
4 orange 11.8 3.51 12 5 0 23 303
5 red 7.37 2.75 7 3.5 1 16 303
6 yellow 7.00 3.17 7 4 0 17 303
In a few sentences, describe what the summary statistics tell you about the distributions of the number of colors per bag of milk chocolate M&M’s that are in the sample. Should be able to describe what a “typical” bag of M&M’s from the sample would have in terms of colors.
[YOUR ANSWER HERE]
Optional Challenges
Do not have to complete.
Challenge 1
It would be useful to save the mm_tidy so the next time you import it you won’t have to tidy it up. This is where the write_* functions come in. Similarly to the read_* functions there are many of them for writing out (exporting) data of various forms. Let’s use the write_csv() function. Run ?write_csv() in the console to access the function’s documentation.
The first argument in the function should be the data you want to write out, mm_tidy.
The second argument you give it is where to save it and its file name, "data/mm_tidy.csv" which is telling R to write it out to the data/ subdirectory and name the file "mm_tidy.csv".
Try it out and verify that it is now in your data/ subdirectoy.