Data Visualization
Sections 2.4 - 2.6

Today’s goals

Create a linegraph
Create a histogram
Properly describe a linegraph and histogram
Facet graphs based on subgroups

Artwork by @allison_horst

5NG#2: Linegraphs

Linegraphs show the relationship between 2 numerical variables.

The explanatory (x-axis) variable must be of sequential ordering.

Linegraph syntax in R:

ggplot(data = my_data, mapping = aes(x = var1, y = var2)) +
  geom_line()

5NG#2: Linegraphs

When describing linegraphs…

Look for pattern going from left to right.
Classify association as positive, negative, or no association.
Classify relationship as linear or non linear.
Check x and y scales to make sure they are appropriate.

5NG#3: Histograms

Histograms are used to visualize the distribution of a single numerical variable.

Histograms display numerical data by grouping data into bins of equal width.

There is no ‘y’ position aesthetic for geom_histogram() because we are investigating a single variable.

5NG#3: Histograms

Histogram syntax in R:

ggplot(data= my_data, aes(x = var1)) +
  geom_histogram(color = "white", fill = "blue", bins = 10)

There are 3 things we look and describe when inspecting a histogram:

shape (skew and modality)
center (mean or median)
spread (range, IQR, or standard deviation)

Not all distributions have a simple recognizable shape!

5NG#3: Histograms

Example 1: Histogram bins

Which bin size is most appropriate and describe the distribution of penguin body mass.

Bins = 7
Bins = 18
Bins = 29

ggplot(penguins, aes(x = body_mass_g)) +
    geom_histogram(color = "white", 
                   fill = "lightblue", 
                   bins = 7)+
    labs(title = "Palmer Penguins Distribution of Body Mass",
           x = "Body Mass (g)")

ggplot(penguins, aes(x = body_mass_g)) +
    geom_histogram(color = "white", 
                   fill = "lightblue", 
                   bins = 18)+
    labs(title = "Palmer Penguins Distribution of Body Mass",
           x = "Body Mass (g)")

ggplot(penguins, aes(x = body_mass_g)) +
    geom_histogram(color = "white", 
                   fill = "lightblue", 
                   bins = 29)+
    labs(title = "Palmer Penguins Distribution of Body Mass",
           x = "Body Mass (g)")

Example 2: Histogram bins

Which bin size is most appropriate and describe the distribution of penguin flipper length.

Bins = 15
Bins = 25
Bins = 35

  ggplot(penguins, aes(x = flipper_length_mm )) +
    geom_histogram(color = "white", 
                   fill = "tomato1", 
                   bins = 15)+
    labs(title = "Palmer Penguins Distribution of Flipper Length",
           x = "Flipper Length (mm)")

  ggplot(penguins, aes(x = flipper_length_mm )) +
    geom_histogram(color = "white", 
                   fill = "tomato1", 
                   bins = 25)+
    labs(title = "Palmer Penguins Distribution of Flipper Length",
           x = "Flipper Length (mm)")

  ggplot(penguins, aes(x = flipper_length_mm )) +
    geom_histogram(color = "white", 
                   fill = "tomato1", 
                   bins = 35)+
    labs(title = "Palmer Penguins Distribution of Flipper Length",
           x = "Flipper Length (mm)")

Faceting

Faceting is used to make the same plot for different subgroups of the dataset.
This is useful for comparing the same variable across different subgroups in the dataset.
facet_wrap(~var) can be added on to ANY plot type (scatterplot, linegraph, histogram, boxplot, barplot)

ggplot(penguins, aes(x = flipper_length_mm)) +
    geom_histogram(color = "white", fill = "tomato1", bins = 11) +
    facet_wrap(~ species)

Common Coding Errors

Which of the following are correct?

a)

ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram(aes(color = "white"))

b)

ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram(color = "white")

c)

ggplot(penguins) +
  geom_histogram(x = flipper_length_mm, color = "white")

d)

ggplot(penguins) +
  geom_histogram(aes(x = flipper_length_mm) , color = "white")

Extra information and resources

Wikipedia bin size info

Helpful guidelines:

Larger number of observations generally correspond to larger number of bins needed.
You will generally need to test several different number of bins to learn about the data and find an appropriate value.
Sturges Rule of Thumb for unimodal symmetric distributions: bins = 1 + 3.322*log(n)
Sturge’s rule is not great if the data is severely skewed, multi-modal, or for an extremely large number of observations. But it could give you a starting place and then you will want to increase the number of bins until you can properly see the shape.

Data Visualization Sections 2.4 - 2.6

Today’s goals

5NG#2: Linegraphs

5NG#2: Linegraphs

5NG#3: Histograms

5NG#3: Histograms

5NG#3: Histograms

Example 1: Histogram bins

Example 2: Histogram bins

Faceting

Common Coding Errors

Extra information and resources

Data Visualization
Sections 2.4 - 2.6