Data Visualization
Sections 2.4 - 2.6

Today’s goals


  1. Create a linegraph
  2. Create a histogram
  3. Properly describe a linegraph and histogram
  4. Facet graphs based on subgroups

Artwork by @allison_horst

5NG#2: Linegraphs

Linegraphs show the relationship between 2 numerical variables.


The explanatory (x-axis) variable must be of sequential ordering.


Linegraph syntax in R:

ggplot(data = my_data, mapping = aes(x = var1, y = var2)) +
  geom_line()

5NG#2: Linegraphs

When describing linegraphs…

  • Look for pattern going from left to right.
  • Classify association as positive, negative, or no association.
  • Classify relationship as linear or non linear.
  • Check x and y scales to make sure they are appropriate.

5NG#3: Histograms

  • Histograms are used to visualize the distribution of a single numerical variable.
  • Histograms display numerical data by grouping data into bins of equal width.
  • There is no ‘y’ position aesthetic for geom_histogram() because we are investigating a single variable.

5NG#3: Histograms

Histogram syntax in R:

ggplot(data= my_data, aes(x = var1)) +
  geom_histogram(color = "white", fill = "blue", bins = 10)


There are 3 things we look and describe when inspecting a histogram:

  • shape (skew and modality)

  • center (mean or median)

  • spread (range, IQR, or standard deviation)

Not all distributions have a simple recognizable shape!

5NG#3: Histograms

 

 

 

 

 

 

Example 1: Histogram bins

Which bin size is most appropriate and describe the distribution of penguin body mass.

ggplot(penguins, aes(x = body_mass_g)) +
    geom_histogram(color = "white", 
                   fill = "lightblue", 
                   bins = 7)+
    labs(title = "Palmer Penguins Distribution of Body Mass",
           x = "Body Mass (g)")

ggplot(penguins, aes(x = body_mass_g)) +
    geom_histogram(color = "white", 
                   fill = "lightblue", 
                   bins = 18)+
    labs(title = "Palmer Penguins Distribution of Body Mass",
           x = "Body Mass (g)")

ggplot(penguins, aes(x = body_mass_g)) +
    geom_histogram(color = "white", 
                   fill = "lightblue", 
                   bins = 29)+
    labs(title = "Palmer Penguins Distribution of Body Mass",
           x = "Body Mass (g)")

Example 2: Histogram bins

Which bin size is most appropriate and describe the distribution of penguin flipper length.

  ggplot(penguins, aes(x = flipper_length_mm )) +
    geom_histogram(color = "white", 
                   fill = "tomato1", 
                   bins = 15)+
    labs(title = "Palmer Penguins Distribution of Flipper Length",
           x = "Flipper Length (mm)")

  ggplot(penguins, aes(x = flipper_length_mm )) +
    geom_histogram(color = "white", 
                   fill = "tomato1", 
                   bins = 25)+
    labs(title = "Palmer Penguins Distribution of Flipper Length",
           x = "Flipper Length (mm)")

  ggplot(penguins, aes(x = flipper_length_mm )) +
    geom_histogram(color = "white", 
                   fill = "tomato1", 
                   bins = 35)+
    labs(title = "Palmer Penguins Distribution of Flipper Length",
           x = "Flipper Length (mm)")

Faceting

  • Faceting is used to make the same plot for different subgroups of the dataset.

  • This is useful for comparing the same variable across different subgroups in the dataset.

  • facet_wrap(~var) can be added on to ANY plot type (scatterplot, linegraph, histogram, boxplot, barplot)


ggplot(penguins, aes(x = flipper_length_mm)) +
    geom_histogram(color = "white", fill = "tomato1", bins = 11) +
    facet_wrap(~ species)

Common Coding Errors

Which of the following are correct?

a)  
ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram(aes(color = "white"))
b)  
ggplot(penguins, aes(x = flipper_length_mm)) +
  geom_histogram(color = "white")
c)  
ggplot(penguins) +
  geom_histogram(x = flipper_length_mm, color = "white")
d)  
ggplot(penguins) +
  geom_histogram(aes(x = flipper_length_mm) , color = "white")

Extra information and resources

Wikipedia bin size info

Helpful guidelines:

  • Larger number of observations generally correspond to larger number of bins needed.

  • You will generally need to test several different number of bins to learn about the data and find an appropriate value.

  • Sturges Rule of Thumb for unimodal symmetric distributions: bins = 1 + 3.322*log(n)

  • Sturge’s rule is not great if the data is severely skewed, multi-modal, or for an extremely large number of observations. But it could give you a starting place and then you will want to increase the number of bins until you can properly see the shape.