'Illustrate standard deviation in histogram

Consider the following simple example:

# E. Musk in Grunheide 
set.seed(22032022) 

# generate random numbers 
randomNumbers <- rnorm(n = 1000, mean = 10, sd = 10)

# empirical sd 
sd(randomNumbers)
#> [1] 10.34369

# histogram 
hist(randomNumbers, probability = TRUE, main = "", breaks = 50)

# just for illusatration purpose 
###
# empirical density 
lines(density(randomNumbers), col = 'black', lwd = 2)
# theortical density 
curve(dnorm(x, mean = 10, sd = 10), col = "blue", lwd = 2, add = TRUE)
###

Created on 2022-03-22 by the reprex package (v2.0.1)

Question: Is there a nice way to illustrate the empirical standard deviation (sd) in the histogram by colour? E.g. representing the inner bars by a different color, or indicating the range of the sd by an interval, i.e., [mean +/- sd], on the x-axis?

Note, if ggplot2 provides an easy solution, suggesting this would be also much appreciated.



Solution 1:[1]

This is similar ggplot solution to Benson's answer, except we precompute the histogram and use geom_col, so that we don't get any of the unwelcome stacking at the sd boundary:

# E. Musk in Grunheide 
set.seed(22032022) 

# generate random numbers 
randomNumbers <- rnorm(n=1000, mean=10, sd=10)

h <- hist(randomNumbers, breaks = 50, plot = FALSE)

lower <- mean(randomNumbers) - sd(randomNumbers)
upper <- mean(randomNumbers) + sd(randomNumbers)

df <- data.frame(x = h$mids, y = h$density, 
                 fill = h$mids > lower & h$mids < upper)

library(ggplot2)

ggplot(df) +
  geom_col(aes(x, y, fill = fill), width = 1, color = 'black') +
  geom_density(data = data.frame(x = randomNumbers), 
               aes(x = x, color = 'Actual density'),
               key_glyph = 'path') +
  geom_function(fun = function(x) {
    dnorm(x, mean = mean(randomNumbers), sd = sd(randomNumbers)) },
    aes(color = 'theoretical density')) +
  scale_fill_manual(values = c(`TRUE` = '#FF374A', 'FALSE' = 'gray'), 
                    name = 'within 1 SD') +
  scale_color_manual(values = c('black', 'blue'), name = 'Density lines') +
  labs(x = 'Value of random number', y = 'Density') +
  theme_minimal()

enter image description here

Solution 2:[2]

Here is a ggplot solution. First calculate mean and sd, and save the values in different vectors. Then use an ifelse statement to categorise the values into "Within range" and "Outside range", fill them with different colours.

Blue line represents the normal distribution stated in your question, and black line represents the density graph of the histogram we're plotting.

library(ggplot2)

set.seed(22032022) 

# generate random numbers 
randomNumbers <- rnorm(n=1000, mean=10, sd=10)

randomNumbers_mean <- mean(randomNumbers)
randomNumbers_sd <- sd(randomNumbers)

ggplot(data.frame(randomNumbers = randomNumbers), aes(randomNumbers)) +
  geom_histogram(aes(
    fill = ifelse(
      randomNumbers > randomNumbers_mean + randomNumbers_sd |
        randomNumbers < randomNumbers_mean - randomNumbers_sd,
      "Outside range",
      "Within range"
    )
  ), 
  binwidth = 1, col = "gray") +
  geom_density(aes(y = ..count..)) + 
  stat_function(fun = function(x) dnorm(x, mean = 10, sd = 10) * 1000,
                color = "blue") +
  labs(fill = "Data")

Created on 2022-03-22 by the reprex package (v2.0.1)

Solution 3:[3]

data.frame(rand = randomNumbers,
           cut = {
             sd <- sd(randomNumbers)
             mn <- mean(randomNumbers)
             cut(randomNumbers, c(-Inf, mn -sd, mn +sd, Inf))
           }) |>
  ggplot(aes(x = rand, fill = cut ) ) +
  geom_histogram()

enter image description here

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2
Solution 3 Stefano Barbi