Tangible data of normal distributed data

A classical example for a normally distributed variable is height. However, I kept on looking for data as to the mean and sd for some populations, such as Germany. Now I found some reliably looking data here.

We will not question whether the assumption of normality holds, we just assume it.

In the source, we can read that in Germany, the adult men population has the following parameters:

mean: 174cm
sd: 7cm

For women:

mean: 163cm
sd: 7cm

Let’s play around with these numbers and plug them into the normal formula. To keep things straight, we will enjoy the comfort of the xpnorm and xqnorm functions provided by the R package mosaic. We can plug 174 right into xpnorm. The function is an extended version (hence x) of pnorm function, which gives back the percentile or probability (hence p) of some value which is distributed normally (hence norm).

Say, for instance, we wonder how many men are taller than some one (male) of height 174cm:

library(mosaic)
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.5.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: lattice
## Loading required package: ggformula
## Loading required package: ggplot2
## 
## New to ggformula?  Try the tutorials: 
##  learnr::run_tutorial("introduction", package = "ggformula")
##  learnr::run_tutorial("refining", package = "ggformula")
## Loading required package: mosaicData
## Loading required package: Matrix
## 
## The 'mosaic' package masks several functions from core packages in order to add 
## additional features.  The original behavior of these functions should not be affected by this.
## 
## Note: If you use the Matrix package, be sure to load it BEFORE loading mosaic.
## 
## Attaching package: 'mosaic'
## The following object is masked from 'package:Matrix':
## 
##     mean
## The following object is masked from 'package:ggplot2':
## 
##     stat
## The following objects are masked from 'package:dplyr':
## 
##     count, do, tally
## The following objects are masked from 'package:stats':
## 
##     binom.test, cor, cor.test, cov, fivenum, IQR, median,
##     prop.test, quantile, sd, t.test, var
## The following objects are masked from 'package:base':
## 
##     max, mean, min, prod, range, sample, sum
xpnorm(q = 174, 
       mean = 174,
       sd = 7)
## 
## If X ~ N(174, 7), then
##  P(X <= 174) = P(Z <= 0) = 0.5
##  P(X >  174) = P(Z >  0) = 0.5
## 

## [1] 0.5

Simple, right? The z-value as indicated in the figure is the answer to the question: “Relative to the SD \(sd\), how far are you (\(x_i\)) from the mean value \(\mu\)?”. Mathy: \(z = (x_i-\mu) / sd\). In this case: \(z = (174 - 174) / 7 = 0\). This point falls right on the mean, zero distance from the mean, hence \(z=0\).

As the normal distribution is symmetric, 50% is left of the mean, and 50% right.

Let’s say we would like to know how many women are greater than the average man:

xpnorm(q = 174,
       mean = 163,
       sd = 7)
## 
## If X ~ N(163, 7), then
##  P(X <= 174) = P(Z <= 1.571) = 0.942
##  P(X >  174) = P(Z >  1.571) = 0.05804
## 

## [1] 0.9419584

As the figure tells us, about 6% of the women (values) are greater than 174cm.

Similarly, suppose we find a man of average female size and wonder how many of the men are greater than this guy.

xpnorm(q = 163,
       mean = 174,
       sd = 7)
## 
## If X ~ N(174, 7), then
##  P(X <= 163) = P(Z <= -1.571) = 0.05804
##  P(X >  163) = P(Z >  -1.571) = 0.942
## 

## [1] 0.05804157

About 94%.

We can also go the other way round: How tall must a man be in order to measure up to notorious “top 5 per cent”? Now we plug in a percentage or percentile, and we want back a value, also called a quantile. A quantile is the value of the variable of interest for some given percentile. See:

xqnorm(p = .95,
       mean = 174,
       sd = 7)
## 
## If X ~ N(174, 7), then
##  P(X <= 185.514) = 0.95
##  P(X >  185.514) = 0.05
## 

## [1] 185.514

About 186cm. All guys of this height and the taller ones are part of the 5% tallest men (of this population, given the assumptions hold). Note that the percentile counts from “left” to the “right”. So we have to indicate the “red” area of the figure (the left part).

If a women is of height 186, what’s her percentile?

xpnorm(186, 174, 7)
## 
## If X ~ N(174, 7), then
##  P(X <= 186) = P(Z <= 1.714) = 0.9568
##  P(X >  186) = P(Z >  1.714) = 0.04324
## 

## [1] 0.9567619

Around 96%; 4% are taller. Note taht we do not need to name the arguments of a R function given we stick to the standard order of the arguments (what we did here).