BASIC STATISTICS

Descriptive Statistics are used to summarize or describe a set of data. This page covers the basic definitions - mean, mode, median etc. Also Standard Deviation and the normal distribution, and conversion to Percentile.

Descriptive Statistics

Most statistics is either descriptive statistic, or in an inductive analysis. Descriptive statistics are calculations based on the data that describe or summarize that data. For example - the mean (arithmetic average). It is assumed that all the data in the sample are related (e.g. height of 20 year olds, maximum day temperature in Sydney for March, etc). Continous data (e.g. tonnes per hour of coal on a conveyor) are more often used in a trend analysis than descriptive statistics. Trend analysis is usually used to pick up whether the readings are in danger of drifting out of spec, so this will only work if the readings are in consecutive order.

So for Descriptive Statistics, the data is a set of readings or measurements taken on a group of related items in no particular order - like a large barrel of parts produced in one batch (but we don't know the order they were made).

Summary


Basic Terms

Examples below are based on this small set of observations/readings/samples/values: {34, 27, 45, 55, 22, 27}

The set can also be called the population/group/sample/data.

Description Formula Example Excel
The ith value xi x3 = 45 =A3
Count = Number of values n 6 count(A1:A6)
Maximum = highest value  xmax 55 max(A1:A6)
Minimum = lowest value xmin 22 min(A1:A6)
Range = Maximum - Minimum xmax - xmin 55 - 22 = 33 -
Mean = Sum of all values / Count
Common symbols for mean are;
(34, 27, 45, 55, 22, 37) / 6 = 36.5 average(A1:A6)
Median = Middle number when listed in order. (or average of middle 2) 22, 27, 27, 34, 45, 55  
= (27 + 34) / 2 = 30.5
median(A1:A6)
Mode = most frequent value or range of values (frequency diagram) -mode(22, 27, 27, 34, 45, 55) = 27  mode(A1:A6)

Histograms (Frequency Distribution)

A histogram is a graph of frequencies shown as bars. Each bar or "bin" is a certain range. These intervals (or bands, or bins) are generally of the same size, adjacent and non-overlapping.

The choice of bin size is important. A histogram needs at least 20 or 30 measurements or the bars will be too crude to make sense. Ideally, a large number of measurements will allow the bins to be fairly small (narrow) which will give a smoother and more accurate distribution. With fewer measurements the bins must be larger, otherwise every bin might be just 0,1 or 2!

Example

Heights of 31 Black Cherry trees. For practical reasons we chose this relatively small sample.
Max = 87, min = 70, range = 24, average = 76, standard deviation = 6.268 feet
If this sample is reliable (which it probably isn't), we would expect 68% of Black Cherry trees to be 76 feet +/- 6.3 ft tall.

31 measurements of the Height (ft) of Black Cherry Trees
70   65   63   72   81   83   66   75   80   75   79   76   76   69   75   74   85   86   71   64   78   80   74   72   77   81   82   80   80   80   87  
  1. Usually easier to list the values in order first
  2. Now split the range (87 - 63 = 24) into convenient bins (5 feet each).
  3. Then count the number of trees with a height within each bin. (Greater than lower limit up to equal with higher limit)
   >60 to 65 =  3
   >65 to 70 =  3
   >70 to 75 =  8
   >75 to 80 =  10
   >80 to 85 =  5
   >85 to 90 =  2


Standard Deviation

Standard Deviation is a measure of the spread or dispersion of the values. The standard deviation is a statistic that tells you how tightly all the various examples are clustered around the mean in a set of data.

The calculation of standard deviation is actually the root mean square (RMS) of the deviation of the values from the mean.

This can be calculated either for the whole population (population standard deviation), or just a sample (sample standard deviation). Sample SD is commonly used when there are too many items to measure, such as a small sample from a large batch of parts. It is also suitable for polls, market research and experiments where it is not possible to measure the whole population. Sample SD is the default.  

Standard Deviation (Sample)

Excel:
stdev()
  Standard Deviation (Population)

Excel:
stdevp()
      where Σ = Sum of
              X = Individual value 
              M = Mean of all values 
              N = Sample size (Number of values)


Common symbols for Standard Deviation are: SD or S or


Variance is the square of Standard Deviation.
Variance = S2

Examples

Standard Deviation (Population)

Example: To find the Standard deviation of {34, 27, 45, 55, 22, 27}.
1. Calculate the mean
2. Find deviation from mean.
3. Square the deviation
4. Total these squares
5. Divide by N
6. Take square root

X M X-M (X-M)2
34 35 -1 1
27 35 -8 64
45 35 10 100
55 35 20 400
22 35 -13 169
27 35 -8 64
TOTAL 798
divide n 133
sq root 11.53
This population SD means these values are the whole population.


Standard Deviation (Sample)

Example: To find the Standard deviation of {34, 27, 45, 55, 22, 27}.
1. Calculate the mean
2. Find deviation from mean.
3. Square the deviation
4. Total these squares
5. Divide by (n-1)
6. Take square root

X M X-M (X-M)2
34 35 -1 1
27 35 -8 64
45 35 10 100
55 35 20 400
22 35 -13 169
27 35 -8 64
TOTAL 798
divide n-1 159.6
sq root 12.63
This sample SD means these values are 6 readings from a batch, and we are trying to get a rough idea about the whole batch.


Normal (Gaussian) Distribution


A perfectly smooth histogram is usually just called a frequency distribution, or simply a distribution curve. It can either be generated by a very large number of measurements, or by approximating (smoothing) out a rougher histogram. 



The graph above is called the Normal Distribution (or Gaussian distribution). It is perfectly balanced - where the mean is exactly in the middle (median) and it is also the highest or most common value (mode). The curve has a specific bell shape that might be wider (more spread out) or narrower (closer to the mean). This type of frequency distribution (or very similar) occurs often in real life measurements of large populations. e.g. measurement of stature.

One standard deviation  is a specific distance from the mean μ. By including every value within 1σ of the mean you will have 68.2% of the population. Mathematically, one standard deviation is μ ± σ, where μ is the arithmetic mean. About 95% of the values are within two standard deviations (μ ± 2σ), and about 99.7% lie within 3 standard deviations (μ ± 3σ). So common-ness is measured in standard deviations.The percentage within bounds are defined by the formula: %perc=erf(n σ / √2) * 50% + 50%

Z - Score

A Z-score is how many standard deviations a particular score is from the mean.



So a z-score of 1 is 1 above the average μ.

Conversion to Percentile

A percentile is the value of a variable below which a certain percent of observations fall. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found.
The 25th percentile is also known as the first quartile (Q1); the 50th percentile as the median or second quartile (Q2); the 75th percentile as the third quartile (Q3).

The average should be the 50th percentile. (Likewise the Median and Mode in a normal distribution as shown below)
 

Cumulative Probability

This table shows 310 intervals (pink area to the right of mean) which are the cumultaive probablity bewteen the Mean and the Z-Score.
For postive z scores, Percentile = 0.5 + p(z)
For negative z scores, Percentile = 0.5 - p(z)





Example:
In a normal distribution of weights of filled cement bags, the sample average (mean) is 20.04 kg and the sample standard deviation is 18g. If the bags are sold as 20kg, what percentage of bags are expected to be underweight?


μ = 2040, σ = 18,
To be underweight requires (2040 g - 2000 g) = 40g
Number of σ 's = 40/18 = 2.222
This is a negative z score because it is UNDERWEIGHT by 2.222 σ
Reading from the table above, to 2 decimal places, p(2.22)= 48.64
So percentile = 50 - 48.68 = 1.32%
This says 1.32% of bags will be underweight, and 98.68% will be overweight.



Normal / Percentile Calculator



Mean μ
Standard deviation σ
Probability = 
ANSWER x = 

Based on http://bayes.bgsu.edu/nsf_web/jscript/normal_cdf/normal_icdf.htm
Using this calculator, enter the top 3 values then click "Compute x" button to find the value x.
Percentile is the probability that gives the x you are looking for.

Example (Cement bags)
μ = 2040, σ = 18, Probability = ?, x=2000
Keep entering Probability (0=0% to 1=100%) until you get an x of 2000
Solution: Probability = 0.013171 = 1.3171%
This says 1.32% of bags will be underweight, and 98.68% will be overweight.


This calculator is not quite matching the Excel function NORMSDIST() which gives 1.320933881% , but rounding to the accuracy of the table (2 decimal places) it works fine.

Six Sigma 

Six Sigma is a quality improvement strategy.
Sigma means Standard Deviation, so Six Sigma is 6 standard deviations from the mean. Well, sort of. The term "six sigma process" comes from the notion that if one has six standard deviations between the process mean and the nearest specification limit, as shown in the graphic, there will be practically no items that fail to meet specifications. (LSL = Lower Spec. Limit, USL = Upper...)



Using the Normal (Gaussian) distribution, calculation of 6 standard deviations from the mean actually gives only 1 part in 507 million outside the limits!
zσpercentage within CIpercentage outside CIratio outside CI
68.2689492%31.7310508%1 / 3.1514871
1.645σ90%10%1 / 10
1.960σ95%5%1 / 20
95.4499736%4.5500264%1 / 21.977894
2.576σ99%1%1 / 100
99.7300204%0.2699796%1 / 370.398
3.2906σ99.9%0.1%1 / 1000
99.993666%0.006334%1 / 15,788
99.9999426697%0.0000573303%1 / 1,744,278
99.9999998027%0.0000001973%1 / 506,800,000
99.999 999 999 7440%0.0000000002560%1 / 390,600,000,000
But the Six Sigma used in industrial process management equates to 3.4 defect parts per million opportunities (DPMO).
Why?
Sigma levelDPMOPercent defectivePercentage yield
1691,46269%31%
2308,53831%69%
366,8076.7%93.3%
46,2100.62%99.38%
52330.023%99.977%
63.40.00034%99.99966%
70.0190.0000019%99.9999981%
1.5 Sigma Shift
Experience has shown that in the long term, processes deteriorate for a number of reasons. The mean can drift, and the short term standard deviations can expand over time. To account for this real-life increase in process variation over time, a 1.5 sigma shift is introduced into the calculation. So setting up a 6 Sigma process at the start should provide at least a 4.5 Sigma process in the long term. 


Six-Sigma Process with +1.5s Shift vs. Centered Three-Sigma Process

Despite a shift of 1.5σ in the long term mean (target), a 6σ process has only 3.4 ppm defective (4.5σ), compared to a more typical 3σ process, with a failure rate of 2700 ppm.

The Six Sigma strategy uses standard statistical tools, but they are employed in a systematic project-oriented fashion through the define, measure, analyze, improve and control (DMAIC) cycle. Plus a bunch of other acronyms we don't have time for.

Criticisms
  • Nothing new, and a risk of creating a cottage industry of training and certification (yet another industrial management fad).
  • Some claim that a strict Six Sigma implentation can stifle creativity by encouraging incremental rather than large innovations.
  • Six Sigmas is arbitrary. Why not 5 or 7 Sigma? The 3.4 ppm (which is really 4.5 Sigma) is industry specific. A pacemaker process might need higher standards, a direct mail process lower.
  • 1.5 Sigma shift is arbitrary. Why not 0.5 or 2 Sigma shift? It also gives an overstated appearance (6 Sigma) when it is really only a 4.5 Sigma.