What every intelligent person should know about statistics, and a little more
In 1968, Dr Russell Langley, a dentist and learned gentleman of Melbourne, published Practical Statistics, a handbook of statistics prepared for the use of the public. This book was so popular that the paperback edition was soon followed by a hardback edition. It was part of an effort to make the general public aware of the power of statistics and more able to evaluate claims based on statistical reasoning, that arose after the Second World War, but seems to have abated, certainly in the United States, if not in England and Australia. Efforts to educate the public are doomed to failure if the public do not wish to be educated, as in the United States, where ignorance of all sorts is cherished, enjoyed and sturdily defended. The public is pelted daily by statistical claims, most of them inconclusive or misleading, presented with the lawyer's love of partial and selective truth. A little knowledge of statistics would make much better citizens and wiser consumers.
Dr Langley does not simply give recipes; he begins by explaining how figures can be used to mislead, then introduces probability, probability distributions, sampling, when the arithmetic mean is appropriate, measures of dispersion, logarithmic distributions, and the design of experiments, before presenting statistical tests that can easily be carried out by anyone, including a number of nonparametric tests based on ranking and other concepts. It is essential to understand the basis for statistical tests in order to use them meaningfully.
Since Dr Langley's book, the pocket calculator and the computer have greatly reduced the computational effort required to perform the tests. In fact, many pocket calculators are pre-programmed to carry out statistical tasks such as finding mean and standard deviation, and linear regression. The computer user these days is normally incapable of programming the computer to perform statistical analysis, but statistical programs are available that merely require the insertion of numbers and a few clicks of the mouse to produce large amounts of statistical output, with no knowledge or intelligence required of the user. The fruitlessness of this procedure should be evident. It is still necessary to understand what is being done, and to interpret the results correctly, and this understanding and correct interpretation cannot be supplied by the computer. The result of all this is a mountain of bad statistics and false conclusions.
Bad sampling vitiates any statistical analysis. Politicians and lawyers have long known that a cunningly selected sample can be used to prove any point they wish. A random sample is very difficult and expensive to select, and most poll results presented to the public (if not all) are based on bad samples. In a random sample, each member of the population being studied must have an equal chance of inclusion. If a good sample is used, results obtained from a sample of adequate size are nearly as accurate as a complete count of each member of the population. This has been demonstrated over and over, often in ways involving considerable financial outcomes, but it is difficult to convince politicians and the public. Incidentally, sampling in the census has been opposed really because it would be used to count the fringes of society, and politicians know it would give a more accurate (larger) count, since these individuals are hard to tie down for a census.
An important principle, which is emphasized by Dr Langley, is that averages cannot be compared unless the scatter of the data is known. The Colorado Student Assessment Program produces large tables of averages which are compared in various ways to suit political prejudices, without even a hint of the scatter. These averages are, in fact, useless as quantitative measures and are fodder only for idle contention. An observer notes that since the average math score in the 8th grade has risen from 37.6% to 38.2% there is evidence of progress.
The comparison of means is so important that everyone should know the principles of it. We generally have a sample of N measurements, and find their mean x and standard deviation s. The pocket calculator can do this nicely. We wish to compare the mean of our sample with some other mean referring to the whole group or population from which our sample can be considered to be randomly drawn. For example, if the average of a class of 24 students on a math exam was 53%, does this really show that the class was worse than average, if the average of all students in the state on the exam was 62%? If all the students in the state got exactly 62%, and all the students in the class exactly 53%, then, yes, the class is substandard. But what if the class grades ranged from 33% to 98%, and the state grades ranged from 25% to 99%? It could just be the luck of the draw, and the case would not be proven. The class could still be substandard, but these figures would not prove it. The class could just have had a bad day, and was really above average.
It is evident that some assessment of the scatter is necessary before drawing any conclusions. If we should know the standard deviation σ of the parent population, then we can see how unlikely our mean is if the sample is random, since the mean x of a sample of size N is distributed normally about the mean X of the population with a standard deviation σ/√N. We calculate the statistic z = √N(x - X)/σ and use the tables of normal probability to find out how unlikely it is. The 1% level is z = 2.58. That is, if z is greater than this, the chance of our sample being a drawn from the assumed population is less than 1%.
Usually, we do not know the standard deviation of the assumed population. In this case, the best we can do is estimate it from the scatter in our sample. The sample standard deviation s is an unbiased estimate of σ, so we calculate the statistic t = √N(x - X)/s. This is not normally distributed like z, but its distribution has been calculated for different N, so that the 1% probability levels are known for any size sample. for N = 9, it is 3.36, somewhat larger than the value for N = ∞ of 2.58. This is the famous Student's t-Test. "Student" was William Gosset, a chemist at Guinness Brewery in Dublin, who had to publish under a pseudonym in 1908.
Sometimes we have two samples of sizes N and M, with means x and y, respectively, and wish to know if the difference in the means could be due to chance, and the two samples really taken from the same population. If the variances of the two samples are given, we can use the t-test for this problem, with n = N + M - 2 degrees of freedom. The statistic t = (x - y)/s√(1/N + 1/M). The estimate of the population standard deviation s must be calculated from the variances S1 and S2 of the two samples (sum of squares of deviations from the means), which can, of course be found from the standard deviations. s2 = [(N - 1)S1 + (M - 1)S2]/(N - M - 2). This test is not explained by Langley, but is given in Weatherburn, p. 190.
These three tests, easily done with a pocket calculator, will allow you to use the means and standard deviations that are otherwise so easy to find. It is really extraordinary that the way to calculate these figures is solemnly taught to students, and the figures are solemnly recorded in so many places, without any real understanding of how they can be used, what they mean, and what their limitations are.
It is also possible to test the ratio of the variances of two samples to see if they could come from the same population. This statistic, Fisher's F, is used in analysis of variance. Most laymen would consider the comparison of variances instead of means to see if two samples were drawn from the same population as very arcane, but it can be quite useful.
If you make a single measurement from a normal population, it will probably not be too far from the mean M of the population. A more reliable estimate of M is the average of a sample of N measurements, X = (Σx)/N. If σ is the standard deviation of the population, then X is distributed with the smaller standard deviation σ/√N. This fact was made use of in the z test. The range R of a sample of N measurements is the difference between the largest and smallest values in the sample. If we take larger and larger samples, the range increases steadily. The table at the right gives the ratio of the average value of the range of a sample to the standard deviation of the population. This furnishes a quick check on the calculated standard deviation of a sample, since the range is easily found.
As we saw above, the quantity s2 = Σ(x - X)/(N - 1) is an unbiased estimator of the population variance s2. It is slightly larger than the average square of the deviation from the mean. In fact, it is a good estimator only for a sufficiently large N, say 5 or greater. For smaller values of N, the sample range might as well be used instead. This is done in Lord's Range Test for the significance between the means of two small samples. In this test, we form the statistic L = |X1 - X2|/(R1 + R2), where X1 and X2 are the sample means, and R1 and R2 the sample ranges. Significance levels for L are given in the table at the left. Larger values of L are more significant, of course. Here, the samples are the same size N = 2, 3 or 4. Langley gives a larger table for samples of different sizes, and for more than two samples. This is a very easy test, and about the best that can be done with small samples.
Another way of comparing two samples is to pool the measurements, and then to rank them from smallest to largest or vice versa. The sum of the ranks of the measurements in each of the two samples is found, and the smaller value, R, is noted. If ranks are tied, the average rank is used. The smaller the value of R, the less likely that the two samples are drawn from the same population. Langley gives tables of the significant values of R for samples smaller than 20. For larger samples, z = [NR(NA + NB + 1) - 2R]/√[NA NB (NA + NB + 1)/3], which can be interpreted as in the z-test above. NR is the size of the sample with the smaller rank sum R.
For example, suppose we have two samples with N = 5, and all the values in one sample are greater than all the values in the other. Whichever way we rank the values, then, 1 to 5 belong to one sample and 6 to 10 to the other. The smaller rank total is 5 + 4 + 3 + 2 + 1 = 15. The table in Langley gives R = 15 for significance at the 1% level. R = 17 is the 5% level, R = 19 the 10% level. This shows again how difficult it is to obtain statistical significance from small samples. If we calculate z for this example, we find z = 2.61, which is significant at the 1% level. This test is called Wilcoxon's Sum of Ranks Test.
The probability of isolated occurrences follows the Poisson distribution. If the probability of occurrence is constant and equal to kdt in a time interval dt, then the probability that an event does not occur in dt is 1 - kdt. Let P(t) be the probability that an event does not occur in a time t. Then, the probability that the event has not occured a time dt later will be P + dP = P(1 - kdt) by the laws of probability. Hence, dP = -Pkdt or dP/P = -kdt, from which ln P/P0 = -kdt, or P(t) = e-kt, since P0 = 1. Now, m = kt is the average number of events occurring in time t, so the probability of observing no events in time t is e-m.
If the probability of 1, 2, ... events in a time in which the average number of events is m is worked out, the result is Pn = (mn/n!)e-m. It is easy to see that Σ(0,∞)P(n) = 1, or that the probability of 0 or more events is unity, which is obvious.
If we observe more events x than the average expected in an interval, or x > m, we may ask if this is consistent with the average. That is, whether the occurrence of x events (or more) may simply be due to chance, and not to an increase in the probability of the events. The null hypothesis is that the increased number of occurrences is simply a chance variation. The probability of observing x or more events is 1 minus the probability of observing 0, 1, ..., x - 1 events. For example, suppose m = 0.82 and we observe 3 events. The probability of 0, 1, or 2 events is e-0.82(1 + 0.82 + 0.822/2) = 0.95, so the probability of 3 or more events is 1 - 0.95 = 0.05, or 5%. There is one chance in 20 of observing such a number of events, which we regard as probably significant. Langley's E Table for Poisson's Test lists x from 0 to 40, and the average value m corresponding to various probability levels. This table, in effect, solves the summed probability for m when x is known, which is very convenient for us.
There is something wrong with the part of the table for x < E. The probability P0 of observing 0 events is e-m, which we can easily solve for m: m = ln(1/P0). If P0 = .05, then m = 3.00. That is, there is a 1 in 20 chance of observing no events even if the average number of events is 3.00. The table, however, shows 3.6. The other values on this line also do not agree with my calculations. If the average is 5.5, then the probability of 0 or 1 event is e-5.5(1 + 5.55) = 0.0268, not .05, as the table implies. In the absence of this table, tables of the summed Poisson distribution can be entered with the values of x and m, and the probability read off. A computer program will be more convenient than a table, however, if you can find or write one.
For x ≥ 40, the Poisson distribution is well approximated by a normal distribution with mean x and standard deviation √x. The z statistic is given by z = (|x - m| - c)/√m, where the correction factor c is 0 for x < m and 0.5 for x > m. The 5% value for z is 1.96, the 1% value is 2.58.
The number of avalanche fatalities in Colorado for the last nine seasons was given in the paper as: 7, 1, 6, 6, 7, 5, 8, 6 and 3. The average per year was 49/9 = 5.4. These are small numbers, and fluctuate greatly. Nevertheless, they are consistent with a Poisson distribution. To make the expected numbers a little larger, we may group the observations as 0-3, 4-6, 7 and greater. The probabilities, calculated from the Poisson distribution, are 0.21, 0.49 and 0.30. The expected numbers for a sample with N = 9 are 1.9, 4.4 and 2.7. The observed numbers were 2, 4 and 3. The observed numbers are, therefore, what we would expect on the basis of a Poisson distribution. To demonstrate this, we can calculate χ2 = (0.1)2/1.9 + (0.4)2/4.4 + (0.3)2/2.7 = 0.075, a remarkably small value. For 2 d.f., the value for 5% significance is 5.99. If all the observations fell into the 4-6 group, that is, were "average," then χ2 would be 9.4, definitely ruling out a Poisson distribution.
The question of the correlation between a pair of measurements is often of interest. The two variables may be connected functionally, as are the current and voltage by Ohm's Law, V = IR, or they may reflect considerable chance variation, as in biometric studies. In the most common case, we look for a linear relation between the variables. This problem is often called regression analysis, from one of its early applications. This arose from a study of the correlation of heights between fathers and sons. All men may have some average height, say 5' 8". If we consider fathers of above-average height, say taller than 6' 0", then their sons also tend to be taller than average, but not by as much as the fathers are. The heights of the sons are said to exhibit a regression to the mean, from which the name is derived.
Let x represent the height of a father (for example) and y the height of his son. A data point is the pair (x,y). The data may be plotted on rectangular x,y axes, as are the points in the diagram at the right. This is a scatter plot, which may suggest a trend to the eye. The average of the fathers' heights is X (denoted x-bar in the diagram), and the average of sons' heights is Y. Explicitly, X = Σx/N and Y = Σy/N, where N is the number of data points. We will need the quantities σx and σy, which are defined by Nσx2 = Σ(x - X)2 and Nσy2 = Σ(y - Y)2. These sigmas are the standard deviations calculated with N instead of N - 1.
The line of regression of y on x is shown in the diagram. It passes through the point (X,Y), and its slope is chosen to make the sum of the squares of the lengths of the red lines a minimum. This is easily done by assuming a line a + bx and minimizing the lengths y - a - bx using differential calculus. the line is most clearly expressed by the equation in the diagram, where r = (1/N)Σ(x - X)(y - Y)/σxσy is the correlation coefficient. This coefficient is chosen so that multiplying x and y by constants does not change it.
The average sum of the squares of the "errors" or differences between the actual values of y and the value predicted by a + bx (the red lines in the diagram) is denoted by Sy2 = σy2(1 - r2). This shows that -1 ≤ r ≤ +1. If |r| = 1, then the line of regression passes exactly through the points. This is what we are looking for in the case of Ohm's Law. If r = 0, then the average square of the error is the variance of the y variables themselves, so that there is no better predicted value of y than Y. That is, the line of regression is horizontal.
If the variables are interchanged, the solution is the same, except that now we are minimizing the sum of the squares of the horizontal errors in the diagram, not the vertical ones. In general, the line of regression of x on y is different, (x - X) = r(σx/σy)(y - Y). However, if |r| = 1, these lines are the same. If r = 0, the line of regression is now parallel to the y-axis, so the two lines cross at right angles at the mean point (X,Y).
The statistic t = [r√(N - 2)/√(1 - r2)] is distributed like a t-variate with N - 2 degrees of freedom. From this we can estimate the chance that N data pairs chosen at random from normally-distributed variates will show a correlation of at least r on the average. For small N, a spurious correlation is to be expected. For example, for N = 12, a correlation of r = 0.576 is to be expected even from uncorrelated variates. For N = 8, an r of as much as 0.707 will be found one time in twenty, on the average.
If the probability distribution of variates x and y can be expressed as a product f(x)g(y), the variates are said to be independent. The covariance (1/N)Σ(x - X)(y - Y) of independent variates is zero, so the variates are uncorrelated. Uncorrelated variates, those for which r = 0, may or may not be independent.
Correlation analysis is reliable and useful when either the number of data points is large (N > 100), or when the correlation coefficient is near unity (r > 0.9). It is doubtful when it is employed to show a dependence using a small number of data points with considerable scatter. Abusers of statistics like it, however, since almost any data collection will show a correlation, which can then be called proof. In any case, even a well-established correlation does not by any means show causation.
An easy test for correlation between two variates is Spearman's Correlation Test, which uses ranks instead of data values. Ranks were mentioned above in connection with Wilcoxon's Sum of Ranks Test. Spearman's test, which appeared in 1904, was the first application of ranks in statistical tests. The sets of x and y values are separately ranked from smallest to largest. Then, for any data pair (x,y) the difference between the two ranks is noted. The sum of squares of the differences, D2 is then found. A correction for ties T must be added. T is the sum of n(n2 - 1)/12 for each tie of n values. That is, T = (1/2)t2 + 2t3 + 5t4 + 10t5 + ... . For N = 5 to 30, tables should be used to find whether D2 + T is small enough to be significant, or large enough for significant inverse correlation. For example, D2 must be less than 23, or greater than 145, for N = 8. These tables can be found in Langley, Neave and other sources. Otherwise, find z = √(N - 1)|1 - 6(D2 + T)/(n3 - n)|, and interpret z the usual way. The expression between absolute value bars is the correlation coefficient r.
Langley does not include some of the more technical aspects of practical statistics, such as regression analysis or analysis of variance, though he does give Spearman's Correlation Test, and the χ2 test for differences from an expected distribution (goodness of fit). Analysis of Variance is briefly explained in another page of this site, Latin Squares. The useful 2-by-2 contingency table also has a page of its own, Two-by-Two.
Composed by J. B. Calvert
Created 27 November 2000
Last revised 15 March 2005