A useful and simple procedure for detecting the statistical significance of the association of two binary qualities is the 2 by 2 contingency table. By "binary quality" we mean a quality that can be assigned to an observation in one of two mutually exclusive ways, such as "present" or "not present," or from source A or source B. Let us suppose that one quality is denoted by A or B, and the other by C or D. We take a binomial sample of N observations, and tally each observation under AC, BC, AD or BD, as shown in the diagram. The row sums are n1 and n2, column sums n3 and n4. N = a + b + c + d = n1 + n2 = n3 + n4.
Our null hypothesis is that the observations are drawn from the same population, so that the number of observations in each of the four classes is proportional to the size of the sample. That is, the average value of a is Ea = n3 x n1/N, that of b is Eb = n4 x n1/N, that of c is Ec = n3 x n2/N and that of d is Ed = n4 x n2/N. When any one of these average values is known, the others can be obtained by subtraction. For example, if we know a, then c = n3 - a and b = n1 - a, and finally d = n4 - b or n2 - c. That is, when the marginal sums are constant, all the numbers in the 2x2 table are determined by a single number. Therefore, the table has one degree of freedom.
When a sample of N observations is drawn, the numbers a, b, c and d will differ from the average values due to the chances of sampling. As the observed values depart from the average values, the chance of drawing that particular sample will become smaller and smaller. Note that this can be characterized in terms of the deviation of one number, say a, from its average value, since all four numbers are determined by this one. If the chance of observing a certain divergent sample is small, we admit that the null hypothesis may be disproved, and that such a sample may well be due to a real difference.
The probability of a certain sample may be determined from the binomial distribution to be P = [(n1! n2! n3! n4!)/(a! b! c! d!)](1/N!). For statistical significance, we must ask in the usual way of the probability of drawing a sample at least this far from average, and so must add the probabilities of observations even less probable. These calculations are tedious, so alternative methods have been found. For values of N greater than 50, and where the expected values are greater than 5, The χ2 statistic with Yates's correction and one degree of freedom gives very good results. For values of N less than 50, tables have been calculated from the accurate formula. These tables may be found in Langley.
The χ2 statistic is the sum of [|O - E| - 0.5]2/E for each entry in the table, where O is the observed value and E the average (or expected) value, and the 0.5 is Yates's correction. If you take into account the relation between the four values, it happens that Y = |O - E| - 0.5 is the same for all four. Then, χ2 = Y2(1/Ea + 1/Eb + 1/Ec + 1/Ed). The sum of the reciprocals of the average values is easy to find on the calculator, and is then multiplied by Y2. The values of χ2 for the different levels of significance are: 10%, 2.71; 5%, 3.84; 1%, 6.63; 0.2% 9.55. As χ2 increases, the probability that the differences from the expected average values are due to chance becomes less and less.
For N < 50, Fisher calculated the probabilities from the exact formula. Langley contains tables for a = 1 to a = 17. All 2x2 tables with a = 18 or greater have a probability of 1% or less for N up to 50. The table for a = 1 is shown at the left. It may be used for N greater than 50, where it will be more accurate than the χ2 test. Should d fall outside these tables, to the right or below, the probability is greater than 5%. For N = 8 or smaller, a 5% level cannot be reached. These tables are not generally found in sets of statistics tables, unfortunately. Langley's source is given in the References.
In order to use the d tables, the table must be rearranged so that n1 is the smallest of the marginal totals, and ad > bc. This is done by exchanging rows and columns, or by exchanging rows or columns, and amounts only to a relabeling of the rows and columns. For significance, the actual value of d should be equal to or greater than the tabular value. If ad = bc, the observed values are proportional (a/c = b/d), so no difference can be indicated.
As an example, suppose we ask 25 high school students from school B who is buried in Grant's Tomb, and none of them know. Now the same question is asked of 3 students from school A, and 1 knows the answer. Can we conclude that school A offers a better training in this respect? The contingency table, properly arranged for Fisher's Test, is shown at the right. The value from the d Table for a = 1, b = 0 and c = 2 is 57. That is, c would have to be at least 57 for even a 5% significance level, but it is only 25, so any difference between schools A and B is not proved. To achieve a 1% level, at least 297 students from school B would have to not know the answer. This shows that it is very difficult to draw conclusions from such a small sample as 3.
Suppose we double our sample size from school A, and find that the ratio of know to don't know remains constant, as shown in the table at the left. The d Table for a = 2, b = 0 and c = 4 gives d = 19 for 5% and d = 50 for 1%. Our value of d of 25 falls between these values, so the difference between school A and school B is probably significant (at a level between 5% and 1%). If we double the sample again and get 4 knows and 8 don't knows, then the value of d for 1% is 23, so the difference is now significant. Tables of d for small values of N can be quite valuable in showing when we have reached a large enough sample for significance.
Now let's consider some larger samples, and analyze them with Yates's chi-square test. In the table on the left, the expected value of a is 27.8 (150 times 62/335). Y = 18.7 for this table, and χ2 = 28. This is a very large value, showing that it is very unlikely that students from the two schools are the same in knowledge of this question. In the table on the right, the proportion of correct responses is more equal, 31% to 22%. Here, χ2 = 3.53, just below the 5% value of 3.84, so it is not proven that the students from the two schools are unequally prepared.
Note carefully that all we can ever prove with statistics is that the difference between the samples of students from the two schools is unlikely to arise by chance, not that there actually is some difference. If we find that there is a statistically significant difference, then we may find it possible to discover some reason for the difference. Statistics cannot prove that there actually is some such difference; only research into possible explanations can do that.
A laboratory that performs many studies looking for correlations between one thing and another will occasionally come up with significant results. If P=5% is the criterion of significance, then one in twenty studies of noncorrelated variables will come up showing a significant correlation, on the average. If only such "significant" studies are published, then a great deal of error can be propagated. It seems as if this situation commonly exists in much medical and nutritional work. Any positive result should be repeated, but it is seldom reported that this has been done. This compounds the error of concluding that whatever was tested against is the actual cause of any observed correlation. The classic example of the wearing of winter coats in Germany correlated with temperatures in England is an example. Low temperatures in England do not make Germans put on coats. Nevertheless, whenever eating turnips is found to correlate with plantar warts, it is said that eating turnips causes plantar warts. When the difficulty of choosing random samples is combined with these fallacies, it is a wonder that anything at all can be concluded by using statistics in this way.
A classical example of a 2x2 contingency table given by Langley is the cholera study of 1894-96, shown at the left. 818 people were studied, of which 279 were inoculated by a new preparation, and 539 were not. This is one binary quality, inoculated or not. Each person was either infected with cholera, or not, which is the other binary quality. In all, 69 were infected, 3 who had been inoculated and 66 who had not. If the inoculation had been of no effect, we would expect 69 x (279/818) = 23.5 of the inoculated persons to have contracted cholera, but only 3 were. Then, Y = |23.5 - 3| - 0.5 = 20, and χ2 = 28.2. This large value shows that the table is highly significant, and consequently that the inoculation was successful in preventing cholera.
A number of small studies can be combined into one larger study by adding the values of χ2 and considering each small study to contribute one degree of freedom. The values of χ2 for each study should be calculated without Yates's correction. For this procedure to be valid, the studies combined should not be selected, but included whether or not they are significant. Any other procedure will give a biased result.
For example, if a small study gives χ2 = 2.0, which is certainly not significant, 14 such studies will reach the 1% level of significance at χ2 = 28. The 1% value of χ2 increases about 1.2 per degree of freedom for more than 20 degrees of freedom. This, of course, demonstrates the power of large samples. The values of χ2 for various confidence levels are given as a function of the number of degrees of freedom in any set of statistics tables.
M. J. Moroney presents a problem that shows another way to look at 2x2 contingency tables. He attends a Bach concert and counts 7 blondes and 143 brunettes in the audience. Then he attends a jazz concert and counts 14 blondes and 108 brunettes. Is it safe to conclude from these figures that blondes prefer jazz to Bach? Consider the 2x2 contingency table, and find that the expected average number of blondes at the Bach concert is 11.6 from the null hypothesis that blondes do not differ from brunettes in their music preferences. Find the other expected average numbers as well. Then, Y = |11.6 - 7| - 0.5 = 4.1, and χ2 = 3.51. This is not large enough to be significant at the 5% level. We must conclude that any preference of blondes for jazz is not proven.
R. Langley, Practical Statistics (Newton Abbot: David and Charles, 1970).
P. Armsen, Biometrika Journal, 1955, pp.506-511. d Tables for Fisher's Test.
H. R. Neave, Elementary Statistics Tables (London: George Allen and Unwin, 1979). These tables were recommended to students at the Open University.
M. J. Moroney, Facts from Figures, 3rd ed. (London: Penguin Books, 1956). The blonde-brunette study is on pp. 269-270.
Composed by J. B. Calvert
Created 7 March 2005