- Introduction
- Statistical Distributions
- Mean and Standard Deviation of a Sample
- Linear Regression
- The z- and t-Tests
- The Chi-Square Test
- The Binomial Test
- Analysis of Variance
- References

This article describes the use of the Hewlett-Packard HP-48G calculator for simple statistics computations. The reader should be familiar with the calculator, and with what the Owner's Guide has to say about statistical calculations. Step-by-step instructions are given here, and obscurities are enlightened. The HP-48 can do the job of many statistical tables, while easing he computational tasks.

A *variate* is a number that is the result of an observation, event, choice, or similar operation, and whose result may, in general, vary by chance. It isn't a *variable*, which is a more general mathematical object, but a variable can hold the value of a variate. The result of observing a variate is described by a *distribution function*. If the variate is continuous, the probability that it lies between x and x + dx is f(x)dx, where f(x) is the distribution function. The area under f(x) over the whole range of possible values of the variate is unity, so that probability is proportional to area. If the variate is discrete, the probability of a value k is P(k), the distribution function. In this case, ΣP(k) = 1, where the sum is over all possible values of k. A *statistic* is a number depending on observed values of k, and itself may be a variate. Its distribution is a *sampling distribution*, which depends on the distributions of the variates of which it is composed.

The most common continuous distribution is the *normal* distribution f(x) = [1/S√2π]exp[-(M - x)^{2}/2S^{2}], with mean M and variance V = S^{2}. It peaks at x = M, and decreases rapidly to both sides. The area under the curve between x = -S and x = +S is 0.683; between x = -2S and x = +2S the area is 0.954. This can be found by using the HP-48 function UTPN, found in the MTH NXT PROB menu. Push a mean of 0 and a variance of 1 on the stack, then a value of 1, and press the menu key for UTPN, to find 0.1587. This is the area under the curve for x > S. The area we want is 1 -(2)(0.1587). Do the same for 2S. Use UTPN to show that the total area under the curve is 1.0 (find the area for x > 0 and double it). 1 minus the value given by UTPN is the *cumulative distribution function* from x = -∞ to x. It is the probability that a variate is less than or equal to x.

The values of f(x) for the normal distribution are found with NDIST. Push M, V and x on the stack, and press NDIST. The result will be the value of f(x) at that value of x. For M = 0, V = 1, x = 0 we get f(0) = 0.3989. This is the only distribution that the HP-48 will compute automatically. Others, such as the Poisson, can be found by evaluating the functions in the normal way. If you use a distribution a lot, it is worth making a User Function or a program to calculate it. The UTPC, UTPF, UTPN and UTPT are values of the cumulative sampling distributions--chi-square, Snedecor's F, normal and Student's t. We just used UTPN above, and will see how to use the others below. They integrate from x to ∞, so to get the actual cumulative distribution they must be subtracted from 1.

An example of a statistic is the mean m of a sample of n variates, determined by Σx/n. If the variates x are normally distributed about M with variance V, then the mean m is also normally distributed about M with variance V/n. In fact, the mean of a sample is more "normal" than the variates of which it is composed, a fact related to the Central Limit Theorem.

The sample consists of N numbers x_{k}, k = 1 to N, drawn from some population. The mean m is (1/N)Σx, the variance v = [Σ(M - x)^{2}]/(N - 1). The standard deviation s = √v. The variance is divided by N - 1 so that it will be an unbiased estimator of the population variance V. If we are finding the mean M, variance V and standard deviation S of the population, then N is used instead of N - 1. When N - 1 is used, the statistic is called the *sample* variance; when N is used, it is called the *population* variance. The HP-48 lets you choose which to use in its calculation. Finding mean and standard deviation is an easy computation, but the HP-48 helps you to avoid errors.

The HP-48 works on an array called ΣDAT, which can be loaded with an array or list, or can be saved in an array. Usually one starts by loading ΣDAT directly. Press → STAT and press OK when the cursor is on Single-Variable. Unless ΣDAT is empty, press DEL and then OK to clear it. When you press EDIT (or →MATRIX), you'll get the form used to input arrays. The next number entered will be stored where the cursor is (a dark field). Use the cursor-moving buttons to move the cursor. Type in the first number and press ENTER. The cursor will move right, but press Cursor Down and it will go beneath the first number. Type in the second number and press Enter. Now the cursor will know where to go automatically, since it realizes you must be typing in a column vector. Repeat for the rest of the numbers in the sample. After you have pressed ENTER for the last number, press it again, and respond with OK to store what you have typed in into ΣDAT. You will then get a normal display of the stack.

Now type → STAT again, and press OK. A form will appear with a number of fields. See that Sample is displayed after TYPE:, which means that N - 1 will be used. Check MEAN and STD DEV or whatever else you would like to see. When you press OK, the mean and standard deviation will be on the stack, labelled. Try the following exercise: x = 56, 51, 63, 60. Mean = 57.5, standard deviation 5.196152. This is very often the first step when you are working with samples of data.

Variance and standard deviation are very closely related; when you have one, you have the other--they are just square and square root. In the sequel, any mention of variance also includes standard deviation, and vice-versa. Be careful when the HP-48 asks for a variance or standard deviation, and use the correct one.

The "summary statistics" that can be calculated are ΣX, ΣY, ΣX^{2}, ΣY^{2}, ΣXY and N. These were the data originally computed by calculating machines, and can be used in various combinations to calculate other statistics. The mean M = ΣX/N, and the variance (N - 1)s^{2} = Σ(X - M)^{2} = Σ(X^{2} - 2MX + M^{2}) = ΣX^{2} - 2MΣX + NM^{2}. Since M = ΣX/N, this gives (N - 1)s^{2} = ΣX^{2} - (ΣX)^{2}/N. In Langley, ΣX = A and ΣX^{2} = B, so s = √[(B - A^{2}/N)/(N - 1)]. This formula works even when an assumed mean is used, so that A is the sum of the differences from the assumed mean and B is the sum of squares of the differences from the mean. Indeed, if the assumed mean is the actual mean, then A = 0 and B is the sum of squares of deviations from the mean.

The HP-48 does not handle samples expressed as frequencies in bins, but will calculate frequencies and plot a histogram of them.

Let us suppose we have two samples, of sizes N and N', means M and M' and standard deviations s and s'. If we pool the samples to form one sample of size N" = N + N', then the mean of the pooled sample will be M" = (NM + N'M')/(M + M'), or the sum of the means weighted by sample proportion, M" = fM + f'M'. A = NM and A' = N'M' are the sums of the variables. The sum of squares for one sample is B = (N - 1)s^{2} + A^{2}/N, and the sum of squares for the pooled sample will be B" = B + B'. The standard deviation of the pooled sample will then be s" = √[(B" - A"^{2}/N")/(N" - 1)]. If you have the summary statistics for the two samples, then it is very easy to pool the samples. In fact, what we have just done is simply to reconstruct the summary statistics from the mean and standard deviation.

Suppose N measurements are made of a pair of variables that are supposed to be associated in some way, or *correlated*. Call them x_{i} and y_{i}. There is no need to identify them as the independent and dependent variables, but you may do so if you wish. If they are plotted on rectangular axes, the result is called a *scatter plot*, because the points are usually scattered and do not suggest a smooth curve. Sometimes an approximate linear relation can be guessed by drawing a line with a straightedge that seems to come as close to all the points as possible. This line is called a *line of regression of y on x* if x is assumed to be the independent variable (abscissa). If y is assumed to be the independent variable, then it is the *line of regression of x on y*.

To make this more objective, let's consider the line of regression as supplying estimates Y_{i} of y_{i} for each x_{i}, or Y_{i} = a + bx_{i}. Then, we would like to find the line that minimizes the total squared error E = Σ(y_{i} - Y_{i})^{2}. Taking the derivatives of E with respect to each of the parameters a,b to be determined, and setting them equal to zero, we find the *normal equations*. Solving them, we get Y = a + bX, and b = m_{11}/s_{x}^{2}, where X,Y are the mean values of x_{i} and y_{i}, m_{11} is the average value of the product (x_{i} - X)(y_{i} - Y), called the covariance, and s_{x} is the standard deviation of the x_{i}. All averages are usually calculated using the sample size N in regression analysis.

Exactly the same thing can be done if we consider y the independent variable, and estimate x_{i} by X_{i} = a' + b'y_{i}. In general, a ≠ a', b ≠ b' and the two lines of regression are not the same, but cross at the point X,Y, since both must pass through the means. The *coefficient of correlation*, r = m_{11}/s_{x}s_{y}. It is not hard to show that -1 ≤ r ≤ 1. If r > 0, the line of regression slopes upward to the right, and if r < 0 it slopes downward to the right. If r = 1 or -1, the variates are said to be *perfectly correlated*, and the two lines of regression coincide. If r = 0, the line of regression is horizontal, so the two lines of regression cross at right angles. The variates are said to be *uncorrelated*.

The values (x_{i},y_{i}) can be stored in two columns of a matrix using Matrix Writer and saved with a name. To compute the statistics, use the Summary Statistics form displayed by → STAT ↑ OK. With the highlight on ΣDAT, press CHOOS to select the data matrix you have stored and put it into ΣDAT. Put check marks in all the outputs desired, and press OK. From the summary statistics that appear on the stack, all of the quantities you need for regression analysis are available. To find averages, just divide by N, which is one of the statistics returned.

Alternatively, the Fit Data form offers a shortcut. This is brought up with → STAT ↓ ↓ OK. The defaults are x = column 1, y = column 2, and Linear Fit, just what we want here. ΣDAT should, of course, contain the data. If it doesn't, move the cursor onto that field and CHOOS the data. When you press OK, the line of regression, coefficient of correlation, and covariance will be displayed on the stack with their values. To do the regression of x on y, just make x column 2 and y column 1 and repeat. There is data to practice on on page 21-3 of the User's Guide. The coefficient of correlation is 0.8214. The covariance found by the HP-48 is calculated with N - 1 instead of N, so is larger than the usual statistic. This inflates the correlation coefficient a little, so it is best to stick with covariance calculated with N. Multiply the covariance obtained by the HP-45 by (N - 1)/N to find the covariance as usually defined.

Although this is an excellent and useful feature, grief and disappointment come from considering how it could have been made much more useful with a minor programming addition. There is no way with the HP-48 to *weight* the individual data pairs. Column 3 could easily have been used to hold weights for each point, and then the statistics routines could have handled frequency distributions, which are almost essential for large samples. This would have been very easy to program, but the possibility apparently went right over the heads of the dull suits in control. There is even an HP-48 feature that creates a set of frequencies by putting data in bins, but all it can do is to plot a histogram, not use the frequencies for any good purpose.

It is quite possible for a nonzero correlation coefficient to arise purely by chance. For a correlation to be significant, it must be larger than a certain value that depends on the size of the sample. For N = 100, a coefficient as large as 0.16 can arise one out of 10 times, and a value greater than 0.25 is necessary for a highly significant result (1 in 100 times). For N = 10, a coefficient of 0.5 can arise from chance one in 10 times, and a value of 0.71 is necessary for significance. This shows the danger when small samples are used, which is quite frequent when correlation is looked for. A significance test for r will be mentioned in the next section.

One often wants to know if an observation giving a value x means that the mean value M of a population has changed or not, if x ≠ M. We assume that even if the mean value is still M, observations may give different values depending on chance. Obviously, we can give no objective, unbiased answer to this question knowing only x and M. We can guess, we can judge, but really can form no good opinion. It is necessary to know how much uncertainty there is in any observation x. This can be quantitatively specified if we know the variance or standard deviation of the population, V or S. Then, the statistic z = |M - x|/S is distributed normally with zero mean and unit variance or standard deviation. This is just the difference between the population mean and the observation in units of standard deviations.

If the area under the normal probability curve represents the whole population, then the area under the curve outside the limits -z, +z represents observations further from the mean M than the observation x we have made. An observation giving a value m in this area is very improbable if the area is small compared to the total area under the curve, which is unity. The HP-48 has a built-in function that returns this area for a normal distribution of mean M and variance V, and any observation x. If we use the statistic z, then M = 0, V = 1 and x = z. The function will be found in the menu MTH NXT PROB NXT, and is called UTPN. Push 0, 1 and z on the stack, in that order, and then press the menu key for UTPN. The number that is returned is the area from z to infinity. What we want is twice this, so key in 2 *. The result is the probability that an observation would give a value of z at least this large. It is small if x differs considerably from M. The smaller the probability, the less likely it is that the mean of the population is still M.

A probability of 1% or less is called *highly significant*. A probability between 5% and 1% is called *significant*, while a probability greater than 5% (one chance in twenty) is *difference not proved*, or that it may well be that the observation was indeed drawn from a population of mean M. These are just arbitrary words, with no precise meaning. We actually start by assuming that there is no difference, called the "null hypothesis," and find the probability that the given observation would occur. Unless it is very unlikely that the observation would occur by chance, we must consider that the null hypothesis is not disproved. If the result is highly significant, then there is a good chance that the null hypothesis is false, and there is some difference. This does not prove that the difference was due to any cause, only that it is unlikely to be the result of chance. These terms, and the old argument over one-sided and two-sided probabilities, are really useless when the actual probabilities are available, which they are when you use the HP-48. In olden times, tables only gave the boundaries between "significant" and "highly significant" results, so the terms had some utility. Now they are rather useless, and judgment should be made on the probabilities themselves.

Statistics is often appealed to by the desperate to *prove* some beloved hypothesis, but it can never do so. It can only say that results are probably not due to chance, and may be the result of something else. A "significant" statistical result is only one in which there is enough data to assess the effects of chance, and to show that any difference is not due to chance. Suppose you carefully record the number of people wearing coats and the days on which the water in a certain pond is frozen. You will find a statistically significant correlation between the wearing of coats and the freezing of the pond, so you may announce that wearing of coats keeps body heat in that otherwise would keep the pond unfrozen (or some such absurd hypothesis). Actually, it is just the winter that causes both effects.

Now suppose we draw a sample of N observations. The statistic z then is √N |M - m|/S, where m is the mean of the sample. It is *sharpened* by the factor √N, making z more significant. Taking more than one observation is an excellent way to make the statistic more sensitive. A value of m that is not significant as a single observation may become highly significant when it is the mean of several observations.

A sample opens another possibility. Often we do not know the population standard deviation S. The sample can be used to estimate it, since the sample standard deviation s is such an estimator. The statistic t is expressed in terms of the sample standard deviation as t = √N |M - m|/s. It is very similar to z, but is not distributed according to the normal distribution, because of the effects of chance in using s as an estimator of S. Its distribution is called the t-distribution and is more complicated than the normal distribution. Note that we still use N - 1 in calculating s, but N in t. The number N - 1 is called the number of *degrees of freedom* of t. Push N - 1 and t on the stack, and then use the function UTPT to find the probability. As in the case of z, multiply this probability by 2 to determine significance.

As an exercise, find if the four numbers given above for calculating mean and standard deviation could be drawn from a population with M = 48 by chance. All you have to do is calculate t from the sample mean 57.5, sample standard deviation 5.1962, and N = 4, so there are 3 degrees of freedom. The probability of 3.53% shows that the difference is probably significant.

If r is the correlation coefficient for a sample with n degrees of freedom, then t = (√n)[r/√(1 - r^{2})] is a t-variate, and the t-test can be used for significance. The result is the probability that a correlation coefficient this large could arise by chance. Since the two parameters of the line of regression are calculated from the data, the degrees of freedom is two less than the number of points. This test should always be applied for small samples, with, say, n < 100.

One good practical lesson that can be drawn from the z- and t-tests is that it is impossible to compare two numbers unless there is some way of estimating the variance. Suppose the class average on a test is 80% and a certain student scores 65%. Does this mean that the 65% student is doing poorly? Well, if the standard deviation is 5%, then most students are in the range 75%-85%, and the student is sucking swamp water. If the standard deviation is 20%, then the student is right in with most of the others. Most numbers presented to the public for their enlightenment are stated with no hint of their variance, and are, therefore, nearly useless.

The chi-square test is used to see if observed data reflect an assumed distribution. If O = observed value and E = expected value, then the chi-squared statistic is the sum of the values of (O - E)^{2}/E, called χ^{2}. If χ^{2} is zero, then O = E and the data are just what is predicted. Any departure of O from E increases χ^{2}. Some difference in O and E is expected due to chance, so chance causes χ^{2} to have a value obeying the χ^{2} distribution. Each term entering into χ^{2} is assumed to be the square of a normally-distributed variate, which is usually quite close to the truth. The theory, which is relatively simple and straightforward, is explained in Weatherburn. We'll use only the results here.

Cholera in Calcutta, 1894 | |||
---|---|---|---|

Treatment | Cholera | No Cholera | Totals |

Inoculated | 3 | 276 | 279 |

Not Inoc. | 66 | 473 | 539 |

Totals | 69 | 749 | 818 |

From Langley, p.271 |

The test is best introduced by means of an example. A new inoculation for cholera was tested in Calcutta in 1894, with the results shown in the table at the right. The study involved 818 individuals, of whom 279 were able to be inoculated, while 539 were not. The null hypothesis is that the inoculation made no difference. Then we can use the combined figures (column totals) to estimate the chance of contracting cholera in the general population, which is 69 out of 818, or 0.08435, about 1 in 12. The expected figures for the 279 who were inoculated are then 23.5 cases of cholera, and for the 539 who were not, 45.5 cases.

Suppose the row and column totals are considered to be given. If the number of cholera cases among the inoculated were 4 instead of 3, then the number of cases among the uninoculated would have to be decreased to 65. There would then be 275 not infected among the inoculated, and 474 among the uninoculated. We see that only one of the four data can be freely changed, while the others follow along. If we changed two data, then the totals would be altered as well. This case is said to have *one degree of freedom*, a very important matter to χ^{2}, since more degrees of freedom introduce more chance variation. A table like this is called a *contingency table*, since the word "contingency" means "uncertainty of occurrence due to chance." It shows a particular outcome of chance happenings. In general, a contingency table has R rows and C columns. Here, R = C = 2. The number of degrees of freedom in such a table, if the totals are to be preserved, is F = (R - 1)(C - 1). Here, F = 1, as we have seen.

Now let's find χ^{2}. In the (1,1) cell, O = 3 while E = 23.5, so this cell contributes (3 - 23.5)^{2}/23.5 = 17.88. Yates found that for 2x2 contingency tables, subtracting 0.5 from |O - E| before squaring it gave more accurate results. Making this correction, we have (20.0)^{2}/23.5 = 17.02 instead. Repeating for the other cells, we get 1.57 for (1,2), 8.79 for (2,1) and 0.81 for (2,2) for a total χ^{2} = 28.19. This happens to be a pretty large value for one degree of freedom, as we can see by using the HP-48 to find the probability of such a large or larger value.

In MTH NXT PROB NXT we find the function UTPC. With the degrees of freedom in level 2 and χ^{2} in level 1 of the stack, press the menu key for UTPC to get 1.1 x 10^{-7}. This is a one-sided probability, so multiply it by 2 for consistency with the other tests. In this case, there could be an argument for using a one-sided probability, but it really makes no practical difference. In this problem, the probability is so small that we see that the null hypothesis is almost definitely wrong. This means that the inoculation must definitely have some effect, and by looking at the data, we see that the effect is positive.

The function UTPC can be used to find the probability of any value of χ^{2} for any number of degrees of freedom from 1 to 499. For one degree of freedom, χ^{2} > 3.82 is significant at a 5% level, or "probably significant," and indicates that the observed data are not consistent with the expected values. Smaller values mean that there is no reason to suspect that the expected values are incorrect.

When the χ^{2}-test is used on discrete data, the expected number in each class should be greater than 5. The number of classes should be greater than 3. The cholera study just met these requirements, since the smallest E was 6, and there were 4 classes. In fact, with Yate's correction, 2x2 contingency tables are a very useful application of the test.

Let's suppose we have a large number of measurements, and want to find out if they are normally distributed. We can find the mean and standard deviation, and these give us a normal distribution for comparison. The data can be plotted, perhaps on probability paper, and we can look at the plot and see how closely the data are distributed normally. The best way is to plot the cumulative distribution function of the data and compare it with the cumulative distribution function of the assumed normal distribution. There is a very efficient test (Komolgorov-Smirnov) for seeing if data are distributed normally, but the χ^{2} test can also be used for an objective decision.

To do this, we divide the data into a number N of classes holding data in a certain range of the variate. Then we use the assumed distribution to find the expected number in each class. This number must be greater than 5, and we combine classes to satisfy this requirement. The number of degrees of freedom is the number of classes, less one because the total number of data is fixed, and less two more since we used the data to find the mean and standard deviation, which is two more conditions on the data since they must remain constant. We have turned the continuous distribution of the variate into a discrete distribution by grouping into classes. Now (O - E)^{2}/E can be calculated for each class, and summed to get χ^{2}. The probability of this value is then determined by UTPC for the number of degrees of freedom. We want this probability to be high, confirming the null hypothesis of no difference between observed and expected values. This is called a test of "goodness of fit."

An extended discussion of the chi-square distribution and test can be found in The Chi-Square Statistic.

The binomial distribution was discovered by Jacques Bernoulli in 1713, and was soon applied to games of chance. It is a discrete distribution that deals with events that can go one of two ways, such as the tossing of a coin that can land heads or tails, or contracting or not contracting a disease. In N events or trials, the number K that go one way, called "success," is the thing of interest. In the limit as N → ∞ the ratio K/N → p, the *probability of success*. 0 ≤ p ≤ 1, and 1 - p = q is the *probability of failure*. p = 0 implies impossibility, and p = 1 implies certainty. This is the *frequency* definition of probability, assumed on principle even though an infinite number of trials is impossible in practice. The probability of a sequence of outcomes of successive events is the product of the probabilities of the individual outcomes. The probability of the simultaneous occurrence of two mutually exclusive outcomes of an event is the sum of the probabilities of the events. An example of this is p + q = 1 (an event must either succeed or fail). These are the familiar "laws of probability" that follow directly from the frequency definition of probability. If S = success, and F = failure, then the sequence of events SFS is pqp = p^{2}q, where p is the probability of success in one event. The probability of SSS is p^{3}, and so on. In three events, the probability of two successes and one failure, in any order, is the sum of the probabilities of FSS, SFS and SSF, or 3p^{2}q. For an unbiased coin, p = q = 1/2, and this probability becomes 3(1/2)^{2}(1/2) = 3/8, a little better than one chance in 3.

If we expand 1 = (p + q)^{n} by the binomial theorem, we find 1 = p^{n} + np^{n-1}q + [n(n-1)/2]p^{n-2}q^{2} + ... + C(n,k)p^{n-k}q^{k} + ... + q^{n}. The finite sum is the sum of the probabilities of all possible outcomes of n trials, which, of course, is unity. The probability of k failures and n-k successes in any order is multiplied by the number of such outcomes, which is C(n,k) = n!/[k!(n-k)!], the number of combinations of n things taken k at a time. This is a very unpleasant number to evaluate because of the factorials, which retarded the use of the binomial distribution for a very long time. Resort was to laboriously computed tables of these "binomial coefficients." The HP-48 makes it easy with the function COMB in the MTH NXT PROB menu. Put n in level 2 and k in level 1, and press the menu key for COMB, and C(n,k) will be displayed.

Other commands in this menu do things often needed in similar calculations. The function P(n,k) = n!/k! is there, as well as the factorial itself. A random variate is available, in the range 0 ≤ rnd < 1, that can be randomly seeded by pushing a 0 on the stack and pressing RDZ. A non-zero seed gives you the same sequence of random numbers every time it is used, for testing routines. The HP-48 does not include any way of generating random normal variates, for example, but these are not needed for the usual applications.

Suppose you want a random sample of 50 from a list of 1000 names. Simply create a list of 50 random numbers, multiply by 1000, and select the individuals with the corresponding numbers. To make a random choice from a list of 10 names, simply use one random number. I seeded with 0, then pressed RND five times, and took that number, which was .8767. Times 10, this is 8.767, which rounds to 9, so the 9th name is tapped. To overcome the possibility of refusals, make several random choices and try them in order. This will give you an *unbiased* choice, unaffected by any tendency to pick someone you know well, or who is pretty, or who has a short name. The selection procedure must be decided in advance, and not deviated from. Bad samples are the leading cause of error in statistics. Self-selected samples guarantee bias and error (as in phone-in polls). Time spent in selecting a good random sample is richly repaid in validity. We desire a random sample so that the sampling distributions we use will be correct. For other purposes, a proper sample may be made up in another way, according to the problem. A political survey is more useful if it includes voters rather than those who will not influence the outcome.

The *binomial test* investigates the probability of an outcome of a binomial sample of K "successes" drawn from N trials with probability p of success in one trial. It is supposed that p < 0.5, so that K is in the smaller class of results. If this is not the case, simply interchange p and q. If K is smaller than average, it should not exceed 4. If larger than average, it should not exceed 20. If it does exceed these limits, approximate Poisson or normal distributions give the results more easily. If p < 0.1, the Poisson distribution can be used instead. These limits are merely to avoid computational tedium; if you are using a computer program, they do not apply. Since we are calculating manually, it may be easier to respect these limits and use the approximations instead if they are violated.

As an example, also taken from Langley, suppose we want to know if Australian aborigines differ from Europeans in allergy to penicillin. Among Australian Europeans, about 0.15 are allergic to penicillin, from studies of large numbers of individuals. However, only one case of allergy resulted in 50 aborigines that were treated with the drug, which would suggest a rate closer to 0.02. However, the sample of 50 was rather small for investigating a rare effect, and we want to know how significant the result was. We assume the null hypothesis that there is no difference in susceptibility of Europeans and aborigines. What, then, would be the probability of only one allergic reaction in 50 trials? This is a binomial question, with the answer C(50,1)pq^{49} = 0.0026.

Before leaping to conclusions, we should add in the probability of no allergic reactions, or C(50,0)q^{50} = 0.0003, since a better question would be: "what is the probability of observing one or less cases of allergy?" In this particular problem, it makes no practical difference. If we had observed 5 cases, we would want to add the probabilities of observing 5, 4, 3, 2, 1 and 0 cases, which would be 0.219. It is clear that the probability of seeing only one case out of 50 is highly significant, and aborigines are not as likely to be allergic to penicillin as Europeans are. If we had seen 5 cases, the result would have been quite probable (one in five) with the same susceptibility of 0.15.

To analyze a binomial experiment, all we need to do is to calculate the binomial distribution, with is easily done with the HP-48 (but in the past was very tedious). I suggest that you find the probabilities for K = 0 to 15 in this problem, and make a bar graph. For K > 15, the probabilities are very small, all the way up to K = 50. The interaction of C(50,K) and the probability of a single outcome p^{K}q^{50-K} causes a distinct hump centered on K = 7.5. (Let the value of K correspond to the left-hand edge of the vertical bar.) The width of the hump is about 10 at a height 0.6065 of the maximum (e^{-1/2}). The hump looks very much like a normal distribution with M = 7.5 and S = 5.0. If you calculate the height of this normal distribution at maximum, 1/S√2π = 0.160, it will agree very closely (for K=7, the probability is 0.157 and a maximum). It is slightly skewed towards higher values of K, but this asymmetry only increases its beauty. The normal distribution appears mysteriously all over, and here is another example. Its mean is about Np, and its variance Npq, as you can easily check with N = 50, p = 0.15 and q = 0.85. This approximate normal distribution is often used to make computations easy. I strongly recommend that you carry out this exercise if you have not previously done so, so you will believe the almost unbelievable.

If p is small, say p = 0.01, and N is large, say 1000, then the binomial distribution will huddle around K = 0. The average value m = Np can be used in the Poisson expression P(K) = m^{K}e^{-m}/K!. If you want to make a bar plot, calculate the probabilities without using COMB, since the large factorials are indigestible. This is easy to do by canceling.

Analysis of Variance (AOV) is a very powerful way to assess the vitality of seeds; the effectiveness of different fertilizers, herbicides or insecticides; or any similar kind of research, not just in agriculture, though it is used extensively there. A description of this procedure will be found in Latin Squares. The significance test used in AOV is Snedecor's F test, called "F" after the eminent statistician R. A. Fisher. The statistic F is the ratio of two variances, the numerator having n1 degrees of freedom and the denominator n2. The HP-48 function UPTF finds the probability for F to be greater than x. n1 is pushed on level 3, n2 on level 2, and x on level 1 of the stack. Then, pressing the menu key for UPTF gives the probability that F is that large on the null hypothesis that the variances are drawn from the same population. A small probability means there probably is a difference, and the null hypothesis is disproved.

R. Langley, *Practical Statistics For Non-Mathematical People* (London: Pan Books, 1968 and Newton Abbot: David and Charles, 1971)

M. J. Moroney, *Facts From Figures* (London: Penguin Books, 1951; 3rd ed. 1956)

S. K. Campbell, *Flaws and Fallacies in Statistical Thinking* (Englewood Cliffs, NJ: Prentice-Hall, 1974)

C. E. Weatherburn, *A First Course in Mathematical Statistics* (Cambridge: C.U.P., 1968)

R. A. Fisher, *Statistical Methods for Research Workers*, 14th ed. (Edinburgh: Oliver and Boyd, 1970).

*HP 48G Series User's Guide* (Corvallis, OR: Hewlett-Packard Co., 1993). Part No. 00048-90126.

Return to Economics Index

Composed by J. B. Calvert

Created 28 April 2003

Last revised 21 February 2005