Understanding what goes on when you fit a straight line to data or find a correlation coefficient

- Introduction
- Regression Analysis
- A Numerical Illustration
- Significance of the Correlation Coefficient
- References

*Regression* means "coming back," a "retreat" from a position. The term comes from biometrics, the science of expressing biological characteristics in numbers, and then subjecting the numbers to statistical investigation. Before the mechanism of heredity was known, it was the only way of studying Mendelian inheritance with any power, when things like wrinkly or smooth peas proved inadequate. It was the only way to study human heredity, where selective breeding was impractical. A famous early study compared the heights of daughters with the heights of their fathers. Tall fathers seldom had short daughters, and short fathers sired few tall daughters. One might think of increased height as "progress" in the common mindset of the time. Tall fathers would have tall daughters who might then cooperate to produce even taller fathers, with even taller daughters, and so on. However, when the daughters of tall fathers were studied, their average height turned out to be less than expected, and even short ones appeared now and then. They had "regressed" from the progress made by the fathers, so that the average height of daughters increased more slowly than expected.

A *scatter plot* could be made of the data, with father's height on the abscissa, and daughter's height on the ordinate. A line could be drawn to express the dependence of the daughter's height on the father's stature. This line would demonstrate any "regression," so it was called a line of regression of daughter's height (y) on father's height (x). The slope of the line was the *coefficient of regression*. If the father's height had no effect whatsoever on the daughter's height, which would mean that the character of height was not inherited, this line would be horizontal at the average height for daughters, and the coefficient of regression would be zero. On the other hand, if the father's height determined the daughter's height (actually, we should consider mothers, but that detail is not important for our purpose), then the slope of the line would be unity. The actual regression would be between these two limits of 0 and 1. Other problems might show a decrease of y with x, and in that case the coefficient would be negative. The variates y and x are said to be *correlated* in either case

The use of regression analysis goes far beyond the biometric concept of regression, but the name is retained for convenience, even where entirely inappropriate. Engineers usually encounter it in a different form. Suppose we have a resistance R through which we can pass a current of I milliamperes and across which we can measure the voltage drop V. The engineer confidently knows that V = IR, which implies that V is a linear function of I. However, measurement is always accompanied by error, so dividing measured values of V by the corresponding current will not always give the same value of R. A good way to handle this is to make a scatter plot of y = V and x = I for a series of measurements. Such a series might show the values in the table at the right.

V-I Characteristic | |||
---|---|---|---|

I, mA | V, V | ||

20 | 1.8 | ||

50 | 5.7 | ||

60 | 6.4 | ||

100 | 10.2 | ||

120 | 12.1 | ||

140 | 13.5 | ||

160 | 15.7 | ||

180 | 18.2 |

Make a scatter plot of these data. Estimate the best line through them, and draw it with a straightedge. Find the slope of the line, ΔV/ΔI, which will be an estimate of the resistance R in kΩ. The result is clearly close to R = 0.1 kΩ or 100Ω. Try to find a value to two significant figures. If we just wanted the resistance, a different experiment would be to measure V eight times at a constant value of I, and then find the mean of the sample of eight values. However, the experiment recorded in the table finds more than just the resistance. It shows that the dependence is indeed linear, fitting the straight line V = IR pretty well.

These two examples show that the concept of correlation and regression can be treated by simple methods that do not involve difficult mathematics. So far, however, our judgment enters into the analysis, which, therefore, contains a subjective element. To make the analysis fully objective, we need the machinery of mathematical statistics. Linear regression turns out to be a rather beautiful thing that can yield lots of information. Too often researchers simply fill in the blanks in a program that then grinds out numbers, which are then quoted in ignorance of their meaning and significance. This is fallacious, and error can be avoided by investigating what the statistics mean. We now turn to this question.

When we draw a straight line on a scatter plot, we establish a linear function y' = a + bx that is an estimator of the value of y, the dependent variable, that corresponds to the value x of the independent variable. This line is called the line of regression of y on x. The *error* at that point is y - y', and we wish to select a line that minimizes the errors at all points. One way to do this is to minimize the function Σ(y - y')^{2}, the sum of the squares of each error. This will penalize large errors more than small ones. We could use Σ|y - y'| instead, the sum of the absolute values of the errors, but this is much more difficult mathematically. We will, therefore, look for a *least-squares* approximation.

The constants a and b are to be chosen to minimize the error E = Σ(y - a - bx)^{2}. To find the extremum (which will be a minimum), the partial derivatives ∂E/∂a and ∂E/∂b are set equal to zero, giving two *normal equations*. The first normal equation can be expressed as Y = a + bX, where X and Y are average values, X = (1/N)Σx and Y = (1/N)Σy. This means that the line of regression passes through the center of gravity (X,Y) of the data in the scatter plot. The second normal equation, after substituting a from the first (a = Y - bX) and taking averages, becomes m_{11} - bs_{x}^{2}, which determines b.

The quantity m_{11} is (1/N)Σ(y - Y)(x - X), the average value of the sum of the products of the departures of x and y from their mean values X and Y. It is called the *covariance*, and is a measure of how much the two variables change in the same direction, or are correlated. It is proportional to the slope of the regression line. This slope, in fact, is the covariance divided by the variance of the independent variable, s_{x}^{2}.

It is just as valid to consider y as the independent variable, and x as the dependent variable, and to look for a line x = a' + b'y. Since the only difference is the exchange of x and y, the normal equations are the same with this change. Therefore, X = a' + b'Y and m_{11} - b's_{y}^{2}, since m_{11} is symmetrical in x and y. This is the line of regression of x on y.

The two lines of regression are y = a + bx and y = -(a'/b') + (1/b')x. If they are the same line, then the slopes must be equal, so b = 1/b' or bb' = 1. If m_{11} = 0, then b = 0 and b' = 0, so y = Y and x = X, the two lines of regression are horizontal and vertical, passing through (X,Y), and bb' = 0. It is quite consistent, then, to let r = √(bb') = m_{11}/s_{x}s_{y} and to call it the *coefficient of correlation*.

All the averages we need (the "moments") are about the mean values X,Y. These can be obtained from the sums of values, of the squares of values, and the products of values. We need N, Σx, Σy, Σx^{2}, Σy^{2} and Σxy. Then, using s_{x} as an example, we have s_{x}^{2} = (1/N)Σ(x - X)^{2} = (1/N)Σx^{2} - 2X(1/N)Σx + X^{2} = (1/N)Σx^{2} - X^{2}. There is a similar expression for s_{y}^{2}, and m_{11} = (1/N)Σxy - XY. These expressions for finding moments about the mean from moments about the origin should be familiar.

Another simplification useful in deriving results is to note that the average values X,Y do not affect the scatter in any way, and we are quite free to redefine the variables so that X = Y = 0. Doing this, it is easy to see that the average square error between the line of regression bx and a value y is just S_{y}^{2} = (1/N)Σ(y - bx)^{2}. Since b = m_{11}/s_{x}^{2} = rs_{y}^{2}/s_{x}^{2}, we have S_{y}^{2} = s_{y}^{2}(1 - r^{2}). S_{y} is called the *standard error of estimation* of the line of regression of y on x.

The above equation can be rearranged to give s_{y}^{2} = S_{y}^{2} + r^{2}s_{y}^{2}. The total variance in y has been split into two parts, the first part the variance of departures from the straight line, and the second part the variance due to the functional dependence on x. If r = 0, the variance is all scatter of y. If |r| = 1, the variance is all functional dependence. This *analysis of variance* is an important and useful concept with wide application.

We shall analyze the problem given by the table in the Introduction. The calculations are not difficult to do by hand, but I will perform them with the HP-48 to show how the calculator can be used, while indicating how hand calculations proceed.

Enter the 8 x 2 matrix of data into the HP-48 and save it as TEST. The use of the HP-48 for statistical calculations is explained in Statistics with the HP-48. If you are not using an HP-48, sum the values of I, I^{2}, V, V^{2} and IV. If you are using an HP-48, go to → STAT and move the cursor down to Summary Stats and press OK. With the cursor on ΣDAT, CHOOS TEST. Then set X-COL to 1, and Y-COL to 2. Then put checks at all the CALCULATE: selections, and press OK. The stack will now contain, from level 1 up: NΣ: 8, ΣXY: 10855.0000, ΣY2: 1087.1200, ΣX2: 108500.0000, ΣY: 83.6000, ΣX: 830.0000. Use the cursor movement key ↑ to see the upper numbers. Then use ENTER to return to the normal stack display. If you are calculating by hand, check your numbers against these.

For practice, calculate the single-variable statistics for I and V separately, using → STAT but choosing SINGLE-VARIABLE STATISTICS. Set the column number to choose either I or V, check at least MEAN and STD DEV, and find both Sample and Population statistics. The means will, of course, agree with the ones found in the previous paragraph. Find the variance by (1/N)ΣX2 - X^{2}, and compare the square root with the function output. It will be clear that Sample statistics uses N - 1 to average, while Population uses N, and is what you want for regression analysis.

Call up → STAT again, but this time use the Fit data choice. The TEST data should already be in ΣDAT, but if not press CHOOS and select TEST from the alternatives. The cursor should be on the first line, of course. Check that X-COL is 1 and Y-COL is 2, and if not, adjust them. The Linear Fit model, the default, should be selected. Press OK to calculate. After a brief think, the calculator will give: '0.3403 + 0.0974*X', Correlation: 0.9978, and Covariance 311.6429. The line of regression of y on x is given as an expression ready to evaluate. The correlation is very close to 1, as it should be in cases like this. The covariance is calculated with what the HP-48 thinks is the number of degrees of freedom, which here is 8 - 1 = 7. It should be calculated for N = 8, so multiply it by 7/8 to get 272.6875, which is the proper covariance. We used 7/8 and not the square root because it is the product of two standard deviations that is being used.

Now find a, b, r and the covariance using the Summary Statistics and the formulas found above. They should agree exactly. If you repeat the calculation with V as the independent variable, you find '-3.0260 + 10.2178*X', Correlation: 0.9978, and Covariance 311.6429. As a check, bb' = 0.0974*10.2178 = 0.9952 = r. There is a little round-off error here; using the full precision, r would agree exactly. The square of the standard error of estimation is S_{V}^{2} = 26.6875(1 - 0.9978^{2}) = 0.1173, or S_{V} = 0.3425 V. Similarly, S_{I}^{2} = 2798.4375(1 - 0.9978^{2}) = 12.2996, so S_{I} = 3.5071 mA. The best value for the resistance is 97.4428Ω. The high correlation of 0.9978 shows that the relation is linear to a very good approximation.

It is quite possible for a small sample to show a large correlation coefficient even when the data are actually uncorrelated. When a correlation coefficient is calculated, it is necessary to compare it with the largest coefficient that could arise purely by chance from uncorrelated data. This depends on the size of the sample. The number of degrees of freedom for a regression analysis on N data pairs is N - 2, because two parameters are estimated from the data, the coefficients in the line of regression, a and b. Since these are to be constant, the equations that determine them restrict the values of the remaining data. In the present example, N = 8, so d.f. = 6. For 6 degrees of freedom, the table on p. 211 of Fisher shows that r = 0.6215 can be expected purely by chance from uncorrelated data at least once out of 10 independent samples, and r = 0.8343 once in a hundred times. For high significance, then, r must exceed 0.8343 for a sample of this size to be sure that it is not due to chance. Our value of 0.9978 is comfortably higher, so we are pretty sure that there is a correlation, in agreement with our impressions of the scatter diagram.

It can be shown that t = (√n)[r/√(1 - r^{2})] is a t-variate, where n is the number of degrees of freedom. For the example, t = 36.8665, and the UTPT test gives a probability of 1.32 x 10^{-8}, which is certainly small enough for high significance. The UTPT test is in MTH NXT PROB NXT on the HP-48, and takes the degrees of freedom in level 2, t in level 1. If you are working without an HP-48, a table of the t-variate will do. Tables have been constructed that are entered with the value of r itself, so that the t-statistic need not be calculated.

The sensitivity of the correlation coefficient to random variations is an excellent thing to keep in mind when pondering the relationship of two variables. With small samples (say, N < 100) chance variations can easily give the appearance of correlation when in fact none exists. This trips up many investigations that continue until a definite result is reached, which is usually the chance happening of a spurious correlation, which is then published. The fact that correlation analysis is often undertaken with small samples adds to the confusion. The only remedy is to consider the significance of the correlation with reference to sample size. Statistics often disappoints the enthusiast at this point, when additional data may be difficult or expensive to acquire.

C. E. Weatherburn, *A First Course in Mathematical Statistics* (Cambridge: Cambridge University Press, 1968). Chapter IV.

R. A. Fisher, *Statistical Methods for Research Workers*, 14th ed. (Edinburgh: Oliver and Boyd, 1970). Chapter VI.

Return to Economics Index

Composed by J. B. Calvert

Created 2 May 2003

Last revised