Estimation: The Ordinary Least Squares (OLS) Method.
The scattergram examined earlier contains a discussion of both the Problem Description and the Data used in deriving
the results presented here. The estimation method is the classical Ordinary Least Squares (OLS) which is programmed into the
SPSS/win statistical package. The Linear Regression Model (LRM) has the form
where Y
is the DV (in this case, annual Family Food Expenditure), X is the IV (in this case, annual
Family Income), and E
is the random error term; it is a proxy for all the uncertain factors
that may also affect family food expenditure. In regression analysis, all of the Classical Assumptions of the LRM basically
apply to the error term. A and
B are the regression
parameters whose numerical values we seek to estimate; and in so doing, we will have
succeeded to estimate the underlying Population
Regression Line (PRL) using the OLS method. By using the
command sequence presented earlier, SPSS/win automatically implements this method.
Discussion of the Outputs/Results and Related Tests
The results will be discussed in the order in which
SPSS/win generates the outputs. These outputs are presented in the tables below. For
instance, the discussion in part I pertains to the DESCRIPTIVE
STATISTICS table, followed by part II which pertains to
the CORRELATIONS table, and so on. This approach permits a critical analysis of the results and
their implications.
| Mean | Std. Deviation | N | |
|---|---|---|---|
| Annual Food Expenditure ($000) | 7.965 | 4.664 | 20 |
| Annual Income ($000) | 45.50 | 23.96 | 20 |
| Annual Food Expenditure ($000) | Annual Income ($000) | ||
|---|---|---|---|
| Pearson Correlation | Annual Food Expenditure ($000) | 1.000 | .946 |
| Annual Income ($000) | .946 | 1.000 | |
| Sig. (1-tailed) | Annual Food Expenditure ($000) | . | .000 |
| Annual Income ($000) | .000 | . | |
| N | Annual Food Expenditure ($000) | 20 | 20 |
| Annual Income ($000) | 20 | 20 | |
| Variables | R | R Square | Adjusted R Square | Std. Error of the Estimate | Durbin-Watson | ||
|---|---|---|---|---|---|---|---|
| Model | Entered | Removed |
|||||
| 1 | Annual Income ($000)(c,d) | . | .946 | .894 | .888 | 1.559 | 2.834 |
| a Dependent Variable: Annual Food Expenditure ($000) | |||||||
| b Method: Enter | |||||||
| c Independent Variables: (Constant), Annual Income ($000) | |||||||
| d All requested variables entered. | |||||||
| Model | Sum of Squares | df | Mean Square | F | Sig. | |
|---|---|---|---|---|---|---|
| 1 | Regression | 369.573 | 1 | 369.573 | 151.975 | .000(b) |
| Residual | 43.773 | 18 | 2.432 | |||
| Total | 413.346 | 19 | ||||
| a Dependent Variable: Annual Food Expenditure ($000) | ||||||
| b Independent Variables: (Constant), Annual Income ($000) | ||||||
| Unstandardized Coefficients | Standardized Coefficients | t | Sig. | |||
|---|---|---|---|---|---|---|
| Model | B | Std. Error | Beta |
|||
| 1 | (Constant) | -.412 | .764 | -.539 | .596 | |
| Annual Income ($000) | .184 | .015 | .946 |
12.328 | .000 | |
| a Dependent Variable: Annual Food Expenditure ($000) | ||||||
I. Descriptive Statistics Table
1. Annual Food Expenditure
a) The sample mean is 7.965 thousands of dollars. This
means that an average family in the sample spends $7965 annually on food.
b) The sample standard deviation of 4.664 (thousands of
dollars) is equivalent to a one-standard deviation of $4660 about the mean values of
$7965. This implies that 68.3% of the families spend between $3305 and $12,625 annually on
food.
2. Annual Income
a) The sample mean is 45.50 thousands of dollars. In terms of
income, this implies that an average family in the sample makes $45,500 annually.
b) The sample standard deviation of 23.96 (thousands of
dollars) is equivalents to ±$23,960 about the mean income of $45,500. Thus, 68.3% of the
families could be said to make between $21,540 and $69,460 annually.
3. Sample size N (actually 'n' ) =
20 simply means that there is no missing value during estimation.
II. Correlations Table
This table contains the Pearson
sample correlation coefficients of variable i
with variable j ( denoted as ri,j ), which are the key analytical tools of Correlation Analysis. This
is the same Karl Pearson that I mentioned in the historical footnote
under the discussion of the Chi-square test of Independence (and also, in glossary under regression analysis). Let us
focus for now on the top part of the table. It is a 2 by 2 matrix (i = 1,2; and j
= 1, 2). The following conclusions are obvious:
1. The correlation of annual Food
Expenditure with itself is perfect, linear, and direct since r1,1 = 1.000.
2. The correlation of annual Food
expenditure with Income is quite strong, linear and direct because
r1,2 = .946
3. The 2 x 2 matrix is symmetric about the
main diagonal; hence, all the information about the type and strength of
relationship between the two variables can be obtained from the correlation coefficients
either above the main diagonal or below it.
4. The middle portion of the table contains the p-values (sig=significance for a one-tailed test that Ho: Pi,j = 0 against Ha:
Pi,j > 0 ; where P (rho) is the population correlation coefficient
whose value is unknown). The probability or p-values (i.e.;
computed/observed values or alphaov ) of .000 means that Ho
can be rejected unequivocally at the critical level of alpha = .01. Thus,
the conclusions in (1) and (2) above are indeed valid.
5. Again, N (i.e. 'n' ) = 20
since all the observations were used in the estimation.
III. Model Summary Table
This table contains the necessary summary statistics for assessing the
accuracy of the estimated sample regression line (SRL) , where 'a' and 'b'
are the estimators of A and B, respectively; and 'e'
denotes the residual as an estimator of the random error term E
previously defined.
Before examining the meaning of the summary statistics, some remarks about the appended
footnotes are in order. Footnotes a and c are self-explanatory. Footnote b relates to the algorithm that SPSS/win uses to
estimate the model/SRL.
'Enter' simply means that annual Family Food Expenditure (DV) is regressed on both the
constant term a and the income (IV) using the OLS method. For those students
who might take Econ. 825, I should mention that SPSS/win can also
implement STEPWISE, BACKWARD, and FORWARD algorithms based on the OLS
estimation procedure. These algorithms are however useful only in the context of multiple
regression analysis where the goal is to select from among many potential IVs those that
significantly influence the DV. That said, let us now discuss the results reported in the
table.
1. R is the sample correlation coefficient (the standard notation is r as discussed earlier). The
meaning of r =.946 is the same as the one
given earlier - that the relationship between annual family Food expenditure and Income is
quite strong, positive and linear.
2. R-square or R2 is the sample
Coefficient of Determination (r2
is commonly used in simple regression analysis while R2 is appropriately
reserved for multiple regression analysis). It measures the goodness-of-fit
of the estimated SRL in terms of the proportion of the variation in the DV explained by
the fitted sample regression equation or SRL. Thus, the value of r2 = .894 simply means that 89.4%
of the variation in annual Family Food Expenditure is explained or accounted for by the
estimated SRL/equation of ,
which is reported in Coefficients table (last one). This information is quite
useful in assessing the overall accuracy of the model. Notice that r2 is the square of r =
.946. The implication is that the value of r
can be determined conversely from a simple rule as r = ±(r2)½
, where ± is the sign preceding the estimated value of the
slope coefficient b. In this example, the +
sign applies since b = .184.
3. Adjusted R-Square (or r2
with a bar over it) is the sample Coefficient of Determination
after adjusting for the degrees of freedom lost in the process of estimating the
regression parameters. In this case, only two parameters A
and B were estimated; thus, the remaining
degrees of freedom can be determined as v = n -2. Hence, the adjusted r-square
is a better measure of the goodness-of-fit of the estimated SRL than its
nominal/unadjusted counterpart. It is always smaller in value than the unadjusted. I will
examine the adjusted coefficient of determination in some details in Econs. 853 and 976.
4. Standard Error of the Estimate (standard
notation is Se). This summary statistic measures the overall
accuracy or quality of the estimated SRL in terms of the average/standardized unexplained
variation in the DV that may be due to possible errors that could originate from (i)
chance errors of sampling or sampling errors, thereby causing the values of a
and b to differ significantly from the true but unknown values of the
parameters A and B; and (ii) possible
variation in the parameter which , according to the Classical Assumption, are presumed
constant. If these errors are small, on average, then the value of Se
could approach zero (exactly equal to zero if the estimated values of the DV, denoted here
as ýi
equals their actual/observed counterparts yi for all i = 1, 2, ..., n).
If otherwise, the values of Se approach +infinity; in which
case the estimates SRL must be considered useless especially if application involves the
prediction of the DV outside the sample period. Note that Se
is an unbiased estimator of the standard deviation
around the true conditional PRL µy/x
= A + BXi which is denoted as Óy/x
In this example, Se = 1.559
means that, on average, the predicted values of the annual family Food expenditure
could vary by ±$1559 about the estimated regression equation for each value of the Income
during the sample period; and even by a much larger amount outside the sample period.
This is why prediction outside the sample period requires the use of the standard errors of the estimators a and b
(denoted, respectively, as Sa and Sb) for
establishing confidence intervals. Note that Sa and Sb take
into account the aforementioned chance errors of sampling. Accounting for parameter
variation will require the application of advanced econometric techniques which is beyond
the scope of the undergraduate material.
5. Durbin Watson (DW) Statistics
measures the presence, or lack thereof, of Serial
Correlation (also known as Autocorrelation) among the errors from one
observation (or time period) to other observations (or time periods). Details about the
implications of the existence of the autocorrelation will be examined in Econ. 853 and
Econ. 976. For now, suffice it to say that a value of DW =
2.834 means that the residuals e =
yi
- ýi (for all i = 1, 2, ..., n) from
the estimated regression model are negatively correlated and strongly so. This is
undesirable according to the Classical Assumptions. The ideal value should be 2.00 indicating no autocorrelation.
IV. ANOVA Table
The summary measures reported here are used in the partitioning of the the
total variation in the DV according to the identity relation TSS
= ESS + RSS, where TSS is the Total Sum of Squares in the DV, ESS is the
Explained Sum of Squares due to the fitted regression equation or model, and RSS is the
Residual (remaining) Sum of Squares that is unexplained and hence attributable to errors
(i.e.; chance sampling errors, and those resulting from parameter invariance). Note the
following: (1) The smaller RSS is relative to the TSS, (or the larger ESS is relative to
TSS), the better the estimated regression equation appears to fit the data. (2) The
underlying principle in the partition of TSS is similar to that of the One-way ANOVA
technique examined earlier. As in that technique, the identity relation carries over to
the associated degrees of freedom in the this manner v = v1
+ v2 where v1 = k-1,
and v2 =
n-k so that v = n -1; where k is denotes the number of parameters that are
estimated. (3) If k is defined as the number
of IVs in the model, then v1 = k, and v2
= n-k-1; again, v = v1 + v2 = n -1.
: Some authors use RSS
(regression sum of squares) instead of ESS (explained sum of squares), and ESS (error sum
of squares) instead of RSS (residual sum of squares) so that the identity is stated as TSS
= RSS + ESS. So pay attention to how these acronyms are defined.
From the table, under the df column, v1 = 1, v2
= 18, v = 19, and Fov
= 151.975. In the context of simple
regression analysis, the F-test is not very useful since there is only one IV in the model
so that assessing the over-all significance of the estimated model can be accomplished by
performing a simple t- test on the slope coefficient of the IV. After all, from the the
formal conceptual definition of the t- and F- distributions, the value of F = t2 (As
a check, Fov = 151.975 = t2
= (12.328)2 = 151.97958; the minor
difference in this case is due to rounding during estimation. This is only true in the
simple regression; in multiple regression, the F- and t -tests are quite different). Thus,
the t-test of the significance of the causal influence of the only IV provides adequate
assessment of the significance of the whole model. The next/final discussion presents the
t-test.
V. Coefficients Table
This table contains the estimated regression coefficients (a = -.412, b = .184), and hence
the estimated SRL/equation written as ýi = -.412 + .184Xi.
These interpretations follow:
1. b = .184 represents the marginal effect
of annual family Income on Food Expenditure. The estimated positive sign implies that such
effect is positive while the absolute value implies that Food Expenditure would increase
by $184 for every $1000 increase in Income.
2. a = -.412 has no interpretable meaning
because the average level of family Food expenditure could not be negative even when no
member of the is gainfully employed. Relative, friends, or Uncle Sam can help such a
family. Nonetheless, this value should not be discarded; it plays an important role when
the estimated regression line/equation is used for prediction.
3. Standard errors of the estimators , Sa = .764, and Sb = .015,
measure the precision of the estimated values of a = -.412
and b = .184, respectively, in taking on or estimating the true but
unknown values of the corresponding regression parameters A
and B. The closer the values of Sa and Sb to zero, the higher the precision of
the estimates, suggesting that chance errors due to sampling is not severe. The converse
would suggest the opposite. Thus Sb = .015
implies that b = .184 is precisely closer to the true value of B; and Sa = .764 implies that a = -.412
implies quite the opposite coupled with the fact the estimated sign contradicts
commonsense or reality.
4. Standardized Coefficient (also called
beta coefficient) for the only IV is the same as the correlation coefficient r = .946. This simply means that family Income is
an important determinant of family Food Expenditure with a strong positive effect.
Application of the standardized coefficient is, however, useful when there are two or more
IVs in the model so that their relative importance can be ranked according to the size
(i.e., the absolute value) of the beta coefficients. See the multiple regression tutorial.
5. Observed/computed t statistic ( tov): T-test of Significance and the sign of the Regression
Coefficient (B)
As part of investigating the accuracy of the fitted SRL, it is often useful to verify both
the statistical significance and the economic significance (i.e., the sign) of the
regression parameter/coefficient B. For statistical significance, the null
hypothesis is stated as H0: B = 0 against
the alternative that Ha: B is not equal to zero. Stated otherwise, H0 says
that Income has no significant causal influence on Food Expenditure; this is refuted
completely by Ha.
For alpha = .05 and v = n -k = 20 -2 = 18, this implies a critical t-value of tcv = t.025,18 = ±2.101. But tov = 12.328, thus, Ho
will have to be rejected in favor of Ha;
in which case, family Income can be said to have a significant influence on family Food
Expenditure.
An interesting variation of the t-test is to verify the economic significance of the
parameter with respect to the direction of causality of the associated IV. In this
case, the null is phrased as H0: B has a value that is at the most zero, against Ha: B > 0
(i.e; its value is strictly positive according to economic theory). At the level of alpha
= .05, the critical t-value is tcv = t.05,18
= +1.734. But the tov = 12.328 ,
thus Ho of negative or no effect of Income will have to be rejected unequivocally.
6. Prediction -- using the estimated SRL
Suppose a typical or ith family drawn from
the same population had an annual net Income of $30,000 in 1993 (this is the 8th
family in our sample).
Its estimated annual Food Expenditure, corresponding Xi
= $30 , would be ýi = -.412 + .184 x 30 = 5.111 thousands of dollars. Thus, $5111 is the best
estimate of the average annual Food Expenditure for this family. But this family actually
spent 5.8 thousands of dollars or $5800. Hence, the positive residual of $689 (i.e., e8
= 5800-5111) is the amount by which the estimated SRP has underpredicted the annual food
expenditure for this family.
Top or Return to Regression & Correlation Analysis or Learning Statistics with SPSS/win
or Home Page or Send me your Comments via E-mail.
Copyright© 1996, Ebenge Usip, all rights reserved.
Last revised: Monday, November 09, 1998.