Definition 1: We use the same terminology as in Definition 3 of Regression Analysis, except that the degrees of freedom dfRes and dfReg are
modified to account for the number k of independent variables. Proof: These
properties are the multiple regression counterparts to Properties 2, 3 and 5f of Regression Analysis, respectively, and their proofs are similar. Observation: From Property 2 and the second assertion of Property 3, which is the multivariate version of Property 1 of Basic Concepts of
Correlation. Property 3: Property 4: MSRes is an unbiased estimator of where is the variance of the error terms Observation: Based on Property 4 and Property 4 of
Multiple Regression using Matrices, the covariance matrix of B can be estimated by In particular, the diagonal of C = [cij] contains the variance of the bj, and so the standard error of bj can be expressed as Example 1: Calculate the linear regression coefficients and their standard errors for the data in Example 1 of Least Squares for Multiple Regression (repeated below in Figure using matrix techniques. Figure 1 – Creating the regression line using matrix techniques The result is displayed in Figure 1. Range E4:G14 contains the design matrix X and range I4:I14 contains Y. The matrix (XTX)-1 in range E17:G19 can be calculated using the array formula =MINVERSE(MMULT(TRANSPOSE(E4:G14),E4:G14)) Per Property 1 of Multiple Regression using Matrices, the coefficient vector B (in range K4:K6) can be calculated using the array formula: =MMULT(E17:G19,MMULT(TRANSPOSE(E4:G14),I4:I14)) The predicted values of Y, i.e. Y-hat, can then be calculated using the array formula =MMULT(E4:G14,K4:K6) The standard error of each of the coefficients in B can be calculated as follows. First calculate the array of error terms E (range O4:O14) using the array formula I4:I14 – M4:M14. Then just as in the simple regression case SSRes = DEVSQ(O4:O14) = 277.36, dfRes = n – k – 1 = 11 – 2 – 1 = 8 and MSRes = SSRes/dfRes= 34.67 (see Multiple Regression Analysis for more details). By the Observation following Property 4 it follows that MSRes (XTX)-1 is the covariance matrix for the coefficients, and so the square root of the diagonal terms are the standard error of the coefficients. In particular, the standard error of the intercept b0 (in cell K9) is expressed by the formula =SQRT(I17), the standard error of the color coefficient b1 (in cell K10) is expressed by the formula =SQRT(J18), and the standard error of the quality coefficient b2 (in cell K11) is expressed by the formula =SQRT(K19). Excel Functions: The functions SLOPE, INTERCEPT, STEYX and FORECAST don’t work for multiple regression, but the functions TREND and LINEST do support multiple regression as does the Regression data analysis tool. TREND works exactly as described in Method of Least Squares, except that the second parameter R2 will now contain data for all the independent variables. LINEST works just as in the simple linear regression case, except that instead of using a 5 × 2 region for the output a 5 × k region is required where k = the number of independent variables + 1. Thus for a model with 3 independent variables you need to highlight an empty 5 × 4 region. As before, you need to manually add the appropriate labels for clarity. The Regression data analysis tool works exactly as in the simple linear regression case, except that additional charts are produced for each of the independent variables. Example 2: We revisit Example 1 of Multiple Correlation, analyzing the model in which the poverty rate can be estimated as a linear combination of the infant mortality rate, the percentage of the population that is white and the violent crime rate (per 100,000 people). We need to find the parameters b0, b1 and such that Poverty (predicted) = b0 + b1 ∙ Infant + b2 ∙ White + b3 ∙ Crime. We illustrate how to use TREND and LINEST in Figure 2. Figure 2 – TREND and LINEST for data in Example 1 Here we show the data for the first 15 of 50 states (columns A through E) and the percentage of poverty forecasted when infant mortality, percentage of whites in the population and crime rate are as indicated (range G6:J8). Highlighting the range J6:J8, we enter the array formula =TREND(B4:B53,C4:E53,G6:I8). As we can see from Figure 2, the model predicts a poverty rate of 12.87% when infant mortality is 7.0, whites make up 80% of the population and violent crime is 400 per 100,000 people. Figure 2 also shows the output from LINEST after we highlight the shaded range H13:K17 and enter =LINEST(B4:B53,C4:E53,TRUE,TRUE). The column headings b1, b2, b3 and intercept refer to the first two rows only (note the order of the coefficients). The remaining three rows have two values each, labeled on the left and the right. Thus, we see that the regression line is Poverty = 0.437 + 1.279 ∙ Infant Mortality + .0363 ∙ White + 0.00142 ∙ Crime Here Poverty represents the predicted value. We also see that R Square is .337 (i.e. 33.7% of the variance in the poverty rate is explained by the model), the standard error of the estimate is 2.47, etc. We can also use the Regression data analysis tool to produce the output in Figure 3. Figure 3 – Output from Regression data analysis tool Since the p-value = 0.00026 < .05 = α, we conclude that the regression model is a significantly good fit; i.e. there is only a 0.026% possibility of getting a correlation this high (.58) assuming that the null hypothesis is true. Note that the p-values for all the coefficients with the exception of the coefficient for infant mortality are bigger than .05. This means that we cannot reject the hypothesis that they are zero (and so can be eliminated from the model). This is also confirmed from the fact that 0 lies in the interval between the lower 95% and upper 95% (i.e. the 95% confidence interval) for each of these coefficients. If we rerun the Regression data analysis tool only using the infant mortality variable we get the results shown in Figure 4. Figure 4 – Reduced regression model for Example 1 Once again we see that the model Poverty = 4.27 + 1.23 ∙ Infant Mortality is a good fit for the data (p-value = 1.96E-05 < .05). We also see that both coefficients are significant. Most importantly we see that R Square is 31.9%, which is not much smaller than the R Square value of 33.7% that we obtained from the larger model (in Figure 3). All of this indicates that the White and Crime variables are not contributing much to the model and can be dropped. See Testing the Significance of Extra Variables on the Regression Model for more information about how to test whether independent variables can be eliminated from the model. Click here to see an alternative way of determining whether the regression model is a good fit. Example 3: Determine whether the regression model for the data in Example 1 of Method of Least Squares for Multiple Regression is a good fit using the Regression data analysis tool. The results of the analysis are displayed in Figure 5. Figure 5 – Output from the Regression data analysis tool Since the p-value = 0.00497 < .05, we reject the null hypothesis and conclude that the regression model of Price = 1.75 + 4.90 ∙ Color + 3.76 ∙ Quality is a good fit for the data. Note that all the coefficients are significant. That R square = .85 indicates that a good deal of the variability of Price is captured by the model. Observation: We can calculate all the entries in the Regression data analysis in Figure 5 using Excel formulas as follows: Regression Statistics
ANOVA
Coefficients (in the third table) – we show how to calculate the intercept fields; the color and quality fields are similar
The remaining output from the Regression data analysis is shown in Figure 6. Figure 6 – Residuals/percentile output from Regression Residual Output Observations 1 through 11 correspond to the raw data in A4:C14 (from Figure 5). In particular, the entries for Observation 1 can be calculated as follows:
Probability Output
Finally, the data analysis tool produces the following scatter diagrams. Normal Probability Plot
Figure 7 – Normal Probability Plot The plot in Figure 7 shows that the data is a reasonable fit with the normal assumption. Residual Plots
Figure 8 – Residual Plots The Color Residual plot in Figure 8 shows a reasonable fit with the linearity and homogeneity of variance assumptions. The Quality Residual plot is a little less definitive, but for so few sample points it is not a bad fit. The two plots in Figure 9 show clear problems. Fortunately, these are not based on the data in Example 3. Figure 9 – Residual Plots showing violation of assumptions For the chart on the left of Figure 9 the vertical spread of dots on the right side of the chart is larger than on the left. This is a clear indication that the variances are not homogeneous. For the chart on the right the dots don’t seem to be random and also few of the points are below the x-axis (which indicates a violation of linearity). The chart in Figure 10 is ideally what we are looking for: a random spread of dots, with an equal number above and below the x-axis. Figure 10 – Residuals and linearity and variance assumptions Line Fit Plots
Figure 11 – Line fit plots for Example 3 Observation: The results from Example 3 can be reported as follows: Multiple regression analysis was used to test whether certain characteristics significantly predicted the price of diamonds. The results of the regression indicated the two predictors explained 81.3% of the variance (R2=.85, F(2,8)=22.79, p<.0005). It was found that color significantly predicted price (β = 4.90, p<.005), as did quality (β = 3.76, p<.002). You could express the p-values in other ways and you could also add the regression equation: price = 1.75 + 4.90*color + 3.76*quality Is a high r2 value always good?A high or low R-square isn't necessarily good or bad, as it doesn't convey the reliability of the model, nor whether you've chosen the right regression. You can get a low R-squared for a good model, or a high R-square for a poorly fitted model, and vice versa.
What does a high R 2 value mean?R-squared and the Goodness-of-Fit
For the same data set, higher R-squared values represent smaller differences between the observed data and the fitted values. R-squared is the percentage of the dependent variable variation that a linear model explains.
What is a good r2 for prediction?Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. However, if you analyze a physical process and have very good measurements, you might expect R-squared values over 90%. There is no one-size fits all best answer for how high R-squared should be.
Is higher RR-squared value always lies between 0 and 1. A higher R-squared value indicates a higher amount of variability being explained by our model and vice-versa. If we had a really low RSS value, it would mean that the regression line was very close to the actual points.
|