True or false: a high r2 for a linear trend model guarantees future predictions will be good.

Definition 1: We use the same terminology as in Definition 3 of Regression Analysis, except that the degrees of freedom dfRes and dfReg are modified to account for the number k of independent variables.

Nội dung chính Show

Is a high r2 value always good?
What does a high R 2 value mean?
What is a good r2 for prediction?
Is higher R

Proof: These properties are the multiple regression counterparts to Properties 2, 3 and 5f of Regression Analysis, respectively, and their proofs are similar.

Observation: From Property 2 and the second assertion of Property 3,

which is the multivariate version of Property 1 of Basic Concepts of Correlation.

Property 3:

Property 4: MSRes is an unbiased estimator of where is the variance of the error terms

Observation: Based on Property 4 and Property 4 of Multiple Regression using Matrices, the covariance matrix of B can be estimated by

In particular, the diagonal of C = [cij] contains the variance of the bj, and so the standard error of bj can be expressed as

Example 1: Calculate the linear regression coefficients and their standard errors for the data in Example 1 of Least Squares for Multiple Regression (repeated below in Figure using matrix techniques.

Figure 1 – Creating the regression line using matrix techniques

The result is displayed in Figure 1. Range E4:G14 contains the design matrix X and range I4:I14 contains Y. The matrix (XTX)-1 in range E17:G19 can be calculated using the array formula

=MINVERSE(MMULT(TRANSPOSE(E4:G14),E4:G14))

Per Property 1 of Multiple Regression using Matrices, the coefficient vector B (in range K4:K6) can be calculated using the array formula:

=MMULT(E17:G19,MMULT(TRANSPOSE(E4:G14),I4:I14))

The predicted values of Y, i.e. Y-hat, can then be calculated using the array formula

=MMULT(E4:G14,K4:K6)

The standard error of each of the coefficients in B can be calculated as follows. First calculate the array of error terms E (range O4:O14) using the array formula I4:I14 – M4:M14. Then just as in the simple regression case SSRes = DEVSQ(O4:O14) = 277.36, dfRes = n – k – 1 = 11 – 2 – 1 = 8 and MSRes = SSRes/dfRes= 34.67 (see Multiple Regression Analysis for more details).

By the Observation following Property 4 it follows that MSRes (XTX)-1 is the covariance matrix for the coefficients, and so the square root of the diagonal terms are the standard error of the coefficients. In particular, the standard error of the intercept b0 (in cell K9) is expressed by the formula =SQRT(I17), the standard error of the color coefficient b1 (in cell K10) is expressed by the formula =SQRT(J18), and the standard error of the quality coefficient b2 (in cell K11) is expressed by the formula =SQRT(K19).

Excel Functions: The functions SLOPE, INTERCEPT, STEYX and FORECAST don’t work for multiple regression, but the functions TREND and LINEST do support multiple regression as does the Regression data analysis tool.

TREND works exactly as described in Method of Least Squares, except that the second parameter R2 will now contain data for all the independent variables.

LINEST works just as in the simple linear regression case, except that instead of using a 5 × 2 region for the output a 5 × k region is required where k = the number of independent variables + 1. Thus for a model with 3 independent variables you need to highlight an empty 5 × 4 region. As before, you need to manually add the appropriate labels for clarity.

The Regression data analysis tool works exactly as in the simple linear regression case, except that additional charts are produced for each of the independent variables.

Example 2: We revisit Example 1 of Multiple Correlation, analyzing the model in which the poverty rate can be estimated as a linear combination of the infant mortality rate, the percentage of the population that is white and the violent crime rate (per 100,000 people).

We need to find the parameters b0, b1 and such that

Poverty (predicted) = b0 + b1 ∙ Infant + b2 ∙ White + b3 ∙ Crime.

We illustrate how to use TREND and LINEST in Figure 2.

Figure 2 – TREND and LINEST for data in Example 1

Here we show the data for the first 15 of 50 states (columns A through E) and the percentage of poverty forecasted when infant mortality, percentage of whites in the population and crime rate are as indicated (range G6:J8). Highlighting the range J6:J8, we enter the array formula =TREND(B4:B53,C4:E53,G6:I8). As we can see from Figure 2, the model predicts a poverty rate of 12.87% when infant mortality is 7.0, whites make up 80% of the population and violent crime is 400 per 100,000 people.

Figure 2 also shows the output from LINEST after we highlight the shaded range H13:K17 and enter =LINEST(B4:B53,C4:E53,TRUE,TRUE). The column headings b1, b2, b3 and intercept refer to the first two rows only (note the order of the coefficients). The remaining three rows have two values each, labeled on the left and the right.

Thus, we see that the regression line is

Poverty = 0.437 + 1.279 ∙ Infant Mortality + .0363 ∙ White + 0.00142 ∙ Crime

Here Poverty represents the predicted value. We also see that R Square is .337 (i.e. 33.7% of the variance in the poverty rate is explained by the model), the standard error of the estimate is 2.47, etc.

We can also use the Regression data analysis tool to produce the output in Figure 3.

Figure 3 – Output from Regression data analysis tool

Since the p-value = 0.00026 < .05 = α, we conclude that the regression model is a significantly good fit; i.e. there is only a 0.026% possibility of getting a correlation this high (.58) assuming that the null hypothesis is true.

Note that the p-values for all the coefficients with the exception of the coefficient for infant mortality are bigger than .05. This means that we cannot reject the hypothesis that they are zero (and so can be eliminated from the model). This is also confirmed from the fact that 0 lies in the interval between the lower 95% and upper 95% (i.e. the 95% confidence interval) for each of these coefficients.

If we rerun the Regression data analysis tool only using the infant mortality variable we get the results shown in Figure 4.

Figure 4 – Reduced regression model for Example 1

Once again we see that the model Poverty = 4.27 + 1.23 ∙ Infant Mortality is a good fit for the data (p-value = 1.96E-05 < .05). We also see that both coefficients are significant. Most importantly we see that R Square is 31.9%, which is not much smaller than the R Square value of 33.7% that we obtained from the larger model (in Figure 3). All of this indicates that the White and Crime variables are not contributing much to the model and can be dropped.

See Testing the Significance of Extra Variables on the Regression Model for more information about how to test whether independent variables can be eliminated from the model.

Click here to see an alternative way of determining whether the regression model is a good fit.

Example 3: Determine whether the regression model for the data in Example 1 of Method of Least Squares for Multiple Regression is a good fit using the Regression data analysis tool.

The results of the analysis are displayed in Figure 5.

Figure 5 – Output from the Regression data analysis tool

Since the p-value = 0.00497 < .05, we reject the null hypothesis and conclude that the regression model of Price = 1.75 + 4.90 ∙ Color + 3.76 ∙ Quality is a good fit for the data. Note that all the coefficients are significant. That R square = .85 indicates that a good deal of the variability of Price is captured by the model.

Observation: We can calculate all the entries in the Regression data analysis in Figure 5 using Excel formulas as follows:

Regression Statistics

Multiple R – SQRT(F7) or calculate from Definition 1 of Multiple Correlation
R Square = G14/G16
Adjusted R Square – calculate from R Square using Definition 2 of Multiple Correlation
Standard Error = SQRT(H15)
Observations = COUNT(A4:A14)

ANOVA

SST = DEVSQ(C4:C14)
SSReg = DEVSQ(M4:M14) from Figure 3 of Method of Least Squares for Multiple Regression
SSRes = G16-G14
All the other entries can be calculated in a manner similar to how we calculated the ANOVA values for Example 1 of Testing the Fit of the Regression Line (see Figure 1 on that webpage).

Coefficients (in the third table) – we show how to calculate the intercept fields; the color and quality fields are similar

The coefficient and standard error can be calculated as in Figure 3 of Method of Least Squares for Multiple Regression
t Stat = F19/G19
P-value = T.DIST.2T(ABS(H19),F15)
Lower 95% = F19-T.INV.2T(0.05,F15)*G19
Upper 95% = F19+T.INV.2T(0.05,F15)*G19

The remaining output from the Regression data analysis is shown in Figure 6.

Figure 6 – Residuals/percentile output from Regression

Residual Output

Observations 1 through 11 correspond to the raw data in A4:C14 (from Figure 5). In particular, the entries for Observation 1 can be calculated as follows:

Predicted Price =F19+A4*F20+B4*F21 (from Figure 5)
Residuals =C4-F26
Std Residuals =G26/STDEV.S(G26:G36)

Probability Output

Percentile: cell J26 contains the formula =100/(2*E36), cell J27 contains the formula =J26+100/E36 (and similarly for cells J28 through J36)
Price: these are simply the price values in the range C4:C14 (from Figure 5) in sorted order. E.g. the supplemental array formula =QSORT(C4:C14) can be placed in range K26:K36.

Finally, the data analysis tool produces the following scatter diagrams.

Normal Probability Plot

This plots the Percentile vs. Price from the table output in Figure 6. This plot is used to determine whether the data fits a normal distribution. It can be helpful to add the trend line to see whether the data fits a straight line. This is done by clicking on the plot and selecting Layout > Analysis|Trendline and choosing Linear Trendline.
It plays the same role as the QQ plot. In fact except for the scale it generates the same plot as the QQ plot generated by the supplemental data analysis tool (switching the axes).

Figure 7 – Normal Probability Plot

The plot in Figure 7 shows that the data is a reasonable fit with the normal assumption.

Residual Plots

One plot is generated for each independent variable. For Example 2, two plots are generated: Color vs. Residuals and Quality vs. Residuals.
These plots are used to determine whether the data fits the linearity and homogeneity of variance assumptions. For the homogeneity of variance assumption to be met each plot should show a random pattern of points. If a definitive shape of dots emerges or if the vertical spread of points is not constant over similar length horizontal intervals, then this indicates that the homogeneity of variances assumption is violated.
For the linearity assumption to be met the residuals should have a mean of 0, which is indicated by an approximately equal spread of dots above and below the x-axis.

Figure 8 – Residual Plots

The Color Residual plot in Figure 8 shows a reasonable fit with the linearity and homogeneity of variance assumptions. The Quality Residual plot is a little less definitive, but for so few sample points it is not a bad fit.

The two plots in Figure 9 show clear problems. Fortunately, these are not based on the data in Example 3.

Figure 9 – Residual Plots showing violation of assumptions

For the chart on the left of Figure 9 the vertical spread of dots on the right side of the chart is larger than on the left. This is a clear indication that the variances are not homogeneous. For the chart on the right the dots don’t seem to be random and also few of the points are below the x-axis (which indicates a violation of linearity). The chart in Figure 10 is ideally what we are looking for: a random spread of dots, with an equal number above and below the x-axis.

Figure 10 – Residuals and linearity and variance assumptions

Line Fit Plots

One plot is generated for each independent variable. For Example 3, two plots are generated: one for Color and one for Quality. For each chart the observed y values (Price) and predicted y values are plotted against the observed values of the independent variable.

Figure 11 – Line fit plots for Example 3

Observation: The results from Example 3 can be reported as follows:

Multiple regression analysis was used to test whether certain characteristics significantly predicted the price of diamonds. The results of the regression indicated the two predictors explained 81.3% of the variance (R2=.85, F(2,8)=22.79, p<.0005). It was found that color significantly predicted price (β = 4.90, p<.005), as did quality (β = 3.76, p<.002).

You could express the p-values in other ways and you could also add the regression equation: price = 1.75 + 4.90*color + 3.76*quality

Is a high r2 value always good?

A high or low R-square isn't necessarily good or bad, as it doesn't convey the reliability of the model, nor whether you've chosen the right regression. You can get a low R-squared for a good model, or a high R-square for a poorly fitted model, and vice versa.

What does a high R 2 value mean?

R-squared and the Goodness-of-Fit For the same data set, higher R-squared values represent smaller differences between the observed data and the fitted values. R-squared is the percentage of the dependent variable variation that a linear model explains.

What is a good r2 for prediction?

Any study that attempts to predict human behavior will tend to have R-squared values less than 50%. However, if you analyze a physical process and have very good measurements, you might expect R-squared values over 90%. There is no one-size fits all best answer for how high R-squared should be.

Is higher R

R-squared value always lies between 0 and 1. A higher R-squared value indicates a higher amount of variability being explained by our model and vice-versa. If we had a really low RSS value, it would mean that the regression line was very close to the actual points.