|
|
|
<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
|
|
|
|
<script type="text/javascript" id="MathJax-script" async
|
|
|
|
src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
|
|
|
|
</script>
|
|
|
|
|
|
## R² (R-squared): Definition, Calculation, and Use in Models
|
|
## R² (R-squared): Definition, Calculation, and Use in Models
|
|
|
|
|
|
### What is R²?
|
|
### What is R²?
|
... | @@ -13,13 +18,24 @@ R² is a useful metric for assessing how well a model explains the relationship |
... | @@ -13,13 +18,24 @@ R² is a useful metric for assessing how well a model explains the relationship |
|
|
|
|
|
R² is calculated by comparing the total variation in the response variable to the variation explained by the model. The formula is:
|
|
R² is calculated by comparing the total variation in the response variable to the variation explained by the model. The formula is:
|
|
|
|
|
|

|
|
$$
|
|
|
|
R^2 = 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}
|
|
|
|
$$
|
|
|
|
|
|
Where:
|
|
Where:
|
|
- **SS_residual** is the sum of squared differences between the observed values and the values predicted by the model (i.e., the residuals).
|
|
- **SS_residual** is the sum of squared differences between the observed values \(y_i\) and the values predicted by the model \( \hat{y}_i \) (i.e., the residuals).
|
|
- **SS_total** is the total sum of squared differences between the observed values and the mean of the response variable.
|
|
|
|
|
|
$$
|
|
|
|
SS_{\text{residual}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
|
|
|
|
$$
|
|
|
|
|
|
|
|
- **SS_total** is the total sum of squared differences between the observed values \(y_i\) and the mean of the response variable \( \bar{y} \).
|
|
|
|
|
|
In short, R² tells you how much of the variability in the response variable is captured by the model.
|
|
$$
|
|
|
|
SS_{\text{total}} = \sum_{i=1}^{n} (y_i - \bar{y})^2
|
|
|
|
$$
|
|
|
|
|
|
|
|
In short, \(R^2\) tells you how much of the variability in the response variable is captured by the model.
|
|
|
|
|
|
### Interpreting R²
|
|
### Interpreting R²
|
|
|
|
|
... | @@ -69,7 +85,9 @@ In data analysis, this fallacy occurs when researchers fit multiple models or te |
... | @@ -69,7 +85,9 @@ In data analysis, this fallacy occurs when researchers fit multiple models or te |
|
|
|
|
|
Adjusted R² is an alternative to R² that adjusts for the number of predictors in the model. Unlike R², which increases whenever a new predictor is added (even if it doesn’t improve the model), adjusted R² only increases if the new predictor improves the model more than would be expected by chance.
|
|
Adjusted R² is an alternative to R² that adjusts for the number of predictors in the model. Unlike R², which increases whenever a new predictor is added (even if it doesn’t improve the model), adjusted R² only increases if the new predictor improves the model more than would be expected by chance.
|
|
|
|
|
|

|
|
$$
|
|
|
|
R^2_{\text{adj}} = 1 - \left( \frac{(1 - R^2)(n - 1)}{n - p - 1} \right)
|
|
|
|
$$
|
|
|
|
|
|
Where:
|
|
Where:
|
|
- **n** is the number of observations.
|
|
- **n** is the number of observations.
|
... | @@ -84,4 +102,40 @@ To avoid falling into the Texas Sharpshooter Fallacy with R²: |
... | @@ -84,4 +102,40 @@ To avoid falling into the Texas Sharpshooter Fallacy with R²: |
|
- **Use R² in context**: Remember that R² only measures how well the model fits the data used in the analysis. Always check other metrics like adjusted R² and p-values to evaluate the significance and generalizability of the model.
|
|
- **Use R² in context**: Remember that R² only measures how well the model fits the data used in the analysis. Always check other metrics like adjusted R² and p-values to evaluate the significance and generalizability of the model.
|
|
- **Report all findings**: Don’t focus solely on high R² models. Even models with lower R² values may provide useful insights, particularly if they are based on a sound hypothesis and are generalizable to new data.
|
|
- **Report all findings**: Don’t focus solely on high R² models. Even models with lower R² values may provide useful insights, particularly if they are based on a sound hypothesis and are generalizable to new data.
|
|
|
|
|
|
By carefully interpreting R² and using it alongside other metrics, researchers can avoid overfitting and misleading conclusions, ensuring their models provide meaningful insights into the data. |
|
### Pseudo-R² for Generalized Linear Models (GLMs)
|
|
|
|
|
|
|
|
In some models, such as logistic regression or other Generalized Linear Models (GLMs), the traditional R² does not apply. Instead, pseudo-R² measures are used. Here are three common types:
|
|
|
|
|
|
|
|
#### McFadden's Pseudo-R²
|
|
|
|
|
|
|
|
McFadden’s pseudo-R² is commonly used for logistic regression models. It is defined as:
|
|
|
|
|
|
|
|
$$
|
|
|
|
R^2_{\text{McFadden}} = 1 - \frac{\ln(L_{\text{full model}})}{\ln(L_{\text{null model}})}
|
|
|
|
$$
|
|
|
|
|
|
|
|
Where:
|
|
|
|
- \(L_{\text{full model}}\) is the likelihood of the fitted model.
|
|
|
|
- \(L_{\text{null model}}\) is the likelihood of the null model (a model with only an intercept).
|
|
|
|
|
|
|
|
#### Cox & Snell's Pseudo-R²
|
|
|
|
|
|
|
|
Cox & Snell’s pseudo-R² is another likelihood-based measure:
|
|
|
|
|
|
|
|
$$
|
|
|
|
R^2_{\text{Cox-Snell}} = 1 - \left( \frac{L_{\text{null model}}}{L_{\text{full model}}} \right)^{2/n}
|
|
|
|
$$
|
|
|
|
|
|
|
|
Where \(n\) is the number of observations.
|
|
|
|
|
|
|
|
#### Nagelkerke's Pseudo-R²
|
|
|
|
|
|
|
|
Nagelkerke’s pseudo-R² is a modification of Cox & Snell’s pseudo-R² that adjusts for the fact that Cox & Snell’s pseudo-R² cannot reach a maximum value of 1. The formula is:
|
|
|
|
|
|
|
|
$$
|
|
|
|
R^2_{\text{Nagelkerke}} = \frac{R^2_{\text{Cox-Snell}}}{1 - \left( L_{\text{null model}} \right)^{2/n}}
|
|
|
|
$$
|
|
|
|
|
|
|
|
Each pseudo-R² provides an indication of the model fit, with values closer to 1 indicating a better fit. However, unlike traditional R², pseudo-R² values can vary depending on the model and should be interpreted with caution.
|
|
|
|
|
|
|
|
By carefully interpreting R², adjusted R², and pseudo-R² values, you can assess how well your models explain the variability in your data while avoiding overfitting and other common pitfalls. |
|
|
|
\ No newline at end of file |