Changes

gillesc92 · 2149d77e
--- a/2.-Statistics/R-squared.md
+++ b/2.-Statistics/R-squared.md
-<script src="https://polyfill.io/v3/polyfill.min.js?features=es6"></script>
-<script type="text/javascript" id="MathJax-script" async
-  src="https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js">
-</script>
 ## R² (R-squared): Definition, Calculation, and Use in Models
 ### What is R²?
@@ -19,23 +14,23 @@ R² is a useful metric for assessing how well a model explains the relationship
 R² is calculated by comparing the total variation in the response variable to the variation explained by the model. The formula is:
 $$
-R^2 = 1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}
+R^2 = 1 - \frac{SSE}{SST}
 $$
 Where:
- **SS_residual** is the sum of squared differences between the observed values $y_i$ and the values predicted by the model $\hat{y}_i$ (i.e., the residuals).
+- **$SSE$** (Residual Sum of Squares) is the sum of squared differences between the observed values $y_i$ and the values predicted by the model $\hat{y}_i$, i.e., the unexplained variance:
 $$
-SS_{\text{residual}} = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
+SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
 $$
- **SS_total** is the total sum of squared differences between the observed values $y_i$ and the mean of the response variable $\bar{y}$.
+- **$SST$** (Total Sum of Squares) is the total sum of squared differences between the observed values $y_i$ and the mean of the response variable $\bar{y}$, i.e., the total variance:
 $$
-SS_{\text{total}} = \sum_{i=1}^{n} (y_i - \bar{y})^2
+SST = \sum_{i=1}^{n} (y_i - \bar{y})^2
 $$
-In short, $R^2$ tells you how much of the variability in the response variable is captured by the model.
+In short, $R^2$ tells you how much of the total variability in the response variable is explained by the model.
 ### Interpreting R²
@@ -101,55 +96,3 @@ To avoid falling into the Texas Sharpshooter Fallacy with R²:
 - **Avoid overfitting**: Don’t include unnecessary predictors just to boost the R² value. Use adjusted R² or cross-validation to assess the model’s performance.
 - **Use R² in context**: Remember that R² only measures how well the model fits the data used in the analysis. Always check other metrics like adjusted R² and p-values to evaluate the significance and generalizability of the model.
 - **Report all findings**: Don’t focus solely on high R² models. Even models with lower R² values may provide useful insights, particularly if they are based on a sound hypothesis and are generalizable to new data.
-### Pseudo-R² for Generalized Linear Models (GLMs)
-In some models, such as logistic regression or other Generalized Linear Models (GLMs), the traditional R² does not apply. Instead, pseudo-R² measures are used. Here are three common types:
-#### McFadden's Pseudo-R²
-McFadden’s pseudo-R² is commonly used for logistic regression models. It is defined as:
-$$
-R^2_{\text{McFadden}} = 1 - \frac{\ln(L_{\text{full model}})}{\ln(L_{\text{null model}})}
-$$
-Where:
- \(L_{\text{full model}}\) is the likelihood of the fitted model.
- \(L_{\text{null model}}\) is the likelihood of the null model (a model with only an intercept).
-#### Cox & Snell's Pseudo-R²
-Cox & Snell’s pseudo-R² is another likelihood-based measure:
-$$
-R^2_{\text{Cox-Snell}} = 1 - \left( \frac{L_{\text{null model}}}{L_{\text{full model}}} \right)^{2/n}
-$$
-Where \(n\) is the number of observations.
-#### Nagelkerke's Pseudo-R²
-Nagelkerke’s pseudo-R² is a modification of Cox & Snell’s pseudo-R² that adjusts for the fact that Cox & Snell’s pseudo-R² cannot reach a maximum value of 1. The formula is:
-$$
-R^2_{\text{Nagelkerke}} = \frac{R^2_{\text{Cox-Snell}}}{1 - \left( L_{\text{null\ model}} \right)^{2/n}}
-$$
---
-### Comparing Different Pseudo-R² Measures
-Each pseudo-R² provides an indication of the model fit, with values closer to 1 indicating a better fit. However, unlike traditional \(R^2\), pseudo-\(R^2\) values can vary depending on the type of model, and they don’t have a direct interpretation as the percentage of variance explained like traditional \(R^2\) does in linear regression models.
- **McFadden’s pseudo-\(R^2\)**: Tends to produce values lower than traditional \(R^2\), and values between 0.2 and 0.4 are considered good fits in many contexts.
- **Cox & Snell’s pseudo-\(R^2\)**: Provides a likelihood-based measure but does not reach 1, making interpretation less straightforward.
- **Nagelkerke’s pseudo-\(R^2\)**: Adjusts Cox & Snell’s measure to allow for a maximum value of 1, making it easier to interpret, but still not as intuitive as traditional \(R^2\).
-### Interpreting Pseudo-R² Measures
-When comparing models using pseudo-\(R^2\), it's important to note that these measures are relative within the context of a specific model. Unlike traditional \(R^2\), they are not directly comparable across different types of models or datasets. As such, while pseudo-\(R^2\) values can offer insight into model performance, they should be interpreted with caution and supplemented with other evaluation metrics, such as likelihoods or classification accuracy, depending on the model type.