Changes

gillesc92 · 455eb618
--- a/2.-Statistics/F-value.md
+++ b/2.-Statistics/F-value.md
@@ -6,7 +6,7 @@ The **F-value** is a test statistic used in ANOVA and regression models to asses
 ### How to Calculate the F-value
-To calculate the F-value, you need to partition the total variance in the outcome variable. This total variance is known as the **Total Sum of Squares (SST)**, and it is calculated by taking the squared differences between the observed data points and the overall mean of the outcome:
+To calculate the F-value, you need to partition the total variance in the outcome variable into explained and unexplained components. This begins with calculating the **Total Sum of Squares (SST)**, which measures how far the observed data points deviate from the overall mean:
 $$
 SST = \sum (y_i - \bar{y})^2
@@ -16,9 +16,8 @@ Where:
 - $y_i$: The observed value for each data point
 - $\bar{y}$: The mean of the outcome variable
-The total variance is then split into two components:
+Next, we calculate:
+1. **Explained Variance (Regression Sum of Squares - SSR)**: This is the variance explained by the model, comparing predicted values to the mean:
-1. **Explained Variance (Regression Sum of Squares - SSR)**: This is the portion of variance explained by the model. It is the difference between the predicted values from the model and the overall mean of the outcome variable:
 $$
 SSR = \sum (\hat{y}_i - \bar{y})^2
@@ -28,7 +27,7 @@ Where:
 - $\hat{y}_i$: The predicted value from the model for each data point
 - $\bar{y}$: The mean of the outcome variable
-2. **Unexplained Variance (Residual Sum of Squares - SSE)**: This is the variance that remains unexplained by the model, which measures how far the observed values differ from the predicted values:
+2. **Unexplained Variance (Residual Sum of Squares - SSE)**: This is the variance left unexplained by the model, comparing observed values to the predicted values:
 $$
 SSE = \sum (y_i - \hat{y}_i)^2
@@ -38,67 +37,59 @@ Where:
 - $y_i$: The observed value for each data point
 - $\hat{y}_i$: The predicted value from the model
-Next, you calculate the **Mean Squares** for both the explained and unexplained variances. These values account for the number of predictors and observations in the model. The **Mean Square Between (MSB)**, representing the explained variance, is calculated by dividing the **SSR** by the number of predictors ($p$):
+Now, the total explained and unexplained variance must be adjusted for the number of predictors and observations. The **Mean Square Between (MSB)** is calculated by dividing the **SSR** by the number of predictors, $p$:
 $$
 MSB = \frac{SSR}{p}
 $$
-Where:
+And the **Mean Square Within (MSW)**, representing unexplained variance, is calculated by dividing the **SSE** by the degrees of freedom. In most regression models, the degrees of freedom is $n - p - 1$, where:
- $SSR$: The regression sum of squares (explained variance)
+- $n$ is the number of observations,
- $p$: The number of predictors in the model
+- $p$ is the number of predictors,
+- The **-1** accounts for the intercept, which is also estimated as a parameter.
-The **Mean Square Within (MSW)**, representing the unexplained variance, is calculated by dividing the **SSE** by the degrees of freedom (the number of observations $n$, minus the number of predictors $p$, and minus one for the intercept):
-$$
-MSW = \frac{SSE}{n - p - 1}
-$$
-Where:
+For models **without an intercept**, you don’t subtract the 1, so the degrees of freedom becomes $n - p$. This is because the intercept isn't estimated, so the number of independent data points isn't reduced.
- $SSE$: The residual sum of squares (unexplained variance)
- $n$: The total number of observations
- $p$: The number of predictors in the model
-Finally, the **F-value** is calculated by taking the ratio of the explained variance (MSB) to the unexplained variance (MSW):
+The **F-value** is then calculated by dividing the explained variance by the unexplained variance:
 $$
 F = \frac{MSB}{MSW}
 $$
-This ratio indicates how much more variance is explained by the model compared to the variance that remains unexplained. A large F-value suggests that the model explains much more variance than what is left unexplained, meaning the model is statistically significant. Conversely, a small F-value indicates that the model is not significantly better than using the mean to predict the outcome.
+This ratio shows how much more variance is explained by the model compared to the variance left unexplained. A large F-value means the model explains much more variance than random chance, indicating significance. Conversely, a small F-value means the model doesn't improve much over using the mean.
 ### Interpretation
-The F-value tests whether the model provides a better fit than using just the mean. A high F-value indicates that the data points are close to the regression line, showing that the model fits well. A low F-value suggests that the data points are scattered, indicating a poor fit.
+The F-value tests whether the model provides a better fit than using just the mean. A high F-value indicates that the data points are closer to the regression line, showing that the model fits well. A low F-value suggests that the data points are scattered, indicating a poor fit.
+In **simple regression** (one predictor), the F-value is equivalent to the t-test, with the F-value being the square of the t-value ($F = t^2$). In **multiple regression**, the F-value tests the overall model significance, while t-tests assess the individual predictors.
-In **simple regression** (one predictor), the F-value is related to the t-test, with the F-value being the square of the t-value ($F = t^2$). In this case, both tests give the same information about model significance. In **multiple regression**, the F-value tests the overall model significance, while t-tests assess the individual predictors.
+In **ANOVA**, the degrees of freedom are calculated similarly, where $n$ is the number of observations and $g$ is the number of groups:
+- **Between-groups degrees of freedom**: $g - 1$ (analogous to the number of predictors minus one),
+- **Within-groups degrees of freedom**: $n - g$ (similar to residuals in regression). 
+Here, the $-1$ is because the overall mean is estimated, just like the intercept in regression.
 ### When to Use the F-value
- **Multiple Regression**: When testing whether the predictors, as a group, significantly explain the variance in the outcome.
+- **Multiple Regression**: To test if a group of predictors collectively explain significant variance in the outcome.
- **ANOVA**: When comparing group means to check for statistically significant differences.
+- **ANOVA**: To test if the means of different groups are statistically different.
 ### Example (Good Practice)
-Suppose you are modeling plant growth based on factors like sunlight, water, and fertilizer. The F-value tests whether these predictors, collectively, explain the variation in plant growth. A large F-value indicates that the model is a good fit.
+Suppose you are modeling plant growth based on factors like sunlight, water, and fertilizer. The F-value will tell you whether these predictors, collectively, significantly explain the variation in plant growth. A large F-value suggests that the model is a good fit.
 ### Example (Bad Practice)
- **Texas Sharpshooter Fallacy**: Occurs when a researcher looks for patterns after data collection, then reports only significant F-values by chance. This can lead to **p-hacking**, where multiple tests are conducted, but only the significant results are presented.
+- **Texas Sharpshooter Fallacy**: This occurs when researchers look for patterns after collecting data and report significant F-values by chance, leading to **p-hacking**.
- **Incorrect Use of ANOVA**: Applying ANOVA to non-normal data or data with unequal group variances can lead to misleading results. For example, using ANOVA without checking assumptions like homogeneity of variance may produce a biased F-value.
+- **Improper ANOVA Use**: Applying ANOVA to non-normal data or data with unequal variances without applying appropriate transformations can produce misleading F-values.
 ### Common Pitfalls
- **Overfitting**: Including too many predictors can inflate the F-value, making the model appear more significant than it really is, leading to poor generalization.
+- **Overfitting**: Adding too many predictors can inflate the F-value, making the model appear more significant than it actually is, reducing its generalizability.
- **Assumption Violations**: The F-test assumes that:
+- **Assumption Violations**: The F-test assumes that residuals are normally distributed and that variance across groups is homogeneous. Mild violations of normality are often tolerated, but unequal variances can inflate the F-value and increase the risk of a Type I error (false positive).
-  - **Residuals are normally distributed**
-  - **Homogeneity of variance** (equal variances across groups)
-  Violating these assumptions doesn't always invalidate your results, but it can affect the accuracy of the F-test. For example:
-  - **Mild violations of normality**: The F-test can be robust to slight deviations from normality, especially in large sample sizes.
-  - **Homogeneity of variance**: Unequal variances between groups (heteroscedasticity) can lead to an inflated F-value, increasing the chance of a Type I error (false positive). In such cases, transformations of the data or alternative tests like Welch's ANOVA can be applied.
 ### Related Measures
- **p-value**: The p-value indicates whether the F-value is statistically significant. A small p-value (typically < 0.05) suggests that the F-value is significant.
+- **p-value**: The p-value associated with the F-value tells you whether the model’s F-value is statistically significant. A small p-value (typically < 0.05) indicates that the F-value is significant.
\ No newline at end of file