Changes

gillesc92 · c738af0b
--- a/2.-Statistics/F-value.md
+++ b/2.-Statistics/F-value.md
@@ -4,17 +4,72 @@

 The **F-value** is a test statistic used in ANOVA and regression models to assess whether the model is statistically significant. It compares the variance explained by the model (signal) to the unexplained variance (noise). A high F-value indicates that the model explains a significant amount of the variance, while a low F-value suggests that the model doesn't improve much over using the mean of the outcome.

-The F-value is calculated as:
+### How to Calculate the F-value

-$F = \frac{\text{MSB}}{\text{MSW}}$
+To calculate the F-value, you need to partition the total variance in the outcome variable. This total variance is known as the **Total Sum of Squares (SST)**, and it is calculated by taking the squared differences between the observed data points and the overall mean of the outcome:
+
+\[
+SST = \sum (Y_i - \bar{Y})^2
+\]
+
+Where:
+- $Y_i$: The observed value for each data point
+- $\bar{Y}$: The mean of the outcome variable
+
+The total variance is then split into two components:
+
+1. **Explained Variance (Regression Sum of Squares - SSR)**: This is the portion of variance explained by the model. It is the difference between the predicted values from the model and the overall mean of the outcome variable:
+
+\[
+SSR = \sum (\hat{Y}_i - \bar{Y})^2
+\]
+
+Where:
+- $\hat{Y}_i$: The predicted value from the model for each data point
+- $\bar{Y}$: The mean of the outcome variable
+
+2. **Unexplained Variance (Residual Sum of Squares - SSE)**: This is the variance that remains unexplained by the model, which measures how far the observed values differ from the predicted values:
+
+\[
+SSE = \sum (Y_i - \hat{Y}_i)^2
+\]
+
+Where:
+- $Y_i$: The observed value for each data point
+- $\hat{Y}_i$: The predicted value from the model
+
+Next, you calculate the **Mean Squares** for both the explained and unexplained variances. These values account for the number of predictors and observations in the model. The **Mean Square Between (MSB)**, representing the explained variance, is calculated by dividing the **SSR** by the number of predictors ($p$):
+
+\[
+MSB = \frac{SSR}{p}
+\]
+
+Where:
+- $SSR$: The regression sum of squares (explained variance)
+- $p$: The number of predictors in the model
+
+The **Mean Square Within (MSW)**, representing the unexplained variance, is calculated by dividing the **SSE** by the degrees of freedom (the number of observations $n$, minus the number of predictors $p$, and minus one):
+
+\[
+MSW = \frac{SSE}{n - p - 1}
+\]

 Where:
- **MSB (Mean Square Between)**: The variance explained by the model.
- **MSW (Mean Square Within)**: The residual variance, or the variance that remains unexplained.
+- $SSE$: The residual sum of squares (unexplained variance)
+- $n$: The total number of observations
+- $p$: The number of predictors in the model
+
+Finally, the **F-value** is calculated by taking the ratio of the explained variance (MSB) to the unexplained variance (MSW):
+
+\[
+F = \frac{MSB}{MSW}
+\]
+
+This ratio indicates how much more variance is explained by the model compared to the variance that remains unexplained. A large F-value suggests that the model explains much more variance than what is left unexplained, meaning the model is statistically significant. Conversely, a small F-value indicates that the model is not significantly better than using the mean to predict the outcome.

 ### Interpretation

-The F-value tests whether the model provides a better fit than using just the mean. Visually, a high F-value indicates that the data points are close to the regression line, showing that the model fits well. A low F-value suggests that the data points are scattered, indicating a poor fit.
+The F-value tests whether the model provides a better fit than using just the mean. A high F-value indicates that the data points are close to the regression line, showing that the model fits well. A low F-value suggests that the data points are scattered, indicating a poor fit.

 In **simple regression** (one predictor), the F-value is related to the t-test, with the F-value being the square of the t-value ($F = t^2$). In this case, both tests give the same information about model significance. In **multiple regression**, the F-value tests the overall model significance, while t-tests assess the individual predictors.

@@ -44,13 +99,6 @@ Suppose you are modeling plant growth based on factors like sunlight, water, and
  - **Mild violations of normality**: The F-test can be robust to slight deviations from normality, especially in large sample sizes.
  - **Homogeneity of variance**: Unequal variances between groups (heteroscedasticity) can lead to an inflated F-value, increasing the chance of a Type I error (false positive). In such cases, transformations of the data or alternative tests like Welch's ANOVA can be applied.

-### Interpreting the F-value
-
- **High F-value**: Indicates that the model explains a significant amount of variance.
- **Low F-value**: Suggests that the model doesn't explain much variance.
-
-The F-value is compared to a critical value from an F-distribution table. If the F-value is greater than the critical value, the model is considered statistically significant.
-
 ### Related Measures

 - **p-value**: The p-value indicates whether the F-value is statistically significant. A small p-value (typically < 0.05) suggests that the F-value is significant.
\ No newline at end of file