Changes

gillesc92 · 5fb5491b
--- a/2.-Statistics/overdispersion-underdispersion.md
+++ b/2.-Statistics/overdispersion-underdispersion.md
+## Overdispersion and Underdispersion
+### 1. What are Overdispersion and Underdispersion?
+**Overdispersion** occurs when the variance in a dataset is greater than what is expected under a given statistical model, such as a Poisson or Binomial distribution. For example, in a Poisson model, the variance should equal the mean, and when the variance exceeds this value, overdispersion is present. This can lead to biased parameter estimates and poor model performance.
+**Underdispersion** is the opposite issue, where the observed variance is smaller than expected under the assumed model. Both overdispersion and underdispersion distort statistical models' performance by violating their distributional assumptions.
+Commonly, overdispersion arises in count data models, while underdispersion is less frequent but can still appear in cases where the data is unusually homogenous.
+### 2. How to Calculate (Detect) Overdispersion and Underdispersion
+#### Steps to Detect Overdispersion and Underdispersion:
+1. **Fit a Poisson or Binomial Model**: Start by fitting the appropriate model assuming that the variance equals the mean (Poisson) or is determined by the binomial assumption.
+2. **Calculate the Dispersion Statistic**:
+   - For **Poisson regression**, the dispersion statistic is calculated as:
+   $$
+   \text{Dispersion} = \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2 / \hat{y}_i}{n - p}
+   $$
+   Where:
+   - $y_i$ is the observed value.
+   - $\hat{y}_i$ is the predicted value.
+   - $n$ is the number of observations.
+   - $p$ is the number of parameters in the model.
+   A dispersion statistic significantly **greater than 1** indicates **overdispersion**, while a value significantly **less than 1** indicates **underdispersion**.
+3. **Residual Analysis**: Examine the residuals from the model to check for patterns. If the residuals are larger or smaller than expected, it indicates overdispersion or underdispersion.
+4. **Use Information Criteria**: Compare models with and without overdispersion/underdispersion using information criteria such as **AIC** or **BIC**. A significantly better model fit with models that account for dispersion suggests an issue in the original model.
+5. **Chi-Squared Goodness of Fit**: A large chi-squared statistic relative to the degrees of freedom can signal overdispersion. Conversely, a small chi-squared statistic may signal underdispersion.
+### 3. Common Uses
+Dealing with overdispersion and underdispersion is crucial in many applied fields, especially in count data, binary outcomes, and models dealing with large datasets with high variability.
+#### 1. **Overdispersion in Count Data**
+Overdispersion frequently occurs in Poisson models for count data, where the assumption that the variance equals the mean is violated. This leads to inflated standard errors and biased parameter estimates if not accounted for.
+##### Example: Species Abundance Models
+In ecology, when modeling species counts in different habitats, Poisson regression is commonly used. If the counts exhibit more variation than expected, switching to a **Negative Binomial regression** model can account for the overdispersion, providing more reliable estimates.
+#### 2. **Underdispersion in Medical Trials**
+Underdispersion can arise in situations where data is tightly clustered, with less variability than expected. This can occur in tightly controlled experimental conditions, such as some medical trials, where the response variability among subjects is unusually low.
+##### Example: Response to Treatment
+In clinical trials where patients' responses to a drug are more uniform than expected, underdispersion might occur, leading to narrower confidence intervals and potentially misleading conclusions about treatment efficacy.
+#### 3. **Observer-Level Random Effects (OLRE) for Overdispersion**
+In mixed-effects models, **Observer-Level Random Effects (OLRE)** are often used to handle overdispersion caused by unmodeled variability at the individual or observation level. OLRE introduces additional variability into the model by adding random effects for each observation, which accounts for excess variance not explained by fixed effects alone.
+##### Example: Survey Data with Observer Variability
+In wildlife surveys, differences between observers (e.g., variation in detection ability) can lead to overdispersion in the count data. Including OLRE in the model can help account for this additional variance and improve model fit.
+### 4. Issues
+#### 1. **Overdispersion: Inflated Variance**
+Overdispersion leads to inflated standard errors, which can distort parameter estimates and lead to higher Type I error rates (false positives). This occurs because the model underestimates the true variability in the data.
+##### Solution:
+- Use **Negative Binomial Regression**: This model adds an extra parameter to account for the additional variability in count data.
+- **Quasi-Likelihood Models**: In cases of binary data, use quasi-binomial models to scale the variance and handle overdispersion.
+#### 2. **Underdispersion: Deflated Variance**
+Underdispersion can lead to overly confident parameter estimates, with standard errors that are too small, increasing the risk of Type II errors (false negatives). This can result in missed significant effects because the model assumes less variability than is truly present.
+##### Solution:
+- **Data Transformation**: Transform the data to better fit the model assumptions, or use a model that can account for underdispersion.
+- **Revisit the Model Assumptions**: Ensure that the assumptions regarding variance are appropriate for the data.
+#### 3. **Incorrect Model Choice**
+If overdispersion or underdispersion is not accounted for, the model chosen may be incorrect, leading to biased conclusions and poor generalization to new data. This can be particularly problematic in predictive modeling.
+##### Solution:
+- **Use Models That Handle Dispersion**: For overdispersion, consider using **Negative Binomial**, **Quasi-Poisson**, or **Quasi-Binomial** models. For underdispersion, ensure the variance structure of the model aligns with the observed data.
+#### 4. **Difficulty in Detecting Underdispersion**
+Underdispersion is less common and often harder to detect. It can be mistaken for overly optimistic model performance (e.g., overly narrow confidence intervals), leading to incorrect inferences.
+##### Solution:
+- **Residual Diagnostics**: Pay close attention to residuals and model fit diagnostics to ensure that variance is appropriately modeled. Perform goodness-of-fit tests to assess underdispersion.
+---
+### How to Handle Overdispersion and Underdispersion Effectively
+- **Use the Right Model**: For overdispersion, switch to **Negative Binomial**, **Quasi-Poisson**, or **Quasi-Binomial** models. For underdispersion, check if the variance structure aligns with the data and adjust the model accordingly.
+- **Check Residuals and Diagnostics**: Regularly check residuals, dispersion statistics, and goodness-of-fit tests to detect overdispersion or underdispersion early in the modeling process.
+- **Use OLRE for Overdispersion**: In mixed-effects models, use **Observer-Level Random Effects** to account for overdispersion that arises from individual-level variability.
+- **Monitor Model Fit**: Use information criteria (AIC/BIC) and cross-validation to compare models that account for dispersion versus those that do not.