# FAQ

Frequently asked questions about combatlearn.

---

## Can I use ComBat in a cross-validation pipeline?

Yes. The `ComBat` class implements scikit-learn's `BaseEstimator` and `TransformerMixin`, so it works with `Pipeline`, `cross_val_score`, and `GridSearchCV`:

```python
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from combatlearn import ComBat

pipe = Pipeline([
    ("combat", ComBat(batch=batch)),
    ("scaler", StandardScaler()),
])
pipe.fit_transform(X)
```

**Important:** Batch labels are passed at construction, not at `fit()`. This means the same batch vector is used for both fitting and transforming. In a cross-validation setting, ensure your batch labels are properly aligned with the train/test splits.

---

## Which method should I use?

| Method | When to use |
|--------|-------------|
| `'johnson'` | Simple batch correction without covariates. Good default for most cases. |
| `'fortin'` | When you have biological covariates (e.g., age, sex, diagnosis) that should be preserved during correction. |
| `'chen'` | When batch effects also affect the covariance structure of the data, not just means and variances. Extends Fortin with PCA-based covariance correction. |

Start with `'johnson'` if you have no covariates, or `'fortin'` if you do. Use `'chen'` only if you have evidence of covariance-level batch effects.

---

## Parametric or non-parametric?

- **Parametric** (`parametric=True`, default): Assumes batch effect parameters follow specific distributions (inverse gamma for variance, normal for mean). Faster and works well for most omics data.
- **Non-parametric** (`parametric=False`): Makes no distributional assumptions. Use this when your data violates the parametric assumptions (e.g., heavy-tailed distributions, small sample sizes).

In practice, parametric mode works well for the vast majority of cases.

---

## What does `mean_only` do?

When `mean_only=True`, ComBat only adjusts the **location** (mean) of each batch, leaving the **scale** (variance) unchanged. This is useful when:

- You believe batch effects only shift the means.
- You want to preserve the original variance structure of your data.
- Your variance estimates are unreliable due to small sample sizes.

---

## Can I apply ComBat to new/unseen data?

Yes, via `transform()`. After fitting on your training data, you can transform new data from the **same batches**:

```python
combat = ComBat(batch=batch_train).fit(X_train)
X_test_corrected = combat.transform(X_test)
```

However, ComBat **cannot** handle batches that were not seen during fitting. If `X_test` contains samples from a new batch, you must re-fit the model with data from all batches.

---

## How do I interpret the `summary()` output?

After fitting, call `summary(combat)` (from `combatlearn.inspection`) for a diagnostic report. Key sections:

- **Method/Parametric/Mean only**: Confirms your configuration.
- **Samples per batch**: Check for small or imbalanced batches.
- **Top 5 features by batch effect**: Features most affected by batch - useful for quality control.
- **Diagnostics table**:
  - **Batch var. explained (before/after)**: Fraction of total variance explained by batch. Should decrease substantially after correction.
  - **Design matrix condition number**: Large values (>100) suggest collinearity issues (Fortin/Chen only).
  - **EB convergence**: Whether the iterative estimation converged for each batch.

---

## How do I choose `covbat_cov_thresh`?

This parameter only applies to `method='chen'` (CovBat). It controls how many principal components are used for covariance correction:

- **Float (0, 1]**: Cumulative variance ratio. `0.9` (default) retains PCs explaining 90% of variance. Higher values correct more subtle covariance effects but risk overfitting.
- **Int >= 1**: Fixed number of PCs. Useful when you know the dimensionality of your data's covariance structure.

Start with the default `0.9`. If correction seems insufficient, try `0.95` or `0.99`.