FAQ#
Frequently asked questions about combatlearn.
Can I use ComBat in a cross-validation pipeline?#
Yes. The ComBat class implements scikit-learn’s BaseEstimator and TransformerMixin, so it works with Pipeline, cross_val_score, and GridSearchCV:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from combatlearn import ComBat
pipe = Pipeline([
("combat", ComBat(batch=batch)),
("scaler", StandardScaler()),
])
pipe.fit_transform(X)
Important: Batch labels are passed at construction, not at fit(). This means the same batch vector is used for both fitting and transforming. In a cross-validation setting, ensure your batch labels are properly aligned with the train/test splits.
Which method should I use?#
Method |
When to use |
|---|---|
|
Simple batch correction without covariates. Good default for most cases. |
|
When you have biological covariates (e.g., age, sex, diagnosis) that should be preserved during correction. |
|
When batch effects also affect the covariance structure of the data, not just means and variances. Extends Fortin with PCA-based covariance correction. |
Start with 'johnson' if you have no covariates, or 'fortin' if you do. Use 'chen' only if you have evidence of covariance-level batch effects.
Parametric or non-parametric?#
Parametric (
parametric=True, default): Assumes batch effect parameters follow specific distributions (inverse gamma for variance, normal for mean). Faster and works well for most omics data.Non-parametric (
parametric=False): Makes no distributional assumptions. Use this when your data violates the parametric assumptions (e.g., heavy-tailed distributions, small sample sizes).
In practice, parametric mode works well for the vast majority of cases.
What does mean_only do?#
When mean_only=True, ComBat only adjusts the location (mean) of each batch, leaving the scale (variance) unchanged. This is useful when:
You believe batch effects only shift the means.
You want to preserve the original variance structure of your data.
Your variance estimates are unreliable due to small sample sizes.
Can I apply ComBat to new/unseen data?#
Yes, via transform(). After fitting on your training data, you can transform new data from the same batches:
combat = ComBat(batch=batch_train).fit(X_train)
X_test_corrected = combat.transform(X_test)
However, ComBat cannot handle batches that were not seen during fitting. If X_test contains samples from a new batch, you must re-fit the model with data from all batches.
How do I interpret the summary() output?#
After fitting, call summary(combat) (from combatlearn.inspection) for a diagnostic report. Key sections:
Method/Parametric/Mean only: Confirms your configuration.
Samples per batch: Check for small or imbalanced batches.
Top 5 features by batch effect: Features most affected by batch - useful for quality control.
Diagnostics table:
Batch var. explained (before/after): Fraction of total variance explained by batch. Should decrease substantially after correction.
Design matrix condition number: Large values (>100) suggest collinearity issues (Fortin/Chen only).
EB convergence: Whether the iterative estimation converged for each batch.
How do I choose covbat_cov_thresh?#
This parameter only applies to method='chen' (CovBat). It controls how many principal components are used for covariance correction:
Float (0, 1]: Cumulative variance ratio.
0.9(default) retains PCs explaining 90% of variance. Higher values correct more subtle covariance effects but risk overfitting.Int >= 1: Fixed number of PCs. Useful when you know the dimensionality of your data’s covariance structure.
Start with the default 0.9. If correction seems insufficient, try 0.95 or 0.99.