FAQ#

Frequently asked questions about combatlearn.


Can I use ComBat in a cross-validation pipeline?#

Yes. The ComBat class implements scikit-learn’s BaseEstimator and TransformerMixin, so it works with Pipeline, cross_val_score, and GridSearchCV:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from combatlearn import ComBat

pipe = Pipeline([
    ("combat", ComBat(batch=batch)),
    ("scaler", StandardScaler()),
])
pipe.fit_transform(X)

Important: Batch labels are passed at construction, not at fit(). This means the same batch vector is used for both fitting and transforming. In a cross-validation setting, ensure your batch labels are properly aligned with the train/test splits.


Which method should I use?#

Method

When to use

'johnson'

Simple batch correction without covariates. Good default for most cases.

'fortin'

When you have biological covariates (e.g., age, sex, diagnosis) that should be preserved during correction.

'chen'

When batch effects also affect the covariance structure of the data, not just means and variances. Extends Fortin with PCA-based covariance correction.

Start with 'johnson' if you have no covariates, or 'fortin' if you do. Use 'chen' only if you have evidence of covariance-level batch effects.


Parametric or non-parametric?#

  • Parametric (parametric=True, default): Assumes batch effect parameters follow specific distributions (inverse gamma for variance, normal for mean). Faster and works well for most omics data.

  • Non-parametric (parametric=False): Makes no distributional assumptions. Use this when your data violates the parametric assumptions (e.g., heavy-tailed distributions, small sample sizes).

In practice, parametric mode works well for the vast majority of cases.


What does mean_only do?#

When mean_only=True, ComBat only adjusts the location (mean) of each batch, leaving the scale (variance) unchanged. This is useful when:

  • You believe batch effects only shift the means.

  • You want to preserve the original variance structure of your data.

  • Your variance estimates are unreliable due to small sample sizes.


Can I apply ComBat to new/unseen data?#

Yes, via transform(). After fitting on your training data, you can transform new data from the same batches:

combat = ComBat(batch=batch_train).fit(X_train)
X_test_corrected = combat.transform(X_test)

However, ComBat cannot handle batches that were not seen during fitting. If X_test contains samples from a new batch, you must re-fit the model with data from all batches.


How do I interpret the summary() output?#

After fitting, call summary(combat) (from combatlearn.inspection) for a diagnostic report. Key sections:

  • Method/Parametric/Mean only: Confirms your configuration.

  • Samples per batch: Check for small or imbalanced batches.

  • Top 5 features by batch effect: Features most affected by batch - useful for quality control.

  • Diagnostics table:

    • Batch var. explained (before/after): Fraction of total variance explained by batch. Should decrease substantially after correction.

    • Design matrix condition number: Large values (>100) suggest collinearity issues (Fortin/Chen only).

    • EB convergence: Whether the iterative estimation converged for each batch.


How do I choose covbat_cov_thresh?#

This parameter only applies to method='chen' (CovBat). It controls how many principal components are used for covariance correction:

  • Float (0, 1]: Cumulative variance ratio. 0.9 (default) retains PCs explaining 90% of variance. Higher values correct more subtle covariance effects but risk overfitting.

  • Int >= 1: Fixed number of PCs. Useful when you know the dimensionality of your data’s covariance structure.

Start with the default 0.9. If correction seems insufficient, try 0.95 or 0.99.