combatlearn#

combatlearn makes the popular ComBat (and CovBat) batch-effect correction algorithm available for use in machine learning frameworks. It lets you harmonize high-dimensional data inside a scikit-learn Pipeline, so that cross-validation and grid-search automatically take batch structure into account, without data leakage.

Features#

Six ComBat Methods:
- method="johnson" - Classic ComBat (Johnson et al., 2007)
- method="fortin" - neuroCombat with covariates (Fortin et al., 2018)
- method="chen" - CovBat PCA-based (Chen et al., 2022)
- method="longitudinal" - Longitudinal ComBat for repeated measures (Beer et al., 2020)
- method="gam" - ComBat-GAM, nonlinear (spline) covariate effects (Pomponio et al., 2020)
- method="covbat_gam" - CovBat with the same nonlinear covariate modeling
- Each accepts case- and separator-insensitive literature aliases (e.g. "covbat", "neurocombat", "combat_gam")
Scikit-learn Compatible:
- Works seamlessly in Pipeline objects
- Compatible with GridSearchCV and cross_val_score
- Prevents data leakage during cross-validation
Visualization Tools:
- Built-in plotting with PCA, t-SNE, and UMAP
- Static (matplotlib) and interactive (plotly) visualizations
- Before/after batch effect comparison
Inspection & Metrics:
- feature_batch_diagnostics() identifies which features carry the strongest batch effects, with location (mean shift) and scale (variance) decomposition
- compute_batch_metrics() quantifies correction quality (Silhouette, Davies-Bouldin, kBET, LISI, k-NN preservation, and more)
- summary() prints a diagnostic report for a fitted model

Quick Example#

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from combatlearn import ComBat

# Load your data
X = pd.read_csv("data.csv", index_col=0)
y = pd.read_csv("labels.csv", index_col=0).squeeze()
batch = pd.read_csv("batch.csv", index_col=0).squeeze()

# Create pipeline with ComBat
pipe = Pipeline([
    ("combat", ComBat(batch=batch, method="fortin")),
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression())
])

# Hyperparameter tuning with grid search
param_grid = {
    "combat__mean_only": [True, False],
    "clf__C": [0.01, 0.1, 1, 10],
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring="roc_auc")
grid.fit(X, y)

print(f"Best CV AUROC: {grid.best_score_:.3f}")

Why combatlearn?#

Batch effects are systematic technical variations that can confound biological signals in high-dimensional data. ComBat is the gold standard for batch effect correction, but traditional implementations don’t integrate well with machine learning workflows.

combatlearn solves this by:

Fitting ComBat parameters on training data only
Applying the same transformation to test data
Preventing data leakage in cross-validation
Supporting hyperparameter tuning of batch correction

Installation#

pip install combatlearn

Citation#

If combatlearn is useful in your research, please cite the paper introducing this Python package:

Rocchi, E., Nicitra, E., Calvo, M. et al. Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy. BMC Microbiol (2026). https://doi.org/10.1186/s12866-025-04657-2

@article{Rocchi2026,
  author    = {Rocchi, Ettore and Nicitra, Emanuele and Calvo, Maddalena and Cento, Valeria and Peiretti, Laura and Asif, Zian and Menchinelli, Giulia and Posteraro, Brunella and Sala, Claudia and Colosimo, Claudia and Cricca, Monica and Sambri, Vittorio and Sanguinetti, Maurizio and Castellani, Gastone and Stefani, Stefania},
  title     = {Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy},
  journal   = {BMC Microbiology},
  year      = {2026},
  doi       = {10.1186/s12866-025-04657-2},
  url       = {https://doi.org/10.1186/s12866-025-04657-2}
}

Acknowledgements#

This project builds on the excellent work of the ComBat family of harmonisation methods. Please consider citing the original papers:

ComBat - Johnson WE, Li C, Rabinovic A. Biostatistics. 2007. doi: 10.1093/biostatistics/kxj037
neuroCombat - Fortin JP et al. Neuroimage. 2018. doi: 10.1016/j.neuroimage.2017.11.024
CovBat - Chen AA et al. Hum Brain Mapp. 2022. doi: 10.1002/hbm.25688
Longitudinal ComBat - Beer JC et al. Neuroimage. 2020. doi: 10.1016/j.neuroimage.2020.117129
ComBat-GAM - Pomponio R et al. Neuroimage. 2020. doi: 10.1016/j.neuroimage.2019.116450

Author#

Ettore Rocchi @ University of Bologna

Google Scholar | Scopus | GitHub

License#

MIT License - see LICENSE for details.