combatlearn#
combatlearn makes the popular ComBat (and CovBat) batch-effect correction algorithm available for use in machine learning frameworks. It lets you harmonize high-dimensional data inside a scikit-learn Pipeline, so that cross-validation and grid-search automatically take batch structure into account, without data leakage.
Features#
Three ComBat Methods:
method="johnson"- Classic ComBat (Johnson et al., 2007)method="fortin"- neuroCombat with covariates (Fortin et al., 2018)method="chen"- CovBat PCA-based (Chen et al., 2022)
Scikit-learn Compatible:
Works seamlessly in
PipelineobjectsCompatible with
GridSearchCVandcross_val_scorePrevents data leakage during cross-validation
Visualization Tools:
Built-in plotting with PCA, t-SNE, and UMAP
Static (matplotlib) and interactive (plotly) visualizations
Before/after batch effect comparison
Feature Importance Analysis (New in v1.2.0):
Identify which features have strongest batch effects
Location (mean shift) and scale (variance) decomposition
Magnitude and distribution modes for different use cases
Quick Example#
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from combatlearn import ComBat
# Load your data
X = pd.read_csv("data.csv", index_col=0)
y = pd.read_csv("labels.csv", index_col=0).squeeze()
batch = pd.read_csv("batch.csv", index_col=0).squeeze()
# Create pipeline with ComBat
pipe = Pipeline([
("combat", ComBat(batch=batch, method="fortin")),
("scaler", StandardScaler()),
("clf", LogisticRegression())
])
# Hyperparameter tuning with grid search
param_grid = {
"combat__mean_only": [True, False],
"clf__C": [0.01, 0.1, 1, 10],
}
grid = GridSearchCV(pipe, param_grid, cv=5, scoring="roc_auc")
grid.fit(X, y)
print(f"Best CV AUROC: {grid.best_score_:.3f}")
Why combatlearn?#
Batch effects are systematic technical variations that can confound biological signals in high-dimensional data. ComBat is the gold standard for batch effect correction, but traditional implementations don’t integrate well with machine learning workflows.
combatlearn solves this by:
Fitting ComBat parameters on training data only
Applying the same transformation to test data
Preventing data leakage in cross-validation
Supporting hyperparameter tuning of batch correction
Installation#
pip install combatlearn
Citation#
If combatlearn is useful in your research, please cite the paper introducing this Python package:
Rocchi, E., Nicitra, E., Calvo, M. et al. Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy. BMC Microbiol (2026). https://doi.org/10.1186/s12866-025-04657-2
@article{Rocchi2026,
author = {Rocchi, Ettore and Nicitra, Emanuele and Calvo, Maddalena and Cento, Valeria and Peiretti, Laura and Asif, Zian and Menchinelli, Giulia and Posteraro, Brunella and Sala, Claudia and Colosimo, Claudia and Cricca, Monica and Sambri, Vittorio and Sanguinetti, Maurizio and Castellani, Gastone and Stefani, Stefania},
title = {Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy},
journal = {BMC Microbiology},
year = {2026},
doi = {10.1186/s12866-025-04657-2},
url = {https://doi.org/10.1186/s12866-025-04657-2}
}
Acknowledgements#
This project builds on the excellent work of the ComBat family of harmonisation methods. Please consider citing the original papers:
ComBat - Johnson WE, Li C, Rabinovic A. Biostatistics. 2007. doi: 10.1093/biostatistics/kxj037
neuroCombat - Fortin JP et al. Neuroimage. 2018. doi: 10.1016/j.neuroimage.2017.11.024
CovBat - Chen AA et al. Hum Brain Mapp. 2022. doi: 10.1002/hbm.25688
License#
MIT License - see LICENSE for details.