combatlearn#

Python versions PyPI Version License Test

combatlearn logo

combatlearn makes the popular ComBat (and CovBat) batch-effect correction algorithm available for use in machine learning frameworks. It lets you harmonize high-dimensional data inside a scikit-learn Pipeline, so that cross-validation and grid-search automatically take batch structure into account, without data leakage.

Features#

  • Three ComBat Methods:

    • method="johnson" - Classic ComBat (Johnson et al., 2007)

    • method="fortin" - neuroCombat with covariates (Fortin et al., 2018)

    • method="chen" - CovBat PCA-based (Chen et al., 2022)

  • Scikit-learn Compatible:

    • Works seamlessly in Pipeline objects

    • Compatible with GridSearchCV and cross_val_score

    • Prevents data leakage during cross-validation

  • Visualization Tools:

    • Built-in plotting with PCA, t-SNE, and UMAP

    • Static (matplotlib) and interactive (plotly) visualizations

    • Before/after batch effect comparison

  • Feature Importance Analysis (New in v1.2.0):

    • Identify which features have strongest batch effects

    • Location (mean shift) and scale (variance) decomposition

    • Magnitude and distribution modes for different use cases

Quick Example#

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from combatlearn import ComBat

# Load your data
X = pd.read_csv("data.csv", index_col=0)
y = pd.read_csv("labels.csv", index_col=0).squeeze()
batch = pd.read_csv("batch.csv", index_col=0).squeeze()

# Create pipeline with ComBat
pipe = Pipeline([
    ("combat", ComBat(batch=batch, method="fortin")),
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression())
])

# Hyperparameter tuning with grid search
param_grid = {
    "combat__mean_only": [True, False],
    "clf__C": [0.01, 0.1, 1, 10],
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring="roc_auc")
grid.fit(X, y)

print(f"Best CV AUROC: {grid.best_score_:.3f}")

Why combatlearn?#

Batch effects are systematic technical variations that can confound biological signals in high-dimensional data. ComBat is the gold standard for batch effect correction, but traditional implementations don’t integrate well with machine learning workflows.

combatlearn solves this by:

  • Fitting ComBat parameters on training data only

  • Applying the same transformation to test data

  • Preventing data leakage in cross-validation

  • Supporting hyperparameter tuning of batch correction

Installation#

pip install combatlearn

Citation#

If combatlearn is useful in your research, please cite the paper introducing this Python package:

Rocchi, E., Nicitra, E., Calvo, M. et al. Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy. BMC Microbiol (2026). https://doi.org/10.1186/s12866-025-04657-2

@article{Rocchi2026,
  author    = {Rocchi, Ettore and Nicitra, Emanuele and Calvo, Maddalena and Cento, Valeria and Peiretti, Laura and Asif, Zian and Menchinelli, Giulia and Posteraro, Brunella and Sala, Claudia and Colosimo, Claudia and Cricca, Monica and Sambri, Vittorio and Sanguinetti, Maurizio and Castellani, Gastone and Stefani, Stefania},
  title     = {Combining mass spectrometry and machine learning models for predicting Klebsiella pneumoniae antimicrobial resistance: a multicenter experience from clinical isolates in Italy},
  journal   = {BMC Microbiology},
  year      = {2026},
  doi       = {10.1186/s12866-025-04657-2},
  url       = {https://doi.org/10.1186/s12866-025-04657-2}
}

Acknowledgements#

This project builds on the excellent work of the ComBat family of harmonisation methods. Please consider citing the original papers:

Author#

Ettore Rocchi @ University of Bologna

Google Scholar | Scopus | GitHub

License#

MIT License - see LICENSE for details.