Method Guide#
combatlearn implements three variants of the ComBat algorithm. This guide helps you choose the right method for your use case.
Johnson Method (Classic ComBat)#
Reference: Johnson et al. (2007)
The original ComBat algorithm without covariate support.
When to Use#
Simple batch correction scenarios
No biological covariates to preserve
Exploratory data analysis
Fastest computation time
Algorithm#
Standardize features across all samples
Estimate location (γ) and scale (δ) parameters for each batch
Apply empirical Bayes shrinkage
Remove batch effects using adjusted parameters
Example#
from combatlearn import ComBat
combat = ComBat(
batch=batch,
method="johnson",
parametric=True # or False for non-parametric
)
X_corrected = combat.fit_transform(X)
Advantages#
✅ Simple and fast ✅ No covariate dependencies ✅ Well-established method
Limitations#
❌ Cannot preserve covariate effects
Fortin Method (neuroCombat)#
Reference: Fortin et al. (2018)
Extended ComBat that preserves effects of biological covariates.
When to Use#
Known biological variables (age, sex, diagnosis)
Need to preserve biological variation
Recommended for most applications
Standard choice for neuroimaging
Algorithm#
Build design matrix with batch indicators and covariates
Estimate batch effects while accounting for covariates
Apply empirical Bayes shrinkage
Remove only batch-related variation
Example#
from combatlearn import ComBat
import pandas as pd
# Define covariates
age = pd.DataFrame({"age": [25, 30, 45, ...]})
sex = pd.DataFrame({"sex": ["M", "F", "M", ...]})
diagnosis = pd.DataFrame({"dx": ["healthy", "disease", ...]})
combat = ComBat(
batch=batch,
method="fortin",
continuous_covariates=age,
discrete_covariates=pd.concat([sex, diagnosis], axis=1)
)
X_corrected = combat.fit_transform(X)
Advantages#
✅ Preserves covariate effects ✅ Removes only technical variation ✅ More biologically meaningful
Limitations#
❌ Requires covariate information ❌ Slightly slower than Johnson
Chen Method (CovBat)#
Reference: Chen et al. (2022)
PCA-based ComBat that operates in reduced dimensionality space.
When to Use#
High-dimensional data (many features)
Batch effects vary across features
Feature-specific corrections needed
Computational efficiency important
Algorithm#
Apply Fortin method for mean/variance adjustment
Perform PCA on corrected data
Apply batch correction in PC space
Transform back to original space
Example#
from combatlearn import ComBat
combat = ComBat(
batch=batch,
method="chen",
continuous_covariates=age,
discrete_covariates=sex,
covbat_cov_thresh=0.95 # Retain 95% variance
)
X_corrected = combat.fit_transform(X)
Variance Threshold Options#
You can specify the number of principal components in two ways:
Option 1: Cumulative Variance (float)
covbat_cov_thresh=0.95 # Retain 95% of variance
Option 2: Fixed Number (int)
covbat_cov_thresh=50 # Use exactly 50 components
Advantages#
✅ Handles high-dimensional data ✅ Feature-specific corrections ✅ Can reduce dimensionality ✅ Preserves covariate effects
Limitations#
❌ Requires covariate information ❌ Most computationally intensive ❌ Information loss in PCA step
Parametric vs Non-Parametric#
All methods support both parametric and non-parametric empirical Bayes:
Parametric (default):
Faster computation
Assumes normal distribution
Recommended for most datasets
Non-Parametric:
Iterative scheme
No distribution assumptions
Use when parametric assumptions violated
# Parametric (default)
combat = ComBat(batch=batch, method="fortin", parametric=True)
# Non-parametric
combat = ComBat(batch=batch, method="fortin", parametric=False)
Mean-Only Correction#
All methods support mean-only mode, which corrects batch means but preserves variance:
combat = ComBat(
batch=batch,
method="fortin",
mean_only=True # Only correct means
)
Use when: You want to preserve variance structure across batches.
Reference Batch#
Optionally specify a reference batch. Other batches will be adjusted to match it:
combat = ComBat(
batch=batch,
method="johnson",
reference_batch="Batch_A" # Match to Batch_A
)
Samples in the reference batch remain unchanged after correction.
Choosing a Method#
Simple Decision Tree:
No covariates? → Use Johnson
Have covariates + low/normal dimensionality? → Use Fortin
Have covariates + high dimensionality? → Use Chen
Next Steps#
See the API Reference for complete parameter documentation