API Reference#

Complete API documentation for combatlearn.

ComBat#

The main scikit-learn compatible transformer for batch effect correction.

class ComBat(batch, *, discrete_covariates=None, continuous_covariates=None, subject_id=None, time_covariate=None, method='johnson', parametric=True, mean_only=False, reference_batch=None, eps=1e-08, covbat_cov_thresh=0.9, smooth_terms=None, spline_df=10, spline_degree=3, smooth_term_bounds=None)[source]#

Bases: BaseEstimator, TransformerMixin

Pipeline-friendly wrapper around ComBatModel.

Stores batch (and optional covariates) passed at construction and appropriately uses them for separate fit and transform.

Parameters:

batch (array-like of shape (n_samples,)) – Batch labels for each sample.
discrete_covariates (array-like, optional) – Categorical covariates to protect (Fortin/Chen/Longitudinal only).
continuous_covariates (array-like, optional) – Continuous covariates to protect (Fortin/Chen/Longitudinal only).
subject_id (array-like, optional) – Subject/individual labels for the random intercept. Required for method='longitudinal', ignored otherwise.
time_covariate (array-like, optional) – Continuous time variable for repeated measures (Longitudinal only).
method ({'johnson', 'fortin', 'chen', 'longitudinal', 'gam', 'covbat_gam'}, default='johnson') – ComBat variant to use. ‘gam’/’covbat_gam’ model the continuous covariates in smooth_terms nonlinearly with B-splines (ComBat-GAM, Pomponio et al. 2020). Literature aliases are also accepted: ‘classic_combat’ (johnson), ‘neurocombat’ (fortin), ‘covbat’ (chen), ‘longcombat’ (longitudinal), ‘combat_gam’ (gam).
parametric (bool, default=True) – Use parametric empirical Bayes.
mean_only (bool, default=False) – Adjust only the mean (ignore variance).
reference_batch (str, optional) – Batch level to leave unchanged.
eps (float, default=1e-8) – Numerical jitter for stability.
covbat_cov_thresh (float or int, default=0.9) – CovBat variance threshold for PCs.
smooth_terms (list of str or int, optional) – Continuous covariates to model nonlinearly (gam/covbat_gam only). Default (None) smooths every continuous covariate.
spline_df (int, default=10) – B-spline degrees of freedom per smooth term.
spline_degree (int, default=3) – B-spline degree (3 = cubic).
smooth_term_bounds (tuple of (float, float) or dict, optional) – Boundary knots for the splines; a single (lo, hi) for all terms or a {term: (lo, hi)} dict. Default uses each term’s training min/max.

__init__(batch, *, discrete_covariates=None, continuous_covariates=None, subject_id=None, time_covariate=None, method='johnson', parametric=True, mean_only=False, reference_batch=None, eps=1e-08, covbat_cov_thresh=0.9, smooth_terms=None, spline_df=10, spline_degree=3, smooth_term_bounds=None)[source]#

fit(X, y=None)[source]#

Fit the ComBat model.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input data to fit.
y (None) – Ignored. Present for API compatibility.

Returns:

self – Fitted estimator.

Return type:

ComBat

transform(X)[source]#

Transform the data using fitted ComBat parameters.

Parameters:: X (array-like of shape (n_samples, n_features)) – Input data to transform.
Returns:: X_transformed – Batch-corrected data.
Return type:: pd.DataFrame

get_feature_names_out(input_features=None)[source]#

Get output feature names for transform.

Parameters:: input_features (array-like of str or None, default=None) – Ignored. Present for API compatibility.
Returns:: feature_names_out – Feature names.
Return type:: ndarray of str objects
Raises:: sklearn.exceptions.NotFittedError – If the estimator is not fitted.

NestedComBat#

Multi-batch-variable ComBat (Nested / OPNested / GMM ComBat) for harmonizing over several batch variables at once.

class NestedComBat(batch, *, discrete_covariates=None, continuous_covariates=None, method='fortin', optimize_order=True, order_metric='anderson', max_exhaustive_vars=4, gmm=None, gmm_min_cluster_frac=0.25, parametric=True, mean_only=False, reference_batch=None, eps=1e-08, covbat_cov_thresh=0.9, smooth_terms=None, spline_df=10, spline_degree=3, smooth_term_bounds=None, random_state=None)[source]#

Bases: BaseEstimator, TransformerMixin

Nested / OPNested / GMM ComBat for multiple batch variables.

Harmonizes over several batch variables at once (e.g. site, scanner, protocol; Horng et al. 2022) by applying single-batch ComBat to each one in sequence, every step delegating to a ComBatModel. It adds no new empirical-Bayes math - the chosen order, the optional Gaussian-mixture grouping, and every per-step parameter are learned on the training data and frozen for transform, so it is inductive and cross-validation-safe like ComBat.

Parameters:

batch (pd.DataFrame or list of array-like) – The batch variables to harmonize. A DataFrame uses one column per batch variable (column names become the variable names); a list/tuple provides one array-like per batch variable (named Series keep their name, others are named 'batch0', 'batch1', …).
discrete_covariates (array-like, optional) – Categorical covariates to protect, preserved across every step.
continuous_covariates (array-like, optional) – Continuous covariates to protect, preserved across every step. Required for the GAM engines.
method ({'fortin', 'chen', 'gam', 'covbat_gam'}, default='fortin') – The ComBat engine used for every nested step. Literature aliases ('neurocombat', 'covbat', 'combat_gam', 'covbatgam') are also accepted. 'johnson' and 'longitudinal' are not supported (they do not preserve covariates across the nested steps).
optimize_order (bool, default=True) – If True, select the harmonization order that minimizes the residual batch effect (OPNested); otherwise use the order the batch variables are given.
order_metric ({'anderson'}, default='anderson') – Objective for the order search: the number of features with a significant residual batch effect by the Anderson-Darling k-sample test, summed over all batch variables (lower is better).
max_exhaustive_vars (int, default=4) – Cap on the exhaustive order search. With k <= max_exhaustive_vars batch variables all k! orderings are tried (each ordering fits k ComBat models and scores every feature), so the cost grows factorially; a warning reports the number of fits before a large exhaustive search runs. Above the cap the search falls back to greedy forward selection. Raise this to force the exhaustive search over more variables, at your own cost.
gmm ({None, 'batch', 'covariate'}, default=None) – Optional Gaussian-mixture grouping (GMM ComBat). 'batch' (+GMM) feeds the latent grouping in as an extra batch variable (harmonized away); 'covariate' (-GMM) feeds it in as a protected discrete covariate (preserved as signal). None disables it.
gmm_min_cluster_frac (float, default=0.25) – Minimum fraction of samples each mixture component must hold for a feature to be eligible as the grouping source.
parametric (bool, default=True) – Use parametric empirical Bayes (passed to every step).
mean_only (bool, default=False) – Adjust only the mean (passed to every step).
reference_batch (dict or str, optional) – Reference level per batch variable, as a {batch_variable: level} dict (each nested step leaves its reference level unchanged; variables absent from the dict use the grand mean). A bare string is accepted only when there is a single batch variable. None uses the grand mean throughout.
eps (float, default=1e-8) – Numerical jitter (passed to every step).
covbat_cov_thresh (float or int, default=0.9) – CovBat variance threshold for PCs (chen / covbat_gam steps).
smooth_terms (list of str or int, optional) – Continuous covariates to model nonlinearly (gam / covbat_gam).
spline_df (int, default=10) – B-spline degrees of freedom per smooth term (GAM engines).
spline_degree (int, default=3) – B-spline degree (GAM engines).
smooth_term_bounds (tuple of (float, float) or dict, optional) – Boundary knots for the splines (GAM engines).
random_state (int or None, default=None) – Seed for the Gaussian-mixture initialization (used only when gmm is set). None follows the scikit-learn convention (nondeterministic grouping); pass an int for a reproducible grouping.

order_#

The batch variables in the order they were harmonized.

Type:: list of str

used_greedy_#

Whether the greedy fallback was used instead of the exhaustive search.

Type:: bool

batch_var_before_#

Per-variable fraction of variance explained by batch before correction.

Type:: dict of str to float

batch_var_after_#

Per-variable fraction of variance explained by batch after correction.

Type:: dict of str to float

__init__(batch, *, discrete_covariates=None, continuous_covariates=None, method='fortin', optimize_order=True, order_metric='anderson', max_exhaustive_vars=4, gmm=None, gmm_min_cluster_frac=0.25, parametric=True, mean_only=False, reference_batch=None, eps=1e-08, covbat_cov_thresh=0.9, smooth_terms=None, spline_df=10, spline_degree=3, smooth_term_bounds=None, random_state=None)[source]#

fit(X, y=None)[source]#

Fit the nested model: select an order and fit one step per batch variable.

Parameters:

X (array-like of shape (n_samples, n_features)) – Input data to fit.
y (None) – Ignored. Present for API compatibility.

Returns:

self – Fitted estimator.

Return type:

NestedComBat

transform(X)[source]#

Transform new data by replaying the fitted nested sequence.

Parameters:: X (array-like of shape (n_samples, n_features)) – Input data to transform.
Returns:: X_transformed – Batch-corrected data.
Return type:: pd.DataFrame

get_feature_names_out(input_features=None)[source]#

Get output feature names for transform.

Parameters:: input_features (array-like of str or None, default=None) – Ignored. Present for API compatibility.
Returns:: feature_names_out – Feature names.
Return type:: ndarray of str objects
Raises:: sklearn.exceptions.NotFittedError – If the estimator is not fitted.

TransductiveComBat#

Whole-cohort, fit_transform-only harmonizers whose benefit is realized in-sample (currently Longitudinal ComBat).

class TransductiveComBat(batch, *, method='longitudinal', discrete_covariates=None, continuous_covariates=None, subject_id=None, time_covariate=None, parametric=True, mean_only=False, reference_batch=None, eps=1e-08)[source]#

Bases: BaseEstimator, TransformerMixin

Whole-cohort ComBat harmonizers that do not meet the inductive contract.

Some ComBat variants only pay off in-sample: their benefit is tied to the samples present at fit time, so there is no leakage-free way to freeze them on a training split and apply them to held-out data. TransductiveComBat exposes these as a single fit_transform step over a complete cohort; calling transform() on separate held-out data raises, and it is deliberately not meant for a scikit-learn Pipeline.

Currently method='longitudinal' (Longitudinal ComBat, Beer et al. 2020) is available.

Parameters:

batch (array-like of shape (n_samples,)) – Batch labels for each sample.
method ({'longitudinal'}, default='longitudinal') – Transductive engine to use. The alias 'longcombat' is also accepted.
discrete_covariates (array-like, optional) – Categorical covariates to protect.
continuous_covariates (array-like, optional) – Continuous covariates to protect.
subject_id (array-like, optional) – Subject/individual labels for the random intercept. Required for method='longitudinal'.
time_covariate (array-like, optional) – Continuous time variable for repeated measures (Longitudinal only).
parametric (bool, default=True) – Use parametric empirical Bayes.
mean_only (bool, default=False) – Adjust only the mean (ignore variance).
reference_batch (str, optional) – Batch level to leave unchanged.
eps (float, default=1e-8) – Numerical jitter for stability.

Notes

“Transductive” here follows scikit-learn’s glossary sense: the estimator “is designed to model a specific dataset, but not to apply that model to unseen data” - i.e. whole-cohort, non-inductive, fit_transform-only. It is not the strictly supervised Vapnik sense of transduction (predicting labels for specific unlabeled points): ComBat is unsupervised and predicts no labels.

__init__(batch, *, method='longitudinal', discrete_covariates=None, continuous_covariates=None, subject_id=None, time_covariate=None, parametric=True, mean_only=False, reference_batch=None, eps=1e-08)[source]#

fit(X, y=None)[source]#

Fit the underlying whole-cohort model.

Parameters:

X (array-like of shape (n_samples, n_features)) – The complete cohort to harmonize.
y (None) – Ignored. Present for API compatibility.

Returns:

self – Fitted estimator.

Return type:

TransductiveComBat

fit_transform(X, y=None)[source]#

Fit and harmonize the whole cohort in a single pass.

Parameters:

X (array-like of shape (n_samples, n_features)) – The complete cohort to harmonize.
y (None) – Ignored. Present for API compatibility.

Returns:

X_transformed – Batch-corrected data for the whole cohort.

Return type:

pd.DataFrame

transform(X)[source]#

Not supported: TransductiveComBat is fit_transform-only.

Raises:: NotImplementedError – Always. The correction is in-sample, so it cannot be frozen at fit and applied to separate held-out data.

Inspection#

Functions for inspecting fitted ComBat models.

Standalone inspection functions for fitted ComBat models.

feature_batch_diagnostics(combat, mode='magnitude', weighted=True)[source]#

Compute per-feature batch effect magnitude.

Returns a DataFrame with columns location, scale, and combined. Location is the (weighted) RMS of gamma across batches (standardized mean shifts). Scale is the (weighted) RMS of log-delta across batches (log-fold variance change). Combined is the Euclidean norm sqrt(location**2 + scale**2). Using RMS provides L2-consistent aggregation; using log(delta) ensures symmetry.

Parameters:

combat (ComBat) – A fitted ComBat instance.
mode ({'magnitude', 'distribution'}, default='magnitude') –
- ‘magnitude’: Returns L2-consistent absolute batch effect magnitudes. Suitable for ranking, thresholding, and cross-dataset comparison.
- ’distribution’: Returns column-wise normalized proportions (each column sums to 1, values in range [0, 1]), representing the relative contribution of each feature to the total location, scale, or combined batch effect. Note: normalization is applied independently to each column, so the Euclidean relationship (combined**2 = location**2 + scale**2) no longer holds.
weighted (bool, default=True) – If True, compute a weighted RMS where each batch is weighted by its sample size. This gives more influence to larger batches, producing a more statistically representative summary. If False, all batches contribute equally regardless of size.

Returns:

DataFrame with index=feature names, columns=[‘location’, ‘scale’, ‘combined’], sorted by ‘combined’ descending.

Return type:

pd.DataFrame

Raises:

ValueError – If the model is not fitted or if mode is invalid.

batch_variance_explained(combat, X)[source]#

Fraction of total variance explained by batch after correcting X.

Transforms X with the fitted model and measures the residual between-batch signal (between-batch sum of squares over total sum of squares). Compare it against combat._batch_var_before_ (recorded at fit on the training data) to gauge how much batch structure the correction removed. Computed on demand, so combatlearn.ComBat.transform() stays free of diagnostic side effects.

Parameters:

combat (ComBat) – A fitted ComBat instance.
X (array-like of shape (n_samples, n_features)) – Data to correct and score. It must align with combat.batch exactly as for transform() (a shared pandas index, or a matching length for arrays).

Returns:

Fraction in [0, 1] of total variance explained by batch after correction; lower means less residual batch structure.

Return type:

float

Raises:

ValueError – If the model is not fitted.

summary(combat, X=None)[source]#

Return a human-readable diagnostic report after fitting.

Parameters:

combat (ComBat) – A fitted ComBat instance.
X (array-like of shape (n_samples, n_features), optional) – If given, the report adds the fraction of variance explained by batch after correcting X (via batch_variance_explained()). Omit it to report only the pre-correction value recorded at fit.

Returns:

Multi-line summary string.

Return type:

str

Raises:

ValueError – If the model is not fitted.

Metrics#

Functions for computing batch effect metrics.

Batch effect metrics and diagnostics.

compute_batch_metrics(combat, X, batch=None, *, pca_components=None, k_neighbors=None, kbet_k0=None, lisi_perplexity=30, n_jobs=1, nn_algorithm='auto')[source]#

Compute batch effect metrics before and after ComBat correction.

Parameters:

combat (ComBat) – A fitted ComBat instance.
X (array-like of shape (n_samples, n_features)) – Input data to evaluate.
batch (array-like of shape (n_samples,), optional) – Batch labels. If None, uses the batch stored at construction.
pca_components (int, optional) – Number of PCA components for dimensionality reduction before computing metrics. If None (default), metrics are computed in the original feature space. Must be less than min(n_samples, n_features).
k_neighbors (list of int, default=[5, 10, 50]) – Values of k for k-NN preservation metric.
kbet_k0 (int, optional) – Neighborhood size for kBET. Default is 10% of samples.
lisi_perplexity (int, default=30) – Perplexity for LISI computation.
n_jobs (int, default=1) – Number of parallel jobs for neighbor computations.
nn_algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto') – Algorithm used for nearest neighbor computation. Passed to sklearn.neighbors.NearestNeighbors.

Returns:

Dictionary with three main keys:

batch_effect: Silhouette, Davies-Bouldin, kBET, LISI, variance ratio (each with ‘before’ and ‘after’ values)
preservation: k-NN preservation fractions, distance correlation
alignment: Centroid distance, Levene statistic (each with ‘before’ and ‘after’ values)

Return type:

dict

Raises:

ValueError – If the model is not fitted or if pca_components is invalid.

Visualization#

Functions for visualizing batch effects and ComBat corrections.

Visualization utilities for ComBat batch correction.

plot_transformation(combat, X, *, reduction_method='pca', n_components=2, plot_type='static', figsize=(12, 5), alpha=0.7, point_size=50, cmap='Set1', title=None, show_legend=True, return_embeddings=False, **reduction_kwargs)[source]#

Visualize the ComBat transformation effect using dimensionality reduction.

It shows a before/after comparison of data transformed by ComBat using PCA, t-SNE, or UMAP to reduce dimensions for visualization.

Parameters:

combat (ComBat) – A fitted ComBat instance.
X (array-like of shape (n_samples, n_features)) – Input data to transform and visualize.
reduction_method ({‘pca’, ‘tsne’, ‘umap’}, default=`’pca’`) – Dimensionality reduction method.
n_components ({2, 3}, default=2) – Number of components for dimensionality reduction.
plot_type ({‘static’, ‘interactive’}, default=`’static’`) – Visualization type: - ‘static’: matplotlib plots (can be saved as images) - ‘interactive’: plotly plots (explorable, requires plotly)
figsize (tuple of int, default=(12, 5)) – Figure size in inches (width, height). Only used for static plots.
alpha (float, default=0.7) – Marker transparency. Only used for static plots.
point_size (int, default=50) – Marker size. Only used for static plots.
cmap (str, default='Set1') – Matplotlib colormap name for batch colors.
title (str or None, default=None) – Custom figure title. If None, a default title is generated.
show_legend (bool, default=True) – Whether to display the batch legend.
return_embeddings (bool, default=False) – If True, return embeddings along with the plot.
**reduction_kwargs (dict) – Additional keyword arguments passed to the reduction method (e.g., perplexity for t-SNE, n_neighbors for UMAP).

Returns:

fig (matplotlib.figure.Figure or plotly.graph_objects.Figure) – The figure object containing the plots.
embeddings (dict, optional) – If return_embeddings=True, dictionary with: - ‘original’: embedding of original data - ‘transformed’: embedding of ComBat-transformed data

Return type:

Any | tuple[Any, dict[str, FloatArray]]

plot_feature_diagnostics(combat, top_n=20, kind='combined', mode='magnitude', layout='grouped', weighted=True, figsize=(8, 10))[source]#

Plot top features affected by batch effects.

Parameters:

combat (ComBat) – A fitted ComBat instance.
top_n (int, default=20) – Number of top features to display.
kind ({'location', 'scale', 'combined'}, default='combined') –
- ‘location’: bar plot of location (mean shift) contribution only
- ’scale’: bar plot of scale (variance) contribution only
- ’combined’: grouped bar plot showing location and scale side-by-side for each feature (sorted by Euclidean magnitude). In magnitude mode: bars reflect Euclidean decomposition (combined**2 = location**2 + scale**2). In distribution mode: bars reflect independent normalized contributions (each sums to 1 separately).
mode ({'magnitude', 'distribution'}, default='magnitude') –
- ‘magnitude’: y-axis shows absolute batch effect magnitude
- ’distribution’: y-axis shows relative contribution (proportion), includes annotation showing cumulative contribution of top_n features (e.g., “Top 20 features explain 75% of total batch effect”)
layout ({'grouped', 'diverging'}, default='grouped') –
Only used when kind='combined'.
- ’grouped’: location and scale bars side-by-side on a single shared x-axis.
- ’diverging’: back-to-back bars with location growing leftward and scale growing rightward, sharing the feature axis but with an independent x-axis and grid per side. Keeps the absolute (or relative) values of mode while giving each component its own scale, so a small component is not visually crushed by a large one.
weighted (bool, default=True) – If True, batch effects are weighted by batch sample size. Passed to feature_batch_diagnostics().
figsize (tuple, default=(8,10)) – Figure size (width, height) in inches.

Returns:

The figure object containing the plot.

Return type:

matplotlib.figure.Figure

Raises:

ValueError – If the model is not fitted, or if kind/mode is invalid.

plot_batch_effect_heatmap(combat, top_n=50, weighted=True, figsize=(12, 8))[source]#

Plot a heatmap of batch effect parameters across features and batches.

Displays the estimated batch-specific location shifts (gamma) and, unless mean_only=True, log-scale shifts (log delta) for the top_n most affected features.

Parameters:

combat (ComBat) – A fitted ComBat instance.
top_n (int, default=50) – Number of top features (by combined batch effect) to display.
weighted (bool, default=True) – If True, feature ranking uses sample-size-weighted batch effects. Passed to feature_batch_diagnostics().
figsize (tuple of int, default=(12, 8)) – Figure size in inches.

Returns:

Figure containing the heatmap(s).

Return type:

matplotlib.figure.Figure

Raises:

ValueError – If the model is not fitted.
ImportError – If seaborn is not installed.