flagx.gating

This module provides classification methods for automated flow cytometry gating.

SOMClassifier

class flagx.gating.SOMClassifier(*args: Any, **kwargs: Any)

Bases: BaseEstimator, ClassifierMixin

Self-Organizing Map (SOM) classifier with scikit-learn–compatible API.

This classifier uses Somoclu to train a 2D SOM grid in an unsupervised fashion and assigns class labels to SOM units by majority vote across labeled training samples. Predictions are computed using the best-matching unit (BMU) for each sample and the majority class associated with that <unit.

The classifier supports:

Unsupervised SOM training
Supervised unit annotation
Class probability estimation
Hyperparameter tuning (via GridSearchCV)
SOM quality metrics (quantization error, topographic error)
Visualization-oriented transformations
Model saving and loading

som_

Trained SOM object.

Type:: Somoclu

is_fitted_

Whether the model has been fitted.

Type:: bool

classes_

Class labels after re-indexing to integers starting from 0.

Type:: np.ndarray or None

class_counts_

Class counts from the training data.

Type:: np.ndarray or None

og_classes_

Original class labels before re-indexing.

Type:: np.ndarray or None

class_priors_

Empirical class priors.

Type:: np.ndarray or None

som_unit_labels_

Majority class per SOM unit.

Type:: np.ndarray

class_counts_per_unit_

Class histogram per SOM unit.

Type:: np.ndarray

grid_search_

Grid search results if hyperparameter tuning was performed.

Type:: GridSearchCV or None

Parameters:

som_topology (Literal['planar', 'toroid']) – SOM grid topology. Defaults to 'planar'.
som_grid_type (Literal['rectangular', 'hexagonal']) – Grid layout type. Defaults to 'rectangular'.
som_dimensions (Tuple[int, int]) – Dimensions of the SOM grid (n_columns, n_rows). Defaults to (10, 10).
neighborhood (Literal['gaussian', 'bubble']) – Neighborhood function type. Defaults to 'gaussian'.
gaussian_neighborhood_sigma (float or None) – Sigma for Gaussian neighborhood function. Defaults to 0.1.
initialization (Literal['random', 'pca']) – Codebook initialization method. Defaults to 'pca'.
initial_codebook (np.ndarray or None) – Custom initialization of SOM weights. Defaults to None.
n_epochs (int) – Number of SOM training epochs. Defaults to 100.
radius_0 (float) – Initial neighborhood radius. Negative values are interpreted as fractions of the grid size. Defaults to -0.5.
radius_n (float) – Final neighborhood radius. Defaults to 0.1.
radius_cooling (Literal['linear', 'exponential']) – Radius decay schedule. Defaults to 'exponential'.
learning_rate_0 (float) – Initial learning rate. Defaults to 0.1.
learning_rate_n (float) – Final learning rate. Defaults to 0.001.
learning_rate_decay (Literal['linear', 'exponential']) – Learning rate decay schedule. Defaults to 'exponential'.
unlabeled_label (Any) – Label indicating unlabeled samples. Defaults to -999.
verbosity (int) – Logging level. Defaults to 1.

fit(X: numpy.ndarray, y: numpy.ndarray) → typing_extensions.Self

Train the SOM on input data and annotate units if labeled data is provided.

Parameters:

X (np.ndarray) – Training features of shape (n_samples, n_features).
y (np.ndarray) – Training labels. Unlabeled samples must be marked using unlabeled_label.

Returns:

The fitted classifier instance.

Return type:

Self

Raises:

ValueError – If the feature dimension does not match a previous fit call.
UserWarning – If fitting continues from an already-initialized SOM.

predict(X: numpy.ndarray) → numpy.ndarray

Predict labels for new samples using the BMU and unit annotations.

Parameters:

X (np.ndarray) – Input feature matrix.

Returns:

Predicted labels in the original label space.

Return type:

np.ndarray

Raises:

NotFittedError – If the classifier has not been fitted.
UserWarning – If units without labels are BMU for some samples.

predict_proba(X: numpy.ndarray) → numpy.ndarray

Estimate class probabilities based on the class distribution of the BMU.

Parameters:

X (np.ndarray) – Input feature matrix.

Returns:

Class probabilities per sample.

Return type:

np.ndarray

Raises:

NotFittedError – If the classifier has not been fitted.
UserWarning – If no labeled data was provided.

annotate_som(X: numpy.ndarray, y: numpy.ndarray) → typing_extensions.Self

Assign class labels to SOM units by computing the majority class among samples for which the respective unit is the BMU.

Parameters:

X (np.ndarray) – Input features for annotation.
y (np.ndarray) – Labels corresponding to X.

Returns:

Updated classifier instance with unit annotations.

Return type:

Self

Raises:

RuntimeError – If SOM has not been trained prior to annotation.
UserWarning – If some SOM units have no support from labeled samples.

Perform hyperparameter optimization using Scikit-learn’s GridSearchCV.

Parameters:

X (np.ndarray) – Feature matrix.
y (np.ndarray) – Labels.
param_grid (dict or None) – Hyperparameter search space.
cv (int or CrossValidator) – Number of folds or cross-validation strategy. Defaults to 5.
scoring (str or callable or None) – Scoring metric. If 'internal', macro-F1 is used. Defaults to 'internal'.
refit (bool or str or callable) – Whether to refit using the best model. Defaults to True.
gridsearchcv_kwargs (dict or None) – Additional parameters for GridSearchCV. Defaults to None.

Returns:

Classifier with updated best-found parameters.

Return type:

Self

Notes

The method updates the instance with GridSearchCV stored in the grid_search_ attribute.

score(X: numpy.ndarray, y: numpy.ndarray, sample_weight: numpy.ndarray | None = None)

Compute macro F1 score on the provided data.

Parameters:

X (np.ndarray) – Feature matrix.
y (np.ndarray) – True labels.
sample_weight (np.ndarray or None) – Optional sample weights.

Returns:

Macro-averaged F1 score.

Return type:

float

activation_frequencies(X: numpy.ndarray)

Compute activation frequencies of each SOM unit on the given data.

Parameters:: X (np.ndarray) – Input features.
Returns:: Array of shape (som_dim0, som_dim1) with normalized activation counts per unit.
Return type:: np.ndarray

quantization_error(X: numpy.ndarray) → float

Compute the SOM quantization error.

Quantization error = mean Euclidean distance between samples and the codebook vector of their BMU.

Parameters:: X (np.ndarray) – Input features.
Returns:: Mean quantization error.
Return type:: float

topographic_error(X: numpy.ndarray) → float

Compute the SOM topographic error.

Topographic error = proportion of samples where the 1st and 2nd BMUs are not adjacent on the SOM grid.

Parameters:: X (np.ndarray) – Input feature matrix.
Returns:: Topographic error.
Return type:: float
Raises:: NotImplementedError – If SOM topology is not planar rectangular.

unit_impurity(impurity_measure: typing_extensions.Literal.('entropy', 'gini')='entropy') → numpy.ndarray

Compute class impurity for each SOM unit.

Parameters:: impurity_measure (Literal['entropy', 'gini']) – Impurity metric. Defaults to 'entropy'.
Returns:: Impurity per SOM unit.
Return type:: np.ndarray
Raises:: UserWarning – If classifier was trained without labeled data.

mean_impurity(impurity_measure: typing_extensions.Literal.('entropy', 'gini')='entropy') → float

Compute the mean impurity across all SOM units.

Parameters:: impurity_measure (Literal['entropy', 'gini']) – Impurity metric. Defaults to 'entropy'.
Returns:: Mean impurity over all units.
Return type:: float

unpredictable_classes() → numpy.ndarray

Identify classes that were seen during training but cannot be predicted because no SOM unit was annotated with those labels.

Returns:: Array of missing/unpredictable classes.
Return type:: np.ndarray
Raises:: UserWarning – If no labeled data was provided.

save(filename: str = 'som_classifier.pkl', filepath: str | None = None) → None

Save the trained classifier to disk using pickle.

Parameters:

filename (str) – Output filename.
filepath (str or None) – Directory to save the file. Defaults to CWD.

Returns:

None

classmethod load(filename: str = 'som_classifier.pkl', filepath: str | None = None) → typing_extensions.Self

Load a saved classifier instance from disk.

Parameters:

filename (str) – File to load.
filepath (str or None) – Directory containing the file.

Returns:

Loaded classifier instance.

Return type:

Self

reset()

Reset the classifier to its untrained state, clearing the trained SOM, class annotations, and metadata.

Returns:: None

transform(X: numpy.ndarray) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray]

Project samples onto the SOM grid and generate visualization-friendly scattered BMU coordinates.

Parameters:: X (np.ndarray) – Input data.
Returns:: bmus (np.ndarray): BMU coordinates for each sample. bmus_scattered (np.ndarray): Scattered BMU coordinates for visualization. som_unit_ids (np.ndarray): Unit ID in row-major format for each sample. radii (np.ndarray): Radius proportional to activation frequency across input data of BMU for each sample.
Return type:: Tuple
Raises:: NotFittedError – If the classifier has not been trained.

MLPClassifier

class flagx.gating.MLPClassifier(*args: Any, **kwargs: Any)

Bases: BaseEstimator, ClassifierMixin

A three layer perceptron (MLP) classifier.

This classifier wraps a fully connected neural network implemented in PyTorch while exposing a scikit-learn–style API. The model supports multi-class classification, automatic device selection (CPU or GPU), and provides the methods fit(), predict(), predict_proba(), score(), save(), and load(). For model training CrossEntropyLoss and the Adam optimizer are used.

classes_

Class labels after re-indexing to integers starting from 0.

Type:: np.ndarray or None

class_counts_

Class counts from the training data.

Type:: np.ndarray or None

og_classes_

Original class labels before re-indexing.

Type:: np.ndarray or None

class_priors_

Empirical class priors.

Type:: np.ndarray or None

new_to_og_classes_dict_

Mapping from new integer labels back to original labels.

Type:: dict[int, Any] or None

data_set_

PyTorch tensor dataset constructed during fitting.

Type:: TensorDataset or None

data_loader_

PyTorch DataLoader used for minibatch training.

Type:: DataLoader or None

model_

Neural network model.

Type:: nn.Module or None

criterion_

Loss function, PyTorch CrossEntropyLoss.

Type:: nn.Module or None

optimizer_

PyTorch Adam optimizer with learning rate 0.001.

Type:: Optimizer or None

training_log_

Logged losses of the training run.

Type:: dict[str, int | list[int | float]] or None

is_fitted_

Whether the classifier has been fitted.

Type:: bool

Parameters:

layer_sizes (Tuple[int, ...]) – Sizes of the hidden layers in the fully connected neural network.
n_epochs (int) – Number of training epochs.
loss_params (dict[str, Any] or None) – Parameters passed to the PyTorch’s CrossEntropyLoss.
optimizer_params (dict[str, Any] or None) – Parameters passed to the PyTorch’s Adam optimizer. If None, defaults to {'lr': 0.001}.
data_loader_params (dict[str, Any] or None) – Parameters passed to the PyTorch DataLoader. If None, defaults to {'batch_size': 128, 'shuffle': True, 'num_workers': 1}.
validation_fraction (float) – Fraction of the training data used as validation set. Defaults to 0.1.
early_stopping (bool) – Whether early stopping is used or not. If early_stopping is True and validation_fraction=0.0, the training loss is used as an early stopping criterion. Defaults to False.
tol (float) – Tolerance for early stopping. When the validation/training loss is not improving by at least tol for n_iter_no_change consecutive iterations, training is stopped early.
n_iter_no_change (int) – Maximum number of epochs to not meet tol improvement.
device (str or None) – Device to use for training (e.g., 'cpu', 'cuda', 'cuda:0'). If None, CUDA is used when available, otherwise falls back to CPU.
verbosity (int) – Verbosity level for training logs.

fit(X: numpy.ndarray, y: numpy.ndarray) → typing_extensions.Self

Fit the MLP classifier to the provided training data.

Parameters:

X (np.ndarray) – Feature matrix of shape (n_samples, n_features).
y (np.ndarray) – Target labels of shape (n_samples,).

Returns:

The fitted classifier instance.

Return type:

Self

Raises:

ValueError – If X and y have incompatible shapes.

predict(X: numpy.ndarray) → numpy.ndarray

Predict class labels for the given input samples.

Parameters:: X (np.ndarray) – Feature matrix of shape (n_samples, n_features).
Returns:: Predicted class labels using the original label encoding.
Return type:: np.ndarray
Raises:: NotFittedError – If predict() is used before calling fit().

predict_proba(X: numpy.ndarray) → numpy.ndarray

Predict class probabilities for the given samples.

Parameters:: X (np.ndarray) – Feature matrix of shape (n_samples, n_features).
Returns:: Array of shape (n_samples, n_classes) containing class probabilities.
Return type:: np.ndarray
Raises:: NotFittedError – If predict() is used before calling fit().

score(X: numpy.ndarray, y: numpy.ndarray, sample_weight: numpy.ndarray | None = None)

Compute the macro F1 score of the classifier on the given dataset.

Parameters:

X (np.ndarray) – Feature matrix of shape (n_samples, n_features).
y (np.ndarray) – True labels.
sample_weight (np.ndarray or None) – Optional sample weights.

Returns:

Macro-averaged F1 score.

Return type:

float

Raises:

NotFittedError – If score() is used before calling fit().

save(filename: str = 'mlp_classifier.pkl', filepath: str | None = None) → None

Save the fitted classifier to disk using torch.save.

Parameters:

filename (str) – Name of the file to save the model to.
filepath (str or None) – Directory where the file will be saved. Defaults to current working directory.

Returns:

None

classmethod load(filename: str = 'mlp_classifier.pkl', filepath: str | None = None, map_location: str | torch.device = 'cpu') → typing_extensions.Self

Load a previously saved classifier from disk.

Parameters:

filename (str) – Name of the saved file.
filepath (str or None) – Directory containing the saved file. Defaults to current working directory.
map_location (str or torch.device) – Device mapping for loading the model (e.g., 'cpu' or 'cuda').

Returns:

The loaded classifier instance.

Return type:

Self

Neural Network Models

class flagx.gating.FCNNModel(*args: Any, **kwargs: Any)

Bases: Module

Fully connected neural network with arbitrary number of hidden linear layers of arbitrary size.

All but the output layer uses ReLU activations. Softmax is intentionally omitted from the final layer because torch.nn.CrossEntropyLoss expects raw logits.

The default parameters follow the configuration described in:

DGCyTOF: Deep learning with graphic cluster visualization to predict cell types of single cell mass cytometry data (Cheng et al., 2022).

Their implementation can be found at:

https://github.com/lijcheng12/DGCyTOF/blob/main/Code_Study/DGCyTOF/CyTOF2/CyTOF2.ipynb (22/27/2025).

Parameters:

in_size (int) – Number of input features.
out_size (int) – Number of output classes.
layer_sizes (Tuple[int, ...], optional) – Sizes of the hidden layers. Defaults to (128, 64, 32).

layers

List of fully connected linear layers.

Type:: nn.ModuleList

forward(x: torch.Tensor) → torch.Tensor

Forward pass of the FCNN model. ReLU activation is applied after each layer except after the output layer.

Parameters:: x (torch.Tensor) – Input data tensor.
Returns:: Raw output logits with shape (batch_size, out_size).
Return type:: torch.Tensor