Normalizing CITE-Seq Data

Installation

To use ImmunoPheno, first install it using pip:

(.venv) $ !pip install git+https://github.com/CamaraLab/ImmunoPheno.git@package-dev

ImmunoPhenoData

This class holds all single cell or cytometry data for analysis.

class immunopheno.data_processing.ImmunoPhenoData(protein_matrix: str | DataFrame | None = None, gene_matrix: str | DataFrame | None = None, cell_labels: str | DataFrame | None = None, spreadsheet: str | None = None, scanpy: AnnData | None = None, scanpy_labels: str | None = None)

A class to hold single-cell data (CITE-Seq, etc) and cytometry data.

Performs fitting of gaussian/negative binomial mixture models and normalization to antibodies present in a protein dataset. Requires protein data to be supplied using the protein_matrix or scanpy field.

Parameters:
  • protein_matrix (str | pd.Dataframe) – file path or dataframe to ADT count/protein matrix. Format: Row (cells) x column (antibodies/proteins).

  • gene_matrix (str | pd.DataFrame) – file path or dataframe to UMI count matrix. Format: Row (cells) x column (genes).

  • cell_labels (str | pd.DataFrame) – file path or dataframe to cell type labels. Format: Row (cells) x column (cell type such as Cell Ontology ID). Must contain a column called “labels”.

  • spreadsheet (str) – name of csv file containing a spreadsheet with information about the experiment and antibodies.

  • scanpy (anndata.AnnData) – scanpy AnnData object used to load in protein and gene data.

  • scanpy_labels (str) –

    location of cell labels inside a scanpy object. Format: scanpy is an AnnData object containing an ‘obs’ field

    Ex: AnnData.obs[‘scanpy_labels’]

__getitem__(index: Index | list)

Allows instances of ImmunoPhenoData to use the indexing operator.

Parameters:

index (pd.Index | list) – list or pandas index of cell names. This will return a new ImmunoPhenoData object containing only those cell names in all dataframes of the object.

Returns:

ImmunoPhenoData object with modified dataframes based on provided rows/cells names.

Return type:

ImmunoPhenoData

property protein: DataFrame

Get or set the current protein dataframe in the object.

Setting a new protein dataframe requires the format to have rows (cells) and columns (proteins/antibodies).

Returns:

Dataframe containing protein data.

Return type:

pd.DataFrame

property rna: DataFrame

Get or set the current gene/rna dataframe in the object.

Setting a new RNA dataframe requires the format to have rows (cells) and columns (genes).

Returns:

Dataframe containing RNA data.

Return type:

pd.DataFrame

property fits: dict

Get the mixture model fits for each antibody in the protein dataframe.

Each mixture model fit will be stored in a dictionary, where the key is the name of the antibody.

Returns:

Key-value pairs represent an antibody name with a nested dictionary containing the respective mixture model fits. Fits are ranked by the lowest AIC.

Return type:

dict

property normalized_counts: DataFrame

Get the normalized protein dataframe in the object.

This dataframe will only be present if normalize_all_antibodies has been called. The format will have rows (cells) and columns (proteins/antibodies). Note that some rows may be missing/filtered out from the normalization step.

Returns:

Normalized protein counts for each antibody.

Return type:

pd.DataFrame

property labels: DataFrame

Get or set the current cell labels for all normalized cells in the object.

This dataframe will contain rows (cells) and two columns: “labels” and “celltypes”. All values in the “labels” column will follow the EMBL-EBI Cell Ontology ID format. A common name for each value in “labels” will be in the “celltypes” column.

Setting a new cell labels table will only update rows (cells) that are shared between the existing and new table.

Returns:

Dataframe containing two columns: “labels” and “celltypes”.

Return type:

pd.DataFrame

convert_labels() None

Convert all cell ontology IDs to a common name.

Requires all values in the “labels” column of the cell labels dataframe to follow the cell ontology format of CL:0000000 or CL_0000000.

Returns:

None. Modifies the cell labels dataframe in-place.

remove_antibody(antibody: str) None

Removes an antibody from all protein data and mixture model fits.

Removes all values for an antibody from all protein dataframes in-place. If fit_antibody or fit_all_antibodies has been called, it will also remove the mixture model fits for that antibody.

Parameters:

antibody (str) – name of antibody to be removed.

Returns:

None. Modifies all protein dataframes and fits data in-place.

select_mixture_model(antibody: str, mixture: int) None

Overrides the best mixture model fit for an antibody.

Parameters:
  • antibody (str) – name of antibody to modify best mixture model fit.

  • mixture (int) – preferred number of mixture components to override a fit.

Returns:

None. Modifies mixture model order in-place.

fit_antibody(input: list | str, ab_name: str | None = None, transform_type: str | None = None, transform_scale: int = 1, model: str = 'gaussian', plot: bool = False, **kwargs) dict

Fits a mixture model to an antibody and returns its optimal parameters.

This function can be called to either initially fit a single antibody with a mixture model or replace an existing fit. This function can be called after fit_all_antibodies has been called to modify individual fits.

Parameters:
  • input (list | str) – raw values from protein data or antibody name.

  • ab_name (str, optional) – name of antibody. Ignore if calling fit_antibody by supplying the antibody name in the “input” parameter.

  • transform_type (str) – type of transformation. “log” or “arcsinh”.

  • transform_scale (int) – multiplier applied during transformation.

  • model (str) – type of model to fit. “gaussian” or “nb”.

  • plot (bool) – option to plot each model.

  • **kwargs – initial arguments for sklearn’s GaussianMixture (optional).

Returns:

Results from optimization as either gauss_params/nb_params.

Return type:

dict

fit_all_antibodies(transform_type: str | None = None, transform_scale: int = 1, model: str = 'gaussian', plot: bool = False, **kwargs) None

Fits all antibodies with a Gaussian or Negative Binomial mixture model.

Fits a Gaussian or Negative Binomial mixture model to all antibodies in the protein dataset. After all antibodies are fit, the output will display the number of each mixture model fit in the dataset. This includes the names of the antibodies that were fit with a single component model.

Parameters:
  • transform_type (str) – type of transformation. “log” or “arcsinh”.

  • transform_scale (int) – multiplier applied during transformation.

  • model (str) – type of model to fit. “gaussian” or “nb”.

  • plot (bool) – option to plot each model.

  • **kwargs – initial arguments for sklearn’s GaussianMixture (optional).

Returns:

None. Results will be stored in the class. This is accessible using the “fits” property.

normalize_all_antibodies(p_threshold: float = 0.05, sig_expr_threshold: float = 0.85, bg_expr_threshold: float = 0.15, bg_cell_z_score: int = 10, cumulative: bool = False) DataFrame

Normalizes all antibodies in the protein data.

The normalization step uses the fits from the mixture model to remove background noise from the overall signal expression of an antibody. This will take into account non-specific antibody binding if RNA data is present. If RNA data is present, the effects of cell size on the background noise will be regressed out for cells not expressing the antibody. Likewise, if cell labels are provided, the effects of cell types on the background noise for these cells will also be regressed out. These effects are determined by performing a linear regression using the total number of mRNA UMI counts as a proxy for cell size.

Parameters:
  • p_threshold (float) – level of significance for testing the association between cell size/type and background noise from linear regression. If the p-value is smaller than the threshold, these factors are regressed out.

  • sig_expr_threshold (float) – cells with a percentage of expressed proteins above this threshold are filtered out.

  • bg_expr_threshold (float) – cells with a percentage of expressed proteins below this threshold are filtered out.

  • bg_cell_z_score (int) – The number of standard deviations of average protein expression to separate cells that express an antibody from cells that do not express an antibody. A larger value will result in more discrete clusters in the normalized protein expression space.

  • cumulative (bool) – flag to indicate whether to return the cumulative distribution probabilities.

Returns:

None. Results will be stored in the class. This is accessible using the “normalized_counts” property.