Querying the Database
ImmunoPhenoDB_Connect
This class connects to the database containing the reference dataset and contains an interface for querying.
- class immunopheno.connect.ImmunoPhenoDB_Connect(url: str)
A class to interact with the ImmunoPheno database
Performs queries to a database containing curated single cell data from different experiments, tissues, antibodies, and cell populations. These queries can be used to find antibodies for gating specific cell populations, perform automatic annotation of cytometry data, and design optimal antibody panels for cytometry experiments.
- Parameters:
url (str) – URL link to the ImmunoPheno database server.
- imputed_reference
reference antigenic dataset returned from the database after calling run_stvea. Format: Row (cells) x column (antibodies).
- Type:
pd.DataFrame
- transfer_matrix
transfer matrix returned after calling run_stvea. Format: Row (query cells) x column (reference cells).
- Type:
pd.DataFrame
- plot_db_graph(root=None) Figure
Plots a graph of all cell type ontologies in the database
The graph will start with the root node of “CL:0000000” representing “cell”. This root node can be modified to hone in on a particular cell type and their descendant cell types. Nodes colored in red indicate cell ontologies for which there are cells in the database containing those ontologies. Nodes in blue indicate intermediary cell ontologies that have been dervied from those that are already in the database.
- Parameters:
root (str) – Root node in the graph. Modifying the root node will return a subgraph containing descendants for only that modified root node.
- Returns:
Graph containing cell ontologies as nodes. This plotly figure can be further updated or styled.
- Return type:
go.Figure
- find_antibodies(id_CLs: list, background_id_CLs: list | None = None, idBTO: list | None = None, idExperiment: list | None = None) tuple
Queries the database to find antibodies that mark a provided list of cell populations.
This function contains two parameters to accept a list of cell populations in the form of cell ontology IDs. If providing a list for “id_CLs”, the function will return a table of antibodies that are expressed in those cell populations. If a list is also provided for “background_id_CLs”, the function returns a table of antibodies that can distinguish cells in populations defined in “id_CLs” from those defined in “background_id_CLs”. Additional filters based on tissue and experiment IDs can be applied to restrict the data in the query.
- Parameters:
- Returns:
Returns a tuple (pd.DataFrame, dictionary).
The dataframe contains rows for all possible antibodies found in each cell population. The columns contain statistics regarding the upregulation or downregulation of an antibody for a cell population. The level of detection of an antibody for a cell population is also included.
The dictionary contains boxplots of antibodies for each provided cell population in both “id_CLs” and “background_id_CLs”. Each plot contains the distribution of normalized expression levels for each antibody in a specific cell population. Each set of boxplots is accessible by providing the cell population ID as the key.
- Return type:
- find_celltypes(ab_ids: list, idBTO: list | None = None, idExperiment: list | None = None) tuple
Queries the database to find cell populations that are marked by a provided list of antibodies.
This function contains one parameter “ab_ids” to accept a list of antibody IDs. This finds all cell populations that are marked by each antibody provided. Additional filters based on tissue and experiment IDs can be applied to restrict the data in the query.
- Parameters:
- Returns:
Returns a tuple (dictionary, dictionary).
The first dictionary in the tuple will be the query results from the database in the form of a dataframe. The dataframe contains rows for all possible cell populations that the antibody marks. The columns contain statistics regarding the regulation of the antibody in comparison to all other cell populations. Each dataframe is accessibly by providing the antibody ID as the key.
The second dictionary in the tuple will be the boxplots for each provided antibody in “ab_ids”. Each plot contains the distribution of normalized expression levels for cell populations marked by each antibody. Each set of boxplots is accessible by providing the antibody ID as the key.
- Return type:
- find_experiments(ab: list | None = None, idCL: list | None = None, idBTO: list | None = None) DataFrame
Queries the database to find experiments under specific requirements
The function must accept at least one parameter when looking up experiments. One parameter is to search by antibodies used in an experiment, which accepts a list of antibody IDs. Another parameter is to search by the presence of cell populations, which accepts a list of cell ontology IDs. The last parameter is to search by the usage of specific tissues, which accepts a list of BRENDA tissue ontology IDs. A combination of these parameters can be used to further narrow the number of experiments.
- Parameters:
- Returns:
Returns a dataframe containing information about each experiment. This includes its database ID, name, type, PMID, DOI, tissue ID, and tissue name.
- Return type:
pd.DataFrame
- which_antibodies(search_query: str) DataFrame
Queries the database to find antibodies based on a search phrase
- Parameters:
search_query (str) – a plain text input that contains words that will be used to look up antibodies in the database.
- Returns:
Returns a dataframe containing information about each antibody. This includes: antibody ID, name, target, clonality, citation, clone ID, host organism, vendor, catalog number, and all experiment IDs in which the antibody is used.
- Return type:
pd.DataFrame
- which_celltypes(search_query: str) DataFrame
Queries the database to find cell types based on a search phrase
- Parameters:
search_query (str) – a plain text input that contains words that will be used to look up cell types in the database.
- Returns:
Returns a dataframe containing information about each cell type. This includes: cell ontology ID, cell type name, and all experiment IDs in which the cell type is found in.
- Return type:
pd.DataFrame
- which_experiments(search_query: str) DataFrame
Queries the database to find all experiments based on a search phrase
- Parameters:
search_query (str) – a plain text input that contains words that will be used to look up experiments in the database.
- Returns:
Returns a dataframe containing information about each experiment. This includes its database ID, name, type, PMID, DOI, tissue ID, and tissue name.
- Return type:
pd.DataFrame
- setup_quadtree(spatial: DataFrame, max_points: int = 100000, x_coord_col: str = 'x_coord', y_coord_col: str = 'y_coord')
Peforms quadtree image decomposition on spatial immunohistochemistry data.
Recursively divides spatial data into four equal quadrants until the number of points in max_points is reached. Used in conjunction with run_stvea when annotating IHC data in chunks.
- Parameters:
- Returns:
- dataframe containing each box_id (quadrant) and
their respective x & y coordinates
- Return type:
boundary_table (pd.DataFrame)
- merge_quadrants(boundary_table: DataFrame, min_points: int) DataFrame
Iteratively finds partitions with fewer than min_points and merges them with an adjacent neighbor until all partitions meet the minimum size.
- Parameters:
boundary_table (pd.DataFrame) – The table generated by setup_quadtree.
min_points (int) – The minimum number of cells a partition must have.
- Returns:
A new, updated boundary_table with small partitions merged.
- Return type:
pd.DataFrame
- plot_quadtree(spatial: DataFrame, boundary_table: DataFrame, x_coord_col: str = 'x_coord', y_coord_col: str = 'y_coord')
Plot quadrants on spatial data as defined by the quadtree
- Parameters:
spatial (pd.DataFrame) – spatial data containing x and y coordinates
boundary_table (pd.DataFrame) – dataframe containing quadrant boundaries from setup_quadtree
x_coord_col (str) – name of column in spatial containing x coordinates
y_coord_col (str) – name of column in spatial containing y coordinates
- Returns:
Plotly figure containing spatial points outlined by quadrants
- Return type:
fig
- get_quadrant_points(spatial: DataFrame, boundary_table: DataFrame, box_id: int)
Retrieve all points and their coordinates within a quadrant
- Parameters:
spatial (pd.DataFrame) – spatial data containing x and y coordinates
boundary_table (pd.DataFrame) – dataframe containing quadrant boundaries from setup_quadtree
box_id (int) – number of the box or quadrant of interest
- Returns:
dataframe containing points from only that quadrant
- Return type:
spat_box (pd.DataFrame)
- run_stvea(IPD: ImmunoPhenoData, IPD_reference: ImmunoPhenoData | None = None, idBTO: list | None = None, idExperiment: list | None = None, parse_option: int = 1, rho: float = 0.5, population_size: int = 50, k_find_nn: int = 40, k_find_anchor: int = 20, k_filter_anchor: int = 40, k_score_anchor: int = 30, k_find_weights: int = 40, k_transfer_matrix: int = 40, c_transfer_matrix: float = 0.5, mask_threshold: float = 0.75, mask: bool = True, seed: int = 42, map_seed: int = 42, num_chunks: int = 1, num_cores: int = 1)
Automatically transfers single cell annotations to cytometry data
Uses reference data stored in the ImmunoPhenoDB database to annotate cells in a cytometry dataset. This function uses an algorithm called STvEA, which uses a kNN approach in a consolidated protein expression space to map annotations from the reference to query data. This function will find the appropriate reference dataset using the spreadsheet provided in the ImmunoPhenoData object. This spreadsheet must contain all antibodies and antibody RRIDs that were used in that experiment. For any antibodies in the spreadsheet that are not found in the database, a matching algorithm is used to find the next best antibody instead. This level of matching can be adjusted in the “parse_option” parameter. This function can be parallelized by specifying the number of chunks to split the dataset and the number of cores.
- Parameters:
IPD (ImmunoPhenoData) – ImmunoPhenoData object that must already contain the normalized protein counts and a spreadsheet with all antibody IDs for each antibody used in the experiment.
IPD_reference (ImmunoPhenoData, optional) – ImmunoPhenoData object containing normalized protein counts used as the reference dataset instead of retrieving a normalized protein table from the database. This must also contain the corresponding cell annotations for each cell in the normalized protein counts.
idBTO (list, optional) – list of tissue IDs used to restrict the reference dataset. This is optional, but specifying a tissue will greatly improve the accuracy of the annotations.
idExperiment (list, optional) – list of experiment IDs to restrict the reference dataset. This is optional, but specifying certain experiments can greatly improve the accuracy of the annotations.
parse_option (int) –
level of strictness when searching antibodies in the database. Levels are as follows:
1: parse by clone ID and alias (default) 2: parse by alias and antibody target (most relaxed) 3: parse by antibody ID (strictest)
rho (float) – weight parameter to adjust the number of cells or antibodies in the reference dataset. A small value of rho will provide more cells and less antibodies. A large value of rho will provide more antibodies and less cells. Defaults to 0.5.
population_size (int) – the minimum number of cells needed to define a cell type population. This is used to downsample a large reference dataset. Defaults to 50 cells as the minimum number to define a population.
k_find_nn (int) – the number of nearest neighbors. Defaults to 40.
k_find_anchor (int) – the number of neighbors to find anchors. Defaults to 20.
k_filter_anchor (int) – the number of nearest neighbors to find in the original data space. Defaults to 40.
k_score_anchor (int) – The number of neighbors to find anchors. Fewer k_anchor should mean higher quality of anchors. Defaults to 30.
k_find_weights (int) – the number of nearest anchors to use in correction. Defaults to 40.
k_transfer_matrix (int) – the number of nearest anchors to use in correction. Defaults to 40.
c_transfer_matrix (float) – a constant that controls the width of the Gaussian kernel. Defaults to 0.5.
mask_threshold (float) – specifies threshold to discard query cells. Defaults to 0.75.
mask (bool) – a boolean value to specify whether to discard query cells that don’t have nearby reference cells. Defaults to True.
seed (int, optional) – seed value when randomly downsampling the reference table in the server
map_seed (int, optional) – seed value when finding anchors during the mapping from reference to query
num_chunks (int) – number of chunks to split the protein dataset for parallelization. Defaults to 1.
num_cores (int) – number of cores used to run in parallel. Defaults to 1.
- Returns:
Returns ImmunoPhenoData object containing transferred annotations accessible in the “labels” property of the new object.
- Return type:
- run_dt(IPD, idBTO: list | None = None, idExperiment: list | None = None, parse_option: int = 1, **kwargs)
Predicts cell annotations using a decision tree model (rpart)
Uses reference data stored in the ImmunoPhenoDB database to annotate cells in a cytometry dataset. This function uses a decision tree model called Recursive Partitioning and Regression Treees (rpart), which can handle missing data with the use of surrogate splits. Using the spreadsheet provided in the ImmunoPhenoData object, the function will find the appropriate reference dataset from the database. This spreadsheet must contain all antibodies and antibody RRIDs that were used in that experiment. For any antibodies in the spreadsheet that are not found in the database, a matching algorithm is used to find the next best antibody instead. This level of matching can be adjusted in the “parse_option” parameter.
- Parameters:
IPD (ImmunoPhenoData) – ImmunoPhenoData object that must already contain the normalized protein counts and a spreadsheet with all antibody IDs for each antibody used in the experiment.
idBTO (list, optional) – list of tissue IDs used to restrict the reference dataset. This is optional, but specifying a tissue will greatly improve the accuracy of the annotations.
idExperiment (list, optional) – list of experiment IDs to restrict the reference dataset. This is optional, but specifying certain experiments can greatly improve the accuracy of the annotations.
parse_option (int) –
level of strictness when searching antibodies in the database. Levels are as follows:
1: parse by clone ID and alias (default) 2: parse by alias and antibody ID (most relaxed) 3: parse by antibody ID (strictest)
kwargs – initial arguments for rpart’s decision tree model. These parameters are used to modify “rpart.control”, which control aspects of the rpart fit.
- Returns:
Returns ImmunoPhenoData object containing transferred annotations accessible in the “labels” property of the new object.
- Return type:
- filter_labels(IPD: ImmunoPhenoData, localization=False, merging=False, distance_ratio=False, entropy=False, p_threshold_localization=0.05, p_threshold_merging=0.05, epsilon_merging=4, distance_ratio_threshold=2, entropy_threshold=2, remove_labels=False)
Filters out poor-quality annotations using the protein expression space
- Parameters:
IPD (ImmunoPhenoData) – ImmunoPhenoData object that contains cell labels in the “labels” property of the object.
localization (bool) – option to filter annotations by the localization of annotations in the protein expression space.
merging (bool) – option to merge two annotations that cannot be separate in the protein expression space.
distance_ratio (bool) – option to filter annotations by a mapping distance ratio calculated from the nearest neighbors in the reference data.
entropy (bool) – option to filter annotations by entropy caluclated from cell type probabilities for each cell.
p_threshold_localization (float) – p value threshold for fisher exact test during localization filtering.
p_threshold_merging (float) – p value threshold for fisher exact test when merging two annotations.
epsilon_merging (int) – epsilon threshold for deciding to merge two cell types based on the proportion of each cell type.
distance_ratio_threshold (int) – ratio threshold when filtering cells by nearest neighbor distance ratios (D1/D2).
entropy_threshold (int) – entropy threhold when filtering cells by entropies calculated from cell type probabilities.
remove_labels (bool) – option to remove rows/cells from the object that are labeled as “Not Assigned” after filtering.
- Returns:
Returns ImmunoPhenoData object that contain modified cell labels. These could be “Not Assigned” or two labels merged together. This object could also have rows/cells filtered out.
- Return type:
- db_stats()
Prints summary statistics about the experiments and data stored in the database
- Returns:
None. Prints statistics about the information stored in the database. This includes: number of experiments, tissues, cells, antibodies, antibody targets, antibody clones, and the average number of experiments used by each antibody.
- optimal_antibody_panel(target: list, background: list | None = None, tissue: list | None = None, experiment: list | None = None, panel_size: int = 10, max_itr: int = 1000, random_state: int = 0, plot_decision_tree: bool = False, plot_gates: bool = False, plot_gates_option: int = 1, rho: float = 0.35, shift: int = 8, seed: int = 42, merge_option: int = 1) DataFrame
Finds an optimized panel of antibodies to mark cell populations and tissues
Uses reference data stored in the ImmunoPhenoDB database to generate a panel of antibodies that are marked by the specified cell populations and tissue types. This function uses a k-feature decision tree to extract the most optimal antibodies that can mark the desired populations and tissue. There is an option to save the decision tree generated. There is also the option to display suggested flow cytometry gating plots using the antibodies in the optimized panel.
- Parameters:
target (list) – list of cell populations and/or tissues to query as the target for antibodies in the panel. This must use cell ontology IDs and BRENDA tissue ontology IDs. Example: [“CL:0000818”, “CL:0000236”]
background (list, optional) – list of cell populations used as comparison. These will be used to differentiate the target from these background populations Example: [“CL:0000084”]
tissue (list, optional) – list of BRENDA tissue ontologies to restrict the entire query.
experiment (list, optional) – list of experiment IDs from the database to restrict the entire query.
panel_size (int) – desired number of antibodies in the panel. Defaults to 10.
max_itr (int, optional) – number of iterations in the decision tree. Defaults to 1000.
random_state (int, optional) – random state for the decision tree classifier. Defaults to 0.
plot_decision_tree (bool, optional) – option to plot and save the decision tree. This will create one PNG in the current directory: “decision_tree.png”. Defaults to False.
plot_gates (bool, optional) – option to display suggested gating strategies. Defaults to False.
plot_gates_option (int, optional) – type of plot generated. “1”: displays a static plot using seaborn. “2”: displays an interactive plot using Plotly. Creates a file called “multiple_plots.html” “3”: displays an interactive plot using Dash. Must be viewed in a browser at 127.0.0.1:8050 Defaults to 1.
rho (float, optional) – weight parameter to adjust the number of cells or antibodies in the reference dataset. A small value of rho will provide more cells and fewer antibodies. A large value of rho will provide more antibodies and fewer cells. Defaults to 0.1.
shift (int, optional) – value subtracted from all protein expression counts in the antibody panel reference dataset. A larger value for the shift will lead to less discrete populations in the gating plots. Defaults to 0 (no shift applied).
seed (int, optional) – seed value when randomly downsampling the reference table in the server
merge_option (int, optional) –
merge antibodies together based on a similar field “1”: merges all antibodies in the panel that share the same clone ID “2”: merges all antibodies in the panel that share the same antibody target
Example: ‘AB_2734345 (CD7)’ and ‘AB_2800914 (CD7)’ becomes ‘AB_2734345 (CD7)_AB_2800914 (CD7)’
”3”: no merging performed, antibodies remain individual
- Returns:
Returns a tuple (pd.DataFrame, pd.DataFrame).
The first dataframe in the tuple will be the optimal antibody panel. This contains a list of antibodies ranked by their importance. This includes the specific antibody ID and the protein that the antibody marks for.
The second dictionary in the tuple will be the purity and yield metrics for each cell population returned from the decision tree.
- Return type: