Inspect & map identifiers#
To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.
Bionty enables this by mapping metadata on the versioned ontologies using inspect()
.
For terms that are not directly mappable, we offer:
from bionty import Gene, CellMarker, CellType
import pandas as pd
Inspect and mapping synonyms of gene identifiers#
To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.
data = {
"gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
"hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
"ensembl_gene_id": [
"ENSG00000148584",
"ENSG00000121410",
"ENSG00000188389",
"ENSGcorrupted",
],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")
df_orig
gene symbol | hgnc id | |
---|---|---|
ensembl_gene_id | ||
ENSG00000148584 | A1CF | HGNC:24086 |
ENSG00000121410 | A1BG | HGNC:5 |
ENSG00000188389 | FANCD1 | HGNC:1101 |
ENSGcorrupted | corrupted | corrupted |
First we can check whether any of our values are mappable against the ontology reference.
Tip: available fields are accessible via auto-completion: gene_bionty.
gene_bionty = Gene()
gene_bionty.inspect(df_orig.index, gene_bionty.ensembl_gene_id)
✅ 3 terms (75.0%) are mapped.
🔶 1 terms (25.0%) are not mapped.
{'mapped': ['ENSG00000148584', 'ENSG00000121410', 'ENSG00000188389'],
'not_mapped': ['ENSGcorrupted']}
The same procedure is available for gene symbols. First, we inspect which symbols are mappable against the ontology.
gene_bionty.inspect(df_orig["gene symbol"], gene_bionty.symbol)
🔶 The identifiers contain synonyms!
💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'
✅ 2 terms (50.0%) are mapped.
🔶 2 terms (50.0%) are not mapped.
{'mapped': ['A1CF', 'A1BG'], 'not_mapped': ['FANCD1', 'corrupted']}
Apparently 2 of the gene symbols are mappable. Bionty further warns us that some of our symbols can be mapped into standardized symbols.
Mapping synonyms returns a list of standardized terms:
mapped_symbol_synonyms = gene_bionty.map_synonyms(
df_orig["gene symbol"], gene_bionty.symbol
)
mapped_symbol_synonyms
['A1CF', 'A1BG', 'BRCA2', 'corrupted']
Optionally, only returns a mapper of {synonym : standardized name}:
gene_bionty.map_synonyms(df_orig["gene symbol"], gene_bionty.symbol, return_mapper=True)
{'FANCD1': 'BRCA2'}
We can use the standardized symbols as the new index:
df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms
df_curated
ensembl_gene_id | gene symbol | hgnc id | |
---|---|---|---|
A1CF | ENSG00000148584 | A1CF | HGNC:24086 |
A1BG | ENSG00000121410 | A1BG | HGNC:5 |
BRCA2 | ENSG00000188389 | FANCD1 | HGNC:1101 |
corrupted | ENSGcorrupted | corrupted | corrupted |
You may return a DataFrame with a boolean column indicating if the identifiers are mappable:
gene_bionty.inspect(df_curated.index, gene_bionty.symbol, return_df=True)
✅ 3 terms (75.0%) are mapped.
🔶 1 terms (25.0%) are not mapped.
__mapped__ | |
---|---|
A1CF | True |
A1BG | True |
BRCA2 | True |
corrupted | False |
Standardize and look up unmapped CellMarker identifiers#
Depending on how the data was collected and which terminology was used, it is not always possible to curate values. Some values might have used a different standard or be corrupted.
This section will demonstrate how to look up unmatched terms and curate them using CellMarker
.
First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.
markers = pd.DataFrame(
index=[
"KI67",
"CCR7x",
"CD14",
"CD8",
"CD45RA",
"CD4",
"CD3",
"CD127",
"PD1",
"Invalid-1",
"Invalid-2",
"CD66b",
"Siglec8",
"Time",
]
)
Let’s instantiate the CellMarker ontology with the default database and version.
cell_marker_bionty = CellMarker()
cell_marker_bionty
CellMarker
Species: human
Source: cellmarker, 2.0
📖 CellMarker.df(): ontology reference table
🔎 CellMarker.lookup(): autocompletion of ontology terms
🔗 CellMarker.ontology: Pronto.Ontology object
First, we can have a look at the cell marker table that we just loaded.
df = cell_marker_bionty.df()
df.head()
id | name | ncbi_gene_id | gene_symbol | gene_name | uniprotkb_id | synonyms | |
---|---|---|---|---|---|---|---|
0 | CM_MERTK | MERTK | 10461 | MERTK | MER proto-oncogene, tyrosine kinase | Q12866 | None |
1 | CM_CD16 | CD16 | 2215 | FCGR3A | Fc fragment of IgG receptor IIIb | O75015 | None |
2 | CM_CD206 | CD206 | 4360 | MRC1 | mannose receptor C-type 1 | P22897 | None |
3 | CM_CRIg | CRIg | 11326 | VSIG4 | V-set and immunoglobulin domain containing 4 | Q9Y279 | None |
4 | CM_CD163 | CD163 | 9332 | CD163 | CD163 molecule | Q86VB7 | None |
Now let’s check which cell markers from the file can be found in the reference:
cell_marker_bionty.inspect(markers.index, cell_marker_bionty.name, return_df=True)
🔶 The identifiers contain synonyms!
💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'
✅ 7 terms (50.0%) are mapped.
🔶 7 terms (50.0%) are not mapped.
__mapped__ | |
---|---|
KI67 | False |
CCR7x | False |
CD14 | True |
CD8 | True |
CD45RA | True |
CD4 | True |
CD3 | True |
CD127 | True |
PD1 | False |
Invalid-1 | False |
Invalid-2 | False |
CD66b | True |
Siglec8 | False |
Time | False |
Logging suggests we map synonyms:
synonyms_mapper = cell_marker_bionty.map_synonyms(
markers.index, cell_marker_bionty.name, return_mapper=True
)
Now we mapped 3 additional terms:
synonyms_mapper
{'KI67': 'Ki67', 'PD1': 'PD-1', 'Siglec8': 'SIGLEC8'}
Let’s replace the synonyms with standardized names in the markers DataFrame:
markers.rename(index=synonyms_mapper, inplace=True)
From the logging, it can be seen that 4 terms were not found in the reference!
Among them Time
, Invalid-1
and Invalid-2
are non-marker channels which won’t be curated by cell marker.
cell_marker_bionty.inspect(markers.index, cell_marker_bionty.name, return_df=True)
✅ 10 terms (71.4%) are mapped.
🔶 4 terms (28.6%) are not mapped.
__mapped__ | |
---|---|
Ki67 | True |
CCR7x | False |
CD14 | True |
CD8 | True |
CD45RA | True |
CD4 | True |
CD3 | True |
CD127 | True |
PD-1 | True |
Invalid-1 | False |
Invalid-2 | False |
CD66b | True |
SIGLEC8 | True |
Time | False |
We don’t really find CCR7x
, let’s check in the lookup with auto-completion:
cell_marker_bionty_lookup = cell_marker_bionty.lookup()
cell_marker_bionty_lookup.CCR7
cell_marker(index=163, id='CM_CCR7', name='CCR7', ncbi_gene_id='1236', gene_symbol='CCR7', gene_name='C-C motif chemokine receptor 7', uniprotkb_id='P32248', synonyms=None)
Indeed we find it should be CCR7, we had a typo there with CCR7x
.
Now let’s fix the markers so all of them can be linked:
Tip
Using the .lookup instead of passing a string helps eliminate possible typos!
curated_df = markers.rename(index={"CCR7x": cell_marker_bionty_lookup.CCR7.name})
OK, now we can try to run curate again and all cell markers are linked!
cell_marker_bionty.inspect(curated_df.index, cell_marker_bionty.name)
✅ 11 terms (78.6%) are mapped.
🔶 3 terms (21.4%) are not mapped.
{'mapped': ['Ki67',
'CCR7',
'CD14',
'CD8',
'CD45RA',
'CD4',
'CD3',
'CD127',
'PD-1',
'CD66b',
'SIGLEC8'],
'not_mapped': ['Invalid-1', 'Invalid-2', 'Time']}
Map CellType names via fuzzy string matching#
cell_type_bionty = CellType()
cell_type_bionty.fuzzy_match("T cells", cell_type_bionty.name)
ontology_id | definition | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
name | |||||
T cell | CL:0000084 | A Type Of Lymphocyte Whose Defining Characteri... | T-lymphocyte|T lymphocyte|T-cell | [CL:0002419, CL:0000798, CL:0000789, CL:0002420] | 92.307692 |
By default, fuzzy_match also matches against synonyms:
cell_type_bionty.fuzzy_match("P cell", cell_type_bionty.name)
ontology_id | definition | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
name | |||||
nodal myocyte | CL:0002072 | A Specialized Cardiac Myocyte In The Sinoatria... | P cell|cardiac pacemaker cell|myocytus nodalis | [CL:1000409, CL:1000410] | 100.0 |
You can turn off synonym matching with synonyms_field=None
:
cell_type_bionty.fuzzy_match("P cell", cell_type_bionty.name, synonyms_field=None)
ontology_id | definition | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
name | |||||
PP cell | CL:0000696 | A Cell That Stores And Secretes Pancreatic Pol... | type F enteroendocrine cell | [CL:0002680] | 92.307692 |
Return all results ranked by matching ratios:
cell_type_bionty.fuzzy_match(
"P cell", cell_type_bionty.name, return_ranked_results=True
).head()
ontology_id | definition | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
name | |||||
nodal myocyte | CL:0002072 | A Specialized Cardiac Myocyte In The Sinoatria... | P cell|cardiac pacemaker cell|myocytus nodalis | [CL:1000409, CL:1000410] | 100.000000 |
double-positive, alpha-beta thymocyte | CL:0000809 | A Thymocyte Expressing The Alpha-Beta T Cell R... | DP thymocyte|DP cell|double-positive, alpha-be... | [CL:0002428, CL:0002430, CL:0002427, CL:000242... | 92.307692 |
PP cell | CL:0000696 | A Cell That Stores And Secretes Pancreatic Pol... | type F enteroendocrine cell | [CL:0002680] | 92.307692 |
pigmented ciliary epithelial cell | CL:0002303 | A Cell That Is Part Of Pigmented Ciliary Epith... | PE cell | [] | 92.307692 |
GIP cell | CL:0002278 | An Enteroendocrine Cell Of Duodenum And Jejunu... | type K enteroendocrine cell | [] | 85.714286 |
Tied results will all be returns:
cell_type_bionty.fuzzy_match("A cell", cell_type_bionty.name, synonyms_field=None)
ontology_id | definition | synonyms | children | __ratio__ | |
---|---|---|---|---|---|
name | |||||
T cell | CL:0000084 | A Type Of Lymphocyte Whose Defining Characteri... | T-lymphocyte|T lymphocyte|T-cell | [CL:0002419, CL:0000798, CL:0000789, CL:0002420] | 83.333333 |
B cell | CL:0000236 | A Lymphocyte Of B Lineage That Is Capable Of B... | B lymphocyte|B-cell|B-lymphocyte | [CL:0009114, CL:0001201] | 83.333333 |