Inspect & map identifiers#

To make data queryable by an entity identifier, one needs to ensure that identifiers comply to a chosen standard.

Bionty enables this by mapping metadata on the versioned ontologies using inspect().

For terms that are not directly mappable, we offer:

from bionty import Gene, CellMarker, CellType
import pandas as pd

Inspect and mapping synonyms of gene identifiers#

To illustrate it, let us generate a DataFrame that stores a number of gene identifiers, some of which corrupted.

data = {
    "gene symbol": ["A1CF", "A1BG", "FANCD1", "corrupted"],
    "hgnc id": ["HGNC:24086", "HGNC:5", "HGNC:1101", "corrupted"],
    "ensembl_gene_id": [
        "ENSG00000148584",
        "ENSG00000121410",
        "ENSG00000188389",
        "ENSGcorrupted",
    ],
}
df_orig = pd.DataFrame(data).set_index("ensembl_gene_id")

df_orig

	gene symbol	hgnc id
ensembl_gene_id
ENSG00000148584	A1CF	HGNC:24086
ENSG00000121410	A1BG	HGNC:5
ENSG00000188389	FANCD1	HGNC:1101
ENSGcorrupted	corrupted	corrupted

First we can check whether any of our values are mappable against the ontology reference.

Tip: available fields are accessible via auto-completion: gene_bionty.

gene_bionty = Gene()

gene_bionty.inspect(df_orig.index, gene_bionty.ensembl_gene_id)

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

{'mapped': ['ENSG00000148584', 'ENSG00000121410', 'ENSG00000188389'],
 'not_mapped': ['ENSGcorrupted']}

The same procedure is available for gene symbols. First, we inspect which symbols are mappable against the ontology.

gene_bionty.inspect(df_orig["gene symbol"], gene_bionty.symbol)

🔶 The identifiers contain synonyms!

💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'

✅ 2 terms (50.0%) are mapped.

🔶 2 terms (50.0%) are not mapped.

{'mapped': ['A1CF', 'A1BG'], 'not_mapped': ['FANCD1', 'corrupted']}

Apparently 2 of the gene symbols are mappable. Bionty further warns us that some of our symbols can be mapped into standardized symbols.

Mapping synonyms returns a list of standardized terms:

mapped_symbol_synonyms = gene_bionty.map_synonyms(
    df_orig["gene symbol"], gene_bionty.symbol
)

mapped_symbol_synonyms

['A1CF', 'A1BG', 'BRCA2', 'corrupted']

Optionally, only returns a mapper of {synonym : standardized name}:

gene_bionty.map_synonyms(df_orig["gene symbol"], gene_bionty.symbol, return_mapper=True)

{'FANCD1': 'BRCA2'}

We can use the standardized symbols as the new index:

df_curated = df_orig.reset_index()
df_curated.index = mapped_symbol_synonyms

df_curated

	ensembl_gene_id	gene symbol	hgnc id
A1CF	ENSG00000148584	A1CF	HGNC:24086
A1BG	ENSG00000121410	A1BG	HGNC:5
BRCA2	ENSG00000188389	FANCD1	HGNC:1101
corrupted	ENSGcorrupted	corrupted	corrupted

You may return a DataFrame with a boolean column indicating if the identifiers are mappable:

gene_bionty.inspect(df_curated.index, gene_bionty.symbol, return_df=True)

✅ 3 terms (75.0%) are mapped.

🔶 1 terms (25.0%) are not mapped.

	__mapped__
A1CF	True
A1BG	True
BRCA2	True
corrupted	False

Standardize and look up unmapped CellMarker identifiers#

Depending on how the data was collected and which terminology was used, it is not always possible to curate values. Some values might have used a different standard or be corrupted.

This section will demonstrate how to look up unmatched terms and curate them using CellMarker.

First, we take an example DataFrame whose index containing a valid & invalid cell markers (antibody targets) and an additional feature (time) from a flow cytometry dataset.

markers = pd.DataFrame(
    index=[
        "KI67",
        "CCR7x",
        "CD14",
        "CD8",
        "CD45RA",
        "CD4",
        "CD3",
        "CD127",
        "PD1",
        "Invalid-1",
        "Invalid-2",
        "CD66b",
        "Siglec8",
        "Time",
    ]
)

Let’s instantiate the CellMarker ontology with the default database and version.

cell_marker_bionty = CellMarker()

cell_marker_bionty

CellMarker
Species: human
Source: cellmarker, 2.0

📖 CellMarker.df(): ontology reference table
🔎 CellMarker.lookup(): autocompletion of ontology terms
🔗 CellMarker.ontology: Pronto.Ontology object

First, we can have a look at the cell marker table that we just loaded.

df = cell_marker_bionty.df()

df.head()

	id	name	ncbi_gene_id	gene_symbol	gene_name	uniprotkb_id	synonyms
0	CM_MERTK	MERTK	10461	MERTK	MER proto-oncogene, tyrosine kinase	Q12866	None
1	CM_CD16	CD16	2215	FCGR3A	Fc fragment of IgG receptor IIIb	O75015	None
2	CM_CD206	CD206	4360	MRC1	mannose receptor C-type 1	P22897	None
3	CM_CRIg	CRIg	11326	VSIG4	V-set and immunoglobulin domain containing 4	Q9Y279	None
4	CM_CD163	CD163	9332	CD163	CD163 molecule	Q86VB7	None

Now let’s check which cell markers from the file can be found in the reference:

cell_marker_bionty.inspect(markers.index, cell_marker_bionty.name, return_df=True)

🔶 The identifiers contain synonyms!

💡 To increase mappability, convert them into standardized names/symbols using '.map_synonyms()'

✅ 7 terms (50.0%) are mapped.

🔶 7 terms (50.0%) are not mapped.

	__mapped__
KI67	False
CCR7x	False
CD14	True
CD8	True
CD45RA	True
CD4	True
CD3	True
CD127	True
PD1	False
Invalid-1	False
Invalid-2	False
CD66b	True
Siglec8	False
Time	False

Logging suggests we map synonyms:

synonyms_mapper = cell_marker_bionty.map_synonyms(
    markers.index, cell_marker_bionty.name, return_mapper=True
)

Now we mapped 3 additional terms:

synonyms_mapper

{'KI67': 'Ki67', 'PD1': 'PD-1', 'Siglec8': 'SIGLEC8'}

Let’s replace the synonyms with standardized names in the markers DataFrame:

markers.rename(index=synonyms_mapper, inplace=True)

From the logging, it can be seen that 4 terms were not found in the reference!

Among them Time, Invalid-1 and Invalid-2 are non-marker channels which won’t be curated by cell marker.

cell_marker_bionty.inspect(markers.index, cell_marker_bionty.name, return_df=True)

✅ 10 terms (71.4%) are mapped.

🔶 4 terms (28.6%) are not mapped.

	__mapped__
Ki67	True
CCR7x	False
CD14	True
CD8	True
CD45RA	True
CD4	True
CD3	True
CD127	True
PD-1	True
Invalid-1	False
Invalid-2	False
CD66b	True
SIGLEC8	True
Time	False

We don’t really find CCR7x, let’s check in the lookup with auto-completion:

cell_marker_bionty_lookup = cell_marker_bionty.lookup()

https://d33wubrfki0l68.cloudfront.net/eee08aab484a13dbaefc78633d1805ee61cd933c/8d864/_images/lookup_ccr7.png

cell_marker_bionty_lookup.CCR7

cell_marker(index=163, id='CM_CCR7', name='CCR7', ncbi_gene_id='1236', gene_symbol='CCR7', gene_name='C-C motif chemokine receptor 7', uniprotkb_id='P32248', synonyms=None)

Indeed we find it should be CCR7, we had a typo there with CCR7x.

Now let’s fix the markers so all of them can be linked:

Tip

Using the .lookup instead of passing a string helps eliminate possible typos!

curated_df = markers.rename(index={"CCR7x": cell_marker_bionty_lookup.CCR7.name})

OK, now we can try to run curate again and all cell markers are linked!

cell_marker_bionty.inspect(curated_df.index, cell_marker_bionty.name)

✅ 11 terms (78.6%) are mapped.

🔶 3 terms (21.4%) are not mapped.

{'mapped': ['Ki67',
  'CCR7',
  'CD14',
  'CD8',
  'CD45RA',
  'CD4',
  'CD3',
  'CD127',
  'PD-1',
  'CD66b',
  'SIGLEC8'],
 'not_mapped': ['Invalid-1', 'Invalid-2', 'Time']}

Map CellType names via fuzzy string matching#

cell_type_bionty = CellType()

cell_type_bionty.fuzzy_match("T cells", cell_type_bionty.name)

	ontology_id	definition	synonyms	children	__ratio__
name
T cell	CL:0000084	A Type Of Lymphocyte Whose Defining Characteri...	T-lymphocyte\|T lymphocyte\|T-cell	[CL:0002419, CL:0000798, CL:0000789, CL:0002420]	92.307692

By default, fuzzy_match also matches against synonyms:

cell_type_bionty.fuzzy_match("P cell", cell_type_bionty.name)

	ontology_id	definition	synonyms	children	__ratio__
name
nodal myocyte	CL:0002072	A Specialized Cardiac Myocyte In The Sinoatria...	P cell\|cardiac pacemaker cell\|myocytus nodalis	[CL:1000409, CL:1000410]	100.0

You can turn off synonym matching with synonyms_field=None:

cell_type_bionty.fuzzy_match("P cell", cell_type_bionty.name, synonyms_field=None)

	ontology_id	definition	synonyms	children	__ratio__
name
PP cell	CL:0000696	A Cell That Stores And Secretes Pancreatic Pol...	type F enteroendocrine cell	[CL:0002680]	92.307692

Return all results ranked by matching ratios:

cell_type_bionty.fuzzy_match(
    "P cell", cell_type_bionty.name, return_ranked_results=True
).head()

	ontology_id	definition	synonyms	children	__ratio__
name
nodal myocyte	CL:0002072	A Specialized Cardiac Myocyte In The Sinoatria...	P cell\|cardiac pacemaker cell\|myocytus nodalis	[CL:1000409, CL:1000410]	100.000000
double-positive, alpha-beta thymocyte	CL:0000809	A Thymocyte Expressing The Alpha-Beta T Cell R...	DP thymocyte\|DP cell\|double-positive, alpha-be...	[CL:0002428, CL:0002430, CL:0002427, CL:000242...	92.307692
PP cell	CL:0000696	A Cell That Stores And Secretes Pancreatic Pol...	type F enteroendocrine cell	[CL:0002680]	92.307692
pigmented ciliary epithelial cell	CL:0002303	A Cell That Is Part Of Pigmented Ciliary Epith...	PE cell	[]	92.307692
GIP cell	CL:0002278	An Enteroendocrine Cell Of Duodenum And Jejunu...	type K enteroendocrine cell	[]	85.714286

Tied results will all be returns:

cell_type_bionty.fuzzy_match("A cell", cell_type_bionty.name, synonyms_field=None)

	ontology_id	definition	synonyms	children	__ratio__
name
T cell	CL:0000084	A Type Of Lymphocyte Whose Defining Characteri...	T-lymphocyte\|T lymphocyte\|T-cell	[CL:0002419, CL:0000798, CL:0000789, CL:0002420]	83.333333
B cell	CL:0000236	A Lymphocyte Of B Lineage That Is Capable Of B...	B lymphocyte\|B-cell\|B-lymphocyte	[CL:0009114, CL:0001201]	83.333333