Katy Wolstencroft’s Ph.D. work with Andy Brass and myself on classifying protein phosphatases with an ontology was one of the best uses of OWL and reasoning to “do” some biology with which I’be been involved. The ISMB paper that came out of this work was
K Wolstencroft, P. Lord, L. Tabernero, A. Brass, and R. Stevens. Protein classification using ontology classification. Bioinformatics, 22(14):e530-538, 2006.
The core of this work was delightfully simple. Katy wanted to generate catalogues of protein phosphatases from an organism’s genome. It so happens that protein phosphatases can be defined in terms of their protein domain composition; that is, each category of protein phosphatase can be recognised just in terms of what protein domains it has as parts. So, if we then use some bioinformatics tools to recognise protein domains for an individual protein and we have each type of phosphatase defined in terms of what protein domains it contains, then we should be able to use the OWL automated reasoner to realise to which class of protein phosphatase that particular individual belongs. ‘Realising‘ individuals within the classes of an ontology is what OWL and reasoners do; we sort of forget about individuals, as we focus so much on making the class level structure. (Of course, here we’re not actually talking about individuals really, but it is a suitable use of the machinery.) The workflow is something like:
- Take one ontology of protein phosphatases with each class of phosphatase defined in terms of its protein domains;
- Take a genome’s proteins and shove them through a tool such as InterProScan to find the predicted protein domains;
- Turn those proteins and their domains into OWL individuals describing the protein with a series of facts about the domain structure (all done according to the schema provided by the phosphatase ontology);
- Put into the ontology, simmer with the reasoner and realise the individuals;
- Read off the classes of phosphatases as the genome’s catalogue of protein phosphatases.
So, a protein phosphatase in Katy’s protein phosphatase ontology is (in this blog I’ve smoothed out parts of the ontology for easier presentation):
Class: Protein_Phosphatase EquivalentTo: Enzyme and hasPart some (IPR000106-Low_molecular_weight_phosphotyrosine_protein_phosphatase or IPR000222-Protein_phosphatase_2C or IPR000387-Tyrosine_specific_protein_phosphatase_and_dual_specificity_protein_phosphatase or IPR000751-M-Phase_inducer_phosphatase or IPR006186-Serine_threonine-specific_protein_phosphatase_and_bis_5-nucleosyl_-tetraphosphatase or IPR006545-EYA)
the ontology uses the InterPro protein domains. The protein phosphatase class is defined as any protein that has at least one of the varius protein phosphatase catalytic domains. So, a phosphaatases can bbe recognised as such because it is a protein with a phosphatase catalytic domain. We could have neatened this up with a
Protein phosphatase catalytic domain defined class that was equivalent to that disjunction. Neater, but not of any especial virtue apart from readability. Having one of these catalytic domains is sufficient to recognise a protein as being a member of the protein phosphatase class. All other protein phosphatases are necessarily kinds of this class.
Going down the hierarchy a bit, and a protein tyrosine phosphatase (
Class: PTP EquivalentTo: Protein_Phosphatase and hasPart some IPR000387-Tyrosine_specific_protein_phosphatase_and_dual_specificity_protein_phosphatase]
Which uses the InterPro domain IPR000387 to gives the means to recognise a protein phosphatase as one that is a tyrosine phosphatase. Then, and, at a leaf of the ontology, we have an
Class: RK_RM EquivalentTo: RA_RE_RN_RN2 and (hasPart some IPR000998-MAM) and (hasPart some IPR003599-Immunoglobulin_subtype) and (hasPart some IPR007110-Immunoglobulin-like) and (hasPart some IPR008979) and (hasPart some IPR013151)
Which uses a range of InterPro domains to narrow down a protein phosphatase to some type or other. Where all these definitions have just stated at least one of this domain, we now use some qualified cardinality constraints to sspecify exactly how many of a particular domain are needed to recognise a protein as being a member of a particular class. An R2A phosphatase is recognised by:
Class: R2A EquivalentTo: R1_R6, and hasPart exactly 2 IPR000242-Tyrosine_specific_protein_phosphatase
Here we need exactly two of this domain; not one or three, but two. For phosphatases we don’t need ordering of domains, but it would bbe easy enough. A hack could use a data property
hasPosition with an integer giving the ordering of domains, so we could have
hasPart some (DomainA and hasPosition value 1), hasPart some (DomainB and hasPosition value 2)
The details of all this can be read in the paper, but the highlights are:
- Katy surveyed the protein phosphatases for human and Aspergillus fumigatis and produced a catalogue.
- For human, this catalogue matched the expert produced catalogue, but differed in one respect. There was an extra phosphatase – it was already known, but had been ‘mis-placed’ and left out of the catalogue.
- The mould was even better. For phosphatases of a known specific type, the individual is realised down at the leaves of the ontology. An unknown protein that is recognised as a protein phosphatase, but of unknown type, will be realised somewhere else in the hierarchy; a phosphatase, but of a more abbstract type. This happened for the mould complement of protein phosphatases. A putative new phosphatase, with a novel combination of domains was found. the ontology had enough information to recognise a protein as a phosphatase, but to only classify it part way down the ontology. A novel combination (or composition) of protein domains meant it didn’t classify down at a leaf. In this way a putatively novel protein phosphatase was recognised by ontology and automated reasoning.
by reasoning with what we know about a field of interest we can catalogue the stuff we know about – sort of obvius. Also, as we have various abstractions over the classes of the actual concrete things we find in cells etc, we can discover new entities from that field of interest that conform to some part of the knownm, but vary in some way. So, we can find new protein phosphatases – they have a phosphatase catalytic domain, but are variants on the described patterns.