Archive for July, 2013

One hundred years of ontology

July 26, 2013

At the start of July 2013 I did a Pubmed search for “ontology” or “ontologies” and recorded the numbers of papers per year. I did the same thing again, this time searching for “Gene Ontology”. A bar chart of the numbers is below (a table of the numbers is at the end of the blog).

 

 

Up until the 1990s, things just rumble along, very much at a very low-level, with only the occaisional mention of “ontology” or “ontologies”. Things pick up in the 1990s, as the CS notion of ontology was introduced as a way of organising heterogenous data, and then begin to explode in the 2000s with the advent of the Gene Ontology. The numbers for the “Gene Ontology” start in 2000 and pick up fast and fairly consistently track the total number of ontology papers and, in recent years, forming a high proportion of a large number of papers. There appears to be an anomaly or out-lier in 2010; a weirdness, mistake or a community holiday. I’ve done no further analysis of these numbers…

 

Purely out of interest, I had a look at the earliest paper to mention “ontology”:

 

Bryce P.H. ONTOLOGY IN RELATION TO PREVENTIVE MEDICINE. Am J Public Health (N Y). 1912 Jan;2(1):32-3. (PMID: 18008609)

http://www.ncbi.nlm.nih.gov/pubmed/?term=18008609

 

The article is very much of its time and contains some outrageous comments on the causes and spread of disease. However, the core of the paper is about using ontology as a tool for discussion; the opening sentence is

 

We have to thank the metaphysicians for, if not explaining many things, at least giving us useful terms under which discussions may be carried on. least giving us useful terms under which discussions may be carried on. Ontology

is defined ‘as that branch of metaphysics which investigates and explains the nature of all things or existences.’ While he who coined the word cannot be accused of excessive modesty, yet one may thank him for it since it does give a direction to thought,…

 

which sort of sums it up – especially the bit about excessive modesty.

 

“For the student of preventive medicine there must arise the question: In ONTOLOGY IN RELATION TO PREVENTIVE MEDICINE what ethical category must he place the agents of disease as mosquitos, the hosts

of many diseases or the specific microbes and protozoa, their most direct causes? What is the meaning of pestis, cholera, tuberculosis or syphilis in the plan of life?”

 

Which are questions the modern ontology community are tackling and with the greatest of modesty. Hopefully we can do this without Price’s appeal to the merits of various forms of civilisation, religion and, for that time, the acceptable notion of eugenics… Price says “… until man with the splendid intelligence with which he is endowed shall have learned the life conditions under which each of these evils attacking him exists, and how each in turn may either be subdued to his uses or removed from his pathway.” – he obviously isn’t prone to the lack of modesty he ascribes to metaphysicians.

 

Ontology OR Ontologies

Gene Ontology

Year

Papers

Year

Papers

2013

813

2013

587

2012

1257

2012

835

2011

1036

2011

664

2010

899

2010

61

2009

815

2009

528

2008

731

2008

464

2007

698

2007

436

2006

535

2006

344

2005

457

2005

273

2004

301

2004

162

2003

177

2003

85

2002

91

2002

34

2001

5

2001

4

2000

43

2000

2

1999

21

Total

4479

1998

33

   

1997

15

   

1996

7

   

1995

21

   

1994

11

   

1993

6

   

1992

6

   

1991

7

   

1990

6

   

1989

1

   

1988

3

   

1987

3

   

1985

1

   

1984

1

   

1983

1

   

1982

3

   

1981

2

   

1980

1

   

1979

1

   

1977

1

   

1974

2

   

1972

2

   

1971

2

   

1968

2

   

1967

1

   

1965

1

   

1961

1

   

1951

1

   

1912

1

   

Total

8022

   

Finding irregularities in the syntax of an ontology’s axioms

July 8, 2013

I’ve recently used Eleni Mikroyannidi’s Regularity Inspector for Ontologies (RIO) plugin for Protégé to tidy up some irregularities in the axioms of my Periodic Table Ontology (NPTO). I was happy with the ontology and the inferences it draws, but I knew that I’d not been entirely consistent in the annotation properties I’d used and the syntactic form of the ontology’s axiomatisation. For example, I had expressions such as

 

SubClassOf:

    hasPart some x

    and hasPart some y

 

As well as

 

SubClassOf:

    hasPart some x,

    hasPart some y

 

which have exactly the same logical effects, but are not so easy to handle programmatically – also, it’s just not neat.

 

However, the ontology has 118 atoms, as well as the other classes that make up the ontology’s structure; going through all of these classes and neatening up the syntax is both tedious and error prone. One issue is that, as an author, I don’t necessarily know what things to look for to fix. So, a find and replace will not suffice (and would probably only work in the simplest of cases).

 

This is where RIO comes into play; it uses some off-the-shelf unsupervised clustering techniques to find regularities in axiom usage. Once it finds these clusters it forms generalisations over these clusters. All the details may be seen in

 

E. Mikroyannidi, L. Iannone, R. Stevens, and A. Rector. Inspecting regularities in ontology design using clustering. In International Semantic Web Conference (ISWC) 2011, pages 438-453. Springer, 2011.

 

And subsequent papers of Eleni’s, which can be found on my publications page. The ontology’s repository has three versions of the NPTO together with some output from RIO; together these files show the regularities and irregularities found in the NTPO and what RIO told me about them and how I’ve improved the form of the ontology. Below I put in some highlights.

 

The core of the clustering is about measuring the similarity of axioms and groups of axioms. Here I’ve been using a popularity measure of entity’s usage for variable substitution in the axiom patterns. Looking at the file “npto_syntactic_popularity.xml_output.txt”, I see a cluster of 93 atom classes. I know there are 118 classes, so I have 25 variants in how I’ve described atoms. One cluster I see apart from this one is

 

?cluster_1 SubClassOf ?cluster_3 only (SShell and (?cluster_4 exactly 1 ?DomainEntity) and (?cluster_5 value “?constant”^^string))

     Instantiations: (6)

    PotassiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 4))

    CesiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 6))

    LithiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 2))

    FranciumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 7))

    RubidiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 5))

    SodiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 3))

 

 

And this shows the pattern I outlined above of using “and” and not”,” to separate axioms. RIO shows me which ones have used this pattern and I can fix them. There are other little clusters showing variants of this form.

 

RIO found one instance of the form

 

?cluster_2 SubClassOf (?cluster_3 some ?Metalness) and (?cluster_3 only (SShell and (?cluster_4 exactly 1 ?DomainEntity) and (?cluster_5 value “?constant”^^string)))

     Instantiations: (1)

    HydrogenAtom SubClassOf (hasMetalness some NonMetal) and (hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 1)))

 

 

Which is a slightly different form of using “and” – difficult to find by eye, relatively easy to find by regular expression once you know it’s there, but RIO finds it for you without you having to know what to look for.

 

Rather comfortingly, I see things like all the gas state, solid state, metals and non-metals clustered together. So RIO is spotting the things I’ve done the same way and the things I’ve done in different ways. If you look at the second popularity based file (npto_popularity_output-v2) you can now see that, in the third version of the NPTO in the repository, there are 118 atoms in cluster_1 – all the atoms are in one cluster, indicating they are beginning to look syntactically the same way – there are, however, still a few irregularities I missed, but the output enabled me to spot them. For instance, RIO is revealing deeper nested versions of the same pattern I’ve been targeting. However, this scan of the NPTO by RIO shows things becoming nicely regular – it does expose some deviations from style, but most of these are deliberate (and often a hack, like my treatment of the actinides and lanthanides). The various outputs of RIO to be seen in the repository have enabled me to see errors of style (or “bad axiom smells”) WRT the syntax I’ve used. Also, RIO enabled me to find errors in annotaitons – there were missing discovery years and there’s still a bit of a variety in how I’ve done labelling of atoms. However, RIO did enable me to spot things I didn’t know were there and would have been very tedious to find by eye. RIO’s output presentation still needs some attention, but I already find it useful.

The rise and rise of the Gene Ontology

July 7, 2013

Geraint Duck, one of our Ph.D. students, has just published a paper on a named entity recogniser for databases and software used in bioinformatics and computational biology. This is a wider project looking at extracting computational biological methods from text. As part of the paper about the BioNERDS tool, we did a survey of databases and software reported in the full-texts of Genome Biology and BMC Bioinformatics in PMC. More recently we’ve done a full survey of PMC, but the paper just reports on the two journals. The paper’s full reference is

 

Geraint Duck, Goran Nenadic, Andy Brass, David Robertson, and Robert Stevens. bionerds: exploring bioinformatics’ database and software use through literature mining. BMC Bioinformatics, 14(1):194, 2013. (DOI: 10.1186/1471-2105-14-194).

 

Here I want to report on the survey and, in particular, what it says about the reported usage of the gene Ontology. We surveyed BMC Bioinformatics and Genome Biology; the former has a remit to report development of bioinformatics methods, tools, databases, the latter has a remit to report more on the use of those rsources to actually “do biology2 – though, of course, there is overlap. The table below shows the top nine resources for each journal over the life-time of each journal.

 

BMC Bioinformatics

Genome Biology

Resource

Count

Resource

Count

R

1922

R

574

GO

1102

GO

516

BLAST

870

BLAST

430

analysis

696

GenBank

414

PDB

631

GEO

287

Network

553

Ensembl

266

Q

494

S4

229

GenBank

468

tRNA

195

KEGG

463

analysis

193

GEO

416

RefSeq

175

 

 

These numbers are the documents in which the resource was mentioned. There are a few resources that are over-reported – “network” and “analysis” are both real bioinformatics resources, but with highly inconvenient names for text-mining. “analysis” is not an unusual word to find in reports of science, so calling a tool “analysis” is, we think, something of an infelicity. However, the textp-miners dealt with this kind of thing for gene and protein names, so I’m sure we will also do so.

 

In both journals the Gene Ontology is up there in the top resources reported in the literature. It’s up there with the usual suspects. R is now top-dog, with BLAST, GO, Ensembl, KEGG, GEO and Genbank. I’m reasonably happy in concluding that the Gene Ontology is one of the central resources in these journals –.

 

We also had a look at the GO’s usage over time. We calculated the relative use of the GO by dividing the number of documents mentioning GO by the number of documents in that year in each journal).

 

We can see the mentions of GO in BMC Bioinformatics increasing fairly rapidly until 2005 and then increasing more slowly, and even tailing off a bit, thereafter (the paper has more details on these trends – normalising and statistical testing etc.), but these trends appear to be OK). The picture in Genome Biology is a little less clear, but GO becomes an established 0–resource. My suspicion is that numbers appear to tail off (as they do for other resources) as they become part of the fabric and no longer explicitly mentioned, also, there are more resources to use and cite, so competition is fierce – I’ve no evidence for these thoughts, but that’s my conjecture).

 

In these two journals GO is a “top” resource – we have an ontology that is a key resource for bioinformatics and computational biology. Something happens in 2005/2006 to GO’s usage (the paper has some plots of acceleration of usage too) – some kind of saturation, establishment as “a top resource”, or something else. A similar picture is seen in the whole of PMC – GO is in the top ten – I’ll report on that, and on how other ontologies fare, in another post. However, the take home message is that there is an ontology that is a central resource in bioinformatics and computational biology. That it is the GO is no surprise.

Putting an ontology of the atoms in order

July 4, 2013

 

I’ve come to the end of one thread of my playing with an ontology of the Periodic table of the Elements – thanks to an excellent third year project by Ionica Durchi. I’ve already written about an ontology of the atoms, where each atom is described according to its electronic configuration; then the atom families are also defined according to common electronic configuration. It is this electronic configuration that defines the physicochemical properties of the atom, the substances it forms and the families we observe in the Periodic table. An OWLViz view of this ontology can be seen below.

 


 

This ontology lacks the explicative power of the standard view of the Periodic Table. As we move left to right in the table, we have increasing atomic mass; we also observe periodicity in the physicochemical properties of the elements. As this periodicity or regularity happens, we group similar elements together. So, lithium, sodium, potassium, cesium and so on are all light, soft, highly reactive, combine with halides in the ratio one to one, and so on. This gives us the standard view of the Periodic table below with the alkali metals described above on the far left-hand-side (though note that the tabular form is a visual artefact of putting it on a two-dimensional medium – it’s really a spiral).

 

Source: http://www.bpc.edu/mathscience/chemistry/images/periodic_table_of_elements.jpg

 

My ontology has all the information (or proxies for it) for the standard view of the Periodic table, but it doesn’t show off this periodicity. So, the problem I set for Ionica was to implement an algorithm and some visualisation that would bridge this gap. The rules of engagement were couched as the two questions that could be asked of the ontology:

 

  1. What is the next atom;
  2. To what family does the atom belong

     

 

These two questions encapsulate the two dimensions along which the Periodic Table is arranged – the increasing atom mass and the periodically occurring physicochemical families to which tey belong. The aim was to be able to render the ontology as the periodic table looks, putting in gaps where appropriate. Just as Mendeleev left in gaps where he thought elements should be present (though not yet discovered), the algorithm for rendering the ontology, using the two questions above, should put the atoms in order of increasing atomic mass, but also order them in a second dimension by physicochemical famly.

 

 

 

Ionica’s algorithm for doing this is outlined in the decision tree below. It takes the next element in increasing atomic number and then it checks its membership against any of the already displayed elements. Depending on the result, it goes either on the ‘Yes’ branch and it that case it only sticks the element beneath the one with the same superClass or if the results is ‘No’ it creates a new column for the newly ‘discovered’ element and shuffles all the elements above accordingly. After executing any of the branches, it goes to the top, extracting the next element and repeating the process until space has been allocated for all elements.

 


 

 

 

 

 

 

The pictures below show the programme working with various ranges of atomic number (as a proxy for atomic mass) and/or their year of discovery. The algorithm can be seen working, adding in gaps into the table as necessary. The application looks like this (below) and can filter by discovery year and do specified ranges of atomic numbers.

 

 

 

For atomic numbers 3 to 20 we just get three rows of elements up until the element prior to the first transition element; the algorithm checks each atom in turn to see if it’s a member of a current groups – on reaching sodium the answer is, for the first time, yes, and a new period is started.

 

On reaching scandium, we find it is not a member of the boron family or any other family, so a gap for a new family should be started.

 


 

 

This carries on adding “gaps” until all the transition elements are done. Then we get the rest of the Periodic Table as we’d expect to see.

 

 

If we start with element 2 (helium) we end up with the noble gases on the left hand side:

 


 

 

This looks strange, but is conceptually OK, as the “table” is continuous, so having the noble gases at the left or right doesn’t really matter. However, we do prefer to have the non-metals together on the right hand side – so there’s a little tweek to the algorithm to deal with hydrogen and helium. A future piece of work may render the thing as a rotatable spiral…

 

 

The algorithm also leaves gaps appropriately for “undiscovered” elements:

 


Drawing the Periodic table from year 0 to 1891 (above) has only those elements that Mendeleev knew.

 


 

The ontology has all the information about the periodicity of the physicochemical properties of the elements (or that which accounts for it), but doesn’t make this explicit. It is only the layout that makes this periodicity with increasing atomic mass explicit. A simple observation it may be, but espite the ontology having the knowledge, it is how that knowledge is presented that often matters.

Realising the potential of OWL…

July 3, 2013

Katy Wolstencroft’s Ph.D. work with Andy Brass and myself on classifying protein phosphatases with an ontology was one of the best uses of OWL and reasoning to “do” some biology with which I’be been involved. The ISMB paper that came out of this work was

K Wolstencroft, P. Lord, L. Tabernero, A. Brass, and R. Stevens. Protein classification using ontology classification. Bioinformatics, 22(14):e530-538, 2006.

The core of this work was delightfully simple. Katy wanted to generate catalogues of protein phosphatases from an organism’s genome. It so happens that protein phosphatases can be defined in terms of their protein domain composition; that is, each category of protein phosphatase can be recognised just in terms of what protein domains it has as parts. So, if we then use some bioinformatics tools to recognise protein domains for an individual protein and we have each type of phosphatase defined in terms of what protein domains it contains, then we should be able to use the OWL automated reasoner to realise to which class of protein phosphatase that particular individual belongs. ‘Realising‘ individuals within the classes of an ontology is what OWL and reasoners do; we sort of forget about individuals, as we focus so much on making the class level structure. (Of course, here we’re not actually talking about individuals really, but it is a suitable use of the machinery.) The workflow is something like:

  1. Take one ontology of protein phosphatases with each class of phosphatase defined in terms of its protein domains;
  2. Take a genome’s proteins and shove them through a tool such as InterProScan to find the predicted protein domains;
  3. Turn those proteins and their domains into OWL individuals describing the protein with a series of facts about the domain structure (all done according to the schema provided by the phosphatase ontology);
  4. Put into the ontology, simmer with the reasoner and realise the individuals;
  5. Read off the classes of phosphatases as the genome’s catalogue of protein phosphatases.

So, a protein phosphatase in Katy’s protein phosphatase ontology is (in this blog I’ve smoothed out parts of the ontology for easier presentation):

Class: Protein_Phosphatase

    EquivalentTo:
        Enzyme
        and hasPart some
            (IPR000106-Low_molecular_weight_phosphotyrosine_protein_phosphatase
             or IPR000222-Protein_phosphatase_2C
             or IPR000387-Tyrosine_specific_protein_phosphatase_and_dual_specificity_protein_phosphatase
             or IPR000751-M-Phase_inducer_phosphatase
             or IPR006186-Serine_threonine-specific_protein_phosphatase_and_bis_5-nucleosyl_-tetraphosphatase
             or IPR006545-EYA)

the ontology uses the InterPro protein domains. The protein phosphatase class is defined as any protein that has at least one of the varius protein phosphatase catalytic domains. So, a phosphaatases can bbe recognised as such because it is a protein with a phosphatase catalytic domain. We could have neatened this up with a Protein phosphatase catalytic domain defined class that was equivalent to that disjunction. Neater, but not of any especial virtue apart from readability. Having one of these catalytic domains is sufficient to recognise a protein as being a member of the protein phosphatase class. All other protein phosphatases are necessarily kinds of this class.

Going down the hierarchy a bit, and a protein tyrosine phosphatase (PTP) is

Class: PTP

    EquivalentTo:
        Protein_Phosphatase
        and hasPart some IPR000387-Tyrosine_specific_protein_phosphatase_and_dual_specificity_protein_phosphatase]

Which uses the InterPro domain IPR000387 to gives the means to recognise a protein phosphatase as one that is a tyrosine phosphatase. Then, and, at a leaf of the ontology, we have an RK_RM:

Class: RK_RM

    EquivalentTo:
        RA_RE_RN_RN2
        and (hasPart some IPR000998-MAM)
         and (hasPart some IPR003599-Immunoglobulin_subtype)
         and (hasPart some IPR007110-Immunoglobulin-like)
         and (hasPart some IPR008979)
         and (hasPart some IPR013151)

Which uses a range of InterPro domains to narrow down a protein phosphatase to some type or other. Where all these definitions have just stated at least one of this domain, we now use some qualified cardinality constraints to sspecify exactly how many of a particular domain are needed to recognise a protein as being a member of a particular class. An R2A phosphatase is recognised by:

Class: R2A

    EquivalentTo:
        R1_R6,
        and hasPart exactly 2 IPR000242-Tyrosine_specific_protein_phosphatase

Here we need exactly two of this domain; not one or three, but two. For phosphatases we don’t need ordering of domains, but it would bbe easy enough. A hack could use a data property hasPosition with an integer giving the ordering of domains, so we could have

hasPart some (DomainA and hasPosition value 1),
hasPart some (DomainB and hasPosition value 2)

The Phosphatase Ontology at Work

The details of all this can be read in the paper, but the highlights are:

  1. Katy surveyed the protein phosphatases for human and Aspergillus fumigatis and produced a catalogue.
  2. For human, this catalogue matched the expert produced catalogue, but differed in one respect. There was an extra phosphatase – it was already known, but had been ‘mis-placed’ and left out of the catalogue.
  3. The mould was even better. For phosphatases of a known specific type, the individual is realised down at the leaves of the ontology. An unknown protein that is recognised as a protein phosphatase, but of unknown type, will be realised somewhere else in the hierarchy; a phosphatase, but of a more abbstract type. This happened for the mould complement of protein phosphatases. A putative new phosphatase, with a novel combination of domains was found. the ontology had enough information to recognise a protein as a phosphatase, but to only classify it part way down the ontology. A novel combination (or composition) of protein domains meant it didn’t classify down at a leaf. In this way a putatively novel protein phosphatase was recognised by ontology and automated reasoning.

by reasoning with what we know about a field of interest we can catalogue the stuff we know about – sort of obvius. Also, as we have various abstractions over the classes of the actual concrete things we find in cells etc, we can discover new entities from that field of interest that conform to some part of the knownm, but vary in some way. So, we can find new protein phosphatases – they have a phosphatase catalytic domain, but are variants on the described patterns.