Archive for June, 2014

Learning about a domain from an ontology

June 20, 2014

One of the things I (and I think we collectively have done to a great extent) is forgotten about or neglected ontology as “tutorial”. We used to talk about this way back in TAMBIS days and others did so as well. The idea is that by looking at an ontology I can learn about a field of interest. Our idea in TAMBIS was that one should be able to look at the TAMBIS ontology and learn about the basics of molecular biology and an operational aspect of bioinformatics (though this exact idea was never explored or evaluated). Ontologies are often described as the “background” knowledge of a discipline; they contain the entities in a domain, their definitions, descriptions and inter-relatedness. From this, a “reader” of an ontology should be able to get some kind of understanding of a domain.

With an ontology, there are two ways I can learn about a field of interest: First, I can look at an ontology for that field, explore it and from that derive an understanding of how the entities of that field “work”; Second, I can write an ontology about that field and, in doing so, do the learning. This latter one only works for small topics or learning at a fairly superficial level. I’ve done this for heraldry; cloud nomenclature; anatomy of flowers; plate armour; galenic medicine; and a few others. This isn’t scalable; we can’t all write ontologies for a field of interest, just to learn about it. I have, however, found it a useful way to help myself structure my understanding, even if the resulting ontologies rarely, if ever, amount to very much at all (these have also largely been for fun and not an endeavour to drive some research).

 

Is this tutorial aspect of ontology going to give a full understanding? For most ontologies of which I’m aware, looking at that ontology will not act like a college course in that subject area. Looking at an ontology is more like looking at an encyclopaedia; it is a list of things and descriptions of those things, which is all an ontology is really trying to do. A so-called reference ontology can fit into this encyclopaedic role well; an application ontology should do so, but just for that application area. However, I should be able to look at an ontology or a collection of ontologies and get a decent overview of a domain.

 

Having said this, however, we can make quite a good encyclopaedia from an ontology or set of ontologies, especially if there are an adequate number of semantic relationships between entities, as well as good editorial and other metadata around those entities. I say “ontologies” as just having an encyclopaedia or ontology of molecular function (as an example) tells me what molecular functions there are and how they’re organised, but it doesn’t give me, as a learner, much of a biological context. This isn’t the fault of the ontology; I just need to look at a broader picture of biology to really learn anything. If I could ask questions such as “what molecular functions exist in the mitochondria of mammals and in what processes do they participate”, then I have something to work with (I suspect). There then, of course, remains the question of how all this information knowledge should be presented. I feel there’s mileage in a standard sort of encyclopaedic form, using the label (term), synonyms, natural language definitions,, together with the structure of the ontology to present something useful.

 

I’m still sort of taken with the idea of ontology as tutorial; I should be able to look at the ontologies from a field of interest and learn about that field of interest. It probably won’t be an in-depth learning; shallower even than that offered by the excellent resource Wikipedia, which can readily be used as an introduction to a subject area. However, I should be able to get a decent enough view of a field of interest from its ontologies that I can structure my learning from other resources.

The Software Ontology (SWO)

June 19, 2014

Our paper on the Software Ontology (SWO) has just been published in the Journal of Biomedical Semantics (JBMS) thematic issue on ontologies. The paper is:

 

James Malone, Andy Brown, Allyson Lister, Jon Ison, Duncan Hull, Helen Parkinson, and Robert Stevens. The software ontology (swo): a resource for reproducibility in biomedical data analysis, curation and digital preservation. Journal of Biomedical Semantics, 5(1):25, 2014.

 

There’s also a lot of information about how we went about making the SWO at the SWO blog.

 

We now have a range of bio-ontologies covering sequences, gene products, their functions, the processes in which they participate, cellular and gross anatomy, to diseases and phenotypes. These are primarily used to describe the entities in the masses of data biology now produces. More recently, there’s been work on describing the investigations by which these data were produced and analysed; the SWO fits into the ontology landscape at this location. The data is just a load of stuff; we detect things in these datasets with some software and the provenance trail of how these entities were detected needs to include the software that was used.

 

The SWO describes software, the software suites of which it is a part, its inputs and outputs, the tasks it supports, its versions, licencing its interface, and its developers. It doesn’t capture the hardware upon which the software runs, the software’s dependencies, cost of ownership (not the price in lucre, but does it need a lot of sys admin kind of thing), software architecture… (see the paper and blog for more)

 

The scope of the SWO is thus wide and we could have included a whole lot more than we did; much of the stuff not included is important and useful, but resources are scarce and some of the features, like the hardware, is v hard to represent. One of the major problems in writing an ontology is scope and mission creep – how do we stop modelling the world and spending inordinate amounts of time on pathological edge cases? To help us in this we used some Agile techniques in producing the SWO. Perhaps the most useful was the “planning poker” and “buy a feature” games we played. In the SWO project we used a bunch of stakeholders to help us out and the use of these techniques in the SWO went something like this:

 

  1. We did the usual thing of asking for competency questions (which play the role of user stories); clustering them and drawing out a set of features that needed to be modelled.
  2. For the planning poker, we asked people to estimate the effort needed to represent the feature on a numeric scale. Here the trick is that everyone has cards with notional costs written upon them. All cards are held up simultaneously to prevent bias from the first to reveal his or her card. Discussion ensues and a consensus effort for the ontological feature is decided upon.
  3. We then did the same thing for choosing a feature. Depending on the values for effort an amount of “money” is calculated and distributed evenly amongst the stakeholders; there is not enough money to buy everything. Each feature has a cost and each stakeholder can spend his or her money on the features he or she thinks most important. negotiating and so on takes place and features to be modelled are either bought or not bought.

This actually worked well and produced a list of prioritised SWO features. We didn’t do it often enough, as priorities and cost estimations change, but features to be modelled could be seen to be changed on one iteration of the planning. In the SWO we think this technique struck a good balance between what was needed and what was achieveable.

 

We also needed to add content for these features to the SWO. In the first round this was driven by what our customers needed – this was largely, but not exclusively, the EBI’s Gene Expression Atlas. Later on, we’ve been a bit more systematic about what to put into the SWO. Using a named entity recogniser for bioinformatics software and databases (BioNERDS) we’ve done a survey of all PMC for mentions of said bioinformatics databases and software. We pulled out the top 50 of these software mentions and we’re slowly ploughing our way through those (I’ve put this list at the end of this Blog).

 

The paper itself is one in the JBMS thematic series on ontologies; it does for ontologies what the NAR annual database issue does – describes, in this case, ontologies, their state of play and what updates have happened. This is what the SWO paper does. It has the motivation – we need to know how our data were produced and analysed and software plays a crucial role in this analysis. The paper describes what features were bought by our stakeholders, how we axiomatised descriptions of these software features and outlines some of the more tricky modelling issues. My two favourite tricky bits were:

 

  1. Versions of software. The vast variety of versioning schemes is horrid to represent; we did it with individuals of the class “version name”representing a version for a given bit of software. These versions are linked to preceding and succeeding versions to support the obvious queries. It’s not beautiful, but works well enough.
  2. Licences for software. Again, this has to support the variety of the multitude of licences,but the interesting thing here is to be able to infer that, for instance, a bit of software is open source – the paper describes the axiom pattern to do this trick.

 

 

The paper also describes the SWO’s merger with EDAM, which has brought a lot of content into the SWO. The SWO is being used, and not just by the EBI (the paper has some examples) and will continue to grow. The SWO represents a complex field of human developed artefacts. In doing so the SWO team has very much taken a pragmatic approach in its representation. The SWO is already quite complex, but we have tried to avoid being too baroque.

 

Here’s the top 50 as produced by BioNERDS (it’s actually 49 and there’s a couple of glitches in this data, but it’s good enough)

 

R

PSI-BLAST

BLAT

Firefox

neighbor

BLAST

FASTA

Entrez

Tree View

PSSM

UCSC Genome Browser

MATLAB

RepeatMasker

Weka

SAM

Q

Apache

Image

PAML

Phred

Network

Cytoscape

MIPS

EMBOSS

TMHMM

ClustalW

BLASTN

DAVID

ClustalX

BLASTP

Bioconductor

SAM

MEME/MAST

T-COFFEE

MUMmer

Cluster

HMMER

MUSCLE

SOAP

Primer3

analysis

PHYLIP

PostgreSQL

Match

PhyML

 

Excel

MEDLINE

Microarray Suite

SEQUEST

       

MAFFT