Archive for May, 2013

Which is used most for biomedical ontologies: OBO Format or OWL?

May 16, 2013

I was reading robert Hoehndorf et al‘s paper Relations as patterns: bridging the gap between OBO and OWL and was rather struck by the opening sentence:

“The OBO Flatfile Format [1] is used to represent most biomedical ontologies, among them the Gene Ontology (GO) [2] and most of the OBO Foundry ontologies [3].”

the bit “The OBO Flatfile Format [1] is used to represent most biomedical ontologies,…” struck me as unlikely (at least on face value). So, I had a look. Using the RESTful API to BioPortal, Nico Matentzoglu (one of our group’s Ph.D. students) downloaded all the publically available ontologies (the API lets you get both public and private, but we didn’t use the private ones). We got a total of 347 ontologies that used the representations as follows:

OBO 114
OWL 161
OWL-DL 32
OWL-FULL 9
PROTEGE frames 2
RRF 26
UMLS-RELA 3

So, OBO Format has 114 and OWL (the different flavours of OWL are apparently different ontologies) has 202. I don’t need to do the stats – there are more OWL ontologies than OBO Format ontologies. I’m assuming that BioPortal is a representative sample of biomedical ontologies. With this assumption, Rob’s statement is wrong.

Can we change the statement to “the OBO Format is the representation of the most widely used biomedical ontologies”? The Gene Ontology (and other OBO Format ontologies) have a large corpus of annotations; I have no numbers across the board, but GO has 3898904 annotations (number of filtered annotations from the Gene Ontology Annotations page on 15 May 2013) and is also widely used in gene over expression analysis etc. This is a big number – and other OBO format ontologies are used for annotations too, though to what extent I don’t yet know.

If we look at some OWL ontologies like SNOMED and NCIT (we can probably argue about whether SNOMED is natively OWL, but we’ll go with it for now), we also probably have some big numbers. The nature of SNOMED annotations of health records means it may be difficult to get the numbers and even though the “mandate” for use and actual use may be different, I suspect the numbers will stil be quite big. Anyway, let’s make something up – UK health records are annotated (I think with Reid codes which are now part of SNOMED) and there are 60 million UK people and, assuming 1 code per person’s record, we’ve got 60 million annotations – quite big. The experimental Factor Ontology (EFO) is more bio and is used for some 636k anotations in the Gene Expression Atlas (thanks to James Malone for the numbers) – not GO sized, but getting on for a biggish number.

So, in terms of numbers OWL ontologies are widely used.

What happens if we take the medical ones out? Then the numbers will start to look much less healthy for OWL ontologies. Nevertheless, we’ve got a lot of OWL ontologies and fewer OBO ontologies and, even if we have fewer OWL ontologies actually used, we’ve got a lot of use of biomedical OWL ontologies. taking the medical ones out of the OWL set, I suspect we’ve still got more OWL bio-ontologies, but the OBO ones are used more widely in bio (and the “important” ones are OBO). Taking a look at BioPortal’s OWL ontologies, one gets the suspicion that a lot of them are “toy” ontologies (I’m sure some OBO Format ontologies come into this category too). this will reduce the number of OWL ontologies, but I don’t want to do the categorisation.

Despite this blog having deteriorated from firm numbers to speculation, I think we could go with an opening sentence for Rob’s paper of

“At the time of writing, most of the widely used bio-ontologies use the OBO Format….”

or

At the time of writing, most of the important bio-ontologies that are extensively used for description of data use the OBO format representation….