Archive for July, 2012

The Gene Ontology as a language: Investigating Gene Ontology annotations with Ziph’s law

July 28, 2012

Ontologies are key for communication, but are they fit for purpose? Do they allow us to comunicate? The main driver for ontology authoring remains the need to communicate what we know about an entity, often a gene product. In this communication the speakers are usually annotators and the listeners are usually (or are the intended audience) biologists. We can then think of something like the Gene Ontology as a means of communitcation between an annotator speaker and a biology listener, and so the Gene Ontology should act like a language and annotations are utterances in that language.

We’ve recently had a paper exploring these questions with the Gene Ontology. That is, does the GO act like a language and do GO annotations behave like utterances in that language? The paper is

Leila Kalankesh, Robert Stevens, and Andy Brass. The language of gene ontology: a zipf’s law analysis. BMC Bioinformatics, 13(1):127, 2012. http://www.biomedcentral.com/1471-2105/13/127/abstract

and is the work of Leila Kalankesh, one of Andy Brass’ Ph.D. students. In this work wwe used Ziph’s law to explore the language-like characteristics of GO; this distribution (a kind of power law) is a characteristic of human languages as well as many other phenomena. In a corpus of text, if all the words are ordered by their frequency, a ziphian distribution is seen. the first most highly ranked word is twice that of the second and thrice that of the third, and so on (frequency is inversely proportional to rank). This gives one of those curves that decline very steeply and have a long, long tail.

In a log log plot of rank and frequency the gradient can be revealing about the “effort” used in encoding and decoding utterances in that language. A gradient of 1.6 is suggestive of a child-like language and a gradient of 2.4 is that of a sophisticated reader?

So, as GO terms are used to describe some biological phenomena and are a form of communication (where the annotator is the speaker and the user of those annotations the “listener”), do the ranked frequencies of GO terms in annotation corpora follow a Ziphian distribution; that is, do they behave like utterances in a language? If so, do the gradients of the curves tell us anything about this communication process? If GO annotations behave like statements in a language then we could start to think of applying all the tools of computational linguistics to ontology annotations. If GO annotations do not behave like a language we might wish to ask ourselves why. Finally, the gradient of the plots of rank vs frequency might be able to tell us something about the quality of the communication between anotator and user.

So, this is what we did:

  1. Download GO annotations for a range of model organisms.
  2. Plotted the Curve for rank vs frequency for each GO sub-ontology.
  3. We also separated out the GO evidence code subsets that indicated the highest confidence…

In overview, this is what we found (look at the paper for details):

  1. Most of the species annotations with GO look Ziphian;
  2. Most molecular function and cellular component annotations have a mean slope of around 1.8 and those with biological process one of 2.1.
  3. things look more ziphian and steeper slope with the annotations that have a higher confidence.
  4. the gradient is not a function of ontology or genome size.

So, what does all of this tell us? Well, in general, we know that GO annotations behave like statements in a language and, by extension, GO is a language. We also see that annotations in the BP ontology appear to be more “sophisticated” utterances than those for MF and CC annotations. We can speculate why this might be: there is less to say about function and location; they are smaller sub-languatges. For BP there is much more to say – a gene product may be involved in a large number of processes and there are many more processes than there are functions and location – there’s just more to say. We might also see this happening in, for instance the Phenotypic quality Ontology where there is (probably) a lot to say about the phenotype of entities.

this work established that we can view GO annotations (and probably other annotations) as communications in a language. we’d like to explore whether we can use this kind of approach as a means of investigating the quality of an ontology and/or statements made using that ontology. We saw that the D. rerio genome bP annotations had a significantly lower gradient at 1.8 than the mean of 2.2; why? that is, the communication between annotator and user may be impaired. This may be because the annotations are not of high quality, which itself may be for a variety of reasons, including the state of our knowledge (there are fewer papers for this model organism). There is literature (see the paper) that talks about the gradient of the Ziphian distribution indicating a degree of effort or “willingness” to communicate. the linkage to communication effectiveness is controversial, but it remains an attractive thought that we can measure annotation quality (and perhaps indirectly ontology quality) through this kind of simple computation.