Archive for July, 2015

Open data and the need for ontologies

July 24, 2015

This is an abstract for “Digital Scholarship and Open Science in Psychology and the Behavioural

Sciences”, a Dagstuhl Perspectives Seminar (15302) held in the week commencing 20 July 2015. The workshop brought together computer scientists, computational biologists and people from the behavioural sciences. The workshop explored eScience, data, data standards and ontologies in psychology and other behavioural sciences. This abstract gives my view on the advent of eScience in parts of biology and the role open data and metadata supplied by ontologies played in this change.

There is a path that can be traced with the use of open data in the biological domain and the rise in the use of ontologies for describing those data. Biology has had open repositories for its nucleic acid and protein sequence data and controlled vocabularies were used to describe those data. These sequence data are core, ground truth in biology; all else comes from nucleic acids and, these days, the environment. As whole genome sequences became available, different organism communities found that the common vocabulary used to represent sequences facilitated their comparison at that level, but a lack of a common vocabulary for what was known about those sequences blocked the comparison of the knowledge of those sequences. Thus we could tell that sequence A and sequence B were very similar, but finding that the function, processes in which they were involved and where they were to be found etc. was much more difficult, especially for computers. Thus biologists created common vocabularies, delivered by ontologies, for describing the knowledge held about sequences. This has spread too many types of data and many types of biological phenomenon, from genotype to phenotype and beyond, so that there is now a rich, common language for describing what we know about biological entities of many types.

At roughly the same time was the advent of eScience. The availability of data and tools open and available via the Web, together with sufficient network infra-structure to use them, led to systems that co-ordinated distributed resources to achieve some scientific goal, often in the form of workflows. Open tools, open data, open standards, open, common metadata all contribute to this working, but it can be done in stages; not all has to be perfect for something to happen – just availability of data will help, irrespective of its metadata. Open data will, however provoke the advent of common data and metadata standards, as people wish to do more and do it more easily.

In summary, we can use the FAIR principles (Findable, Accessible, Interoperable and re-usable) to chart this story. First we need data and tools to be accessible and this means openness. Metadata, via ontologies, also have a role to play in this accessibility – do we know what those data are etc.? Metadata has an obvious role in making tools and data findable – calling the same things by the same term and knowing what those terms mean makes things findable. The same argument works for interoperable tools and data.