Competence questions, user stories and testing

February 2, 2014

The notion of competency questions as a means of gathering requirements for and a means of evaluation of an ontology comes from a paper by Gruninger and Fox in 1994: “These requirements, which we call competency questions, are the basis for a rigorous characterization of the problems that the enterprise model is able to solve, providing a new approach to benchmarking as applied to enterprise modelling and business process engineering. Competency questions are the benchmarks in the sense that the enterprise model is necessary and sufficient to represent the tasks specified by the competency questions and their solution. They are also those tasks for which the enterprise model finds all and only the correct solutions. Tasks such as these can serve to drive the development of new theories and representations and also to justify and characterize the capabilities of existing theories for enterprise modelling.” And “We use a set of problems, which we call competency questions that serve to characterize the various ontologies and microtheories in our enterprise model. The microtheories must contain a necessary and sufficient set of axioms to represent and solve these questions, thus providing a declarative semantics for the system.” Here we read “enterprise model” as ontology (or, more correctly, that an enterprise model may have as a part an ontology, as a KR can have other thins than an ontology…).

 

Below you can see examples of what we gathered as competency questions during some Pizza tutorials. They mostly take the form of example questions:

 

  • Find me pizza with hot spicy peppers
  • What sorts of pizza base are there?
  • What vegetarian pizzas are there?
  • What pizzas are there with more than one type of cheese?
  • What kinds of pizza contain anchovy?

     

     

    What we usually do is to cluster these in several ways to find the major categories we need in the ontology; we also extract example class labels and so on. It also feeds into the abstractions; gathering together Vegetable, fish, meat as types of ingredient. The CQ can also pull out things like qualities of these ingredients – spiciness and so on. Usually there are many versions of the same kind of questions. A few examples are:

     

  • Find pizza with ingredient x
  • Find pizza with ingredient x, but not y
  • Find pizza without ingredient z
  • Find pizza with ingredient that has some quality or other

 

We can view the informal, natural language competency questions like user stories in agile software engineering techniques. We use a typical template for a user story:

 

As a role I want to do task for a benefit

 

Usually, “benefit” boils down to money. We can adapt the “five whys” technique for problem solving; ask why the role holder of the user story why they want the task (return on investment) and, when applied with some skill, one can get to a root justification for the user story. Often it is money, but sometimes users ask for edge cases – this is often especially true of ontology types – some fun, intricate or complex modelling or logic can ensue for no real return. I’ve done this kind of thing a bit and found it rather useful at weeding out spurious user stories, but also getting better justifications and thus higher priorities for a user story.

 

I’ll carry on this blog with the CQ

 

“Find pizza with anchovy but not capers”

 

We could take our example CQ and do the following (in the context of the Intelligent Pizza Finder):

 

“As a customer I wish to be able to find pizza that have anchovy but no capers, because I like anchovy and don’t like capers”

 

And abstract to

 

As a customer I want to find pizzas with and without certain ingredients to make it easier to choose the pizza I want.

 

The benefit here bottoms out in money (spending money on something that is actually desired), but goes through customer satisfaction through finding what pizza to buy with more ease. Such a user story tells me that my ontology must describe pizza in terms of their ingredients, and therefore have a description (hierarchy) of ingredients, as well as needing to close down descriptions of pizzas (a pizza has this and that, and only this and that (that is, no other). Other CQ user stories give me other requirements:

 

As a vegetarian customer I want to be able to ask for vegetarian pizzas, otherwise I won’t be able to eat anything.

 

This suggests I need abstractions over my ingredients. User stories can imply other stories; an epic user story can be broken down into smaller (in terms of effort) user stories and this would seem like a sensible thing to do. If CQ are thought of in terms of user stories, then one can bring in techniques of effort estimation and do some planning poker. We did this quite successfully in the Software Ontology.

 

In engineering, and especially Agile software engineering, these CQ or user stories also gives me some acceptance tests – those things by which we can test if the product is acceptable. A competence question obviously fits neatly into this – my ontology should be competent to answer this question. Acceptance tests are run against the software, with inputs and expected outputs; a user story is not complete until the acceptance test(s) is passed. For competence questions as acceptance tests, input data doesn’t really make sense, though results of the competence question do make sense as “output” data.

 

If we take a natural language CQ such as

 

Find me pizza with anchovy, but no capers

 

We may get a DL query like

 

Pizza and (hasTopping some AnchovyTopping) and (hasTopping only not CaperTopping) 

 

Which I can use as a test. I was stumped for a while about how not necessarily having any ontology and not knowing the answer makes running the test “before” and knowing whether it has passed or failed hard. However, it may all fall out easily enough (and may have been already done in some environments); here’s the scenario

 

  1. I have my query: Pizza and (hasTopping some AnchovyTopping) and (hasTopping only not CaperTopping) and no ontology; I’m setting up the ontology and a “test before” testing style.
  2. The test fails; I can pass the test by adding Pizza, hasTopping AnchovyTopping and CaperTopping to my currently empty ontology; the test passes in that the query is valid
  3. I also add pizzas to my test that I expect to be in the answer – NapolitanaPizza; again, the test fails
  4. I add NapolitanaPizza and the test is valid in that the entities are there in the ontology, but I need to add NapolitanaPizza as a subclass of Pizza for there to be any chance of a pass.
  5. I do the addition, but still the test fails; I need to re-factor to add the restrictions from NapolitanaPizza to its Ingredients (tomatotopping, CheeseTopping, Olivetopping, Anchovytopping and Capertopping)
  6. My test passes

 

 

I’m bouncing between the test itself passing a validity test and the ontology passing the test. It’s easier to see these as tests all working in a test after scenario, but it can work in a test before scenario, but seems a bit clunky. This could perhaps be sorted out in a sensible environment. I could even (given the right environment) mock up parts of the ontology and supply the query with some test data.

 

My example query does imply other test queries as it implies other bits of pizza ontology infra-structure. There’s an implication of a pizza hierarchy and an ingredients hierarchy. We’d want tests for these. Also, not all test need be DL Queries – we have SPRQL too (see, as an example, tests for annotations on entities below).

 

There are other kinds of test too:

  1. Non-logical tests – checks that all classes and properties have labels; there’s provenence information and so on.
  2. Tests that patterns are complied with – normalisation, for instance, could include test for trees of classes with pairwise disjoint siblings.
  3. Tests to check that classes could be traced back to some kind of top-level ontology class.
  4. Tests on up-to-dateness with imported portions of ontology (Chris Mungall and co describe continuous integration testing in GO).

 

Some or all of which can probably be done and are being done in some set-ups. However, as I pointed out in a recent blog about the new wave of ontology environments needs to make these types of testing as easy and as automatable and reproducible as it is in many software development environments.

Issues in authoring ontologies

January 31, 2014

We’ve recently had a short paper accepted to Computer Human Interaction (CHI) where we describe the outcomes from a qualitative study on what ontologists do when they author an ontology. The paper is called “Design Insights for the Next Wave Ontology Authoring Tools” and the motivation is a desire to understand how people currently go about authoring ontologies (the work was carried out by Markel Vigo in collaboration with Caroline Jay and me). Ultimately we want to understand how people judge the consequences of adding axioms so that we can support the answering of “what if…?” questions (and we’ll do this by creating models of ontology authoring). So, if I add some axiom to a large system of axioms what will be the consequences. As we work up to this, we’re doing studies where we record what people are doing as they author an ontology, logging all activities in the Protégé 4 environment, as well as capturing screen recordings and eye-tracking data.

 

This first study was a qualitative study where we asked 15 experienced ontology builders a series of open questions:

 

  • Can you describe the authoring tasks you perform?
  • How do you use tools to support you in these tasks?
  • What sort of problems do you encounter?

 

You can read the details of the method and analysis in the paper. We chose to do this study with experienced ontology authors as this will, in the fullness of time, inform us about how authoring takes place without any confounding factors such as not fully understanding ontologies, OWL or tools being used. Understanding issues faced by novices also needs to be done, but that’s for another time.

 

The 15 participants partition into three groups of five: Ontology researchers; ontology developers; and ontology curators. These, in turn, are CS types who do research on ontology and associated tools and techniques; ontology developers are CS types that work closely with domain experts to create ontologies; curators are those that have deep domain knowledge and maintain, what are often, large ontologies.

 

The tools participants use are (here I list those with numbers of users above one): Protégé (14 users), OWL API (6), OBO-Edit (4) and Bioportal (3). We didn’t choose participants by the tools they used; these are the tools that the people we talked to happened to use.

 

The analysis of the interviews revealed themes based on the major tasks undertaken by ontologists; the problems they encounter and the strategies they use to deal with these problems.

 

  1. Sense-making, exploration and searching: Knowing the state of the ontology, finding stuff, understanding how it’s all put together – “making sense” of an ontology.
  2. Ontology building: Efficiently adding axioms to an ontology en mass and effectively support what we called “definition orientated” ontology building.
  3. Reasoning: Size and complexity of ontologies hampering use of ontologies.
  4. Debugging: Finding faults and testing ontologies.
  5. Evaluation: is it a good thing?

 

The paper describes in more detail the strategies people use in these five themes. For instance, speeding up reasoning by restricting the ontology to a profile like OWL EL and using a fast reasoners like ELK; chopping up the ontology to make reasoning faster; relying on user feedback for evaluation; using repositories and search tools to find ontologies and re-use parts of them; using the OWL API to programmatically add axioms; and so on (the paper gives more of the strategies people use).

 

There will be other issues; there will be ones we may not have encountered through our participants and there will be important issues that were in our interviews, but may not have been common enough to appear in our analysis.

 

There may well be tools and techniques around that address many of the issues raised here (we’ve done some of them here in Manchester). However, this sample of ontology authors don’t use them. Even if tools that address these problems exist, are known about and work, they don’t work together in a way that ontology authors either can use or want to use. So, whilst we may have many useful tools and techniques, we don’t have the delivery of these techniques right. What we really need to build the new wave of ontology authoring environments are models of the authoring process. These will inform us about how the interactions between author and computational environment will work. This qualitative study is our first step on our way to elucidating such a model. The next study is looking at how experienced ontology authors undertake some basic ontology authoring tasks.

Manchester Advanced OWL tutorial: Family History

January 28, 2014

Manchester Family History Advanced OWL Tutorial

Dates: 27th/28th February 2014

Time: 10am – 5pm

Location: Room G306a Jean McFarlane Building, University of Manchester.

The Bio-Health Informatics Group at The University of Manchester invites you to participate in a newly developed OWL Ontology that covers more advanced language concepts for OWL.

The overall goal for this tutorial is to introduce the more advanced language concepts for OWL. This new tutorial builds on the Manchester Pizza Tutorial, by exploring OWL concepts in greater depth, concentrating on properties, property hierarchies, property features and individuals.

 

The topic of family history is used to take the tutee through various modelling issues and, in doing so, using many features of OWL 2 to build a Family History Knowledgebase (FHKB). The exercises involving the FHKB are designed to maximise inference about family history through use of an automated reasoner on an OWL knowledgebase (KB) containing many members of the Stevens family. The aim, therefore, is to enable people to learn advanced features of OWL 2 in a setting that involves both classes and individuals, while attempting to maximise the use of inference within the FHKB.

 

By the end of the tutorial you will be able to:

  1. Know about the separation of entities into TBox and ABox;
  2. Use classes and individuals in modelling;
  3. Write detailed class expressions;
  4. Assert facts about individuals;
  5. Use the effects of property hierarchies, property characteristics, domain/range constraints to drive inference;
  6. Use property characteristics and subproperty chains on inferences about individuals
  7. Understand and manage the consequences of the open world assumption in the TBox and ABox;
  8. Use nominals in class expressions;
  9. Appreciate some of the limits of OWL 2;
  10. Discover how many people in the Stevens family are called “James”.

 

The tutorial is led by Professor Robert Stevens and run by a team of experienced OWL users and researchers from Manchester.

 

Supplementary material for the tutorial can be found at: http://owl.cs.manchester.ac.uk/publications/talks-and-tutorials/fhkbtutorial/

The cost of the course is £250 per day.

 

Registration and Further Information

To register, please email Kieran O’Malley

(kieran.omalley@manchester.ac.uk) prior to February 21st 2014. Payment options will be returned to you following reservation. For further information please visit the website at:

http://owl.cs.manchester.ac.uk/

Generating natural language from OWL and the uncanny valley

January 28, 2014

There is this phenomenon called the uncanny valley where, in situations like CGI, robotics etc., if as the human-like thing gets closer and closer to being human, but not quite human, then the human observer is sort of weirded or creeped out. If the human-like thing obviously isn’t human, say a simple cartoon, all is OK, but if it is almost human then people really don’t like it.

 

In our recent work on natural language generation (NLG) from OWL I’ve noticed a related phenomenon; readers aren’t weirded out by the generated and somewhat clunky English, but are irritated or piqued by it in a way they wouldn’t be by, for instance, some Manchester Syntax for the same axioms or some hand-crafted, but not perfect, text. The Manchester Syntax is further away from “naturalness”, but apparently less irritating – perhaps because the expectations are less. Manchester syntax for some axioms is easy enough to make a “correct” instance of what it is (Manchester Syntax); it’s not so easy to make natural language “natural”, but if we get close-ish, we’ve met a “valley”, perhaps of irritation. It’s not really an uncanny valley we’ve seen in our work with natural language generation from OWL ontologies, but when we generate sentences and paragraphs from OWL readers like the NLG form, but are caused irritation by English that is almost English, but not quite “natural” natural language. As we’ll see, this may be the nature of the errors; they’re basic errors in, for instance, the use of articles and plurals – not grammar fascism.

 

Doing NLG from an OWL axiom is sort of obvious; an axiom is a correlate of a sentence and we have nouns (and adjectives) in the form of classes and individuals, then properties (relationships) often do verby like things. A class or concept is the correlate of a paragraph; it’s what we want to say on a topic. So, we can take a set of axioms for classes from Snomed CT like

 

Class: Heart Disease

SubClassOf: (Disorder of Cardiovascular System) and (is-located-in some Heart Structure)

 

and similarly for hypertensive heart disease:

 

Class: Hypertensive heart disease

SubClassOf: (Heart Disease) and (is-associated-with some Hypertensive disorder)

 

And produce paragraphs like

 

A heart disease is a disorder of the cardiovascular system that is found in a heart structure.

 

and

 

A hypertensive heart disease is a heart disease that is associated with a hypertensive disorder.

 

These paragraphs are oK (produced by OntoVerbal), but are not “beautiful” English prose. In these cases, we’ve got the articles right etc, but it all seems a little plodding. There is some clunkiness that is a little irritating, but overall I think they’re pretty good and give a decent view on a set of axioms that can be fairly hard work to read. It is possible to produce better English, but at the cost of making a bespoke verbaliser for each ontology, especially for the “unpacking” of complex class labels to get articles and plurals correct; OntoVerbal is generic (though we did a little local fixing to help out with articles for Snomed classes). However, what we did do in OntoVerbal is to try and generate coherent, structured paragraphs of text for a class’ axioms. To get this coherence (rather than a set of sentences from unordered axioms for a class) we used rhetorical structure theory (RST) and mapped various types of OWL axioms to roles within RST. Example RST roles are evidence, motivation, contrast, elaboration, result, cause, condition, antithesis, alternative, list, concession and justification. these may be implicit within a text, but are often signalled by “discourse markers”; as “because” for evidence, “in order to” for enablement, “although” for antithesis, “but” for concession, “and” for list, “or” for alternative, etc You can see how we put all of this together in our IJACSA paper.

 

 

In the IJACSA paper we did an evaluation to look at the acceptability of these types of rendering and whether they were faithful enough to the OWL to allow “round-tripping” – that people experienced in OWL could take the verbalised OWL and re-produce the OWL axioms in Manchester Syntax. We also looked at the quality by comparing the machine verbalisations to human generated verbalisations. The argument is that a human produces good quality text (under the constraints given), so if the OntoVerbal text is similar to the human written text, then it should be of reasonable quality. Below are the OWL axioms from SNOMED for 10 classes. There are natural language verbalisations generated by a human (not in the project) and by OntoVerbal.

 

 

Number

OWL input

OntoVerbal

Ontologist

1

pelvic structure SubClassOf: lower trunk structure

lower trunk structure SubClassOf: structure of subregion of the trunk

pelvic structure SubClassOf: the pelvis and lower extremities and the abdomen and the pelvis and lower trunk structure

A lower trunk structure is a kind of structure of subregion of the trunk. A more specialised kind of lower trunk structure is pelvic structure. Another relevant aspect of lower trunk structure is that a pelvic structure is defined as the pelvis and lower extremities, the abdomen and the pelvis and a lower trunk structure.

A lower trunk structure is a structure of the subregion of the trunk. A pelvic structure is a subtype of a lower trunk structure.

2

procedure on the brain SubClassOf: intracranial procedure

intracranial procedure SubClassOf: procedure on the central nervous system

intracranial procedure SubClassOf: procedure on the head

intracranial procedure EquivalentClass procedure by site and has a procedure site some intracranial structure

An intracranial procedure is a kind of procedure on the central nervous system and procedure on the head. A more specialised kind of intracranial procedure is procedure on the brain. Additionally, an intracranial procedure is defined as a procedure by site that has a procedure site in an intracranial structure.

An intracranial procedure is a procedure on the central nervous system and a procedure on the head. Any procedure by site in which the procedure site is an intracranial structure is also an intracranial procedure. A procedure on the brain is a subtype of intracranial procedure.

3

abdominal vascular structure SubClassOf: abdominal and pelvic vascular structure

abdominal and pelvic vascular structure SubClassOf: vascular structure of the trunk

abdominal vascular structure SubClassOf: abdominal structure and abdominal and pelvic vascular structure

An abdominal and pelvic vascular structure is a kind of vascular structure of the trunk. A more specialised kind of abdominal and pelvic vascular structure is abdominal vascular structure. Another relevant aspect of abdominal and pelvic vascular structure is that an abdominal vascular structure is defined as an abdominal structure and an abdominal and pelvic vascular structure.

An abdominal and pelvic vascular structure is a vascular structure of the trunk. An abdominal vascular structure is a subtype of an abdominal and pelvic vascular structure.

4

chronic disease of the genitourinary system SubClassOf: chronic disease

chronic disease of the genitourinary system SubClassOf: disorder of the genitourinary system

chronic hypertensive uraemia SubClassOf: chronic disease of the genitourinary system

chronic disease of the genitourinary system EquivalentClass: chronic disease and disorder of the genitourinary system and has a finding site some structure of the genitourinary system

Chronic disease of the genitourinary system is a kind of chronic disease and disorder of the genitourinary system. A more specialised kind of chronic disease of the genitourinary system is chronic hypertensive uraemia. Additionally, chronic disease of the genitourinary system is defined as chronic disease that is a disorder of the genitourinary system, and has a finding site in a structure of the genitourinary system.

A chronic disease of the genitourinary system is a chronic disease and a disorder of the genitourinary system. Any chronic disease which is also a disorder of the genitourinary system and is found in the structure of the genitourinary system is also a chronic disease of the genitourinary system. A chronic hypertensive uraemia is a subtype of a chronic disease of the genitourinary system.

5

finding of the head and the neck region SubClassOf: finding of the body region

head finding SubClassOf: finding of the head and the neck region

finding of the head and the neck region EquivalentClass: finding of the body region and has a finding site some head and neck structure

head finding EquivalentClass: finding of the head and the neck region and has a finding site some head structure

A finding of the head and the neck region is a kind of finding of the body region. A more specialised kind of finding of the head and the neck region is head finding. Additionally, A finding of the head and the neck region is defined as a finding of the body region that has a finding site in a head and neck structure. Another relevant aspect of finding of the head and the neck region is that a head finding is defined as a finding of the head and the neck region that has a finding site in a head structure.

A finding of the head and the neck region is a finding of the body region. Any finding of the body which is found in a head and neck structure is also a finding of the head and neck region. A head finding is a subtype of the finding of the head and the neck region.

6

nephrosclerosis SubClassOf: degenerative disorder

degenerative disorder SubClassOf: disease

arteriosclerotic vascular disease SubClassOf: degenerative disorder

degenerative disorder EquivalentClass: disease and has an associated morphology some degenerative abnormality

Degenerative disorder is a kind of disease. More specialised kinds of degenerative disorder are nephrosclerosis and arteriosclerotic vascular disease. Additionally, degenerative disorder is defined as disease that has an associated morphology in a degenerative abnormality.

A degenerative disorder is a disease. Any disease which has an associated morphology of degenerative abnormality is also a degenerative disease. Nephrosclerosis and arteriosclerotic vascular disease are subtypes of degenerative disease.

7

kidney graft material SubClassOf: urinary tract material

kidney graft material SubClassOf: solid organ graft material

kidney graft material SubClassOf: urinary tract material and solid organ graft material

transplant of the kidney EquivalentClass: kidney operation and solid organ transplant and renal replacement and has a method some surgical transplantation action and has a direct substance some kidney graft material and has an indirect procedure site some kidney structure

A kidney graft material is a kind of urinary tract material and solid organ graft material. Another relevant aspect of kidney graft material is that a transplant of the kidney is defined as a kidney operation that is a solid organ transplant, and is a renal replacement, and has a method in a surgical transplantation action, and has a direct substance in a kidney graft material, and has an indirect procedure site in a kidney structure.

Kidney graft material is a urinary tract material and a solid organ graft material. A kidney operation, solid organ transplant and renal replacement which has a method of surgical transplantation action, a direct substance of kidney graft material and an indirect procedure site of kidney structure is a type of transplant of the kidney.

8

graft SubClassOf: biological surgical material

tissue graft material SubClassOf: graft

tissue graft material SubClassOf: graft and body tissue surgical material

A graft is a kind of biological surgical material. A more specialised kind of graft is tissue graft material. Another relevant aspect of graft is that a tissue graft material is defined as a graft and a body tissue surgical material.

A graft is a biological surgical material. Tissue graft material is a subtype of graft as well as a body tissue surgical material.

9

benign essential hypertension complicating and/or reason for care during pregnancy SubClassOf: essential hypertension complicating and/or reason for care during pregnancy

essential hypertension complicating and/or reason for care during pregnancy SubClassOf: essential hypertension in the obstetric context

essential hypertension complicating and/or reason for care during pregnancy SubClassOf: pre-existing hypertension in the obstetric context

essential hypertension complicating and/or reason for care during pregnancy SubClassOf: essential hypertension in the obstetric context and pre-existing hypertension in the obstetric context

benign essential hypertension complicating and/or reason for care during pregnancy SubClassOf: benign essential hypertension in the obstetric context and essential hypertension complicating and/or reason for care during pregnancy

Essential hypertension complicating and/or reason for care during pregnancy is a kind of essential hypertension in the obstetric context and pre-existing hypertension in the obstetric context. A more specialised kind of essential hypertension complicating and/or reason for care during pregnancy is benign essential hypertension complicating and/or reason for care during pregnancy. Another relevant aspect of essential hypertension complicating and/or reason for care during pregnancy is that benign essential hypertension complicating and/or reason for care during pregnancy is defined as benign essential hypertension in the obstetric context and essential hypertension complicating and/or reason for care during pregnancy.

An essential hypertension complicating and/or reason for care during pregnancy is an essential hypertension in the obstetric context and a pre-existing hypertension in the obstetric context. A benign essential hypertension complicating and/or reason for care during pregnancy is a subtype of essential hypertension complicating and/or reason for during pregnancy.

10

procedure on artery of the abdomen SubClassOf: procedure on the abdomen

procedure on artery of the abdomen SubClassOf: procedure on artery of the thorax and the abdomen

abdominal artery implantation SubClassOf: procedure on artery of the abdomen

procedure on artery of the abdomen EquivalentClass: procedure on artery and has a procedure site some structure of artery of the abdomen

A procedure on artery of the abdomen is a kind of procedure on the abdomen and procedure on artery of the thorax and the abdomen. A more specialised kind of procedure on artery of the abdomen is abdominal artery implantation. Additionally, a procedure on artery of the abdomen is defined as a procedure on artery that has a procedure site in a structure of artery of the abdomen.

A procedure on artery of the abdomen is a procedure of the abdomen and a procedure on artery of the thorax and the abdomen. Any procedure on artery which has a procedure site of structure of artery of the abdomen is also a procedure on artery of the abdomen. An abdominal artery implantation is a subtype of procedure on artery of the abdomen.

 

 

 

You can see that the verbalisations are fairly similar. Given the task of being faithful to the OWL and enabling “round-tripping”, very similar texts are produced by a human and OntoVerbal; the machine and human verbalisations are of a very similar quality. Evaluators could use both to round-trip to the OWL axioms, but did better with the OntoVerbal generated axioms. This is, we think, at least in part due to OntoVerbal being more “complete” in its verbalisation. The human verbalisation is smoother, but presumably not as smooth as a description written by a human domain expert could be (though do look at James Malone’s blog on this topic). However, I suspect that such smooth, natural language texts would be much harder to match to a set of OWL axioms.

 

Where does this leave my uncanny valley or valley of irritation for generated natural language? Domain expert humans writing definitions without the constraint of being faithful to the ontology’s axioms will probably avoid the valley of irritation; if there’s too much “ontologising” in the verbalisation there will be irritation (this came up in another paper on an earlier verbalisation); if there’s clunky English there is irritation. In a generic tool like OntoVerbal this is probably inevitable and I suspect it’s irritating as these are minor English errors that are always irritating as they disrupt reading. However, the use of rST does seem to give OntoVerbal’s NLG verbalisations a good level of coherence and fluency, even if they’re not perfectly fluent. They are also cheap to produce…. As they are close to the OWL they give an alternative view on those axioms – one thing I’d like to find out is if a verbalised view is any better (or worse) at allowing error spotting – and whther it is the verbalisation or just the alternative that does the job). One could also provide a variety of verbalisations – hand-crafted, luxury ones; ones close to the OWL and ones with and without the often inpenetrable words used in ontologies (especially for relationships).

My first publication discovered

December 24, 2013

I’ve been poking around in the long-tail of my publications as gathered by Google Scholar. Within this I found the following little publication:

 

Five glycyl tRNA genes within the noc gene complex of Drosophila melanogaster.

YB Meng, RD Stevens, W Chia, S McGill, M Ashburner

Nucleic acids research 16 (14), 7189-7189 1988

 

And this must be me. I did my undergraduate biochemistry project with bill Chia and I sequenced, by hand, some tRNA genes in drosophila. This is the first I’ve known of this publication and it has made me happy – that my sausage like fingers clumsily squirting stuff around willy nilly in bill’s lab actually earnt me a name on the paper; it is a lovely thing to find.

 

This should be my opportunity to drone on about pouring polyacrylamide gels, doing dideoxy reactions, running gels, exposing autoradiograms, reading gels, etc etc., but that’s enough of that. I should also perhaps say that using the lab’s BBC microcomputer to run a programme over-night to find tRNA genes was the start of my interest in bioinformatics – but it wasn’t. It was, however,a continuation of an interest in what was then known as molecular biology; ultimately bioinformatics has been a way of carrying on that interest.

Making scholarly articles born semantic

August 14, 2013

In Sepublica 2012 I did the invited keynote talk entitled “Semantic Publishing: What does it all mean anyway?”. This was on the back of an increasing interest I have in semantic publishing, motivated by the work Phil Lord and I have done on the Ontogenesis knowledgeblog. The invitation itself was, I suspect, also on the back of the fun I had with the ontology submission I did to sepublica that year, which was just of the RDF of the Amino Acids Ontology. Nevertheless, I did the keynote and one of the things I did in the talk was to make the distinction between a scholarly article being born semantic and made semantic.

 

We can author a scholarly document and then add semantics post hoc – for instance, labelling entities with their semantic type – authors, proteins, genes, tools etc., as well as elements of document structure, the nature of the citations (and other stuff you will find in the Semantic Publishing and Referencing Ontologies (SPAR) suite of ontologies). All this is done after the author writes, by hand or through text-mining, etc.

 

In contrast, this could also be done at authoring time, with the encoding of the semantics being done by the author at the time of authoring – rather than by a third party post publication, which would typically be the case in being “made semantic” – with all the obvious issues of such things. So, the author does the same kind of semantic mark-up as before, but as he or she writes the document; the semantics then persist through the publication process and then the semantic content is available both for readers and machines

 

When talking about this to my colleague Sean Bechhofer, he made the analogy with analogue and digital photographs and music being either “born digital” or “made digital”: “However, as of yet, few born-digital (defined in opposition to “made digital” or “digitized” photographs, which are created by scanning analogue sources), photographs have been acquired by archives.” (Becoming Digital: The Challenges of Archiving Digital Photographs, Karen Rae Simonson, University of Manitoba (Canada), 2006). Similarly, in music recordings we had Analogue (A) and Digital (D) recording, Mastering and Publication as either A or D. Compact disks were labelled with AAD to DDD depending of what combinations of analogue/digital recording, mastering and published were used. A piece of music that is DDD is “born digital” music; anything else is “made digital”. A gramophone record would be AAA. So, a scholarly bpublication can be “born semantic” if it is semantic from the start or “made semantic” if the semantics are added post hoc.

 

WE talked about how to make scholarly publications “born semantic” in our “three steps paper“. It’s all very well wanting all this semantics in a paper from its birth, but it doesn’t happen for free. A made semantic paper costs resource and a born semantic paper also costs – in this case for the author. If we think about the three players in the scholarly work game, the author, the reader and the mediating machine, the advantages of semantics are fairly obvious for the reader and the machine. The reader can get better search, active documents where the machine’s ability to use the semantics of a document’s entities can enhance the reading experience by, for instance, doing protein sequence things with things it knows are protein sequences etc etc. the machine, knowing what thins are, can do appropriate things with those entities; it becomes computationally more effective.

 

This leaves the author; what’s in being born semantic for the author? Grief and pain if we’re just asking an author to do loads of mark up. Just as there are advantages for being born semantic for the reader and machine, there needs to be an advantage for the author. This means that the adding of semantics either has to help the author in his or her task or it has to come for “free” as a side-effect of something else the author would do anyway.

 

Phil Lord has started doing some of this in the knowledgeblog. There’s some simple markup that indicates a thing is an author name or a citation . By labelling a DOI or PubMed id as a citation, all my refernces get compiled and styled; one can imagine adding a CiTO attribute as well, though it’s a bit tricky to work out what’s in it for the author). Knowledgeblog does a bit with knowing that a string is an author, but one can imagine labelling something as an ORCID and getting all sorts of stuff for free-a semantically marked up affiliation. On the “semantics bby stealth” side of things, we could have style sheets already marked up with elements of rhetoric structure and so on. This doesn’t really help the author, but should aid machine processing down-the-line; the key is that it costs nothing. (the 3 steps paper above gives more examples.)

 

One of the “not making demands” of an author (without payback) is that a new tool or specialised environment won’t work. Whatever born semantic stuff we use for authoring, it’s got to work in Ms Word, Latex or whatever. Nice WYSWYG tools or v simple markup (if markup is your thing). Any additions by the author have to be low-cost or they won’t happen. This may mean we don’t get a lot of semantics, but that may just be the way that it is – unless the payback is down-the-line with demonstrably more readers and, of course, citations. The other thing that needs to happen is to do away with strange publisher processes that take camera-ready (possibly semantic) documents and re-do them from scratch, which would remove the semantics gained at birth.


 

One hundred years of ontology

July 26, 2013

At the start of July 2013 I did a Pubmed search for “ontology” or “ontologies” and recorded the numbers of papers per year. I did the same thing again, this time searching for “Gene Ontology”. A bar chart of the numbers is below (a table of the numbers is at the end of the blog).

 

 

Up until the 1990s, things just rumble along, very much at a very low-level, with only the occaisional mention of “ontology” or “ontologies”. Things pick up in the 1990s, as the CS notion of ontology was introduced as a way of organising heterogenous data, and then begin to explode in the 2000s with the advent of the Gene Ontology. The numbers for the “Gene Ontology” start in 2000 and pick up fast and fairly consistently track the total number of ontology papers and, in recent years, forming a high proportion of a large number of papers. There appears to be an anomaly or out-lier in 2010; a weirdness, mistake or a community holiday. I’ve done no further analysis of these numbers…

 

Purely out of interest, I had a look at the earliest paper to mention “ontology”:

 

Bryce P.H. ONTOLOGY IN RELATION TO PREVENTIVE MEDICINE. Am J Public Health (N Y). 1912 Jan;2(1):32-3. (PMID: 18008609)

http://www.ncbi.nlm.nih.gov/pubmed/?term=18008609

 

The article is very much of its time and contains some outrageous comments on the causes and spread of disease. However, the core of the paper is about using ontology as a tool for discussion; the opening sentence is

 

We have to thank the metaphysicians for, if not explaining many things, at least giving us useful terms under which discussions may be carried on. least giving us useful terms under which discussions may be carried on. Ontology

is defined ‘as that branch of metaphysics which investigates and explains the nature of all things or existences.’ While he who coined the word cannot be accused of excessive modesty, yet one may thank him for it since it does give a direction to thought,…

 

which sort of sums it up – especially the bit about excessive modesty.

 

“For the student of preventive medicine there must arise the question: In ONTOLOGY IN RELATION TO PREVENTIVE MEDICINE what ethical category must he place the agents of disease as mosquitos, the hosts

of many diseases or the specific microbes and protozoa, their most direct causes? What is the meaning of pestis, cholera, tuberculosis or syphilis in the plan of life?”

 

Which are questions the modern ontology community are tackling and with the greatest of modesty. Hopefully we can do this without Price’s appeal to the merits of various forms of civilisation, religion and, for that time, the acceptable notion of eugenics… Price says “… until man with the splendid intelligence with which he is endowed shall have learned the life conditions under which each of these evils attacking him exists, and how each in turn may either be subdued to his uses or removed from his pathway.” – he obviously isn’t prone to the lack of modesty he ascribes to metaphysicians.

 

Ontology OR Ontologies

Gene Ontology

Year

Papers

Year

Papers

2013

813

2013

587

2012

1257

2012

835

2011

1036

2011

664

2010

899

2010

61

2009

815

2009

528

2008

731

2008

464

2007

698

2007

436

2006

535

2006

344

2005

457

2005

273

2004

301

2004

162

2003

177

2003

85

2002

91

2002

34

2001

5

2001

4

2000

43

2000

2

1999

21

Total

4479

1998

33

   

1997

15

   

1996

7

   

1995

21

   

1994

11

   

1993

6

   

1992

6

   

1991

7

   

1990

6

   

1989

1

   

1988

3

   

1987

3

   

1985

1

   

1984

1

   

1983

1

   

1982

3

   

1981

2

   

1980

1

   

1979

1

   

1977

1

   

1974

2

   

1972

2

   

1971

2

   

1968

2

   

1967

1

   

1965

1

   

1961

1

   

1951

1

   

1912

1

   

Total

8022

   

Finding irregularities in the syntax of an ontology’s axioms

July 8, 2013

I’ve recently used Eleni Mikroyannidi’s Regularity Inspector for Ontologies (RIO) plugin for Protégé to tidy up some irregularities in the axioms of my Periodic Table Ontology (NPTO). I was happy with the ontology and the inferences it draws, but I knew that I’d not been entirely consistent in the annotation properties I’d used and the syntactic form of the ontology’s axiomatisation. For example, I had expressions such as

 

SubClassOf:

    hasPart some x

    and hasPart some y

 

As well as

 

SubClassOf:

    hasPart some x,

    hasPart some y

 

which have exactly the same logical effects, but are not so easy to handle programmatically – also, it’s just not neat.

 

However, the ontology has 118 atoms, as well as the other classes that make up the ontology’s structure; going through all of these classes and neatening up the syntax is both tedious and error prone. One issue is that, as an author, I don’t necessarily know what things to look for to fix. So, a find and replace will not suffice (and would probably only work in the simplest of cases).

 

This is where RIO comes into play; it uses some off-the-shelf unsupervised clustering techniques to find regularities in axiom usage. Once it finds these clusters it forms generalisations over these clusters. All the details may be seen in

 

E. Mikroyannidi, L. Iannone, R. Stevens, and A. Rector. Inspecting regularities in ontology design using clustering. In International Semantic Web Conference (ISWC) 2011, pages 438-453. Springer, 2011.

 

And subsequent papers of Eleni’s, which can be found on my publications page. The ontology’s repository has three versions of the NPTO together with some output from RIO; together these files show the regularities and irregularities found in the NTPO and what RIO told me about them and how I’ve improved the form of the ontology. Below I put in some highlights.

 

The core of the clustering is about measuring the similarity of axioms and groups of axioms. Here I’ve been using a popularity measure of entity’s usage for variable substitution in the axiom patterns. Looking at the file “npto_syntactic_popularity.xml_output.txt”, I see a cluster of 93 atom classes. I know there are 118 classes, so I have 25 variants in how I’ve described atoms. One cluster I see apart from this one is

 

?cluster_1 SubClassOf ?cluster_3 only (SShell and (?cluster_4 exactly 1 ?DomainEntity) and (?cluster_5 value “?constant”^^string))

     Instantiations: (6)

    PotassiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 4))

    CesiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 6))

    LithiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 2))

    FranciumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 7))

    RubidiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 5))

    SodiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 3))

 

 

And this shows the pattern I outlined above of using “and” and not”,” to separate axioms. RIO shows me which ones have used this pattern and I can fix them. There are other little clusters showing variants of this form.

 

RIO found one instance of the form

 

?cluster_2 SubClassOf (?cluster_3 some ?Metalness) and (?cluster_3 only (SShell and (?cluster_4 exactly 1 ?DomainEntity) and (?cluster_5 value “?constant”^^string)))

     Instantiations: (1)

    HydrogenAtom SubClassOf (hasMetalness some NonMetal) and (hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 1)))

 

 

Which is a slightly different form of using “and” – difficult to find by eye, relatively easy to find by regular expression once you know it’s there, but RIO finds it for you without you having to know what to look for.

 

Rather comfortingly, I see things like all the gas state, solid state, metals and non-metals clustered together. So RIO is spotting the things I’ve done the same way and the things I’ve done in different ways. If you look at the second popularity based file (npto_popularity_output-v2) you can now see that, in the third version of the NPTO in the repository, there are 118 atoms in cluster_1 – all the atoms are in one cluster, indicating they are beginning to look syntactically the same way – there are, however, still a few irregularities I missed, but the output enabled me to spot them. For instance, RIO is revealing deeper nested versions of the same pattern I’ve been targeting. However, this scan of the NPTO by RIO shows things becoming nicely regular – it does expose some deviations from style, but most of these are deliberate (and often a hack, like my treatment of the actinides and lanthanides). The various outputs of RIO to be seen in the repository have enabled me to see errors of style (or “bad axiom smells”) WRT the syntax I’ve used. Also, RIO enabled me to find errors in annotaitons – there were missing discovery years and there’s still a bit of a variety in how I’ve done labelling of atoms. However, RIO did enable me to spot things I didn’t know were there and would have been very tedious to find by eye. RIO’s output presentation still needs some attention, but I already find it useful.

The rise and rise of the Gene Ontology

July 7, 2013

Geraint Duck, one of our Ph.D. students, has just published a paper on a named entity recogniser for databases and software used in bioinformatics and computational biology. This is a wider project looking at extracting computational biological methods from text. As part of the paper about the BioNERDS tool, we did a survey of databases and software reported in the full-texts of Genome Biology and BMC Bioinformatics in PMC. More recently we’ve done a full survey of PMC, but the paper just reports on the two journals. The paper’s full reference is

 

Geraint Duck, Goran Nenadic, Andy Brass, David Robertson, and Robert Stevens. bionerds: exploring bioinformatics’ database and software use through literature mining. BMC Bioinformatics, 14(1):194, 2013. (DOI: 10.1186/1471-2105-14-194).

 

Here I want to report on the survey and, in particular, what it says about the reported usage of the gene Ontology. We surveyed BMC Bioinformatics and Genome Biology; the former has a remit to report development of bioinformatics methods, tools, databases, the latter has a remit to report more on the use of those rsources to actually “do biology2 – though, of course, there is overlap. The table below shows the top nine resources for each journal over the life-time of each journal.

 

BMC Bioinformatics

Genome Biology

Resource

Count

Resource

Count

R

1922

R

574

GO

1102

GO

516

BLAST

870

BLAST

430

analysis

696

GenBank

414

PDB

631

GEO

287

Network

553

Ensembl

266

Q

494

S4

229

GenBank

468

tRNA

195

KEGG

463

analysis

193

GEO

416

RefSeq

175

 

 

These numbers are the documents in which the resource was mentioned. There are a few resources that are over-reported – “network” and “analysis” are both real bioinformatics resources, but with highly inconvenient names for text-mining. “analysis” is not an unusual word to find in reports of science, so calling a tool “analysis” is, we think, something of an infelicity. However, the textp-miners dealt with this kind of thing for gene and protein names, so I’m sure we will also do so.

 

In both journals the Gene Ontology is up there in the top resources reported in the literature. It’s up there with the usual suspects. R is now top-dog, with BLAST, GO, Ensembl, KEGG, GEO and Genbank. I’m reasonably happy in concluding that the Gene Ontology is one of the central resources in these journals –.

 

We also had a look at the GO’s usage over time. We calculated the relative use of the GO by dividing the number of documents mentioning GO by the number of documents in that year in each journal).

 

We can see the mentions of GO in BMC Bioinformatics increasing fairly rapidly until 2005 and then increasing more slowly, and even tailing off a bit, thereafter (the paper has more details on these trends – normalising and statistical testing etc.), but these trends appear to be OK). The picture in Genome Biology is a little less clear, but GO becomes an established 0–resource. My suspicion is that numbers appear to tail off (as they do for other resources) as they become part of the fabric and no longer explicitly mentioned, also, there are more resources to use and cite, so competition is fierce – I’ve no evidence for these thoughts, but that’s my conjecture).

 

In these two journals GO is a “top” resource – we have an ontology that is a key resource for bioinformatics and computational biology. Something happens in 2005/2006 to GO’s usage (the paper has some plots of acceleration of usage too) – some kind of saturation, establishment as “a top resource”, or something else. A similar picture is seen in the whole of PMC – GO is in the top ten – I’ll report on that, and on how other ontologies fare, in another post. However, the take home message is that there is an ontology that is a central resource in bioinformatics and computational biology. That it is the GO is no surprise.

Putting an ontology of the atoms in order

July 4, 2013

 

I’ve come to the end of one thread of my playing with an ontology of the Periodic table of the Elements – thanks to an excellent third year project by Ionica Durchi. I’ve already written about an ontology of the atoms, where each atom is described according to its electronic configuration; then the atom families are also defined according to common electronic configuration. It is this electronic configuration that defines the physicochemical properties of the atom, the substances it forms and the families we observe in the Periodic table. An OWLViz view of this ontology can be seen below.

 


 

This ontology lacks the explicative power of the standard view of the Periodic Table. As we move left to right in the table, we have increasing atomic mass; we also observe periodicity in the physicochemical properties of the elements. As this periodicity or regularity happens, we group similar elements together. So, lithium, sodium, potassium, cesium and so on are all light, soft, highly reactive, combine with halides in the ratio one to one, and so on. This gives us the standard view of the Periodic table below with the alkali metals described above on the far left-hand-side (though note that the tabular form is a visual artefact of putting it on a two-dimensional medium – it’s really a spiral).

 

Source: http://www.bpc.edu/mathscience/chemistry/images/periodic_table_of_elements.jpg


 

My ontology has all the information (or proxies for it) for the standard view of the Periodic table, but it doesn’t show off this periodicity. So, the problem I set for Ionica was to implement an algorithm and some visualisation that would bridge this gap. The rules of engagement were couched as the two questions that could be asked of the ontology:

 

  1. What is the next atom;
  2. To what family does the atom belong

     

 

These two questions encapsulate the two dimensions along which the Periodic Table is arranged – the increasing atom mass and the periodically occurring physicochemical families to which tey belong. The aim was to be able to render the ontology as the periodic table looks, putting in gaps where appropriate. Just as Mendeleev left in gaps where he thought elements should be present (though not yet discovered), the algorithm for rendering the ontology, using the two questions above, should put the atoms in order of increasing atomic mass, but also order them in a second dimension by physicochemical famly.

 

 

 

Ionica’s algorithm for doing this is outlined in the decision tree below. It takes the next element in increasing atomic number and then it checks its membership against any of the already displayed elements. Depending on the result, it goes either on the ‘Yes’ branch and it that case it only sticks the element beneath the one with the same superClass or if the results is ‘No’ it creates a new column for the newly ‘discovered’ element and shuffles all the elements above accordingly. After executing any of the branches, it goes to the top, extracting the next element and repeating the process until space has been allocated for all elements.

 


 

 

 

 

 

 

The pictures below show the programme working with various ranges of atomic number (as a proxy for atomic mass) and/or their year of discovery. The algorithm can be seen working, adding in gaps into the table as necessary. The application looks like this (below) and can filter by discovery year and do specified ranges of atomic numbers.

 

 

 

For atomic numbers 3 to 20 we just get three rows of elements up until the element prior to the first transition element; the algorithm checks each atom in turn to see if it’s a member of a current groups – on reaching sodium the answer is, for the first time, yes, and a new period is started.

 

On reaching scandium, we find it is not a member of the boron family or any other family, so a gap for a new family should be started.

 


 

 

This carries on adding “gaps” until all the transition elements are done. Then we get the rest of the Periodic Table as we’d expect to see.

 

 

If we start with element 2 (helium) we end up with the noble gases on the left hand side:

 


 

 

This looks strange, but is conceptually OK, as the “table” is continuous, so having the noble gases at the left or right doesn’t really matter. However, we do prefer to have the non-metals together on the right hand side – so there’s a little tweek to the algorithm to deal with hydrogen and helium. A future piece of work may render the thing as a rotatable spiral…

 

 

The algorithm also leaves gaps appropriately for “undiscovered” elements:

 


Drawing the Periodic table from year 0 to 1891 (above) has only those elements that Mendeleev knew.

 


 

The ontology has all the information about the periodicity of the physicochemical properties of the elements (or that which accounts for it), but doesn’t make this explicit. It is only the layout that makes this periodicity with increasing atomic mass explicit. A simple observation it may be, but espite the ontology having the knowledge, it is how that knowledge is presented that often matters.


Follow

Get every new post delivered to your Inbox.

Join 126 other followers