I’ve recently used Eleni Mikroyannidi’s Regularity Inspector for Ontologies (RIO) plugin for Protégé to tidy up some irregularities in the axioms of my Periodic Table Ontology (NPTO). I was happy with the ontology and the inferences it draws, but I knew that I’d not been entirely consistent in the annotation properties I’d used and the syntactic form of the ontology’s axiomatisation. For example, I had expressions such as
hasPart some x
and hasPart some y
As well as
hasPart some x,
hasPart some y
which have exactly the same logical effects, but are not so easy to handle programmatically – also, it’s just not neat.
However, the ontology has 118 atoms, as well as the other classes that make up the ontology’s structure; going through all of these classes and neatening up the syntax is both tedious and error prone. One issue is that, as an author, I don’t necessarily know what things to look for to fix. So, a find and replace will not suffice (and would probably only work in the simplest of cases).
This is where RIO comes into play; it uses some off-the-shelf unsupervised clustering techniques to find regularities in axiom usage. Once it finds these clusters it forms generalisations over these clusters. All the details may be seen in
E. Mikroyannidi, L. Iannone, R. Stevens, and A. Rector. Inspecting regularities in ontology design using clustering. In International Semantic Web Conference (ISWC) 2011, pages 438-453. Springer, 2011.
And subsequent papers of Eleni’s, which can be found on my publications page. The ontology’s repository has three versions of the NPTO together with some output from RIO; together these files show the regularities and irregularities found in the NTPO and what RIO told me about them and how I’ve improved the form of the ontology. Below I put in some highlights.
The core of the clustering is about measuring the similarity of axioms and groups of axioms. Here I’ve been using a popularity measure of entity’s usage for variable substitution in the axiom patterns. Looking at the file “npto_syntactic_popularity.xml_output.txt”, I see a cluster of 93 atom classes. I know there are 118 classes, so I have 25 variants in how I’ve described atoms. One cluster I see apart from this one is
?cluster_1 SubClassOf ?cluster_3 only (SShell and (?cluster_4 exactly 1 ?DomainEntity) and (?cluster_5 value “?constant”^^string))
PotassiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 4))
CesiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 6))
LithiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 2))
FranciumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 7))
RubidiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 5))
SodiumAtom SubClassOf hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 3))
And this shows the pattern I outlined above of using “and” and not”,” to separate axioms. RIO shows me which ones have used this pattern and I can fix them. There are other little clusters showing variants of this form.
RIO found one instance of the form
?cluster_2 SubClassOf (?cluster_3 some ?Metalness) and (?cluster_3 only (SShell and (?cluster_4 exactly 1 ?DomainEntity) and (?cluster_5 value “?constant”^^string)))
HydrogenAtom SubClassOf (hasMetalness some NonMetal) and (hasValenceElectronShell only (SShell and (contains exactly 1 Electron) and (hasOrder value 1)))
Which is a slightly different form of using “and” – difficult to find by eye, relatively easy to find by regular expression once you know it’s there, but RIO finds it for you without you having to know what to look for.
Rather comfortingly, I see things like all the gas state, solid state, metals and non-metals clustered together. So RIO is spotting the things I’ve done the same way and the things I’ve done in different ways. If you look at the second popularity based file (npto_popularity_output-v2) you can now see that, in the third version of the NPTO in the repository, there are 118 atoms in cluster_1 – all the atoms are in one cluster, indicating they are beginning to look syntactically the same way – there are, however, still a few irregularities I missed, but the output enabled me to spot them. For instance, RIO is revealing deeper nested versions of the same pattern I’ve been targeting. However, this scan of the NPTO by RIO shows things becoming nicely regular – it does expose some deviations from style, but most of these are deliberate (and often a hack, like my treatment of the actinides and lanthanides). The various outputs of RIO to be seen in the repository have enabled me to see errors of style (or “bad axiom smells”) WRT the syntax I’ve used. Also, RIO enabled me to find errors in annotaitons – there were missing discovery years and there’s still a bit of a variety in how I’ve done labelling of atoms. However, RIO did enable me to spot things I didn’t know were there and would have been very tedious to find by eye. RIO’s output presentation still needs some attention, but I already find it useful.