Archive for August, 2013

Making scholarly articles born semantic

August 14, 2013

In Sepublica 2012 I did the invited keynote talk entitled “Semantic Publishing: What does it all mean anyway?”. This was on the back of an increasing interest I have in semantic publishing, motivated by the work Phil Lord and I have done on the Ontogenesis knowledgeblog. The invitation itself was, I suspect, also on the back of the fun I had with the ontology submission I did to sepublica that year, which was just of the RDF of the Amino Acids Ontology. Nevertheless, I did the keynote and one of the things I did in the talk was to make the distinction between a scholarly article being born semantic and made semantic.

 

We can author a scholarly document and then add semantics post hoc – for instance, labelling entities with their semantic type – authors, proteins, genes, tools etc., as well as elements of document structure, the nature of the citations (and other stuff you will find in the Semantic Publishing and Referencing Ontologies (SPAR) suite of ontologies). All this is done after the author writes, by hand or through text-mining, etc.

 

In contrast, this could also be done at authoring time, with the encoding of the semantics being done by the author at the time of authoring – rather than by a third party post publication, which would typically be the case in being “made semantic” – with all the obvious issues of such things. So, the author does the same kind of semantic mark-up as before, but as he or she writes the document; the semantics then persist through the publication process and then the semantic content is available both for readers and machines

 

When talking about this to my colleague Sean Bechhofer, he made the analogy with analogue and digital photographs and music being either “born digital” or “made digital”: “However, as of yet, few born-digital (defined in opposition to “made digital” or “digitized” photographs, which are created by scanning analogue sources), photographs have been acquired by archives.” (Becoming Digital: The Challenges of Archiving Digital Photographs, Karen Rae Simonson, University of Manitoba (Canada), 2006). Similarly, in music recordings we had Analogue (A) and Digital (D) recording, Mastering and Publication as either A or D. Compact disks were labelled with AAD to DDD depending of what combinations of analogue/digital recording, mastering and published were used. A piece of music that is DDD is “born digital” music; anything else is “made digital”. A gramophone record would be AAA. So, a scholarly bpublication can be “born semantic” if it is semantic from the start or “made semantic” if the semantics are added post hoc.

 

WE talked about how to make scholarly publications “born semantic” in our “three steps paper“. It’s all very well wanting all this semantics in a paper from its birth, but it doesn’t happen for free. A made semantic paper costs resource and a born semantic paper also costs – in this case for the author. If we think about the three players in the scholarly work game, the author, the reader and the mediating machine, the advantages of semantics are fairly obvious for the reader and the machine. The reader can get better search, active documents where the machine’s ability to use the semantics of a document’s entities can enhance the reading experience by, for instance, doing protein sequence things with things it knows are protein sequences etc etc. the machine, knowing what thins are, can do appropriate things with those entities; it becomes computationally more effective.

 

This leaves the author; what’s in being born semantic for the author? Grief and pain if we’re just asking an author to do loads of mark up. Just as there are advantages for being born semantic for the reader and machine, there needs to be an advantage for the author. This means that the adding of semantics either has to help the author in his or her task or it has to come for “free” as a side-effect of something else the author would do anyway.

 

Phil Lord has started doing some of this in the knowledgeblog. There’s some simple markup that indicates a thing is an author name or a citation . By labelling a DOI or PubMed id as a citation, all my refernces get compiled and styled; one can imagine adding a CiTO attribute as well, though it’s a bit tricky to work out what’s in it for the author). Knowledgeblog does a bit with knowing that a string is an author, but one can imagine labelling something as an ORCID and getting all sorts of stuff for free-a semantically marked up affiliation. On the “semantics bby stealth” side of things, we could have style sheets already marked up with elements of rhetoric structure and so on. This doesn’t really help the author, but should aid machine processing down-the-line; the key is that it costs nothing. (the 3 steps paper above gives more examples.)

 

One of the “not making demands” of an author (without payback) is that a new tool or specialised environment won’t work. Whatever born semantic stuff we use for authoring, it’s got to work in Ms Word, Latex or whatever. Nice WYSWYG tools or v simple markup (if markup is your thing). Any additions by the author have to be low-cost or they won’t happen. This may mean we don’t get a lot of semantics, but that may just be the way that it is – unless the payback is down-the-line with demonstrably more readers and, of course, citations. The other thing that needs to happen is to do away with strange publisher processes that take camera-ready (possibly semantic) documents and re-do them from scratch, which would remove the semantics gained at birth.