Archive for September, 2014

Patterns of bioinformatics software and database usage

September 27, 2014


I published a blog on the rise and rise of the Gene Ontology. This described my Ph.D. student Geraint Duck’s work on bioNerDS, a named entity recogniser for bioinformatics databases and software. In a survey of Genome Biology and BMC Bioinformatics full text articles we saw that the Gene Ontology is in the top ten of mentioned resources (a fact reflected in our survey of the whole of 2013’s PMC). This interesting survey was, however, a bit of a side-show to our goal of trying to extract descriptions of bioinformatics and computational biology method from text. Geraint has just presented a paper at ECCB 2014 called:


Geraint Duck, Goran Nenadic, Andy Brass, David L. Robertson, and Robert Stevens. Extracting patterns of database and software usage from the bioinformatics literature. Bioinformatics, 30(17):i601-i608, 2014.


That has edged towards our ultimate goal of extracting bioinformatics and computational method from text. Ideally this would be in a form that people wishing to use bioinformatics tools and data to analyse their data could consult a resource of methods and see what was commonly done, how and with what it was done, what’s the latest method for data, who’s done each method and so on and so on.


Geraint’s paper presents some networks of interacting bioinformatics software and databases that shows patterns of commonly occurring pairs of resources appearing in 22,418 papers from the 2013 PMC corpus that had the MeSH term “Bioinformatics” as a tag. When assembled into a network, there are things that look remarkably like methods, though they are not methods that necessarily appear in any one individual paper. What Geraint did in the ECCB paper was:


  1. Take the results of his bioNerDS survey of the articles in PMC 2013 labelled with the MeSH term “Bioinformatics”.
  2. Removed all resources that were only mentioned once (as they probably don’t really reflect “common” method).
  3. Filter the papers down to method sections.
  4. Get all the pairs of adjacent resources.
  5. Assuming the most used ordering (“Software A takes data from Database B” or “Data from Database B is put into Software A”), we used a binomial test to find the dominant ordering and assumed that was the correct ordering (our manually sampled and tested pairs suggests this is the case).
  6. Resources were labelled as to whether they are software or a database. A network is constructed by joining the remaining pairs together.

The paper gives the details of our method for constructing patterns of usage and describes the evaluations of each part of the method’s outputs.


Some pictures taken from the paper of these networks created from assembling these ordered pairs of bioinformatics resources are:


Figure 1 A network formed from the software recovered by bioNerDS at the 95% confidence level


This shows the network with only bioinformatics software. In Figure 1 we can see a central set of sequence alignment tools, split into homologue search, multiple sequence alignment and pairwise sequence alignment tools), which reflects the status of these core, basic techniques in bioinformatics based analyses. Feeding into this are sequence assembly, gene locator and mass spectroscopy tools. Out of the sequence analysis tools come proteomic tools, phylogeny tools and then some manual alignment tools. Together these look like a pipeline of core bioinformatics tasks, orientated around what we may call “bioinformatics 101” – it’s the core, vital tasks that many biologists and bioinformaticians undertake to analyse their data.


The next picture shows a network created from both bioinformatics software and databases. Putting in both software and databases in Figure 2, we can see what the datasets are “doing” in the pipelines above: UniProt and GEO are putting things into BLAST; GenBank links into multiple sequence alignment tools; PDB links into various sequence prediction and evaluation tools.


Figure 2 A network formed from the bioinformatics and database recovered by bioNerDS at the 95% confidence level


Finally, we have the same network of bioinformatics software and databases, but with the Gene Ontology node (which we count as a database) highlighted.


Figure 3 The same network of bioinformatics software and databases, but with the Gene Ontology and its associates highlighted.


In another blog I spoke about the significance of the Gene Ontology, as recorded by bioNerDS, and this work also highlights this point. In this network we’re seeing GO as a “data sink”, it’s where data goes, not where it comes from – presumably as it is playing its role in annotation tasks. However, its role in annotation tasks, as well as a way of retrieving data, fits sensibly with what we’ve seen in this work. It may well be that we need a more detailed analysis of the language to pick up and distinguish where GO is used as a means of getting a selection of sequences one wants for an analysis – or to find out if people do report this activity. Again we see GO with a central role in bioinformatics – a sort of confirmation of its appearance in the top flight of bioinformatics resource mentions in the whole PMC corpus.


What are we seeing here? We are not extracting methods from the text (and certainly not individual methods from individual papers). What we’ve extracted are patterns of usage as generalised over a corpus of text. What we can see, however, are things that look remarkably like bioinformatics and computational biology method. In particular, we see what we might call “bioinformatics 101” coming through very strongly. It’s the central dogma of bioinformatics… protein or nucleic acid sequences are taken from a database and then aligned. Geraint’s paper also looks at patterns over time – and we can see change. Assuming that this corpus of papers from PMC is a proxy for biology and bioinformatics as a whole and that, despite the well-known inadequacy of method reporting, the methods are a reasonable proxy for what is actually done, BioNerDS is offering a tool for looking at resources and their patterns of usage.

Being a credible virtual witness

September 19, 2014

Tim Clark introduced me to the notion of a scientific paper acting as a virtual witness upon a scientific investigation. We, the readers, weren’t there to see the experiment being done, but the scientific article acts as a “witness statement” upon the work for us to judge that work. There’s been a deal of work over recent time about how poorly methods are described in scientific papers – method is key to being able to judge the findings in a paper and then to repeat and reproduce the work. Method is thus central to a scientific paper being a “credible virtual witness”. One of the quotes on Wikipedia’s description of credible witness is “Generally, a witness is deemed to be credible if they are recognized (or can be recognized) as a source of reliable information about someone, an event, or a phenomenon”. We need papers to be credible witnesses on the phenomena they report.


We’ve recently added to this body of work on reproducibility with a systematic review of method reporting for ‘omic experiments on a set of parasite host investigations. This work was done by Oscar Florez-Vargas, a Ph.D. student supervised by Andy Brass and me; the work was also done with collaborators in Manchester researching into parasite biology. The paper is:


Oscar Flórez-Vargas, Michael Bramhall, Harry Noyes, Sheena Cruickshank, Robert Stevens, and Andy Brass. The quality of methods reporting in parasitology experiments. PLoS ONE, 9(7):e101131, July 2014.


Oscar has worked for 10 years on the immunogenetics of Chagas disease, which is caused by one of the Trypanosoma parasites. Oscar wished to do some meta-analyses by collecting together various results from ‘omics experiments. He came with one issue of apparently contradictory results – Some papers say that the Th17 immune response, T regulatory cells and Nitric Oxide may be critical to infection and others say that they are not. Our first instinct is to go to the methods used in apparently similar experiments to see if differences in the methods could explain the apparent contradiction; the methods should tell us whether these results can be reasonably compared. Unfortunately the methods of the papers involved don’t give enough information for us to know what’s going on (details of the papers are in Oscar’s article). If we are to compare results from different experiments, we have to base that comparison on the methods by which the data were produced. In a broader context, method lets us judge the validity of the results presented and should enable the results in a paper to be reproduced by other scientists.


This need to do meta-analyses of Trypanosoma experiment data caused us to look systematically at a collection of ‘omic experiments from a series of parasite host experiments (Trypanosoma, Leishmania, Toxoplasma, Plasmodium, Trichuris and Schistosoma, as well as the non-parasitic Mycobacterium). Oscar worked with our collaborating parasitologists to develop a checklist of what essential parameters that should be reported in methods sections. This included parameters in three domains – the parasite, the host and the experimental infection. Oscar then used the appropriate PRISMA guidelines in a systematic review of 23 Trypanosoma spp. papers and 10 from each of the other organisms from the literature on these experiments (all the details are in the paper – we aimed to have our method well reported…).


We looked for effects on the level of reporting from organism and publication venue (various bibliometric features such as impact factor, the journal’s h-index and citations for the article).


Perhaps not unsurprisingly the reporting of methods was not as complete as one may wish. The mean of scores achieved by Trypanosoma articles through the checklist was 65.5% (range 32–90%). The method reporting in the other organisms was similarly poor, except in Trichuriasis experiments, which achieved the highest scores and included the only paper to score 100% in all criteria. We saw no effect of publication (some negative correlation with Google Scholar citation levels, though this is confounded by the tendency of older publications to have more citations). There has been no apparent improvement in reporting over time.


Some highlights of what we found were:

  • Species were described, but strains were not and it’s known that this can have a large effect on outcome;
  • Host’s sex has an influence on immunological response and it was sometimes not described;
  • The passage treatment of the parasite influences its infectivity and this treatment was often not reported;
  • Housing and treatment of hosts (food, temperature, etc.) effects infectivity and response to infection and these were frequently not reported.


We know method reporting tends to be poor. It is unlikely that any discipline is immune from this phenomenon. Human frailty is probably at the root – as authors, we’d like to think all the parameters we describe are taken into account in the experimental design. The fault is presumably in the reporting rather than the execution. Can we get both authors and reviewers to use checklists? The trick is, I suspect, to make such checklists help scientists do their work – not a stick, but some form of carrot. This is a similar notion that Phil Lord has used in discussing semantic publishing – the semantics have to help the author do their work, not be just another hindrance. We need checklists in a form that help scientists write their methods sections. Methods need to become a first class citizen in scientific writing, rather than a bit of a chore. Method is vital to being a credible virtual witness and we need to enable us all to be credible in our witness statements.

An accessible front end to Google Calendar

September 15, 2014

I’ve not written about being blind and using computers in this forum before, but I actually have something to say – my new Accessible Google Calendar (AGC) is ready and I like it. As can be appreciated, a calendar or diary is a tremendously useful thing. Not having effective (as far as I’m concerned) access to electronic calendars, and being able to share commonly used calendar mechanisms with colleagues, makes working more trying than it need be.


The advent of on-line calendars and so on should have made life easier, but the two-dimensional table layout of calendars/diaries makes it too much like hard work. In addition, the Web 2.0 nature of tools like google Calendar is not to my screenreader’s liking and therefore not my cup of tea. As a consequence, for many years I had to organise my diary vicariously and, as a result, badly (just due to the overheads of communication, not the people at the other end of my communications).


My first step along the path to a solution was a little command line gadget made for me by Simon Jupp, one of my research associates. This gadget took some arguments that scoped time and then printed out that portion of my google Calendar diary to the screen in text, which was easy for my screenreader to handle. Additions to my diary had to, of course, be done by someone else.

Dimitris Zlitidis then did my M.Sc. project on creating an accessible front end to Google Calendar and this allowed me to both read and write to my google Calendar. This project gave the design of the AGC’s user interface I describe here. I’ve been using this for many years. Google changing their calendar’s API has prompted a re-write by Nikita Abramovs, a vacation student at the School of Computer Science of the University of Manchester, and it’s this re-write I now describe.


The Accessible google Calendar (AGC) tool was written in C#; this has all the user interface stuff that is native to Windows, the operating system I use, so its interface is inclined to work with my screenreader JAWS immediately. I then looked at scoping and prioritising what I wanted done. There’s a lot that one can do with Google Calendar – a lot of management of calendar type stuff – who can edit the entries; inclusion of schedules of public holidays etc. I left these out. When I want them I will work with the Web version and do so vicariously as necessary. The two things I really want to do are:


  1. Look at entries in the portions of time that I most frequently wish to look;
  2. Add, modify and delete entries. I want to do this with access to the facilities for specifying times (all day and fragments of days) and to do recurring events.

As the “past is a foreign country”, the main things I want to do are to look at the “now” and the “future” events in my diary. So, there’s a list of simple patterns of ways in which I choose events at which to look:


  1. Today and tomorrow;
  2. This week and next week;
  3. This month and next month
  4. “Select month period” extends the month functionality by being able to choose months further into the future, which the option of a) single month; b) all months; c) intervening months;
  5. For the rare dates that fall outside this scope there’s a choose date dialogue where I can specify start and end days.
  6. Finally, I can use a search date for events by their content.


AGC’s Event Tab is shown here:


Figure 1An image of AGC’s Events tab showing a week’s events and the various controls for selecting events; details are in the rest of the text.


Events are shown as a simple list that I can move up and down with my cursor key. Unconfirmed events are indicated by a “*” at the start of the entry. I can update events by clicking (pressing return) on the event, which brings up an update event dialogue (similar to the add event dialogue described below). There’s a settings tab that allows me to specify things like: Showing end times; 12 or 24 hour clock; separators for parts of dates (space, slash or dash); and some sounds or text to indicate errors.


AGC’s add date functionality is a moderately complex dialogue, but it flattens out any two-dimensional calendar presentation from which to pick dates. Nearly everything is done via little spin boxes that let me pick years, months and days via my cursor keys. As I fix the start time, the end time dialogue keeps track, defaulting to one hour later, to reduce the amount of “setting” I have to do. Checkboxes for whole day events limits the interaction to setting the day date and a recurring events checkbox exposes dialogue for setting for how long the recurring holds and on which days the recurring event happens. Finally the dialogue allows me to set a reminder time and whether or not the event is confirmed. There’s also an “add quick event” tab that lets me use Google’s controlled natural language for setting dates – “Dinner with Isaac Newton 7 p.m. next Friday” does as it says on the tin. There’s a menu of template CNL sentences from which to pick.

The Add Event tab, showing the recurring events bit, is shown here:


Figure 2An image of AGC’s Set events tab showing an event that recurs weekly from September to December


I’ve used the original version of AGC for several years and it’s been a vital tool. Dimitris and I got the user interface more or less right and Nikita’s re-write and update has made it even better. I rarely need to get outside intervention in my diary setting and the view events tab has a nice regularity, symmetry and simplicity about it that I rather like. I rarely use the choose date and search functions (though they are nice to have for the odd occasion); just having today, tomorrow, this week, next week, this month and next month does it for me nearly all the time. The user interface, having been used for years, has had lots of testing and, while the user base is not extensive (me), it does all that I need to do on a frequent and regular basis. It’s good that Google have exposed the API to their calendar. Ideally I’d like the Web offering of their calendar to work well for me, but I need to do my diary now and AGC is my solution.


The AGC installer can be downloaded from There is a short readme file with a description of AGC functionality and how to install it can be found at: