Work

Reference Genome Graphs

I posted this wondering on Twitter earlier today:

Which elicited precisely no response. So I actually had to do my own research.

sigh.

This is the hacked together results of that hour or so. Presented on the grounds that it may help others, as well as being an aide memoir for me.

Motivation

In general, linear representations of reference genomes fail (unsurprisingly) to capture the complexities of populations. It is almost universally the case, though, that the tools we have available to us work with precisely this type of reference.

If we take the example of the human genome – since GRCh37 (Feb ’09), the GRC has attempted to represent some of the population-level complexity of humans by making alternative haplotypes available for regions of high complexity (the MHC locus of chromosome 6 is the canonical example). Church et al (2015) provide a more complete overview of the issues than I could hope to [1].

Despite this recognition of the issues at hand, a flattened representation of the ‘genome graph’ is still predominantly used for downstream mapping applications. With GRCh38 having 178 regions with alt loci (as opposed to 3 in GRCh37), the need for an adequate toolchain becomes more pressing.

The Approach

Church et al points to a Github repo which is used for tracking software tools which make use of the full GRC assembly. I think the contents speaks for itself.

I’ve found a few useful resources discussing the technology surrounding the use of a reference graph, as opposed to the flattened representation. Kehr et al (2014) discusses the different graph models available for building graph-based alignment tools [2]. Though the focus on actual implementations is rather historical (the most modern tool referenced was released in 2011).

More up-to-date is the work of Dilthey et al (2015), which looks at improving the genome representation of the MHC region specifically through the use of reference graphs and hidden Markov models [3]. However, this work doesn’t seek to tackle a generic approach to read alignment to a genome graph. We do get a proposal for a useful graph model (the Population Reference Graph, or PRG) and a nice method for using data specifically in the MHC region, using HMMs. It’s also unclear to me (from my brief reading of the paper) how well this approach would scale.

From the CS side of things, we get an extension of the Burrows-Wheeler Transform for graphs rather than strings from Sirén et al (2014) [4]. This is an approach which would clearly allow the very popular transformation used so widely in short read alignment to be adapted to a graph-based algorithm.

Then, finally, I came across an implementation. BWBBLE represents a collection of genomes, not using a graph structure, but by making use of ambiguity codes and string padding [5]. Huang et al then rely on a conventional BWT to index this string representation of the ‘multi-genome’. This work also describes the implementation of an aligner that makes use of this genome representation.

BWBBLE feels a bit ‘close but no cigar’ – it’s not using a graph to represent a population of genomes and no one is really using it.

Finally, I get to the place where I actually started. HISAT2. I knew that this aligner was using graph-based FM indices to represent the genome (based on the work of Sirén et al mentioned above). I knew this made it possible to represent SNPs and small indels in the reference. I have no idea whether this allows HISAT2 to fully represent alt loci in its reference. Though from the descriptions available, it seems unlikely. I was pretty impressed by HISAT when it was released [6], but have yet to test HISAT2 in anger.

HISAT was posited as the replacement for Tophat as an RNA-Seq aligner. HISAT2’s webpage makes it clear the developers intend it to be used over both HISAT and Tophat. It is not clear that any such thing is happening (STAR seems to be the de facto Tophat replacement in most people’s RNA-Seq toolchain, where they still rely on alignment, or are not still using Tophat).

Conclusions

My perspective on this is one of a complete outsider doing about an hour’s worth of reading, but it seems to me that there is a complete paucity of tools allowing for alignment to a graph model of a reference genome. There’s plenty of discussion of the issues, and a recognition that such tools are necessary, but little so far in the way of implementation.

I assume these tools are in the works (as I said, I’m an outsider looking in here, I have no idea who’s developing what).

I’ll leave with this, which is part of what got me wondering about the state of this field in the first place:

I guess 3-5 years is a loooong time in genomics.

EDIT: Clearly I missed a whole bunch of stuff. Many useful comments on Twitter, especially from @ZaminIqbal and @erikgarrison. Most especially, Erik points out the completeness of vg – which:

implements alignment, succinct db of graph (sequences + haplotyps), text/binary formats, visualization, lossless RDF transformation, annotation, variant calling, graph statistics, normalization, long read aln/assembly, sequence to debruijn graph, kmers, read simulation, graph comparison, and tools to project models (graph alns and variant calls) into linear ref.

(from a collection of tweets: 1, 2, 3, 4, 5, 6, 7, and 8).

Jeffrey in the comments below also points out a presentation on Google Docs by Erik: https://docs.google.com/presentation/d/1bbl2zY4qWQ0yYBHhoVuXb79HdgajRotIUa_VEn3kTpI/edit#slide=id.p.

Another tool to get a mention was progressiveCactus (see here). A graph-based alignment tool that seems to be under active development on Github.

So, not quite the paucity of implementation I first feared (though it was clear stuff must be in the works – good to know some of the what and where) – and Twitter came in handy for the research in the end…

[1]: doi: 10.1186/s13059-015-0587-3

[2]: doi: 10.1186/1471-2105-15-99

[3]: doi: 10.1038/ng.3257

[4]: doi: 10.1109/TCBB.2013.2297101

[5]: doi: 10.1093/bioinformatics/btt215

[6]: doi: 10.1038/nmeth.3317

“Alignment free” transcriptome quantification

Last week a pre-print was published describing Kallisto – a quantification method for RNA-Seq reads described (by the authors) as ‘near optimal’. Here is Lior Pachter’s associated blog post. Kallisto follows a current trend for ‘alignment free’ quantification methods for RNA-Seq analysis, with recent methods including Sailfish and it’s successor, Salmon.

These methods do, of course, do ‘alignment’ (pseudo-alignments in Kallisto parlance, lightweight alignments according to Salmon), but they do not expend computation in determining the optimal alignment, or disk in saving the results of the alignment. Making an alignment ‘good enough’ to determine the originating transcript and then disposing of it seems like a sensible use of resources – if you don’t intend to use the alignment for anything else later on (which usually I do – so alignment isn’t going to go away any time soon).

I’ve been using Salmon for a few months now, and have been very impressed with its speed and apparent accuracy, but I felt that the publication of Kallisto means I should actually do some testing.

What follows is pretty quick and dirty, and I know it could well stand some improvements along the way. I’ve tried to document the workflow adequately – let me know if you have any specific suggestions for improvement.

Building a test set

Initially I wanted to run the tests with a full transcriptome simulation, but I’ve been having some technical issues with polyester, the RNA-Seq experiment simulator, and I haven’t had the time to work them out. So instead, I am working with the transcripts of a sample of 250 random genes. Reads were simulated for these transcripts to match empirically observed counts from a recent experiment. This simulation gave me a test set of 206,124 paired-end reads for 1,129 transcripts. This data set was then used for quantification with both Salmon (v0.3.0) and Kallisto (v 0.42.1).

Quantification

I then ran the quantification with each tool & each set of reads 10 times, to get an idea of the variability of the results. For Salmon, this meant 10 separate runs, for Kallisto it meant extracting the HDF5 file from a run with 10 bootstraps. For interest, I tracked the time and resource use of each tool, though this is not a big consideration. Since we are now in the place where most tools operate in minutes per sample (alignment via STAR or HISAT, and these quantification methods), a few minutes either way is going to make a negligible difference, and speed is usually entirely the wrong metric to focus on.

For the record – Salmon took a lot longer for this experiment, though the requirement to keep counting reads until at least 50,000,000 have been processed (by default, configurable through the -n parameter) accounts for the majority of the time discrepancy & would not be such a ‘penalty’ to operation with a normally-sized experiment.

Variability

The two graphs below show the coefficient of variance for the observations by each of the software tools, vs the raw read count. In both cases, the transcripts with low base mean exhibit the largest relative variance – this is hardly surprising. In general, variance is proportional to mean expression.

salmon_variance

Per-transcript coefficient of variance for Salmon observations

kallisto_variance

Per-transcript coefficient of variance for Kallisto observations

Comparison

For comparison, I took the mean observation from the 10 runs of Salmon, and the final abundance.txt observations from Kallisto, and compared them to the ‘ground truth’ – the count table from which the simulation was derived. Plots of these comparisons are below.

salmon_truth

Correlation of Salmon counts with ground truth. The red line indicates perfect correlation.

kallisto_truth

Correlation of Kallisto counts with ground truth. The red line indicates perfect correlation.

I think both tools here are doing a pretty bang-up job — though Kallisto is performing better in this test, particularly with high abundance isoforms. The correlation with ‘truth’ is stronger (Spearman, reported on the graphs above), as is the mean absolute difference from truth (10.04 for Kallisto, 60.66 for Salmon).

Both Salmon and Kallisto are still under active development (Salmon has not yet been published, and Kallisto is only just in pre-print), so this is actually relatively early days for quantification by alignment-free methods (see this post by the Salmon developer Rob Patro for some potential future directions). The fact, then, that both tools are already doing such a good job of quantification is very exciting.

EDIT:

In response to the comment from Rob Patro below, I’m including a graph of the comparison of TPM (Transcripts Per Million) – again, truth vs Salmon & Kallisto, this time in one figure.

Transcripts per million comparison. Graph is of log2(TPM+1).

Transcripts per million comparison. Graph is of log2(TPM+1).

EDIT THE SECOND:

Further feedback from Rob, suggesting I use the non-bias corrected results from Salmon. This has a pretty significant effect on the results. The revised plots are included below. The Salmon help does mention that bias-correction is experimental…

Non-Bias Corrected Salmon counts vs Ground Truth

Non-Bias Corrected Salmon counts vs Ground Truth

TPM comparison with non-bias corrected Salmon results

TPM comparison with non-bias corrected Salmon results

I’m a bioinformatician

Navel-gazingthis opinion piece, called “Who qualifies to be a bioinformatician?” seems to have prompted rather a lot of it. I’m breaking my extended silence to add my two-pennyworth on what being a bioinformatician means to me

  • Being a bioinformatician means being a biologist, programmer, sysadmin, statistician and grief counsellor, rolled into one
  • Being a bioinformatician means writing thousands of lines of glue code in [scripting language of choice]
  • Being a bioinformatician means teaching experienced scientists something new, and getting to see the dawning realization that it might just be useful
  • Being a bioinformatician means jumping through a 112 hoops to compile the latest and greatest tool, just to find it segfaults on anything other than the test data
  • Being a bioinformatician means embarking on what seems like a simple job, only to find six weeks later you’ve written yet another short-read aligner
  • Being a bioinformatician means crafting an exquisite pipeline that has to be subtly changed with each run because every dataset is a special little flower that needs bespoke treatment
  • Being a bioinformatician means writing yet another hacky data munging script that will break on the 32,356th line of the poorly defined, exception riddled, lumpen slurry of an input file you’re having to deal with this time
  • Being a bioinformatician means learning that Excel is an acceptable interoperability format, whether you like it or not (I don’t)
  • Being a bioinformatician means knowing enough biology, computing and statistics to be looked down on by purists in all three disciplines
  • Being a bioinformatician means playing a key role in an unparalleled range of exciting, cutting edge research
  • Being a bioinformatician means being part of an open, collaborative worldwide community who are genuinely supportive and helpful

Now, this list may be a little flippant in places — but it is intended to make a point. There are no hard and fast rules about what a bioinformatician is and isn’t, the label will mean different things to different people. But what it does involve is an unusually wide skill set, usually hard-won over many years, and the knowledge of when and where to apply those skills. It definitively doesn’t involve looking down on hardworking partitioners in the field purely because they don’t fit your elitist mould — the only thing this is likely to do is exclude those interested in the field, but who don’t fit your preconceived ideals.

If you want to let me know what being a bioinformatician means to you, feel free to comment below.

Housekeeping

After years of renting a VM from first Slicehost, and then Rackspace, I’ve finally taken the decision to change my web hosting arrangements. This site is now hosted at wordpress.com. I’m in the process of tidying things up and consolidating what was a confusing tangle of a web presence that had evolved over a number of years.

The move to a hosted blog, rather than self-hosting, means that stuff is certain to be broken in older posts which rely on plugins I’m no longer able to deploy.

I’m not promising a massive uptick in activity or anything, but at least things should be a bit more organised now.

Bioinformatics Community Building in Newcastle

Ooohhh look, a blog post. Not seen one of those in a while around these parts…

I’ve been rather busy since taking on my new job (which I guess I should now refer to just as ‘my job’). Hence the lack of posts for quite some time. Mostly I’ve been focussed on making sure the Bioinformatics Support Unit continues to run smoothly, but like anyone in a new job, I’ve also wanted to make my mark by changing the way thing operate a little bit. So this year we’ve run a proper training course for the first time, for instance. I’m also trying to make the unit more central to the way bioinformatics is done throughout the faculty, by establishing and running the Newcastle University Bioinformatics Special Interest Group. The aim of this group is to foster communication between bioinformatcians working at the University, and hopefully establish some sort of mutually supportive local geek community. The first meeting took place a couple of weeks ago, and I wrote it up for the Special Interest Group blog, but I thought I would reproduce that post here as well. The remainder of this post is taken from that site, with permission of the author (me).

In the first Bioinformatics Special Interest Group meeting, we heard a talk from Dr Andrew “Harry” Harrison entitled ‘On the causes of correlations seen between probe intensities in Affymetrix GeneChips’.

Harry started his talk with a brief overview of the Affymetrix microarray platform, including the important observation (as will become obvious later) that the distance between full length probes on the surface of a GeneChip is around 3nm. Full length probes are around 20nm long, so there is plenty of scope for adjacent probes to interact with one another. Also reviewed was the progress made in the summarisation of probe information from GeneChips into probeset observations per gene.

The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene [cite]10.1186/1471-2105-8-195[/cite]

The Affymetrix developed MAS5.0 algorithm [cite]10.1093/bioinformatics/18.12.1585[/cite], which takes the Tukey bi-weighted mean of the difference in logs of PM and MM probes, was swiftly shown to be outperformed by methods developed in academia, once Affymetrix released data that could be used to develop other summarisation algorithms (in particular dChip [cite]10.1007/0-387-21679-0_5[/cite], RMA [cite]10.1093/biostatistics/4.2.249[/cite] and GCRMA [cite]10.1198/016214504000000683[/cite] – which take into account systematic hybridisation patterns – i.e. the fact that some probes are “stickier” than others).

Finally for his introductory segment, Harry also mentioned the “curse of dimensionality” – the fact that high-throughput ‘omics experiments make 10s of 1,000s of measurements, and identifying small but significant differences that express what’s going on in the biology suffers from an enormous multiple-testing problem. Therefore, we want to be sure that those things we are measuring are truly indicative of the underlying biology.

For the main portion of his talk, Harry went on to detail a number of features of GeneChip data that mean the correlations we measure using this technology may not be due entirely to biology. This was split into four sections, each with their own conclusions.

Section 1

Different probesets mapping to the same gene may not always be up- and down-regulated together [cite]10.1186/1471-2105-8-13[/cite]. The obvious explanation for this is that probes map to different exons, and alternative splicing means that differing probes may be differentially regulated, even if they map to the same gene. The follow-on suggestion from this is that while genes come in pieces, exons do not, and the exon can be considered the ‘atomic unit’ of transcription.

Conclusions: Exons need to be considered and classified separately. We should be careful of assumptions that contradict known biology.

Section 2

By investigating correlations across >6,000 GeneChips (HGU-133A, from experiments that are publicly available in the Gene Expression Omnibus), the causes of coherent signals across these experiments can be investigated. Colour map correlation plots can show at a glance the relationships between the probes in many probesets, and anomalous probesets can be easily targeted for investigation. One such probeset (209885_at) was one that looked like it was showing splicing in action (3 of 11 probes clearly did not correlate with the remainder of the probeset across the arrays in GEO), but on further investigation it was found that all the probes in the probeset mapped to the same exon. Another probeset (31846_at) that also mapped to the same exon showed a very similar pattern. By investigating the correlation of all of the probes in the 2 probesets, Harry clearly demonstrated that those 4 outlier probes correlated with one another, even though they did not correlate with any of the other probes.

The probes in the 2 probesets under investigation (centre panel, red bars) can clearly be seen to all be located in the final exon of the RHOD gene (top panel) on chromosome 11. In spite of the fact that the Affy annotation (bottom panel) has the probesets annotating the entire gene.

Further investigation showed that all of these 4 outlier probes contain long (4 or more) runs of guanine in their sequence, Harry showed that if you compare all probes with runs of guanine, you find more correlation than you would expect, and the more Gs, the better the correlation. A possible explanation for this was provided, with the suggestion being that the runs of Gs found in the probes could lead to G-quadruplexes being formed between adjacent probes on the GeneChip surface. This would mean that any RNA molecule with a run of Cs could hybridise to the remaining, free probes, and with a much greater affinity than at normal spots on the array, due to a much lower effective probe density in that spot (see [cite]10.1093/bib/bbp018[/cite] for more details on the physics of this).

Conclusions: Probes containing runs of 4 or more guanines are correlated with one another, and therefore are not measuring the expression of the gene they represent. It is proposed that the signals of these probes should be ignored when analysing a GeneChip experiment.

Section 3

Probes that contain the sequence GCCTCCC are, just like probes containing runs of guanine, more correlated with one another than you might expect them to be (see picture below, taken from [cite]10.1093/bfgp/elp027[/cite]). The proposed reason for this is that this sequence will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridizing to the GeneChip.

 

Probe correlations

Pairwise correlations for probes containing GCCTCCC. From Briefings in Functional Genomics and Proteomics (2009) 8 (3): 199-212.

Conclusions: Probes containing the complementary sequence to the primer spacer are probably not measuring gene expression. As with GGGG probes, they should be ignored in analysis.

Section 4

In the final section of his talk, Harry focussed on physical reasons for correlations between probes, showing that many probes show a correlation purely because they are found adjacent to very bright probes [cite]10.2202/1544-6115.1590[/cite]. So their correlated measurements are almost entirely due to poor focussing on the instrument capturing the image of the array. It can be shown that sharply focussed arrays have big values right next to small values, whereas poorly focussed arrays will have smaller differences between adjacent spots, because the large values have some of their intensity falling into their small neighbours. Harry also showed that you can use this objective measure to show the “quality” of a particular array scanner, and how it changes over time (since the scanner ID is contained within the metadata in a CEL file).

Conclusions: There is evidence that many GeneChip images are blurred. This blurring can confound the measurement of biology that you are trying to take in your experiment.

The take home message from Harry’s engaging and thought-provoking talk is that the analysis of high-throughput experiments like those using Affymetrix GeneChips cannot happen in isolation. The things we can learn from considering the statistics and bio-physics (among other things) of these experiments can be invaluable in interpreting the data.

Further resources:

One of the questions after the talk asked how to generate custom CDFs for removing the problematic probes that Harry highlighted during his talk. The answer was to use a tool like Xspecies (NASC) for achieving this.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Announcing a Bioinformatics Kblog writeathon

(Reposted from Knowledgeblog.org)

The Knowledgeblog team is holding a ‘writeathon’ to produce content for a tutorial-focused bioinformatics kblog.

The event will be taking place in Newcastle on the 21st June 2011.  We’re looking for volunteer contributors who would like to join us in Newcastle on the day, or would like to contribute tutorial material remotely to the project.

We will be sending invites shortly to a few invited contributors but are looking for a total of 15 to 20 participants in total.

Travel and accommodation costs (where appropriate) can be reimbursed.

If you would like to contribute tutorial material on microarray analysis, proteomics, next-generation sequencing, bioinformatics workflow development, bioinformatics database resources, network analysis or data integration and receive a citable DOI for your work please get in touch with us at admin@knowledgeblog.org

For more information about Knowledgeblog please see http://knowledgeblog.org.  For examples of existing Knowledgeblogs please see http://ontogeneis.knowledgeblog.org and http://taverna.knowledgeblog.org.

Automatic citation processing with Zotero and KCite

Writing papers. It’s a pain, right? Journals are finicky about formatting. You write the content and then the journal wants you to make it look right. You finally get the content in the right shape and then they tell you that you’ve formatted the bibliography wrong. Your bibliography is clearly in Harvard format, when the journal only accepts papers where the bibliography is formatted Chicago style. Another hour or two of spitting and cursing as you try to massage the citations and bibliography into the “correct” format. You’re not even allowed to cite everything you want to, because the internet is clearly so untrusted a resource.

I’m of the opinion that publishing should be lightweight, the publishers should get out of the way of the author’s process, not actively get in the way. Working on the Knowledgeblog project has only reinforced this opinion. Why should I spend days formatting the content, when any web content management system (CMS) worth its salt will take raw content and format it in a consistent way? Why should I process all the citations and format the bibliography when it should be (relatively) simple to do this in software? Why should I spend time producing complicated figures that compromise what I am able to show when data+code would give the reader far more power to visualise my results themselves?

This document is written in Word 2007 on a Windows 7 virtual machine. On this virtual machine I have also installed Standalone Zotero. The final piece of this particular jigsaw is a Citation Style Language (CSL) style document I wrote (you can download it from the Knowledgeblog Google Code site) that formats a citation in such a way that KCite, Knowledgeblog’s citation engine, can understand it. Now, when I insert citations into my Word document via the Zotero Add-In, I can pick the “KCite” style from the list, and the citation is popped into my document. Now when I hit “Publish” in Word, the document is pushed to my blog, KCite sees the citation as added by Zotero, and processes it, producing a nicely formatted bibliography. We are working on the citeproc-js implementation that means the reader can format this bibliography any way they choose (Phil has a working prototype of this). The biggest current limitation is that your Zotero library entry must have a DOI in it for everything to join up.

So, here is a paragraph with some (contextually meaningless) citations in it [cite]10.1006/jmbi.1990.9999[/cite]. All citations have been added into the Word doc via Zotero, and processed in the page you’re viewing by KCite [cite]10.1073/pnas.0400782101[/cite]. Adding a reference into the document from your Zotero library takes 3-4 clicks, no further processing is needed [cite]10.1093/bioinformatics/btr134[/cite].

Other popular reference management tools, such as Mendeley and Papers, also use CSL styles to format citations and bibliographies, so this same style could be employed to enable KCite referencing with those tools as well. This opens up a wide range of possible tool chains for effective blogging. Mendeley + OpenOffice on Ubuntu. Papers + TextMate on OS X (Papers can be used to insert citations into more than just office suite documents, more on that in a later post). The possibilities are broad (but not endless, not yet anyway). Hopefully this means many people’s existing authoring toolchain is already fully supported by Knowledgeblog.

Image credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ (Sybren Stüvel on Flickr)

CASE PhD studentship in Bioinformatics available

I’m delighted to announce we’re offering a PhD studentship, commencing in October. I’ve spent most of my time on the Ondex project building an integrated network focussed on drug repositioning (see [cite source=’doi’]10.2390/biecoll-jib-2010-116[/cite]). I’m very excited that we’ve managed to secure a CASE studentship, in collaboration with Philippe Sanseau at GSK, to continue and considerably extend this work. I think this is a very exciting opportunity. Full details below.

Where? – Newcastle University – School of Computing Science

What? – Development of Novel Computational Approaches to Mine Integrated Datasets for Drug Repurposing Opportunities

The blurb

We invite applications for a CASE PhD studentship in Bioinformatics at Newcastle University in the North East of England. The project is a 3-year EPSRC PhD sponsored by GlaxoSmithKline (GSK) and involves the development of novel methods of finding new targets for existing drugs using data integration.

Ondex is a data integration computational platform for Systems Biology (SB). The student will research the optimization and application of Ondex integrated datasets to the identification of repurposing opportunities for existing compounds with a particular, but not exclusive, focus in the infectious diseases therapeutic area. The student will also use the dataset to explore the interplay between microbial targets and perturbations in the metabolic and community structure of the human gut microbiome.

An ideal student will have a background in computing science, good programming skills, preferably in Java and an interest in biology and bioinformatics. Applicants should also possess an upper second class undergraduate degree. Only students who meet the EPSRC home student requirements are eligible for full fees, other EU students are only eligible to support for the fees. Students from outside the EU are not eligible to apply – please see the EPSRC website for details. 

The studentship will start in October 2011, jointly supervised by Prof. Anil Wipat and Dr. Simon Cockell at Newcastle University, and Dr. Philippe Sanseau at GSK. The student will spend at least three months at GSK in Stevenage as part of the project. Home students are eligible for payment of full fees and an enhanced stipend of approximately £18,000 tax free. To apply, please send an email to [anil dot wipat at ncl dot ac dot uk] with CV (including the contact details of least two referees) and a cover letter indicating your suitability for the position. Please include “Application CASE PhD” in the subject of the email. Applications will be dealt with as they arrive – there is no closing date.



The Problem with DOIs

This article was jointly authored by Phillip Lord and Simon Cockell.

Rhodopsin is a protein found in the eye, which mediates low-light-level vision. It is one of the 7-transmembrane domain proteins and is found in many
organisms including human.

Rhodopsin has an number of identifiers attached to it, which allow you to get additional data about the protein. For instance, the human version is identified by the string “OPSD_HUMAN” in uniprot. If you wish, you can go to http://www.uniprot.org/OPSD_HUMAN and find additional information. Actually, this URI redirects to http://www.uniprot.org/P08100.html. P08100 is an alternative (semantic-free) identifier for the same protein; P08100 is called the accession number and it is stable, as you can read in the user manual. If you don’t like the HTML presentation, you can always get the traditional structured text so beloved of bioinformatics; this is at http://www.uniprot.org/P08100.txt. Or the Uniprot XML (that is at http://www.uniprot.org/P08100.xml). Or http://www.uniprot.org/P08100.rdf if you want RDF. If you just want the sequence, that is at http://www.uniprot.org/P08100.fasta, or http://www.uniprot.org/P08100.gff if you want the sequence features. You might be worried about changes over time, in which case you can see all at http://www.uniprot.org/uniprot/P08100?version=*. Or if you are worried about changes in the future, then http://www.uniprot.org/uniprot/P08100.rss?version=* is the place to be. Obviously, if you want to move outward from here to the DNA sequence, or a report about the protein family, or any of the domains, then all of that is linked from here. If you don’t want to code this for yourself, there are libraries in perl, python and java which will handle these forms of data for you.

So this might be overkill, but the point is surely clear enough. It’s very easy to get the data in a multiple variety of formats, through stable identifiers. The history is clear, and the future as clear as it can be. The technology is simple, straight-forward both for humans and computers to access. The world of the biologist is a good place to be.

What does this have to do with DOIs. Let’s consider a section of publications from one of us. Of course, one of the nice things about DOIs is that you can convert them into URIs. But what do they point to? Well, a variety of different things. Maybe the full HTML article. Or, perhaps an HTML abstract and a picture of the front page. Or more links. Or, bizarrely, a list of the author biographies. Or just another image of a print out of the front page of a identified digital object.

These are a selection from our conference and journal publications. Obviously, this doesn’t cover many of our conference papers, as most don’t have DOIs unless they are published by a big publisher. Or our books. These are published by big publishers, but obviously they are books which is different. I’ve also organised or been on the PC for a number of workshops. They don’t have DOIs either. All of them do have URIs.

In no case, can we guarantee that what we see today will be the same as what we get tomorrow, even though DOIs are supposedly persistent. The presentation of the HTML on those pages that display HTML is wildly different; in many cases, there is no standard metadata. Given the DOI, there doesn’t appear to be a standard way to get hold of the metadata. If you poke around really hard on the DOI website, you may get to http://www.doi.org/tools.html. At this point, you probably already know about http://dx.doi.org, which allows you to resolve a DOI through HTTP. The list of links doesn’t take that long to work through, so you might eventually get to http://www.crossref.org. From here, you can perform searches, including extracting metadata for articles; obviously, you need to register, and you need an API key for this. It doesn’t always work, so if that fails, you can try http://www.pubmed.org, which returns metadata for some DOIs that CrossRef doesn’t, but doesn’t hold a DOI for every publication it lists (even those that have them), so it also fails in unpredictable ways.

The difference between the two situations couldn’t really be clearer. Within biology, we have an open, accessible and usable system. With DOIs, we don’t. The DOI handbook spends an awful lot of time describing the advantages of DOIs for publishers; very little is spent on the advantages for the people generating and accessing the content. It is totally unclear to us what use case DOIs are trying to address from our point of view; what ever it is, they certainly seem to fail of their purpose.

So, why do we care about this? Well, recently, we have been implementing a DOIs for kblogs. Ontogenesis articles now all have DOIs. When we were originally thinking about kblogs, our investigations on how to mint new DOIs came to very little. If DOIs are hard to use, creating them is even worse, you need a Registration Authority; setting this up within a university would be a nightmare. Compare this to the £9 credit card transaction required for a domain name (even this can be quite hard in a University setting!). In the end, we have managed to achieve this using DataCite. Ironically, they are misusing technology intended for articles to represent data; we are misusing DataCite to represent articles again. We also have to keep a hard record of our own of the DOIs we have minted, because, despite the fact all this information is stored in the Datacite database, there is no way of discovering if a DOI points at a given URL using the Datacite API, so we have no way of doing a reverse lookup from a blogpost to discover its DOI.

We’ve also created a referencing system for WordPress. This does DOI lookups for the user, currently using CrossRef, or PubMed. We are not sure yet whether we can retrieve DataCite metadata in this way also.

The irony of this is that it is all totally pointless. WordPress already creates permalinks, based on a URI. These URIs are trackback/pingback capable so can be used bi-directionally. We have added support so that URIs maintain their own version history, so that you can see all previous versions. If you do not trust us, or if we go away, then URIs are archived and versioned by the UK Web archive. Currently, we are adding features for better metadata support, which will use a simple REST style API like Uniprot. Hopefully, multiple format and subsection access will follow also.

So, why are we using DOIs at all? For the same reason as DataCite which has as one of it’s aims “to increase acceptance of research data as legitimate, citable contributions to the scientific record”. We need DOIs for kblog because, although DOIs are pointless, they have become established, they are used for assigning credit, and they are used as a badge of worth. For us, we find it unfortunate, that in the process of using DOIs, we are supporting their credentials as a badge of worth, but it seems the course of least resistance.

Blogging with KCite – a real world test

In my last post I introduced the latest output from the Knowledgeblog project, the KCite plugin for adding citations and bibliographies to blog posts. In this post, I’m using the plugin to add citations to the introduction from one of my papers. The paper is “An integrated dataset for in silico drug discovery”, published last year in the Journal of Integrative Bioinformatics under an unspecified “Open Access” license [cite source=’doi’]10.2390/biecoll-jib-2010-116[/cite].

1. Introduction

The drug development process is increasing in cost and becoming less productive. In order to arrest the decline in the productivity curve, pharmaceutical companies, biotechnology companies and academic researchers are turning to systems biology approaches to discover new uses for existing pharmacotherapies, and in some cases, reviving abandoned ones [cite]10.1038/nrd2265[/cite]. Here, we describe the use of the Ondex data integration platform for this purpose.

1.1 Drug Repositioning

There is recognition in the pharmaceutical industry that the current paradigm of research and development needs to change. Drugs based on novel chemistry still take 10-15 years to reach the market, and development costs are usually between $500 million and $2 billion [cite]10.1016/S0167-6296(02)00126-1[/cite] [cite]10.1377/hlthaff.25.2.420[/cite]. Most novel drug candidates fail in or before the clinic, and the costs of these failures must be borne by the companies concerned. These costs make it difficult even for large pharmaceutical companies to bring truly new drugs to market, and are completely prohibitive for publicly-funded researchers. An alternative means of discovering new treatments is to find new uses for existing drugs or for drug candidates for which there is substantial safety data. This repositioning approach bypasses the need for many of the pre-approval tests required of completely new therapeutic compounds, since the agent has already been documented as safe for its original purpose [cite]10.1038/nrd1468[/cite].

There are a number of examples where a new use for a drug has been discovered by a chance observation. New uses have been discovered for drugs from the observation of interesting side-effects during clinical trials, or by drug administration for one condition having unintended effects on a second. Sildenafil is probably the best-known example of the former; this drug was developed by Pfizer as a treatment for pulmonary arterial hypertension; during clinical trials, the serendipitous discovery was made that the drug was a potential treatment of erectile dysfunction in men. The direction of research was changed and sildenafil was renamed “Viagra” [cite]10.1056/NEJM199805143382001[/cite].

In order that a systematic approach may be taken to repositioning, a methodology that is less dependent on chance observation is required for the identification of compounds for alternative use. For instance, duloxetine (Cymbalta) was originally developed as an anti- depressant, and was postulated to be a more effective alternative to selective serotonin reuptake inhibitors (SSRIs) such as fluoxetine (Prozac). However, a secondary indication, as a treatment for stress urinary incontinence was found by examining its mode of action [cite source=’pubmed’]7636716[/cite].

Performing such an analysis on a drug-by-drug basis is impractical, time consuming and inappropriate for systematic screens. Nevertheless, such a re-screening approach, in which alternative single targets for existing drugs or drug candidates are sought by simple screening, has been attempted by Ore Pharmaceuticals [cite]10.1007/s00011-009-0053-3[/cite]. Systems biology provides a complementary method to manual reductionist approaches, by taking an integrated view of cellular and molecular processes. Combining data integration technology with systems approaches facilitates the analysis of an entire knowledgebase at once, and is therefore more likely to identify promising leads. This general approach, of using Systems approaches to search for repositionable candidates, is also being developed by e-Therapeutics plc and others exploring Network Pharmacology [cite]10.1038/nchembio.118[/cite]. However, network pharmacology differs from the approach we set out here, by examining the broadest range of the interventions in the proteome caused by a molecule, and using complex network analysis to interpret these in terms of efficacy in multiple clinical indications.

1.2 The Ondex data integration and visualisation platform

Biological data exhibit a wide variety of technical, syntactic and semantic heterogeneity. To use these data in a common analysis regime, the differences between datasets need to be tackled by assigning a common semantics. Different data integration platforms tackle this complicated problem in a variety of ways. BioMart [cite]10.1093/nar/gkp265[/cite], for instance, relies on transforming disparate database schema into a unified Mart format, which can then be accessed through a standard query interface. On the other hand, systems such as the Distributed Annotation System (DAS) take a federated approach to data integration; leaving data on multiple, distributed servers and drawing it together on a client application to provide an integrated view [cite]10.1186/1471-2105-8-333[/cite].

Ondex is a data integration platform for Systems Biology [cite]10.1093/bioinformatics/btl081[/cite], which addresses the problem of data integration by representing many types of data as a network of interconnected nodes. By allowing the nodes (or concepts) and edges (or relations) of the graph to be annotated with semantically rich metadata, multiple sources of information can be brought together meaningfully in the same graph. So, each concept has a Concept Class, and each relation a Relation Type. In this way it is possible to encode complex biological relationships within the graph structure; for example, two concepts of class Protein may be joined by an interacts_with relation, or a Transcription Factor may be joined to a Gene by a regulates relation. The Ondex data structure also allows both concepts and relations to have attributes, accessions and names. This feature means that almost any information can be attached to the graph in a systematic way. The parsing mechanism also records the provenance of the data in the graph. Ondex data is stored in the OXL data format [cite]10.2390/biecoll-jib-2007-62[/cite], a custom XML format designed for the exchange of integrated datasets, and closely coupled with the design of the data structure of Ondex.

The Ondex framework therefore combines large-scale database integration with sequence analysis, text mining and graph-based analysis. The system is not only useful for integrating disparate data, but can also be used as a novel analysis platform.

Using Ondex, we have built an integrated dataset of around 120,000 concepts and 570,000 relations to visualise the links between drugs, proteins and diseases. We have included information from a wide variety of publicly available databases, allowing analysis on the basis of: drug molecule similarity; protein similarity; tissue specific gene expression; metabolic pathways and protein family analysis. We analysed this integrated dataset to highlight known examples of repositioned drugs, and their connectivity across multiple data sources. We also suggest methods of automated analysis for discovery of new repositioning opportunities on the basis of indicative semantic motifs.