referencing

Automatic citation processing with Zotero and KCite

Writing papers. It’s a pain, right? Journals are finicky about formatting. You write the content and then the journal wants you to make it look right. You finally get the content in the right shape and then they tell you that you’ve formatted the bibliography wrong. Your bibliography is clearly in Harvard format, when the journal only accepts papers where the bibliography is formatted Chicago style. Another hour or two of spitting and cursing as you try to massage the citations and bibliography into the “correct” format. You’re not even allowed to cite everything you want to, because the internet is clearly so untrusted a resource.

I’m of the opinion that publishing should be lightweight, the publishers should get out of the way of the author’s process, not actively get in the way. Working on the Knowledgeblog project has only reinforced this opinion. Why should I spend days formatting the content, when any web content management system (CMS) worth its salt will take raw content and format it in a consistent way? Why should I process all the citations and format the bibliography when it should be (relatively) simple to do this in software? Why should I spend time producing complicated figures that compromise what I am able to show when data+code would give the reader far more power to visualise my results themselves?

This document is written in Word 2007 on a Windows 7 virtual machine. On this virtual machine I have also installed Standalone Zotero. The final piece of this particular jigsaw is a Citation Style Language (CSL) style document I wrote (you can download it from the Knowledgeblog Google Code site) that formats a citation in such a way that KCite, Knowledgeblog’s citation engine, can understand it. Now, when I insert citations into my Word document via the Zotero Add-In, I can pick the “KCite” style from the list, and the citation is popped into my document. Now when I hit “Publish” in Word, the document is pushed to my blog, KCite sees the citation as added by Zotero, and processes it, producing a nicely formatted bibliography. We are working on the citeproc-js implementation that means the reader can format this bibliography any way they choose (Phil has a working prototype of this). The biggest current limitation is that your Zotero library entry must have a DOI in it for everything to join up.

So, here is a paragraph with some (contextually meaningless) citations in it [cite]10.1006/jmbi.1990.9999[/cite]. All citations have been added into the Word doc via Zotero, and processed in the page you’re viewing by KCite [cite]10.1073/pnas.0400782101[/cite]. Adding a reference into the document from your Zotero library takes 3-4 clicks, no further processing is needed [cite]10.1093/bioinformatics/btr134[/cite].

Other popular reference management tools, such as Mendeley and Papers, also use CSL styles to format citations and bibliographies, so this same style could be employed to enable KCite referencing with those tools as well. This opens up a wide range of possible tool chains for effective blogging. Mendeley + OpenOffice on Ubuntu. Papers + TextMate on OS X (Papers can be used to insert citations into more than just office suite documents, more on that in a later post). The possibilities are broad (but not endless, not yet anyway). Hopefully this means many people’s existing authoring toolchain is already fully supported by Knowledgeblog.

Image credit: http://www.flickr.com/photos/sybrenstuvel/2468506922/ (Sybren Stüvel on Flickr)

Blogging with KCite – a real world test

In my last post I introduced the latest output from the Knowledgeblog project, the KCite plugin for adding citations and bibliographies to blog posts. In this post, I’m using the plugin to add citations to the introduction from one of my papers. The paper is “An integrated dataset for in silico drug discovery”, published last year in the Journal of Integrative Bioinformatics under an unspecified “Open Access” license [cite source=’doi’]10.2390/biecoll-jib-2010-116[/cite].

1. Introduction

The drug development process is increasing in cost and becoming less productive. In order to arrest the decline in the productivity curve, pharmaceutical companies, biotechnology companies and academic researchers are turning to systems biology approaches to discover new uses for existing pharmacotherapies, and in some cases, reviving abandoned ones [cite]10.1038/nrd2265[/cite]. Here, we describe the use of the Ondex data integration platform for this purpose.

1.1 Drug Repositioning

There is recognition in the pharmaceutical industry that the current paradigm of research and development needs to change. Drugs based on novel chemistry still take 10-15 years to reach the market, and development costs are usually between $500 million and $2 billion [cite]10.1016/S0167-6296(02)00126-1[/cite] [cite]10.1377/hlthaff.25.2.420[/cite]. Most novel drug candidates fail in or before the clinic, and the costs of these failures must be borne by the companies concerned. These costs make it difficult even for large pharmaceutical companies to bring truly new drugs to market, and are completely prohibitive for publicly-funded researchers. An alternative means of discovering new treatments is to find new uses for existing drugs or for drug candidates for which there is substantial safety data. This repositioning approach bypasses the need for many of the pre-approval tests required of completely new therapeutic compounds, since the agent has already been documented as safe for its original purpose [cite]10.1038/nrd1468[/cite].

There are a number of examples where a new use for a drug has been discovered by a chance observation. New uses have been discovered for drugs from the observation of interesting side-effects during clinical trials, or by drug administration for one condition having unintended effects on a second. Sildenafil is probably the best-known example of the former; this drug was developed by Pfizer as a treatment for pulmonary arterial hypertension; during clinical trials, the serendipitous discovery was made that the drug was a potential treatment of erectile dysfunction in men. The direction of research was changed and sildenafil was renamed “Viagra” [cite]10.1056/NEJM199805143382001[/cite].

In order that a systematic approach may be taken to repositioning, a methodology that is less dependent on chance observation is required for the identification of compounds for alternative use. For instance, duloxetine (Cymbalta) was originally developed as an anti- depressant, and was postulated to be a more effective alternative to selective serotonin reuptake inhibitors (SSRIs) such as fluoxetine (Prozac). However, a secondary indication, as a treatment for stress urinary incontinence was found by examining its mode of action [cite source=’pubmed’]7636716[/cite].

Performing such an analysis on a drug-by-drug basis is impractical, time consuming and inappropriate for systematic screens. Nevertheless, such a re-screening approach, in which alternative single targets for existing drugs or drug candidates are sought by simple screening, has been attempted by Ore Pharmaceuticals [cite]10.1007/s00011-009-0053-3[/cite]. Systems biology provides a complementary method to manual reductionist approaches, by taking an integrated view of cellular and molecular processes. Combining data integration technology with systems approaches facilitates the analysis of an entire knowledgebase at once, and is therefore more likely to identify promising leads. This general approach, of using Systems approaches to search for repositionable candidates, is also being developed by e-Therapeutics plc and others exploring Network Pharmacology [cite]10.1038/nchembio.118[/cite]. However, network pharmacology differs from the approach we set out here, by examining the broadest range of the interventions in the proteome caused by a molecule, and using complex network analysis to interpret these in terms of efficacy in multiple clinical indications.

1.2 The Ondex data integration and visualisation platform

Biological data exhibit a wide variety of technical, syntactic and semantic heterogeneity. To use these data in a common analysis regime, the differences between datasets need to be tackled by assigning a common semantics. Different data integration platforms tackle this complicated problem in a variety of ways. BioMart [cite]10.1093/nar/gkp265[/cite], for instance, relies on transforming disparate database schema into a unified Mart format, which can then be accessed through a standard query interface. On the other hand, systems such as the Distributed Annotation System (DAS) take a federated approach to data integration; leaving data on multiple, distributed servers and drawing it together on a client application to provide an integrated view [cite]10.1186/1471-2105-8-333[/cite].

Ondex is a data integration platform for Systems Biology [cite]10.1093/bioinformatics/btl081[/cite], which addresses the problem of data integration by representing many types of data as a network of interconnected nodes. By allowing the nodes (or concepts) and edges (or relations) of the graph to be annotated with semantically rich metadata, multiple sources of information can be brought together meaningfully in the same graph. So, each concept has a Concept Class, and each relation a Relation Type. In this way it is possible to encode complex biological relationships within the graph structure; for example, two concepts of class Protein may be joined by an interacts_with relation, or a Transcription Factor may be joined to a Gene by a regulates relation. The Ondex data structure also allows both concepts and relations to have attributes, accessions and names. This feature means that almost any information can be attached to the graph in a systematic way. The parsing mechanism also records the provenance of the data in the graph. Ondex data is stored in the OXL data format [cite]10.2390/biecoll-jib-2007-62[/cite], a custom XML format designed for the exchange of integrated datasets, and closely coupled with the design of the data structure of Ondex.

The Ondex framework therefore combines large-scale database integration with sequence analysis, text mining and graph-based analysis. The system is not only useful for integrating disparate data, but can also be used as a novel analysis platform.

Using Ondex, we have built an integrated dataset of around 120,000 concepts and 570,000 relations to visualise the links between drugs, proteins and diseases. We have included information from a wide variety of publicly available databases, allowing analysis on the basis of: drug molecule similarity; protein similarity; tissue specific gene expression; metabolic pathways and protein family analysis. We analysed this integrated dataset to highlight known examples of repositioned drugs, and their connectivity across multiple data sources. We also suggest methods of automated analysis for discovery of new repositioning opportunities on the basis of indicative semantic motifs.