# Announcing a Bioinformatics Kblog writeathon

(Reposted from Knowledgeblog.org)

The Knowledgeblog team is holding a ‘writeathon’ to produce content for a tutorial-focused bioinformatics kblog.

The event will be taking place in Newcastle on the 21st June 2011.  We’re looking for volunteer contributors who would like to join us in Newcastle on the day, or would like to contribute tutorial material remotely to the project.

We will be sending invites shortly to a few invited contributors but are looking for a total of 15 to 20 participants in total.

Travel and accommodation costs (where appropriate) can be reimbursed.

If you would like to contribute tutorial material on microarray analysis, proteomics, next-generation sequencing, bioinformatics workflow development, bioinformatics database resources, network analysis or data integration and receive a citable DOI for your work please get in touch with us at admin@knowledgeblog.org

# The Problem with DOIs

Rhodopsin is a protein found in the eye, which mediates low-light-level vision. It is one of the 7-transmembrane domain proteins and is found in many
organisms including human.

Rhodopsin has an number of identifiers attached to it, which allow you to get additional data about the protein. For instance, the human version is identified by the string “OPSD_HUMAN” in uniprot. If you wish, you can go to http://www.uniprot.org/OPSD_HUMAN and find additional information. Actually, this URI redirects to http://www.uniprot.org/P08100.html. P08100 is an alternative (semantic-free) identifier for the same protein; P08100 is called the accession number and it is stable, as you can read in the user manual. If you don’t like the HTML presentation, you can always get the traditional structured text so beloved of bioinformatics; this is at http://www.uniprot.org/P08100.txt. Or the Uniprot XML (that is at http://www.uniprot.org/P08100.xml). Or http://www.uniprot.org/P08100.rdf if you want RDF. If you just want the sequence, that is at http://www.uniprot.org/P08100.fasta, or http://www.uniprot.org/P08100.gff if you want the sequence features. You might be worried about changes over time, in which case you can see all at http://www.uniprot.org/uniprot/P08100?version=*. Or if you are worried about changes in the future, then http://www.uniprot.org/uniprot/P08100.rss?version=* is the place to be. Obviously, if you want to move outward from here to the DNA sequence, or a report about the protein family, or any of the domains, then all of that is linked from here. If you don’t want to code this for yourself, there are libraries in perl, python and java which will handle these forms of data for you.

So this might be overkill, but the point is surely clear enough. It’s very easy to get the data in a multiple variety of formats, through stable identifiers. The history is clear, and the future as clear as it can be. The technology is simple, straight-forward both for humans and computers to access. The world of the biologist is a good place to be.

What does this have to do with DOIs. Let’s consider a section of publications from one of us. Of course, one of the nice things about DOIs is that you can convert them into URIs. But what do they point to? Well, a variety of different things. Maybe the full HTML article. Or, perhaps an HTML abstract and a picture of the front page. Or more links. Or, bizarrely, a list of the author biographies. Or just another image of a print out of the front page of a identified digital object.

These are a selection from our conference and journal publications. Obviously, this doesn’t cover many of our conference papers, as most don’t have DOIs unless they are published by a big publisher. Or our books. These are published by big publishers, but obviously they are books which is different. I’ve also organised or been on the PC for a number of workshops. They don’t have DOIs either. All of them do have URIs.

In no case, can we guarantee that what we see today will be the same as what we get tomorrow, even though DOIs are supposedly persistent. The presentation of the HTML on those pages that display HTML is wildly different; in many cases, there is no standard metadata. Given the DOI, there doesn’t appear to be a standard way to get hold of the metadata. If you poke around really hard on the DOI website, you may get to http://www.doi.org/tools.html. At this point, you probably already know about http://dx.doi.org, which allows you to resolve a DOI through HTTP. The list of links doesn’t take that long to work through, so you might eventually get to http://www.crossref.org. From here, you can perform searches, including extracting metadata for articles; obviously, you need to register, and you need an API key for this. It doesn’t always work, so if that fails, you can try http://www.pubmed.org, which returns metadata for some DOIs that CrossRef doesn’t, but doesn’t hold a DOI for every publication it lists (even those that have them), so it also fails in unpredictable ways.

The difference between the two situations couldn’t really be clearer. Within biology, we have an open, accessible and usable system. With DOIs, we don’t. The DOI handbook spends an awful lot of time describing the advantages of DOIs for publishers; very little is spent on the advantages for the people generating and accessing the content. It is totally unclear to us what use case DOIs are trying to address from our point of view; what ever it is, they certainly seem to fail of their purpose.

So, why do we care about this? Well, recently, we have been implementing a DOIs for kblogs. Ontogenesis articles now all have DOIs. When we were originally thinking about kblogs, our investigations on how to mint new DOIs came to very little. If DOIs are hard to use, creating them is even worse, you need a Registration Authority; setting this up within a university would be a nightmare. Compare this to the £9 credit card transaction required for a domain name (even this can be quite hard in a University setting!). In the end, we have managed to achieve this using DataCite. Ironically, they are misusing technology intended for articles to represent data; we are misusing DataCite to represent articles again. We also have to keep a hard record of our own of the DOIs we have minted, because, despite the fact all this information is stored in the Datacite database, there is no way of discovering if a DOI points at a given URL using the Datacite API, so we have no way of doing a reverse lookup from a blogpost to discover its DOI.

We’ve also created a referencing system for WordPress. This does DOI lookups for the user, currently using CrossRef, or PubMed. We are not sure yet whether we can retrieve DataCite metadata in this way also.

The irony of this is that it is all totally pointless. WordPress already creates permalinks, based on a URI. These URIs are trackback/pingback capable so can be used bi-directionally. We have added support so that URIs maintain their own version history, so that you can see all previous versions. If you do not trust us, or if we go away, then URIs are archived and versioned by the UK Web archive. Currently, we are adding features for better metadata support, which will use a simple REST style API like Uniprot. Hopefully, multiple format and subsection access will follow also.

So, why are we using DOIs at all? For the same reason as DataCite which has as one of it’s aims “to increase acceptance of research data as legitimate, citable contributions to the scientific record”. We need DOIs for kblog because, although DOIs are pointless, they have become established, they are used for assigning credit, and they are used as a badge of worth. For us, we find it unfortunate, that in the process of using DOIs, we are supporting their credentials as a badge of worth, but it seems the course of least resistance.

# Blogging with KCite – a real world test

In my last post I introduced the latest output from the Knowledgeblog project, the KCite plugin for adding citations and bibliographies to blog posts. In this post, I’m using the plugin to add citations to the introduction from one of my papers. The paper is “An integrated dataset for in silico drug discovery”, published last year in the Journal of Integrative Bioinformatics under an unspecified “Open Access” license [cite source=’doi’]10.2390/biecoll-jib-2010-116[/cite].

1. Introduction

The drug development process is increasing in cost and becoming less productive. In order to arrest the decline in the productivity curve, pharmaceutical companies, biotechnology companies and academic researchers are turning to systems biology approaches to discover new uses for existing pharmacotherapies, and in some cases, reviving abandoned ones [cite]10.1038/nrd2265[/cite]. Here, we describe the use of the Ondex data integration platform for this purpose.

1.1 Drug Repositioning

There is recognition in the pharmaceutical industry that the current paradigm of research and development needs to change. Drugs based on novel chemistry still take 10-15 years to reach the market, and development costs are usually between $500 million and$2 billion [cite]10.1016/S0167-6296(02)00126-1[/cite] [cite]10.1377/hlthaff.25.2.420[/cite]. Most novel drug candidates fail in or before the clinic, and the costs of these failures must be borne by the companies concerned. These costs make it difficult even for large pharmaceutical companies to bring truly new drugs to market, and are completely prohibitive for publicly-funded researchers. An alternative means of discovering new treatments is to find new uses for existing drugs or for drug candidates for which there is substantial safety data. This repositioning approach bypasses the need for many of the pre-approval tests required of completely new therapeutic compounds, since the agent has already been documented as safe for its original purpose [cite]10.1038/nrd1468[/cite].

There are a number of examples where a new use for a drug has been discovered by a chance observation. New uses have been discovered for drugs from the observation of interesting side-effects during clinical trials, or by drug administration for one condition having unintended effects on a second. Sildenafil is probably the best-known example of the former; this drug was developed by Pfizer as a treatment for pulmonary arterial hypertension; during clinical trials, the serendipitous discovery was made that the drug was a potential treatment of erectile dysfunction in men. The direction of research was changed and sildenafil was renamed “Viagra” [cite]10.1056/NEJM199805143382001[/cite].

In order that a systematic approach may be taken to repositioning, a methodology that is less dependent on chance observation is required for the identification of compounds for alternative use. For instance, duloxetine (Cymbalta) was originally developed as an anti- depressant, and was postulated to be a more effective alternative to selective serotonin reuptake inhibitors (SSRIs) such as fluoxetine (Prozac). However, a secondary indication, as a treatment for stress urinary incontinence was found by examining its mode of action [cite source=’pubmed’]7636716[/cite].

Performing such an analysis on a drug-by-drug basis is impractical, time consuming and inappropriate for systematic screens. Nevertheless, such a re-screening approach, in which alternative single targets for existing drugs or drug candidates are sought by simple screening, has been attempted by Ore Pharmaceuticals [cite]10.1007/s00011-009-0053-3[/cite]. Systems biology provides a complementary method to manual reductionist approaches, by taking an integrated view of cellular and molecular processes. Combining data integration technology with systems approaches facilitates the analysis of an entire knowledgebase at once, and is therefore more likely to identify promising leads. This general approach, of using Systems approaches to search for repositionable candidates, is also being developed by e-Therapeutics plc and others exploring Network Pharmacology [cite]10.1038/nchembio.118[/cite]. However, network pharmacology differs from the approach we set out here, by examining the broadest range of the interventions in the proteome caused by a molecule, and using complex network analysis to interpret these in terms of efficacy in multiple clinical indications.

1.2 The Ondex data integration and visualisation platform

Biological data exhibit a wide variety of technical, syntactic and semantic heterogeneity. To use these data in a common analysis regime, the differences between datasets need to be tackled by assigning a common semantics. Different data integration platforms tackle this complicated problem in a variety of ways. BioMart [cite]10.1093/nar/gkp265[/cite], for instance, relies on transforming disparate database schema into a unified Mart format, which can then be accessed through a standard query interface. On the other hand, systems such as the Distributed Annotation System (DAS) take a federated approach to data integration; leaving data on multiple, distributed servers and drawing it together on a client application to provide an integrated view [cite]10.1186/1471-2105-8-333[/cite].

Ondex is a data integration platform for Systems Biology [cite]10.1093/bioinformatics/btl081[/cite], which addresses the problem of data integration by representing many types of data as a network of interconnected nodes. By allowing the nodes (or concepts) and edges (or relations) of the graph to be annotated with semantically rich metadata, multiple sources of information can be brought together meaningfully in the same graph. So, each concept has a Concept Class, and each relation a Relation Type. In this way it is possible to encode complex biological relationships within the graph structure; for example, two concepts of class Protein may be joined by an interacts_with relation, or a Transcription Factor may be joined to a Gene by a regulates relation. The Ondex data structure also allows both concepts and relations to have attributes, accessions and names. This feature means that almost any information can be attached to the graph in a systematic way. The parsing mechanism also records the provenance of the data in the graph. Ondex data is stored in the OXL data format [cite]10.2390/biecoll-jib-2007-62[/cite], a custom XML format designed for the exchange of integrated datasets, and closely coupled with the design of the data structure of Ondex.

The Ondex framework therefore combines large-scale database integration with sequence analysis, text mining and graph-based analysis. The system is not only useful for integrating disparate data, but can also be used as a novel analysis platform.

Using Ondex, we have built an integrated dataset of around 120,000 concepts and 570,000 relations to visualise the links between drugs, proteins and diseases. We have included information from a wide variety of publicly available databases, allowing analysis on the basis of: drug molecule similarity; protein similarity; tissue specific gene expression; metabolic pathways and protein family analysis. We analysed this integrated dataset to highlight known examples of repositioned drugs, and their connectivity across multiple data sources. We also suggest methods of automated analysis for discovery of new repositioning opportunities on the basis of indicative semantic motifs.

# KCite – easy citations in WordPress

For a couple of months now, I’ve been working on a referencing plugin for Knowledgeblog. The idea is to make it easy for authors to add citations to their posts, and have a bibliography produced automatically. Key to this approach (as with everything we’re doing on Knowledgeblog) is enabling authors to use their pre-existing workflow. So, if they are used to writing documents/papers in Word, they should be able to continue using it for writing posts for Knowledgeblog. If, on the other hand, they prefer to write collaboratively using Google Docs, we shouldn’t put unnecessary obstacles in their path, and so on. So the tool that we have produced, called KCite, uses simple text-based tags to process citations. These tags can be added from any platform (they are extremely simple to just type in), and WordPress will interpret them when it renders the post.

There is no attempt to manage references, to create a database and allow selection from that database when adding new citations. This is quite deliberate, researchers already use these tools, they are external to WordPress and (as of yet), incompatible with it. By keeping the system as simple (I hope) as possible, citations should be perfectly manageable by copy&paste from a browser or reference manager of your choosing, into the tool of your choosing.

I will publish an example of the plugin in action as a separate post, but in short the idea is that you surround either a DOI or a PMID with a cite shortcode. The plugin queries the CrossRef API or PubMed (via NCBI eUtils) in order to retrieve metadata about each publication, and uses that data to build the bibliography, which is then appended to the foot of the post. As yet this is far from being completely generic, and there will be circumstances where the lookup fails, but I have attempted to handle these situations as gracefully as possible, so hopefully a usable bibliography will be produced in as many cases as possible.

This is a 0.1 release, intended almost as a preview. The plugin is currently nowhere near what we would consider to be feature complete. There are a number of things on my TODO list to address over the next few weeks, but I would welcome feature requests and bug reports. You can follow development, and contact us, through the Google Code page for Knowledgeblog.

# The Taverna Knowledgeblog

Today I am sat in a room with a fairly large group of people, who all work on the Taverna project. They are writing a Knowledgeblog book about the workflow manager, and I am providing help and technical assistance as a part of my role on the Knowledgeblog project. As well as producing a hopefully useful product (a beginner’s guide to Taverna), we are testing some of the procedures and products that we have been working on over the last few months on the project.

Posts on a Knowledgeblog now have several features that were in our plan for the project. Specifically, post revisions are now publicly exposed, providing a public provenance trail, and preventing someone from ‘unsaying’ anything without the proper process. The editorial workflow is better defined than it was for Ontogenesis (the Knowledgeblog prototype), meaning requests for reviews and the provision of the reviews themselves should be more streamlined, and despite the approach to today, doesn’t require all of the collaborators on a publication to be sitting in the same room (for this we are using the excellent EditFlow plugin, which provides ‘editorial comments’ on posts, and can fire email events upon certain, pre-defined, operations).

Posts can have multiple authors, which, combined with the ability to author posts in genuinely collaborative tools such as Google Docs (as opposed to totally non-collaborative tools like Word documents shared by email, although you can write posts like that too if you like), allows jointly authored posts to be both simple to generate and properly attributed. Finally, easy to generate tables of contents, for both posts and whole sites, makes navigating the content simple.

There are still a number of pieces of the puzzle that need to be slotted into place for us to have a fully functional platform, but I can’t help but feel we’re getting there. As I mentioned, I was here for technical support, and I didn’t really have a massive amount to do today (I spent most of it tinkering with the chosen theme to get it to support CoAuthors Plus).

The next major step will be a plugin to assist with citing papers and generating bibliographies that I am currently in the process of writing, more on that in a future post. I agree with many of Martin Fenner’s points in his post of a few days ago, citations are not currently well supported by WordPress, or any plugins so far. I am working on the dynamic generation of citations and bibliographies from specific tags within posts. This should allow for simple management of referencing by authors, and provide a range of tools for readers of articles, such as BibTeX/RIS export and on-the-fly bibliography reformatting.

# Pretty equations in WordPress

We’ve spent a couple of months now on Knowledgeblog since JISC funded the project. My one day a week working on developing the tools and workflows for lightweight publishing has presented totally different challenges to the majority of my work, and I’m really enjoying it so far. Hopefully we’re engaged in building something that a lot of people will find useful in the long run.

Part of what will make the project useful to as many people as possible is the incremental goals that we will be combining into the whole platform, but that will hopefully be useful to a lot of folks in their own right. The first of these milestones is MathJax-LaTeX, a plugin for WordPress that renders mathematical equations in as attractive a way as possible.

WP-LaTeX is the usual way to do this in WordPress. This plugin takes inline LaTeX code in blog posts, and converts it into PNG images. These images are pretty good, they look good at the default resolution and they do the job, but we thought there might be a better way. Images are not particularly accessible, and they don’t scale very well (as you zoom in on a page, they start to pixelate pretty badly). It also requires running LaTeX locally, or on a third-party server, which might be undesirable for some people.

I’m aware of MathJax because I used to listen to the Stack Overflow podcast, and Joel and Jeff talked about it in one episode in relation to the Math Overflow site, because it was being leveraged there to render the large quantity of equations that a site like that requires. MathJax is a Javascript library that interprets LaTeX and MathML, and renders it as scalable web fonts inline. The LaTeX that is interpreted remains in the source of the page, and the equations are not images, so they scale perfectly with the rest of the text on the page. So the question is, what’s the best way to use MathJax to render equations in blog posts?

The instructions on the MathJax page tell you to edit the header of your blog theme to introduce the Javascript library on every page of the blog. We thought that using a plugin to inject the Javascript only on the pages it is required would be more efficient (it’s a big library, and you don’t want to load it on every page if you don’t have to). It also allows us to stay compatible with WP-LaTeX, because we can leverage the shortcode API that is a brilliant part of the WordPress environment.

Well, the MathJax-LaTeX plugin was published this week, you can download it now, and there’s a page on knowledgeblog.org describing in full how it works. If you’ve used WP-LaTeX in the past, MathJax-LaTeX understands the same syntax, so you can replace one with the other, if you wish.

Here’s a few examples:

The probability of getting (k) heads when flipping (n) coins:

$P(E) = {n choose k} p^k (1-p)^{ n-k}$

This is an inline equation: $sqrt{3x-1}+(1+x)^2$  it should be rendered without affecting the text around it.

OK, one more, definition of (e):

$e = lim_{ntoinfty} left( 1 + frac{1}{n} right)^n$

[mathjax]

If you want to keep tabs on how Knowledgeblog is developing, you can follow us on Twitter, watch our Google Code repository, and keep an eye on the site.

</plug>

(Photo courtesy of bourgeoisbee on flickr.com)