Reference Genome Graphs

I posted this wondering on Twitter earlier today:

Which elicited precisely no response. So I actually had to do my own research.

sigh.

This is the hacked together results of that hour or so. Presented on the grounds that it may help others, as well as being an aide memoir for me.

Motivation

In general, linear representations of reference genomes fail (unsurprisingly) to capture the complexities of populations. It is almost universally the case, though, that the tools we have available to us work with precisely this type of reference.

If we take the example of the human genome – since GRCh37 (Feb ’09), the GRC has attempted to represent some of the population-level complexity of humans by making alternative haplotypes available for regions of high complexity (the MHC locus of chromosome 6 is the canonical example). Church et al (2015) provide a more complete overview of the issues than I could hope to [1].

Despite this recognition of the issues at hand, a flattened representation of the ‘genome graph’ is still predominantly used for downstream mapping applications. With GRCh38 having 178 regions with alt loci (as opposed to 3 in GRCh37), the need for an adequate toolchain becomes more pressing.

The Approach

Church et al points to a Github repo which is used for tracking software tools which make use of the full GRC assembly. I think the contents speaks for itself.

I’ve found a few useful resources discussing the technology surrounding the use of a reference graph, as opposed to the flattened representation. Kehr et al (2014) discusses the different graph models available for building graph-based alignment tools [2]. Though the focus on actual implementations is rather historical (the most modern tool referenced was released in 2011).

More up-to-date is the work of Dilthey et al (2015), which looks at improving the genome representation of the MHC region specifically through the use of reference graphs and hidden Markov models [3]. However, this work doesn’t seek to tackle a generic approach to read alignment to a genome graph. We do get a proposal for a useful graph model (the Population Reference Graph, or PRG) and a nice method for using data specifically in the MHC region, using HMMs. It’s also unclear to me (from my brief reading of the paper) how well this approach would scale.

From the CS side of things, we get an extension of the Burrows-Wheeler Transform for graphs rather than strings from Sirén et al (2014) [4]. This is an approach which would clearly allow the very popular transformation used so widely in short read alignment to be adapted to a graph-based algorithm.

Then, finally, I came across an implementation. BWBBLE represents a collection of genomes, not using a graph structure, but by making use of ambiguity codes and string padding [5]. Huang et al then rely on a conventional BWT to index this string representation of the ‘multi-genome’. This work also describes the implementation of an aligner that makes use of this genome representation.

BWBBLE feels a bit ‘close but no cigar’ – it’s not using a graph to represent a population of genomes and no one is really using it.

Finally, I get to the place where I actually started. HISAT2. I knew that this aligner was using graph-based FM indices to represent the genome (based on the work of Sirén et al mentioned above). I knew this made it possible to represent SNPs and small indels in the reference. I have no idea whether this allows HISAT2 to fully represent alt loci in its reference. Though from the descriptions available, it seems unlikely. I was pretty impressed by HISAT when it was released [6], but have yet to test HISAT2 in anger.

HISAT was posited as the replacement for Tophat as an RNA-Seq aligner. HISAT2’s webpage makes it clear the developers intend it to be used over both HISAT and Tophat. It is not clear that any such thing is happening (STAR seems to be the de facto Tophat replacement in most people’s RNA-Seq toolchain, where they still rely on alignment, or are not still using Tophat).

Conclusions

My perspective on this is one of a complete outsider doing about an hour’s worth of reading, but it seems to me that there is a complete paucity of tools allowing for alignment to a graph model of a reference genome. There’s plenty of discussion of the issues, and a recognition that such tools are necessary, but little so far in the way of implementation.

I assume these tools are in the works (as I said, I’m an outsider looking in here, I have no idea who’s developing what).

I’ll leave with this, which is part of what got me wondering about the state of this field in the first place:

I guess 3-5 years is a loooong time in genomics.

EDIT: Clearly I missed a whole bunch of stuff. Many useful comments on Twitter, especially from @ZaminIqbal and @erikgarrison. Most especially, Erik points out the completeness of vg – which:

implements alignment, succinct db of graph (sequences + haplotyps), text/binary formats, visualization, lossless RDF transformation, annotation, variant calling, graph statistics, normalization, long read aln/assembly, sequence to debruijn graph, kmers, read simulation, graph comparison, and tools to project models (graph alns and variant calls) into linear ref.

(from a collection of tweets: 1, 2, 3, 4, 5, 6, 7, and 8).

Jeffrey in the comments below also points out a presentation on Google Docs by Erik: https://docs.google.com/presentation/d/1bbl2zY4qWQ0yYBHhoVuXb79HdgajRotIUa_VEn3kTpI/edit#slide=id.p.

Another tool to get a mention was progressiveCactus (see here). A graph-based alignment tool that seems to be under active development on Github.

So, not quite the paucity of implementation I first feared (though it was clear stuff must be in the works – good to know some of the what and where) – and Twitter came in handy for the research in the end…

[1]: doi: 10.1186/s13059-015-0587-3

[2]: doi: 10.1186/1471-2105-15-99

[3]: doi: 10.1038/ng.3257

[4]: doi: 10.1109/TCBB.2013.2297101

[5]: doi: 10.1093/bioinformatics/btt215

[6]: doi: 10.1038/nmeth.3317

“Alignment free” transcriptome quantification

Last week a pre-print was published describing Kallisto – a quantification method for RNA-Seq reads described (by the authors) as ‘near optimal’. Here is Lior Pachter’s associated blog post. Kallisto follows a current trend for ‘alignment free’ quantification methods for RNA-Seq analysis, with recent methods including Sailfish and it’s successor, Salmon.

These methods do, of course, do ‘alignment’ (pseudo-alignments in Kallisto parlance, lightweight alignments according to Salmon), but they do not expend computation in determining the optimal alignment, or disk in saving the results of the alignment. Making an alignment ‘good enough’ to determine the originating transcript and then disposing of it seems like a sensible use of resources – if you don’t intend to use the alignment for anything else later on (which usually I do – so alignment isn’t going to go away any time soon).

I’ve been using Salmon for a few months now, and have been very impressed with its speed and apparent accuracy, but I felt that the publication of Kallisto means I should actually do some testing.

What follows is pretty quick and dirty, and I know it could well stand some improvements along the way. I’ve tried to document the workflow adequately – let me know if you have any specific suggestions for improvement.

Building a test set

Initially I wanted to run the tests with a full transcriptome simulation, but I’ve been having some technical issues with polyester, the RNA-Seq experiment simulator, and I haven’t had the time to work them out. So instead, I am working with the transcripts of a sample of 250 random genes. Reads were simulated for these transcripts to match empirically observed counts from a recent experiment. This simulation gave me a test set of 206,124 paired-end reads for 1,129 transcripts. This data set was then used for quantification with both Salmon (v0.3.0) and Kallisto (v 0.42.1).

Quantification

I then ran the quantification with each tool & each set of reads 10 times, to get an idea of the variability of the results. For Salmon, this meant 10 separate runs, for Kallisto it meant extracting the HDF5 file from a run with 10 bootstraps. For interest, I tracked the time and resource use of each tool, though this is not a big consideration. Since we are now in the place where most tools operate in minutes per sample (alignment via STAR or HISAT, and these quantification methods), a few minutes either way is going to make a negligible difference, and speed is usually entirely the wrong metric to focus on.

For the record – Salmon took a lot longer for this experiment, though the requirement to keep counting reads until at least 50,000,000 have been processed (by default, configurable through the -n parameter) accounts for the majority of the time discrepancy & would not be such a ‘penalty’ to operation with a normally-sized experiment.

Variability

The two graphs below show the coefficient of variance for the observations by each of the software tools, vs the raw read count. In both cases, the transcripts with low base mean exhibit the largest relative variance – this is hardly surprising. In general, variance is proportional to mean expression.

salmon_variance

Per-transcript coefficient of variance for Salmon observations

kallisto_variance

Per-transcript coefficient of variance for Kallisto observations

Comparison

For comparison, I took the mean observation from the 10 runs of Salmon, and the final abundance.txt observations from Kallisto, and compared them to the ‘ground truth’ – the count table from which the simulation was derived. Plots of these comparisons are below.

salmon_truth

Correlation of Salmon counts with ground truth. The red line indicates perfect correlation.

kallisto_truth

Correlation of Kallisto counts with ground truth. The red line indicates perfect correlation.

I think both tools here are doing a pretty bang-up job — though Kallisto is performing better in this test, particularly with high abundance isoforms. The correlation with ‘truth’ is stronger (Spearman, reported on the graphs above), as is the mean absolute difference from truth (10.04 for Kallisto, 60.66 for Salmon).

Both Salmon and Kallisto are still under active development (Salmon has not yet been published, and Kallisto is only just in pre-print), so this is actually relatively early days for quantification by alignment-free methods (see this post by the Salmon developer Rob Patro for some potential future directions). The fact, then, that both tools are already doing such a good job of quantification is very exciting.

EDIT:

In response to the comment from Rob Patro below, I’m including a graph of the comparison of TPM (Transcripts Per Million) – again, truth vs Salmon & Kallisto, this time in one figure.

Transcripts per million comparison. Graph is of log2(TPM+1).

Transcripts per million comparison. Graph is of log2(TPM+1).

EDIT THE SECOND:

Further feedback from Rob, suggesting I use the non-bias corrected results from Salmon. This has a pretty significant effect on the results. The revised plots are included below. The Salmon help does mention that bias-correction is experimental…

Non-Bias Corrected Salmon counts vs Ground Truth

Non-Bias Corrected Salmon counts vs Ground Truth

TPM comparison with non-bias corrected Salmon results

TPM comparison with non-bias corrected Salmon results

I’m a bioinformatician

Navel-gazingthis opinion piece, called “Who qualifies to be a bioinformatician?” seems to have prompted rather a lot of it. I’m breaking my extended silence to add my two-pennyworth on what being a bioinformatician means to me

  • Being a bioinformatician means being a biologist, programmer, sysadmin, statistician and grief counsellor, rolled into one
  • Being a bioinformatician means writing thousands of lines of glue code in [scripting language of choice]
  • Being a bioinformatician means teaching experienced scientists something new, and getting to see the dawning realization that it might just be useful
  • Being a bioinformatician means jumping through a 112 hoops to compile the latest and greatest tool, just to find it segfaults on anything other than the test data
  • Being a bioinformatician means embarking on what seems like a simple job, only to find six weeks later you’ve written yet another short-read aligner
  • Being a bioinformatician means crafting an exquisite pipeline that has to be subtly changed with each run because every dataset is a special little flower that needs bespoke treatment
  • Being a bioinformatician means writing yet another hacky data munging script that will break on the 32,356th line of the poorly defined, exception riddled, lumpen slurry of an input file you’re having to deal with this time
  • Being a bioinformatician means learning that Excel is an acceptable interoperability format, whether you like it or not (I don’t)
  • Being a bioinformatician means knowing enough biology, computing and statistics to be looked down on by purists in all three disciplines
  • Being a bioinformatician means playing a key role in an unparalleled range of exciting, cutting edge research
  • Being a bioinformatician means being part of an open, collaborative worldwide community who are genuinely supportive and helpful

Now, this list may be a little flippant in places — but it is intended to make a point. There are no hard and fast rules about what a bioinformatician is and isn’t, the label will mean different things to different people. But what it does involve is an unusually wide skill set, usually hard-won over many years, and the knowledge of when and where to apply those skills. It definitively doesn’t involve looking down on hardworking partitioners in the field purely because they don’t fit your elitist mould — the only thing this is likely to do is exclude those interested in the field, but who don’t fit your preconceived ideals.

If you want to let me know what being a bioinformatician means to you, feel free to comment below.

Housekeeping

After years of renting a VM from first Slicehost, and then Rackspace, I’ve finally taken the decision to change my web hosting arrangements. This site is now hosted at wordpress.com. I’m in the process of tidying things up and consolidating what was a confusing tangle of a web presence that had evolved over a number of years.

The move to a hosted blog, rather than self-hosting, means that stuff is certain to be broken in older posts which rely on plugins I’m no longer able to deploy.

I’m not promising a massive uptick in activity or anything, but at least things should be a bit more organised now.

Bioinformatics Community Building in Newcastle

Ooohhh look, a blog post. Not seen one of those in a while around these parts…

I’ve been rather busy since taking on my new job (which I guess I should now refer to just as ‘my job’). Hence the lack of posts for quite some time. Mostly I’ve been focussed on making sure the Bioinformatics Support Unit continues to run smoothly, but like anyone in a new job, I’ve also wanted to make my mark by changing the way thing operate a little bit. So this year we’ve run a proper training course for the first time, for instance. I’m also trying to make the unit more central to the way bioinformatics is done throughout the faculty, by establishing and running the Newcastle University Bioinformatics Special Interest Group. The aim of this group is to foster communication between bioinformatcians working at the University, and hopefully establish some sort of mutually supportive local geek community. The first meeting took place a couple of weeks ago, and I wrote it up for the Special Interest Group blog, but I thought I would reproduce that post here as well. The remainder of this post is taken from that site, with permission of the author (me).

In the first Bioinformatics Special Interest Group meeting, we heard a talk from Dr Andrew “Harry” Harrison entitled ‘On the causes of correlations seen between probe intensities in Affymetrix GeneChips’.

Harry started his talk with a brief overview of the Affymetrix microarray platform, including the important observation (as will become obvious later) that the distance between full length probes on the surface of a GeneChip is around 3nm. Full length probes are around 20nm long, so there is plenty of scope for adjacent probes to interact with one another. Also reviewed was the progress made in the summarisation of probe information from GeneChips into probeset observations per gene.

The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene [cite]10.1186/1471-2105-8-195[/cite]

The Affymetrix developed MAS5.0 algorithm [cite]10.1093/bioinformatics/18.12.1585[/cite], which takes the Tukey bi-weighted mean of the difference in logs of PM and MM probes, was swiftly shown to be outperformed by methods developed in academia, once Affymetrix released data that could be used to develop other summarisation algorithms (in particular dChip [cite]10.1007/0-387-21679-0_5[/cite], RMA [cite]10.1093/biostatistics/4.2.249[/cite] and GCRMA [cite]10.1198/016214504000000683[/cite] – which take into account systematic hybridisation patterns – i.e. the fact that some probes are “stickier” than others).

Finally for his introductory segment, Harry also mentioned the “curse of dimensionality” – the fact that high-throughput ‘omics experiments make 10s of 1,000s of measurements, and identifying small but significant differences that express what’s going on in the biology suffers from an enormous multiple-testing problem. Therefore, we want to be sure that those things we are measuring are truly indicative of the underlying biology.

For the main portion of his talk, Harry went on to detail a number of features of GeneChip data that mean the correlations we measure using this technology may not be due entirely to biology. This was split into four sections, each with their own conclusions.

Section 1

Different probesets mapping to the same gene may not always be up- and down-regulated together [cite]10.1186/1471-2105-8-13[/cite]. The obvious explanation for this is that probes map to different exons, and alternative splicing means that differing probes may be differentially regulated, even if they map to the same gene. The follow-on suggestion from this is that while genes come in pieces, exons do not, and the exon can be considered the ‘atomic unit’ of transcription.

Conclusions: Exons need to be considered and classified separately. We should be careful of assumptions that contradict known biology.

Section 2

By investigating correlations across >6,000 GeneChips (HGU-133A, from experiments that are publicly available in the Gene Expression Omnibus), the causes of coherent signals across these experiments can be investigated. Colour map correlation plots can show at a glance the relationships between the probes in many probesets, and anomalous probesets can be easily targeted for investigation. One such probeset (209885_at) was one that looked like it was showing splicing in action (3 of 11 probes clearly did not correlate with the remainder of the probeset across the arrays in GEO), but on further investigation it was found that all the probes in the probeset mapped to the same exon. Another probeset (31846_at) that also mapped to the same exon showed a very similar pattern. By investigating the correlation of all of the probes in the 2 probesets, Harry clearly demonstrated that those 4 outlier probes correlated with one another, even though they did not correlate with any of the other probes.

The probes in the 2 probesets under investigation (centre panel, red bars) can clearly be seen to all be located in the final exon of the RHOD gene (top panel) on chromosome 11. In spite of the fact that the Affy annotation (bottom panel) has the probesets annotating the entire gene.

Further investigation showed that all of these 4 outlier probes contain long (4 or more) runs of guanine in their sequence, Harry showed that if you compare all probes with runs of guanine, you find more correlation than you would expect, and the more Gs, the better the correlation. A possible explanation for this was provided, with the suggestion being that the runs of Gs found in the probes could lead to G-quadruplexes being formed between adjacent probes on the GeneChip surface. This would mean that any RNA molecule with a run of Cs could hybridise to the remaining, free probes, and with a much greater affinity than at normal spots on the array, due to a much lower effective probe density in that spot (see [cite]10.1093/bib/bbp018[/cite] for more details on the physics of this).

Conclusions: Probes containing runs of 4 or more guanines are correlated with one another, and therefore are not measuring the expression of the gene they represent. It is proposed that the signals of these probes should be ignored when analysing a GeneChip experiment.

Section 3

Probes that contain the sequence GCCTCCC are, just like probes containing runs of guanine, more correlated with one another than you might expect them to be (see picture below, taken from [cite]10.1093/bfgp/elp027[/cite]). The proposed reason for this is that this sequence will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridizing to the GeneChip.

 

Probe correlations

Pairwise correlations for probes containing GCCTCCC. From Briefings in Functional Genomics and Proteomics (2009) 8 (3): 199-212.

Conclusions: Probes containing the complementary sequence to the primer spacer are probably not measuring gene expression. As with GGGG probes, they should be ignored in analysis.

Section 4

In the final section of his talk, Harry focussed on physical reasons for correlations between probes, showing that many probes show a correlation purely because they are found adjacent to very bright probes [cite]10.2202/1544-6115.1590[/cite]. So their correlated measurements are almost entirely due to poor focussing on the instrument capturing the image of the array. It can be shown that sharply focussed arrays have big values right next to small values, whereas poorly focussed arrays will have smaller differences between adjacent spots, because the large values have some of their intensity falling into their small neighbours. Harry also showed that you can use this objective measure to show the “quality” of a particular array scanner, and how it changes over time (since the scanner ID is contained within the metadata in a CEL file).

Conclusions: There is evidence that many GeneChip images are blurred. This blurring can confound the measurement of biology that you are trying to take in your experiment.

The take home message from Harry’s engaging and thought-provoking talk is that the analysis of high-throughput experiments like those using Affymetrix GeneChips cannot happen in isolation. The things we can learn from considering the statistics and bio-physics (among other things) of these experiments can be invaluable in interpreting the data.

Further resources:

One of the questions after the talk asked how to generate custom CDFs for removing the problematic probes that Harry highlighted during his talk. The answer was to use a tool like Xspecies (NASC) for achieving this.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Making an RSS feed from Google Reader shared items

A month ago there would have been no reason to write this post, because Google Reader made it’s own RSS feed of the posts you wanted to share. See, Google wants to drive people to use Google+, and they seem to be doing this by crippling their other services that have even a smidgen of social usefulness. The thing is, I liked the social bits of Google Reader. Sure, they were a bit of an afterthought, and mildly dysfunctional, but I got a lot of value from reading what other people shared, and I liked that when I shared something I was creating a kind of archive of stuff I liked on Reader, with an RSS feed of its very own.

The great thing about RSS is that it can be consumed by arbitrary 3rd party applications. The likes of Google, Facebook and Twitter don’t like this, because they want control over the 3rd parties that can access their stream. So they are quite prepared to kill off RSS services in their applications, because it does not serve their mission. You can no longer get an RSS feed of your Twitter stream (as far as I know), there is no RSS built into Google+ (which I have yet to have the time to fully grok).

My needs are small, I want an RSS feed of the stuff I want to share from Google Reader, so that other people can follow the things I share in Reader (if they want), and I can pipe that information elsewhere (I use dlvr.it to post selected RSS feeds into Twitter). Google doesn’t want to provide that anymore, so I’ll hack something together.

The ingredients:

  1. These simple instructions for how to render an RSS feed from a MySQL backend.
  2. The instructions for how to create your own “Send to:” item in Google Reader
  3. My rudimentary PHP hackery skills

The code:
All source is available on BitBucket.

First, we need a database connection. The database is set up exactly as described in (1), above.

<?php 
DEFINE('DB_USER', 'db_user'); 
DEFINE('DB_PASSWORD', 'db_password'); 
DEFINE('DB_HOST', 'localhost'); 
DEFINE('DB_NAME', 'db_name'); 
// Make the connnection and then select the database. 
$dbc = @mysql_connect(DB_HOST, DB_USER, DB_PASSWORD) OR die(mysql_error()); 
mysql_select_db(DB_NAME) OR die(mysql_error()); 
?>

Now, when the page is visited, we want to render what is in the database as an RSS feed (again, this is a simple adaptation of the code in (1)):

<?php
  class RSS {
        public function RSS() {
                require_once ('mysql_connect.php');
        }
        public function GetFeed() {
                return $this->getDetails() . $this->getItems();
        }
        private function dbConnect() {
                DEFINE('LINK', mysql_connect(DB_HOST, DB_USER, DB_PASSWORD));
        }
        private function getDetails() {
                //header of the RSS feed
                $detailsTable = "webref_rss_details";
                $this->dbConnect($detailsTable);
                $query = "SELECT * FROM ". $detailsTable;
                $result = mysql_db_query (DB_NAME, $query, LINK);
                while($row = mysql_fetch_array($result)) {
                        //fairly minimal description of the feed
                        $details = '<?xml version="1.0" encoding="ISO-8859-1" ?>
                                <rss version="2.0">
                                        <channel>
                                                <title>'. $row['title'] .'</title>
                                                <link>'. $row['link'] .'</link>
                                                <description>'. $row['description'] .'</description>
                                                <language>'. $row['language'] .'</language>
                                                ';
                }
                return $details;
        }

        private function getItems() {
                //return all the items foe the RSS feed
                $itemsTable = "webref_rss_items";
                $this->dbConnect($itemsTable);
                $query = "SELECT * FROM ". $itemsTable;
                $result = mysql_db_query(DB_NAME, $query, LINK);
                $items = '';
                while($row = mysql_fetch_array($result)) {
                        $items .= '<item>
                                <title>'. $row["title"] .'</title>
                                <link>'. $row["link"] .'</link>
                                <description><![CDATA['. $row["description"] .']]></description>
                        </item>';
                }
                //close the feed
                $items .= '</channel>
                                </rss>';
                return $items;
        }
}
?>

Finally, we need a method for adding new stuff for the feed. This code takes the GET variables passed to it by Google Reader, and stores them in the DB:

<?php
if ($_GET['url']) {
        //receive google reader 'send to' items, and store in mysqldb
        $url = $_GET['url'];
        $source = $_GET['source'];
        $title = $_GET['title'];
        $simple_check = $_GET['check'];
        //stops anyone adding new items to your feed unless they have the key
        if ($simple_check == 'uniquepasscodehere') {
                //statement adds new item to RSS database
                $insert_statement = "INSERT INTO webref_rss_items(title, description, link) VALUES('$title', '$source', '$url')";
                require_once('mysql_connect.php');
                $result = mysql_query($insert_statement, $dbc);
                if ($result) {
                        echo "<p>Success!";
                        //would be nice to close the window automatically after a couple of seconds
                }
                else {
                        die('<p>Invalid query: ' . mysql_error());
                }
        }
}
else {
        //render everything in the db as RSS
        header("Content-Type: application/xml; charset=ISO-8859-1"); 
        include("RSS.class.php"); 
        $rss = new RSS(); 
        echo $rss->GetFeed(); 
}
?>

Now, I can set up the Send To: item in Google Reader:

Finally, click ‘Send To: -> Readershare’ in the footer of an item in Google Reader, and it is rendered into my RSS feed, which can then be consumed by other applications, including Google Reader itself (so if you want to subscribe to my Google Reader shared items feed, you can find it at http://fuzzierlogic.com/readershare). Oh, and I can pipe my Google Reader shares back into Twitter again.

R programming courses at Newcastle University

An announcement courtesy of Colin Gillespie, a lecturer in Maths & Stats here in Newcastle:

The School of Mathematics & Statistics at Newcastle University, are
again running some R courses. In January, 2012, we will run:

  • January 16th: Introduction to R;
  • January 17th: Programming with R;
  • January 18th & 19th: Advanced graphics with R.

The courses aren’t aimed at teaching statistics, rather they aim to go through the fundemental concepts of R programming.

Further information is available at the course website.

Science Online London 2011

It is hard not to get carried away in a room full of people who seem mostly to want the same things. You come away from a conference like Science Online thinking that the open science revolution is inevitable, and there is nothing anyone can do to stop it. Then you get back to your day job and talk of REF and impact factors and get bought back to earth with a bump.

#solo11 tweets

Word Cloud of #solo11 Tweets (tagxedo.com)

The take home message of the conference this year seemed to be this: for open science to work, long term, reward mechanisms within the profession have to change in a comprehensive and profound way. Do I think this is possible? Of course. Do I think this is inevitable? Not by a long chalk. There are too many parties with a vested interest in things remaining the same, some of whom were represented here, despite all of the talk being about openness.

NPG certainly don’t seem that interested in opening things up too far, as the breakout session on APIs demonstrated. Nothing outside of their paywall was discussed, even more broadly applicable tools, like Connotea, seem to be quietly dropped in the background. The research councils are still more interested in “impact” (whatever that means) than genuinely original thinking.

But for all this pessimism, there are interesting things happening, and a mainstream breakthrough becomes more likely as the volume of those agitating for change grows. MaryAnn Martone‘s keynote was genuinely inspiring, a clear case for breaking down the garden walls. Michael Nielsen made a compelling case for wholesale revolution (however unlikely I think this sort of change may be). We showed that in an afternoon, you can set up a collaborative blog and populate it with interesting scientific content, using freely available tools. The interest we always encounter for the Knowledgeblog project enthuses me, and encourages me that something similar will make hay someday soon (even if we don’t manage to be the people who make the breakthrough).

It may be difficult for me to get to SoLo12, but I will try very hard to return, because I always leave with a smile on my face.

Salbutamol promotes SMN2 expression

This is a cross-post from the Blogging for Science Online London group blog. During the Saturday workshop at Science Online London 2011, a bunch of us wrote content relating to Spinal Muscular Atrophy. My post was a short summary of a small scale drug trial, which shows promising results.

This is a summary of a paper that shows that Salbutamol promotes SMN2 expression in vivo [cite]10.1136/jmg.2010.080366[/cite].

Patients with Spinal Muscular Atrophy (SMA) have no functioning copy of the gene SMN1. The SMN2 gene can theoretically function in its place, but a change in this gene means that only a small amount of functional protein is produced from the gene.

It is therefore suggested that any intervention that can increase the level of functional SMN2 transcript could well be effective as a treatment for SMA.

Salbutamol is a short acting beta-adrenergic agonist that is primarily used for treating asthma. A previous study [cite]10.1136/jmg.2007.051177[/cite] has shown that Salbutamol is effective in raising SMN2 full length (SMN2-fl) levels in cultured SMA fibroblasts.

Figure 1

In this study, the researchers administered Salbutamol to 12 patients with SMA, and measured the levels of SMN2-fl 3 times (0, 3 and 6 months). The levels of SMN2-fl were significantly increased in all but 3 patients after 3 months (average increase of 48.9%), and in all patients after 6 months (average increase of 91.8%). They also showed that patients with more copies of the SMN2 gene (some patients had 3 copies, some had 4) showed a larger response to Salbutamol treatment. This increase in expression cannot be explained by normal fluctuations over time in these patients, since studies have shown that levels of SMN2-fl are usually stable over time [cite]10.1212/01.wnl.0000252934.70676.ab[/cite] [cite]10.1038/ejhg.2009.116[/cite]. Clearly the big question now is whether this molecular response to the drug is reflected in a beneficial clinical response in the patient. This study does not address this question, but does propose that a full double-blind, placebo controlled trial should be carried out to ascertain whether or not this treatment is effective in treating the symptoms of SMA.

ResearchBlogging.orgTiziano, F., Lomastro, R., Pinto, A., Messina, S., D’Amico, A., Fiori, S., Angelozzi, C., Pane, M., Mercuri, E., Bertini, E., Neri, G., & Brahe, C. (2010). Salbutamol increases survival motor neuron (SMN) transcript levels in leucocytes of spinal muscular atrophy (SMA) patients: relevance for clinical trial design Journal of Medical Genetics, 47 (12), 856-858 DOI: 10.1136/jmg.2010.080366

Position available in the Bioinformatics Support Unit, Newcastle

Further to my last post, I am able to announce the availability of a position in the BSU, to work with me (it’s my old job!)

Official details are below, or you can find the full ad on the Newcastle University website.

Experimental Scientific Officer, Bioinformatics Support Unit

£27,428 to £35,788 per annum

Closing Date: 11 September 2011

The Bioinformatics Support Unit at Newcastle University is a successful cross-Faculty service providing high quality scientific support for a range of bioinformatics projects.

We require an Experimental Scientific Officer, with experience of a range of Bioinformatics techniques, to work in the Unit on the development and delivery of scientific projects and liaison with relevant academics.

You should have at least a first degree in a relevant science related subject and preferably a PhD. You will have previous experience in bioinformatics support and an understanding of UK research funding procedures.

For an informal discussion on this opportunity, please contact Dr Simon Cockell (Senior Experimental Scientific Officer).

The BSU website is a useful source of information on the kind of work we undertake. There’s a lot of analysis of high throughput data (as you would expect), but also opportunities to get involved with projects that have a bioinformatics research base. We support groups across the university, and beyond – in other north east universities and the local hospitals.