Bioinformatics Community Building in Newcastle

Ooohhh look, a blog post. Not seen one of those in a while around these parts…

I’ve been rather busy since taking on my new job (which I guess I should now refer to just as ‘my job’). Hence the lack of posts for quite some time. Mostly I’ve been focussed on making sure the Bioinformatics Support Unit continues to run smoothly, but like anyone in a new job, I’ve also wanted to make my mark by changing the way thing operate a little bit. So this year we’ve run a proper training course for the first time, for instance. I’m also trying to make the unit more central to the way bioinformatics is done throughout the faculty, by establishing and running the Newcastle University Bioinformatics Special Interest Group. The aim of this group is to foster communication between bioinformatcians working at the University, and hopefully establish some sort of mutually supportive local geek community. The first meeting took place a couple of weeks ago, and I wrote it up for the Special Interest Group blog, but I thought I would reproduce that post here as well. The remainder of this post is taken from that site, with permission of the author (me).

In the first Bioinformatics Special Interest Group meeting, we heard a talk from Dr Andrew “Harry” Harrison entitled ‘On the causes of correlations seen between probe intensities in Affymetrix GeneChips’.

Harry started his talk with a brief overview of the Affymetrix microarray platform, including the important observation (as will become obvious later) that the distance between full length probes on the surface of a GeneChip is around 3nm. Full length probes are around 20nm long, so there is plenty of scope for adjacent probes to interact with one another. Also reviewed was the progress made in the summarisation of probe information from GeneChips into probeset observations per gene.

The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene [cite]10.1186/1471-2105-8-195[/cite]

The Affymetrix developed MAS5.0 algorithm [cite]10.1093/bioinformatics/18.12.1585[/cite], which takes the Tukey bi-weighted mean of the difference in logs of PM and MM probes, was swiftly shown to be outperformed by methods developed in academia, once Affymetrix released data that could be used to develop other summarisation algorithms (in particular dChip [cite]10.1007/0-387-21679-0_5[/cite], RMA [cite]10.1093/biostatistics/4.2.249[/cite] and GCRMA [cite]10.1198/016214504000000683[/cite] – which take into account systematic hybridisation patterns – i.e. the fact that some probes are “stickier” than others).

Finally for his introductory segment, Harry also mentioned the “curse of dimensionality” – the fact that high-throughput ‘omics experiments make 10s of 1,000s of measurements, and identifying small but significant differences that express what’s going on in the biology suffers from an enormous multiple-testing problem. Therefore, we want to be sure that those things we are measuring are truly indicative of the underlying biology.

For the main portion of his talk, Harry went on to detail a number of features of GeneChip data that mean the correlations we measure using this technology may not be due entirely to biology. This was split into four sections, each with their own conclusions.

Section 1

Different probesets mapping to the same gene may not always be up- and down-regulated together [cite]10.1186/1471-2105-8-13[/cite]. The obvious explanation for this is that probes map to different exons, and alternative splicing means that differing probes may be differentially regulated, even if they map to the same gene. The follow-on suggestion from this is that while genes come in pieces, exons do not, and the exon can be considered the ‘atomic unit’ of transcription.

Conclusions: Exons need to be considered and classified separately. We should be careful of assumptions that contradict known biology.

Section 2

By investigating correlations across >6,000 GeneChips (HGU-133A, from experiments that are publicly available in the Gene Expression Omnibus), the causes of coherent signals across these experiments can be investigated. Colour map correlation plots can show at a glance the relationships between the probes in many probesets, and anomalous probesets can be easily targeted for investigation. One such probeset (209885_at) was one that looked like it was showing splicing in action (3 of 11 probes clearly did not correlate with the remainder of the probeset across the arrays in GEO), but on further investigation it was found that all the probes in the probeset mapped to the same exon. Another probeset (31846_at) that also mapped to the same exon showed a very similar pattern. By investigating the correlation of all of the probes in the 2 probesets, Harry clearly demonstrated that those 4 outlier probes correlated with one another, even though they did not correlate with any of the other probes.

The probes in the 2 probesets under investigation (centre panel, red bars) can clearly be seen to all be located in the final exon of the RHOD gene (top panel) on chromosome 11. In spite of the fact that the Affy annotation (bottom panel) has the probesets annotating the entire gene.

Further investigation showed that all of these 4 outlier probes contain long (4 or more) runs of guanine in their sequence, Harry showed that if you compare all probes with runs of guanine, you find more correlation than you would expect, and the more Gs, the better the correlation. A possible explanation for this was provided, with the suggestion being that the runs of Gs found in the probes could lead to G-quadruplexes being formed between adjacent probes on the GeneChip surface. This would mean that any RNA molecule with a run of Cs could hybridise to the remaining, free probes, and with a much greater affinity than at normal spots on the array, due to a much lower effective probe density in that spot (see [cite]10.1093/bib/bbp018[/cite] for more details on the physics of this).

Conclusions: Probes containing runs of 4 or more guanines are correlated with one another, and therefore are not measuring the expression of the gene they represent. It is proposed that the signals of these probes should be ignored when analysing a GeneChip experiment.

Section 3

Probes that contain the sequence GCCTCCC are, just like probes containing runs of guanine, more correlated with one another than you might expect them to be (see picture below, taken from [cite]10.1093/bfgp/elp027[/cite]). The proposed reason for this is that this sequence will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridizing to the GeneChip.


Probe correlations

Pairwise correlations for probes containing GCCTCCC. From Briefings in Functional Genomics and Proteomics (2009) 8 (3): 199-212.

Conclusions: Probes containing the complementary sequence to the primer spacer are probably not measuring gene expression. As with GGGG probes, they should be ignored in analysis.

Section 4

In the final section of his talk, Harry focussed on physical reasons for correlations between probes, showing that many probes show a correlation purely because they are found adjacent to very bright probes [cite]10.2202/1544-6115.1590[/cite]. So their correlated measurements are almost entirely due to poor focussing on the instrument capturing the image of the array. It can be shown that sharply focussed arrays have big values right next to small values, whereas poorly focussed arrays will have smaller differences between adjacent spots, because the large values have some of their intensity falling into their small neighbours. Harry also showed that you can use this objective measure to show the “quality” of a particular array scanner, and how it changes over time (since the scanner ID is contained within the metadata in a CEL file).

Conclusions: There is evidence that many GeneChip images are blurred. This blurring can confound the measurement of biology that you are trying to take in your experiment.

The take home message from Harry’s engaging and thought-provoking talk is that the analysis of high-throughput experiments like those using Affymetrix GeneChips cannot happen in isolation. The things we can learn from considering the statistics and bio-physics (among other things) of these experiments can be invaluable in interpreting the data.

Further resources:

One of the questions after the talk asked how to generate custom CDFs for removing the problematic probes that Harry highlighted during his talk. The answer was to use a tool like Xspecies (NASC) for achieving this.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

The Taverna Knowledgeblog

Today I am sat in a room with a fairly large group of people, who all work on the Taverna project. They are writing a Knowledgeblog book about the workflow manager, and I am providing help and technical assistance as a part of my role on the Knowledgeblog project. As well as producing a hopefully useful product (a beginner’s guide to Taverna), we are testing some of the procedures and products that we have been working on over the last few months on the project.

Posts on a Knowledgeblog now have several features that were in our plan for the project. Specifically, post revisions are now publicly exposed, providing a public provenance trail, and preventing someone from ‘unsaying’ anything without the proper process. The editorial workflow is better defined than it was for Ontogenesis (the Knowledgeblog prototype), meaning requests for reviews and the provision of the reviews themselves should be more streamlined, and despite the approach to today, doesn’t require all of the collaborators on a publication to be sitting in the same room (for this we are using the excellent EditFlow plugin, which provides ‘editorial comments’ on posts, and can fire email events upon certain, pre-defined, operations).

Posts can have multiple authors, which, combined with the ability to author posts in genuinely collaborative tools such as Google Docs (as opposed to totally non-collaborative tools like Word documents shared by email, although you can write posts like that too if you like), allows jointly authored posts to be both simple to generate and properly attributed. Finally, easy to generate tables of contents, for both posts and whole sites, makes navigating the content simple.

There are still a number of pieces of the puzzle that need to be slotted into place for us to have a fully functional platform, but I can’t help but feel we’re getting there. As I mentioned, I was here for technical support, and I didn’t really have a massive amount to do today (I spent most of it tinkering with the chosen theme to get it to support CoAuthors Plus).

The next major step will be a plugin to assist with citing papers and generating bibliographies that I am currently in the process of writing, more on that in a future post. I agree with many of Martin Fenner’s points in his post of a few days ago, citations are not currently well supported by WordPress, or any plugins so far. I am working on the dynamic generation of citations and bibliographies from specific tags within posts. This should allow for simple management of referencing by authors, and provide a range of tools for readers of articles, such as BibTeX/RIS export and on-the-fly bibliography reformatting.

Parsing Thermo Finnigan RAW files

In a rare move, I’m going to largely copy across a post from my work blog, because I hope it contains useful information. For background, I’m trying to write a simple python script that extracts particular metadata from a .RAW file, produced by a Thermo Finnigan mass spectrometer. Tools that exist for parsing these files require access to proprietary XCalibur libraries, which I do not have.

Thermo provided a link to MSFileReader, a ‘freeware’ COM object that should allow interaction with RAW files without an XCalibur installation. They also sent a PDF guide to the COM object. Although this will allow XCalibur to be avoided, the work is still Windows-bound.

Python and COM objects

Python can talk to COM objects, through the win32com.client package. As a test, I installed Python and MSFileReader and the pywin32 libs on my netbook (which is a Windows 7 machine). Can import the required Python module, but need to extent the PATH somewhat:

>>> sys.path.append('C:\Python26\Lib\site-packages\win32')
>>> sys.path.append('C:\Python26\Lib\site-packages\win32\lib')
>>> from win32com.client import Dispatch
>>> x = Dispatch("NAME")

The key thing here is “NAME”:

The provided PDF gives C snippets for each method available in the COM object. This only provides one clue as to the possible name of the COM object

// example for Open 
TCHAR* szPathName[] = _T(“c:\xcalibur\examples\data\steroids15.raw”); 
long nRet = XRawfileCtrl.Open( szPathName ); 
if( nRet != 0 ) {
    ::MessageBox( NULL, _T(“Error opening file”), _T(“Error”), MB_OK ); 

XRawfileCtrl is used to call the Open() method. However, this and MSFileReader as “NAME” both fail (Invalid class string).

Found ‘multiplierz‘ which seems to use MSFileReader to create mzAPI – which focusses on access to the actual data, rather than the metadata. The code gives some good clues as to how to use the COM object. [doi:10.1186/1471-2105-10-364]

MSFileReader.XRawfile is used as “NAME” in this code.


>>> sys.path.append('C:\Python26\Lib\site-packages\win32')
>>> sys.path.append('C:\Python26\Lib\site-packages\win32\lib')
>>> from win32com.client import Dispatch
>>> x = Dispatch("MSFileReader.XRawfile")
>>> x.Open("C:\Users\path\to\file\msfile.RAW")

To be continued…