Bioinformatics Community Building in Newcastle

Ooohhh look, a blog post. Not seen one of those in a while around these parts…

I’ve been rather busy since taking on my new job (which I guess I should now refer to just as ‘my job’). Hence the lack of posts for quite some time. Mostly I’ve been focussed on making sure the Bioinformatics Support Unit continues to run smoothly, but like anyone in a new job, I’ve also wanted to make my mark by changing the way thing operate a little bit. So this year we’ve run a proper training course for the first time, for instance. I’m also trying to make the unit more central to the way bioinformatics is done throughout the faculty, by establishing and running the Newcastle University Bioinformatics Special Interest Group. The aim of this group is to foster communication between bioinformatcians working at the University, and hopefully establish some sort of mutually supportive local geek community. The first meeting took place a couple of weeks ago, and I wrote it up for the Special Interest Group blog, but I thought I would reproduce that post here as well. The remainder of this post is taken from that site, with permission of the author (me).

In the first Bioinformatics Special Interest Group meeting, we heard a talk from Dr Andrew “Harry” Harrison entitled ‘On the causes of correlations seen between probe intensities in Affymetrix GeneChips’.

Harry started his talk with a brief overview of the Affymetrix microarray platform, including the important observation (as will become obvious later) that the distance between full length probes on the surface of a GeneChip is around 3nm. Full length probes are around 20nm long, so there is plenty of scope for adjacent probes to interact with one another. Also reviewed was the progress made in the summarisation of probe information from GeneChips into probeset observations per gene.

The biggest uncertainty in GeneChip analysis is how to merge all the probe information for one gene [cite]10.1186/1471-2105-8-195[/cite]

The Affymetrix developed MAS5.0 algorithm [cite]10.1093/bioinformatics/18.12.1585[/cite], which takes the Tukey bi-weighted mean of the difference in logs of PM and MM probes, was swiftly shown to be outperformed by methods developed in academia, once Affymetrix released data that could be used to develop other summarisation algorithms (in particular dChip [cite]10.1007/0-387-21679-0_5[/cite], RMA [cite]10.1093/biostatistics/4.2.249[/cite] and GCRMA [cite]10.1198/016214504000000683[/cite] – which take into account systematic hybridisation patterns – i.e. the fact that some probes are “stickier” than others).

Finally for his introductory segment, Harry also mentioned the “curse of dimensionality” – the fact that high-throughput ‘omics experiments make 10s of 1,000s of measurements, and identifying small but significant differences that express what’s going on in the biology suffers from an enormous multiple-testing problem. Therefore, we want to be sure that those things we are measuring are truly indicative of the underlying biology.

For the main portion of his talk, Harry went on to detail a number of features of GeneChip data that mean the correlations we measure using this technology may not be due entirely to biology. This was split into four sections, each with their own conclusions.

Section 1

Different probesets mapping to the same gene may not always be up- and down-regulated together [cite]10.1186/1471-2105-8-13[/cite]. The obvious explanation for this is that probes map to different exons, and alternative splicing means that differing probes may be differentially regulated, even if they map to the same gene. The follow-on suggestion from this is that while genes come in pieces, exons do not, and the exon can be considered the ‘atomic unit’ of transcription.

Conclusions: Exons need to be considered and classified separately. We should be careful of assumptions that contradict known biology.

Section 2

By investigating correlations across >6,000 GeneChips (HGU-133A, from experiments that are publicly available in the Gene Expression Omnibus), the causes of coherent signals across these experiments can be investigated. Colour map correlation plots can show at a glance the relationships between the probes in many probesets, and anomalous probesets can be easily targeted for investigation. One such probeset (209885_at) was one that looked like it was showing splicing in action (3 of 11 probes clearly did not correlate with the remainder of the probeset across the arrays in GEO), but on further investigation it was found that all the probes in the probeset mapped to the same exon. Another probeset (31846_at) that also mapped to the same exon showed a very similar pattern. By investigating the correlation of all of the probes in the 2 probesets, Harry clearly demonstrated that those 4 outlier probes correlated with one another, even though they did not correlate with any of the other probes.

The probes in the 2 probesets under investigation (centre panel, red bars) can clearly be seen to all be located in the final exon of the RHOD gene (top panel) on chromosome 11. In spite of the fact that the Affy annotation (bottom panel) has the probesets annotating the entire gene.

Further investigation showed that all of these 4 outlier probes contain long (4 or more) runs of guanine in their sequence, Harry showed that if you compare all probes with runs of guanine, you find more correlation than you would expect, and the more Gs, the better the correlation. A possible explanation for this was provided, with the suggestion being that the runs of Gs found in the probes could lead to G-quadruplexes being formed between adjacent probes on the GeneChip surface. This would mean that any RNA molecule with a run of Cs could hybridise to the remaining, free probes, and with a much greater affinity than at normal spots on the array, due to a much lower effective probe density in that spot (see [cite]10.1093/bib/bbp018[/cite] for more details on the physics of this).

Conclusions: Probes containing runs of 4 or more guanines are correlated with one another, and therefore are not measuring the expression of the gene they represent. It is proposed that the signals of these probes should be ignored when analysing a GeneChip experiment.

Section 3

Probes that contain the sequence GCCTCCC are, just like probes containing runs of guanine, more correlated with one another than you might expect them to be (see picture below, taken from [cite]10.1093/bfgp/elp027[/cite]). The proposed reason for this is that this sequence will hybridize to the primer spacer sequence that is attached to all aRNA prior to hybridizing to the GeneChip.


Probe correlations

Pairwise correlations for probes containing GCCTCCC. From Briefings in Functional Genomics and Proteomics (2009) 8 (3): 199-212.

Conclusions: Probes containing the complementary sequence to the primer spacer are probably not measuring gene expression. As with GGGG probes, they should be ignored in analysis.

Section 4

In the final section of his talk, Harry focussed on physical reasons for correlations between probes, showing that many probes show a correlation purely because they are found adjacent to very bright probes [cite]10.2202/1544-6115.1590[/cite]. So their correlated measurements are almost entirely due to poor focussing on the instrument capturing the image of the array. It can be shown that sharply focussed arrays have big values right next to small values, whereas poorly focussed arrays will have smaller differences between adjacent spots, because the large values have some of their intensity falling into their small neighbours. Harry also showed that you can use this objective measure to show the “quality” of a particular array scanner, and how it changes over time (since the scanner ID is contained within the metadata in a CEL file).

Conclusions: There is evidence that many GeneChip images are blurred. This blurring can confound the measurement of biology that you are trying to take in your experiment.

The take home message from Harry’s engaging and thought-provoking talk is that the analysis of high-throughput experiments like those using Affymetrix GeneChips cannot happen in isolation. The things we can learn from considering the statistics and bio-physics (among other things) of these experiments can be invaluable in interpreting the data.

Further resources:

One of the questions after the talk asked how to generate custom CDFs for removing the problematic probes that Harry highlighted during his talk. The answer was to use a tool like Xspecies (NASC) for achieving this.

Please note that this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s