Research Blogging

Salbutamol promotes SMN2 expression

This is a cross-post from the Blogging for Science Online London group blog. During the Saturday workshop at Science Online London 2011, a bunch of us wrote content relating to Spinal Muscular Atrophy. My post was a short summary of a small scale drug trial, which shows promising results.

This is a summary of a paper that shows that Salbutamol promotes SMN2 expression in vivo [cite]10.1136/jmg.2010.080366[/cite].

Patients with Spinal Muscular Atrophy (SMA) have no functioning copy of the gene SMN1. The SMN2 gene can theoretically function in its place, but a change in this gene means that only a small amount of functional protein is produced from the gene.

It is therefore suggested that any intervention that can increase the level of functional SMN2 transcript could well be effective as a treatment for SMA.

Salbutamol is a short acting beta-adrenergic agonist that is primarily used for treating asthma. A previous study [cite]10.1136/jmg.2007.051177[/cite] has shown that Salbutamol is effective in raising SMN2 full length (SMN2-fl) levels in cultured SMA fibroblasts.

Figure 1

In this study, the researchers administered Salbutamol to 12 patients with SMA, and measured the levels of SMN2-fl 3 times (0, 3 and 6 months). The levels of SMN2-fl were significantly increased in all but 3 patients after 3 months (average increase of 48.9%), and in all patients after 6 months (average increase of 91.8%). They also showed that patients with more copies of the SMN2 gene (some patients had 3 copies, some had 4) showed a larger response to Salbutamol treatment. This increase in expression cannot be explained by normal fluctuations over time in these patients, since studies have shown that levels of SMN2-fl are usually stable over time [cite]10.1212/01.wnl.0000252934.70676.ab[/cite] [cite]10.1038/ejhg.2009.116[/cite]. Clearly the big question now is whether this molecular response to the drug is reflected in a beneficial clinical response in the patient. This study does not address this question, but does propose that a full double-blind, placebo controlled trial should be carried out to ascertain whether or not this treatment is effective in treating the symptoms of SMA.

ResearchBlogging.orgTiziano, F., Lomastro, R., Pinto, A., Messina, S., D’Amico, A., Fiori, S., Angelozzi, C., Pane, M., Mercuri, E., Bertini, E., Neri, G., & Brahe, C. (2010). Salbutamol increases survival motor neuron (SMN) transcript levels in leucocytes of spinal muscular atrophy (SMA) patients: relevance for clinical trial design Journal of Medical Genetics, 47 (12), 856-858 DOI: 10.1136/jmg.2010.080366

Defining absolute protein abundance

At the heart of Systems Biology is a vast hunger for measurements. mRNA abundance, metabolite concentration, reactions rates, degradation rates, protein abundance. This last measurement has long been problematic for researchers, mass spectrometers get increasingly accurate and powerful, but are still hindered by the simple fact that observed signal intensity does not necessarily correlate directly with the abundance of that peptide in the sample. Factors such as peptide ionisation efficiencies, dominant neighbour effects, and missing observations all give rise to erroneous estimates of peptide quantities. Until recently, the best way to get close to measures of protein abundance was to use a peptide tagging methodology, but these are typically expensive, and provide only relative quantification (useful for expression proteomics studies, less useful if you need to know the absolute levels of a protein for a Systems Biology study).

Recently, a three step method has been proposed for determining the absolute quantities of proteins in the cell, on a proteome scale. Step one is isoelectric focussing of tryptic digests of whole cell extracts. Step two, calculating the absolute abundance of a small group of proteins by Selective Reaction Monitoring (SRM). SRM uses spike in, isotopically labelled peptides of known concentration as references to calculate the actual abundance of peptides of interest. Finally, step three uses these abundances as reference points to calculate the abundance of all proteins in the sample, using the median intensities from the 3 most intense peptides for each protein.

Leptospira interrogans (Wikimedia Commons)

Leptospira interrogans (Wikimedia Commons)

Using this methodology, the abundances of >50% of the proteome of a human parasite (Leptospira interrogans) have been determined to an accuracy of ~2-fold. These abundance measurements were confirmed by almost literally counting the number of flagellar proteins present in a cell by cryo-electron tomography.

Although current hardware probably limits this technique to a few thousand proteins, that is still a big step forward on what was previously possible. If whole proteome scale absolute abundance measurements become an achievable reality, maybe proteomics can finally take on microarrays as the dominant technique in the post genomics world.
Malmström, J., Beck, M., Schmidt, A., Lange, V., Deutsch, E., & Aebersold, R. (2009). Proteome-wide cellular protein concentrations of the human pathogen Leptospira interrogans Nature, 460 (7256), 762-765 DOI: 10.1038/nature08184

From eczema to asthma (in mice)

ResearchBlogging.orgEczema and asthma often co-occur, indeed, I suffer from both (albeit mildly). What I wasn’t aware of was that eczema often comes first. Though eczema often precedes asthma (asthma has an underlying rate of 4-8% in the general population, but 70% in individuals with a history of chronic severe eczema), the underlying mechanism for this so called ‘atopic march’ isn’t known, though work published today in PLoS Biology elucidates a possible mechanism.

Researchers genetically engineered mice with chronic skin barrier defects (mice lacking Notch signalling in the skin, leading to impairment of epidermal differentiation), who exhibit an eczema like skin condition. They then used these mice to demonstrate the predisposition of such affected individuals to allergic asthma. Occurance of allergic asthma was 7-fold higher in the mutant mouse population, compared to a wild-type population.

The authors then went on to demonstrate that a cytokine called thymic stromal lymphopoietin (TSLP), which is secreted by the damaged skin into the circulation, is required for atopic march in the mutant mice. They show that by knocking out the TSLP receptor in these mice, they can prevent atopic march. They also show that over-production of TSLP in the skin is sufficient to cause allergic asthma, regardless of the cause of that over-production.

This is a paper a little outside my areas of expertise, which is why this is much more of a skim overview than normal. However, there is clearly good work being done here elucidating the molecular mechanisms of a very common disease process. There are also clear implications in this paper on the future management and treatment of eczema and asthma patients. Even though this is unlikely to improve my own experiences of these conditions, I’m very happy this kind of work is being done.

Demehri, S., Morimoto, M., Holtzman, M., & Kopan, R. (2009). Skin-Derived TSLP Triggers Progression from Epidermal-Barrier Defects to Asthma PLoS Biology, 7 (5) DOI: 10.1371/journal.pbio.1000067

Mining literature for PPIs

You know an article is going to be good when it starts with a sentence like ‘Due to the overwhelming increase in [sequence data/transcriptomics data/etc]…’. So an opening gambit of ‘With the overwhelming amount and exponential increase of biomedical literature[…]’ filled me with the promise of things to come.

The aim of this PLoS One paper is to provide an online tool for mining human protein-protein interaction data from the literature, based on the co-occurance of protein names in PubMed abstracts, together with ‘interaction keywords’. This data, combined with PPI databases and shared GO terms, aims to provide a greatly expanded set of human protein-protein interactions.

The basic suggestion is that if a pair of protein names appear frequently in the same sentence, or paragraph, or even whole article then there may exist a biologically meaningful relationship between them. This may certainly be the case in some circumstances, but it certainly does not imply a physical interaction. However, this study also employs natural language processing (NLP) techniques to examine the semantic relationships between co-occuring entities. This NLP approach uses the sentence as the unit of analysis, so may potentially miss the relevant associative language. The authors suggest that the hybrid approach using both the statistical co-occurance of terms, and the semantic analysis, means they can recover more biologically meaningful relationships than other test mining methods.

By further combining this data with PPI data from established databases (to filter the ‘known’ PPIs) and information about shared GO terms (to provide some kind of qualitative backup for some predicted interactions), the authors provide a tool of real promise for identifying ‘new’ protein-protein interactions (or at least those not present in the existing databases).

29 genes were used in PPI Finder, finding a total of 944 interactions. Of these, only 28% are already found in trusted protein-protein interaction databases. Is PPI finder really capable of enriching our knowledge of protein interactions to this degree, or is it merely finding genes which are biologically related, maybe by a transcription factor-target type relationship or similar, but don’t physically interact?

Of 100 trusted interactions studied, 69 were recovered by PPI finder. Is this demonstrating that the databases can be over zealous in certain circumstances, or that the reporting of PPIs in the literature, especially in the abstracts of high-throughput studies, is woefully inadequate?

Interaction netowrk of DTNBP1.

Interaction network of DTNBP1. Red relations exist in established PPI databases, cyan relations inferred from the literature by PPI Finder.

We’re in danger of reaching saturation point here. Many more methods of predicting or defining PPIs and we’ll have a complete set, everything will be able to be shown by some method to interact with everything else. So it’s a question of confidence, and I really don’t have a great deal of confidence in these results. A quick scan of the PPI finder database online (no mean feat, believe me, the interface is not great) reveals that a large number of the interactions are defined by the protein names co-occuring in just one publication. However much NLP you do, this is no great measure of whether 2 proteins interact or not.

I think my figure shows how much ‘noise’ there is in this dataset. Whether this is productive noise or not is irrelevant without some kind of validation procedure external to the PPI Finder scoring mechanism.

My final crisicism I have already alluded to… it is woefully difficult to retrieve data from PPI Finder. In constructing the above figure I had to do some 35 or so searches of the database, and page through the results of each 10 at a time. Even an option to display all results on a single page would have made this procedure considerably less painful.

Right, enough complaining, because this was actually quite an enjoyable read. I don’t know a huge amount about text mining, and it does seem obvious to use it for this sort of purpose. All of my reservations shouldn’t distract from the fact that the dataset produced for this paper will have its uses, and the authors are correct in the opening hyperbole I opened this post with: we do need good predictive tools to keep up with data deposition in the protein domain, because no experimental methodology is going to do it yet.

He, M., Wang, Y., & Li, W. (2009). PPI Finder: A Mining Tool for Human Protein-Protein Interactions PLoS ONE, 4 (2) DOI: 10.1371/journal.pone.0004554

Gold Standard not so shiny?

This paper caused a bit of a stir when it was published last week. The suggestion that highly curated ‘gold standard’ databases may not be as high quality as has been assumed had august figures such as Henning Hermjakob up in arms and countering as swiftly as humanly possible.

Protein-protein interactions, at the ‘interactome’ scale, are determined in two major ways: (i) high-throughput experimental studies, such as yeast-2 hybrid and TAP assays, and (ii) curating the literature to gather together many interactions found in low-throughput experiments. Neither of these approaches is capable of fully illuminating the complete interactome of any organism, and so the aim of this paper by Michael Cusick and co-workers is to examine which of the two approaches produces the most reliable results.

With high throughput experiments, the number of interactions tested vs number found is known. This is not the case for curated sets. Negatives are underreported, so a full picture of the experimental background is unclear.

Literature curated sets tend to be used for appraisal of reliability of experimental sets. They make up the gold standard positives (GSP) with which high throughput PPIs are scored. This high reliability of curated data has largely been assumed, not tested.

The study examines the superficial reliability of curated datasets, before examining and reappraising specific interaction sets. Only 25% of yeast PPIs in BioGRID are supported by more than one publication, this number is comparable for humans (15%) and Arabidopsis (lower still, only 7%). Single publication supported interactions are naturally of lower reliability than those found more than once experimentally. The authors suggest that it is assumed that even the single publication interactions come from small studies (I’m not sure that this is a valid assumption, most interactions in the literature come from high throughput datasets, I’m not convinced that someone might think that these would be under-represented in curated datasets). Large proportions of single-publication interactions do come from high-throughput studies (not well validated small-scale studies), almost 25% of yeast interactions in BioGRID come from 1% of publications detailing > 100 interactions.

MINT, IntACT and DIP do not overlap well (this is not enumerated in the text, but is in the figures). This is due to use of different manuscripts, not differential interpretation of the same corpus. This does imply that coverage of the literature is poor, but not that curation is unreliable. Different databases have different starting points in the literature, and there is a lot of it out there.

The next step for this study was to recurate ‘representative samples’ from the 3 organisms already mentioned. 35% of 100 yeast interactions were ‘incorrectly curated’, based on the criteria set out in the methods.

For humans, they chose a high confidence, multiply curated, multiply databased interactions, of which 38% of the ‘curation units’ were found to be ‘wrong’. However, these 38% correspond to only 8.5% of the ‘interaction units’, in other words, 91.5% of these 188 interactions are still supported by at least one publication.

Only 6% of the less-well studied Arabidopsis representative set were called ‘incorrect’ in the recuration.

There is no denying that, on the face of it, these are disturbing numbers. However, they are presented in the paper as much more alarming than they perhaps really are. With respect to the human reappraisal, only the 38% figure is mentioned in the main body of the text. The fact is, these unsupported annotations undermine less than 10% of the dataset.

The authors suggest the difficulty of curation is underestimated – not by those recruiting curators it isn’t. I would question what makes this ‘recuration’ more reliable than the original curation by experienced and practiced hands? Furthermore, why do the curation methods differ for different organisms, surely the questions posed in the yeast curation are valid for the human dataset?

They do point out the obscurity of the literature – universal identifiers are lacking, and it can be difficult to even determine the species a given sequence originates from, let alone the specific protein being discussed.
In the light of this, they make the very sensible suggestion that MIMIx is a good thing.

It should be noted also that, as mentioned, Henning Hermjakob sprang to the defence of curated databases, suggesting that the majority of Arabidopsis interactions reported as incorrect, are in fact, accurate. He also suggested that it was hard to debate the veracity of the claims made about yeast and humans, as there is not a direct citation for each interaction to follow up (GenomeWeb).

I think there are some valid points made about the difficulties of curation, and that not all annotations should necessarily be taken at face value, but I also don’t believe the findings of this study are as alarming as they are portrayed. It is not time to ditch all of those gold standard datasets just yet. In general I would still suggest that curated datasets are more reliable than high throughput sets.

Michael E Cusick, Haiyuan Yu, Alex Smolyar, Kavitha Venkatesan, Anne-Ruxandra Carvunis, Nicolas Simonis, Jean-François Rual, Heather Borick, Pascal Braun, Matija Dreze, Jean Vandenhaute, Mary Galli, Junshi Yazaki, David E Hill, Joseph R Ecker, Frederick P Roth, Marc Vidal (2009). Literature-curated protein interaction datasets Nature Methods, 6 (1), 39-46 DOI: 10.1038/nmeth.1284

Contextual Specificity in Peptide-Mediated Protein Interactions

This is my entry for the PLoS ONE @ Two synchroblog. It is my first foray into blogging on peer-reviewed research, I hope you enjoy it. Here is a link to the paper concerned.

It is a general truism that cellular events are mediated by proteins. It is a further truth that proteins do not function in isolation, but work to accomplish their function in ‘cooperation’ often as part of large macromolecular assemblies. These assemblies are created and coordinated through large networks of mostly transient protein-protein interactions (PPIs). Much research effort has been expended in attempting to elucidate the specifics and means of protein interactions, and much recent Bioinformatics research is dedicated to the prediction and validation of PPIs.

The study in question here [1] examines a particular class of protein-protein interaction, and looks to elucidate the precise mechanisms by which binding strength and specificity are determined. The class of PPIs being studied are those where a globular domain in one protein recognises and binds to a linear peptide from another. This type of transient, peptide-mediated interaction is underrepresented in high-throughput datasets [2]. It has been shown that, while bonding between linear motifs and globular domains are sufficient for binding, they are not enough to explain the high degree of interaction specificity that has been observed in vivo. What then confers the specificity? (Pbs2 in yeast, for instance, only binds to the SH3 domain of Sho1, and does not interact with any of the 26 other SH3 domains found in yeast. [3]) The answer, according to Stein and Aloy, is context. This context includes the spacial and temporal location of the proteins concerned (thus limiting the available binding partners), but also the residues that surround the linear binding motif which contribute to the environment of the interaction, and the overall energy of binding.

In order to assess what role the residue context (not spatial or temporal) plays in determining the specificity of PPIs, Stein and Aloy systematically identified all peptide-globular domain interactions (using the ELM database of motifs) of known structure from the PDB, and used them to investigate the contribution of the motif itself and its context to the global binding energy. They ended up with a set of 390 interactions of known structure, that they used for their analysis.

WD40 domain bound to LigEH1 9 amino acid ELM motif

WD40 domain bound to LigEH1 9 amino acid ELM motif. Here just one residue of context provides 9% of the total binding energy.

What they found, using the FoldX Package to perform in silico alanine scanning experiments, is that the residues of the binding motif itself are responsible for, on average, 79% of the global binding energy (between just 12% and 99.7%, depending on the type of interaction). The remaining 21% (on average) is contributed by the residues of the context.

The second major finding of the paper is that, within a group of domain-peptide interactions, the position of the motif within the interaction is relatively ‘fixed’ (RMSD – 2.5 ± 3.2Å), whereas there is more flexibility in context placement (RMSD – 4.2 ± 4.4Å). This reinforces the idea that the motif is necessary and sufficient for actual binding to take place (since it is more restrained, both sequentially and spacially), but the context is required to ensure specificity of a given reaction.

Their final observation, that in 5% of cases sequence conservation of <30% was sufficient to allow for exchange of binding partners, is another important one. This suggests that it is extremely difficult to predict any potential cross-reactions that may occur purely from sequence alignments. Therefore structural knowledge is required (whether experimental or modelled) in order to make successful predictions of domain-peptide interactions. Indeed they cite instances where exploiting structural knowledge has been useful for the prediction of domain-domain interactions (though I fear they missed the, clearly vitally important, work of Cockell et al (2007) [4]).

The suggestion is made that the context has evolved, not to maximise binding strength, but binding specificity. This is supported by the observation that the motif sequence, although not being completely responsible for the global binding energy, is often nearly optimal, and also by the relative inflexibilty of the motifs in structural terms. This has clear implications for both predicted, and experimentally determined PPIs. These implications are not pointed out in the paper, which is largely positive in tone, but I feel they are important.

Where predictions of interactions have been made using linear motifs as the guiding factor, context will not (or maybe very rarely) have been considered. Therefore it may be the case that while a given interaction is technically feasible, and the motif is sufficient for binding to occur, the lack of the correct context means that the interaction is actually unlikely to be found in vivo.

This is also true for experimentally determined interactions. In an experiment such as a yeast 2 hybrid screen (for example), 2 proteins are bought together in excess in the often foreign environment of a yeast nucleus. In these circumstances, a match between a globular domain and the appropriate motif partner may well lead to binding and reporter activation, regardless of context, simply due to the fact that no other proteins are around to compete that have a more suitable context for binding.

I enjoyed this paper, it is unusual to find a paper that is largely about binding energies and dissociation constants that doesn’t include a huge amount of laughably complicated mathematics, and Stein and Aloy strike the right balance I think. They make a valid point while summing up that knowledge of how transient PPIs occur, and are mediated, is cruicial for both systems and synthetic biology (ie understanding and modelling regulatory processes, and designing new circuits). This paper does contribute to that understanding significantly.

1. Amelie Stein, Patrick Aloy (2008). Contextual Specificity in Peptide-Mediated Protein Interactions PLoS ONE, 3 (7) DOI: 10.1371/journal.pone.0002524

2. T PAWSON, R LINDING (2005). Synthetic modular systems – reverse engineering of signal transduction FEBS Letters, 579 (8), 1808-1814 DOI: 10.1016/j.febslet.2005.02.013

3. Ali Zarrinpar, Sang-Hyun Park, Wendell A. Lim (2003). Optimization of specificity in a cellular protein interaction network by negative selection Nature, 426 (6967), 676-680 DOI: 10.1038/nature02178

4. S. J. Cockell, B. Oliva, R. M. Jackson (2007). Structure-based evaluation of in silico predictions of protein protein interactions using Comparative Docking Bioinformatics, 23 (5), 573-581 DOI: 10.1093/bioinformatics/btl661