Gold Standard not so shiny?

This paper caused a bit of a stir when it was published last week. The suggestion that highly curated ‘gold standard’ databases may not be as high quality as has been assumed had august figures such as Henning Hermjakob up in arms and countering as swiftly as humanly possible.

Protein-protein interactions, at the ‘interactome’ scale, are determined in two major ways: (i) high-throughput experimental studies, such as yeast-2 hybrid and TAP assays, and (ii) curating the literature to gather together many interactions found in low-throughput experiments. Neither of these approaches is capable of fully illuminating the complete interactome of any organism, and so the aim of this paper by Michael Cusick and co-workers is to examine which of the two approaches produces the most reliable results.

With high throughput experiments, the number of interactions tested vs number found is known. This is not the case for curated sets. Negatives are underreported, so a full picture of the experimental background is unclear.

Literature curated sets tend to be used for appraisal of reliability of experimental sets. They make up the gold standard positives (GSP) with which high throughput PPIs are scored. This high reliability of curated data has largely been assumed, not tested.

The study examines the superficial reliability of curated datasets, before examining and reappraising specific interaction sets. Only 25% of yeast PPIs in BioGRID are supported by more than one publication, this number is comparable for humans (15%) and Arabidopsis (lower still, only 7%). Single publication supported interactions are naturally of lower reliability than those found more than once experimentally. The authors suggest that it is assumed that even the single publication interactions come from small studies (I’m not sure that this is a valid assumption, most interactions in the literature come from high throughput datasets, I’m not convinced that someone might think that these would be under-represented in curated datasets). Large proportions of single-publication interactions do come from high-throughput studies (not well validated small-scale studies), almost 25% of yeast interactions in BioGRID come from 1% of publications detailing > 100 interactions.

MINT, IntACT and DIP do not overlap well (this is not enumerated in the text, but is in the figures). This is due to use of different manuscripts, not differential interpretation of the same corpus. This does imply that coverage of the literature is poor, but not that curation is unreliable. Different databases have different starting points in the literature, and there is a lot of it out there.

The next step for this study was to recurate ‘representative samples’ from the 3 organisms already mentioned. 35% of 100 yeast interactions were ‘incorrectly curated’, based on the criteria set out in the methods.

For humans, they chose a high confidence, multiply curated, multiply databased interactions, of which 38% of the ‘curation units’ were found to be ‘wrong’. However, these 38% correspond to only 8.5% of the ‘interaction units’, in other words, 91.5% of these 188 interactions are still supported by at least one publication.

Only 6% of the less-well studied Arabidopsis representative set were called ‘incorrect’ in the recuration.

There is no denying that, on the face of it, these are disturbing numbers. However, they are presented in the paper as much more alarming than they perhaps really are. With respect to the human reappraisal, only the 38% figure is mentioned in the main body of the text. The fact is, these unsupported annotations undermine less than 10% of the dataset.

The authors suggest the difficulty of curation is underestimated – not by those recruiting curators it isn’t. I would question what makes this ‘recuration’ more reliable than the original curation by experienced and practiced hands? Furthermore, why do the curation methods differ for different organisms, surely the questions posed in the yeast curation are valid for the human dataset?

They do point out the obscurity of the literature – universal identifiers are lacking, and it can be difficult to even determine the species a given sequence originates from, let alone the specific protein being discussed.
In the light of this, they make the very sensible suggestion that MIMIx is a good thing.

It should be noted also that, as mentioned, Henning Hermjakob sprang to the defence of curated databases, suggesting that the majority of Arabidopsis interactions reported as incorrect, are in fact, accurate. He also suggested that it was hard to debate the veracity of the claims made about yeast and humans, as there is not a direct citation for each interaction to follow up (GenomeWeb).

I think there are some valid points made about the difficulties of curation, and that not all annotations should necessarily be taken at face value, but I also don’t believe the findings of this study are as alarming as they are portrayed. It is not time to ditch all of those gold standard datasets just yet. In general I would still suggest that curated datasets are more reliable than high throughput sets.

Michael E Cusick, Haiyuan Yu, Alex Smolyar, Kavitha Venkatesan, Anne-Ruxandra Carvunis, Nicolas Simonis, Jean-François Rual, Heather Borick, Pascal Braun, Matija Dreze, Jean Vandenhaute, Mary Galli, Junshi Yazaki, David E Hill, Joseph R Ecker, Frederick P Roth, Marc Vidal (2009). Literature-curated protein interaction datasets Nature Methods, 6 (1), 39-46 DOI: 10.1038/nmeth.1284


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s