You know an article is going to be good when it starts with a sentence like ‘Due to the overwhelming increase in [sequence data/transcriptomics data/etc]…’. So an opening gambit of ‘With the overwhelming amount and exponential increase of biomedical literature[…]’ filled me with the promise of things to come.
The aim of this PLoS One paper is to provide an online tool for mining human protein-protein interaction data from the literature, based on the co-occurance of protein names in PubMed abstracts, together with ‘interaction keywords’. This data, combined with PPI databases and shared GO terms, aims to provide a greatly expanded set of human protein-protein interactions.
The basic suggestion is that if a pair of protein names appear frequently in the same sentence, or paragraph, or even whole article then there may exist a biologically meaningful relationship between them. This may certainly be the case in some circumstances, but it certainly does not imply a physical interaction. However, this study also employs natural language processing (NLP) techniques to examine the semantic relationships between co-occuring entities. This NLP approach uses the sentence as the unit of analysis, so may potentially miss the relevant associative language. The authors suggest that the hybrid approach using both the statistical co-occurance of terms, and the semantic analysis, means they can recover more biologically meaningful relationships than other test mining methods.
By further combining this data with PPI data from established databases (to filter the ‘known’ PPIs) and information about shared GO terms (to provide some kind of qualitative backup for some predicted interactions), the authors provide a tool of real promise for identifying ‘new’ protein-protein interactions (or at least those not present in the existing databases).
29 genes were used in PPI Finder, finding a total of 944 interactions. Of these, only 28% are already found in trusted protein-protein interaction databases. Is PPI finder really capable of enriching our knowledge of protein interactions to this degree, or is it merely finding genes which are biologically related, maybe by a transcription factor-target type relationship or similar, but don’t physically interact?
Of 100 trusted interactions studied, 69 were recovered by PPI finder. Is this demonstrating that the databases can be over zealous in certain circumstances, or that the reporting of PPIs in the literature, especially in the abstracts of high-throughput studies, is woefully inadequate?
We’re in danger of reaching saturation point here. Many more methods of predicting or defining PPIs and we’ll have a complete set, everything will be able to be shown by some method to interact with everything else. So it’s a question of confidence, and I really don’t have a great deal of confidence in these results. A quick scan of the PPI finder database online (no mean feat, believe me, the interface is not great) reveals that a large number of the interactions are defined by the protein names co-occuring in just one publication. However much NLP you do, this is no great measure of whether 2 proteins interact or not.
I think my figure shows how much ‘noise’ there is in this dataset. Whether this is productive noise or not is irrelevant without some kind of validation procedure external to the PPI Finder scoring mechanism.
My final crisicism I have already alluded to… it is woefully difficult to retrieve data from PPI Finder. In constructing the above figure I had to do some 35 or so searches of the database, and page through the results of each 10 at a time. Even an option to display all results on a single page would have made this procedure considerably less painful.
Right, enough complaining, because this was actually quite an enjoyable read. I don’t know a huge amount about text mining, and it does seem obvious to use it for this sort of purpose. All of my reservations shouldn’t distract from the fact that the dataset produced for this paper will have its uses, and the authors are correct in the opening hyperbole I opened this post with: we do need good predictive tools to keep up with data deposition in the protein domain, because no experimental methodology is going to do it yet.
He, M., Wang, Y., & Li, W. (2009). PPI Finder: A Mining Tool for Human Protein-Protein Interactions PLoS ONE, 4 (2) DOI: 10.1371/journal.pone.0004554