Graphing protein databases

I’m giving a lecture next week to the Bioinformatics Masters students here about protein structure prediction. As part of the introduction to this topic, I have a traditional ‘data explosion’ slide, to illustrate the gap between the quantity of protein sequence data available versus the number of solved protein structures in the PDB (hence the need for bioinformatics to help fill the gap, by good prediction algorithms). When I last gave this talk (scarily, 4 years ago), this slide was just text, a description of the present size of UniProt & the PDB.

Since 2006 my lecturing style has progressed somewhat, I don’t like to have slides with just words on anymore, so I wanted to replace this slide, rather than just updating the numbers. Graphs of the growing sizes of the databases are easy to find online, but to my mind the real story here is of the gap in the sizes of the 2 databases (UniProt & PDB), and whether it is growing (or are protein structural determination methods catching up). This graph doesn’t (to my knowledge) exist, so, inspired by this question on BioStar I set out to draw them.

The first task is to retrieve numbers from each of the databases of their size at particular dates. For the PDB this is simple, because they distribute a CSV file of this information. You can get it too, it’s linked to here. For UniProt, it was non-obvious where to find this information. Every time there’s a new release, the webpage documenting that release gives the size of UniProt at the point of release (and it’s components, SwissProt and TrEMBL), but it is hard to find these pages for any release that is not current. So my approach was to download the history of UniProt from their FTP server, and use BioPython to calculate the size of each release:

import os
import sys
from Bio import SwissProt

def main():
dirs = os.listdir("data")
results = map(numbers, dirs)

def numbers(dir):
directory = "data/"+dir
h = open(directory+"/reldate.txt")
lines = h.readlines()
date = lines[1].rstrip() #more processing required to return just date
sh = open(directory+"/uniprot_sprot.dat")
descriptions = [record.accessions for record in SwissProt.parse(sh)]
sprot_size = len(descriptions)
th = open(directory+"/uniprot_trembl.dat") #and the same for trembl
descriptions = [record.accessions for record in SwissProt.parse(th)]
trembl_size = len(descriptions)
return (date,sprot_size,trembl_size)

It was only once I was coming to the end of this process (slow, because we’re dealing with 16 releases of UniProt: 150GB of data) that I found this page, which was fairly hidden away, but gives me the sizes of SwissProt from the last 25 years. Curses! So much effort seemingly gone to waste. However, there doesn’t appear to be a corresponding page for TrEMBL, which is much larger (being a conceptual translation of EMBL), and I wanted these numbers too, to illustrate the full scope of the problem. So my effort was not in vein.

Now that we have all the numbers in an appropriate format (DATE,DATABASE,SIZE), we can draw some graphs. For this I use the ggplot2 library and R, which seems to be de rigueur for pretty visualisations these days. Here’s some code:

pdb <- read.table("/path/to/data/pdb.txt", sep=",")
colnames(pdb) = c("Year", "Database", "value")
pdb$Year <- as.Date(pdb$Year)
png("/path/to/graphs/uniprot_graphs/pdb.png", bg="transparent", width=800, height=600)
qplot(Year, value, data=pdb, geom="line", color=I("red")) + scale_x_date(format="%Y") + scale_y_continuous("Entries", formatter="comma")

spdb <- read.table("/path/to/data/sp_pdb.txt", sep=",")
colnames(spdb) = c("Year", "Database", "value")
spdb$Year <- as.Date(spdb$Year)
png("/path/to/graphs/sp_pdb.png", bg="transparent", width=800, height=600)
qplot(Year, value, data=spdb, geom="line", group=Database, color=Database) + scale_x_date(format="%Y") + scale_y_continuous("Entries", formatter="comma")

all <- read.table("/path/to/data/all.txt", sep=",")
colnames(all) = c("Year", "Database", "value")
all$Year <- as.Date(all$Year)
png("/path/to/graphs/all.png", bg="transparent", width=800, height=600)
qplot(Year, value, data=all, geom="line", group=Database, color=Database) + scale_x_date(format="%Y") + scale_y_log10("Entries", breaks=c(10^4,10^5,10^6,10^7))

This very simple R produces 3 plots, all of which are informative in different ways.


Plot 1 is a simple restatment of the PDB graph, which I produced just so all my graphs would look the same, it’s a pretty standard exponential curve (though admittedly the numbers are slightly smaller than the numbers you may be used to seeing on such plots).

SwissProt vs PDB

Plot 2 compares the size of SwissProt with the size of the PDB. I’m extremely happy with this one, as it shows precisely what I wanted it to, SwissProt being much larger than the PDB, and marching away at an increasing rate. For the record, the most recent size of the PDB and SwissProt in the graph are 68,998 and 522,019 respectively (compared with when I last gave the protein structure lecture: 40,132 & 241,365).

TrEMBL vs SwissProt vs PDB

The final plot is just to scare people. It includes TrEMBL, and had to be plotted on a log10 scale, because TrEMBL is another order of magnitude larger than SwissProt (12,347,303 sequences).

Addendum – further to all this, the problem of the gap between sequence and structure is actually more stark than presented here. Although the PDB today (11/11/10) contains 69,162 structures, they are highly redundant, and there are only 39,724 unique sequences of known structure.

While I was away…

View of WhitbyI’ve just returned from a week away in North Yorkshire, and scanning through my RSS backlog and FriendFeed from while I was away, I notice a few interesting developments.

  1. Zotero made a firm announcement of a standalone version of their excellent open-source reference management tool. I’ve been keen on Zotero for a while, but have moved away from FireFox as my browser lately (at present, Zotero is only available as a FireFox plugin). I’m looking forward to using it in earnest again, and trying to integrate it more fully into my workflow (which has been the main problem with other reference managers I’ve tried). No release date yet that I’ve found (sadly).
  2. Nodalpoint, the bioinformatics blog of old, has been reincarnated as a podcast. I love podcasts, and listen to a whole bunch of them (to the point where I don’t have much of a chance to listen to music on my commute any more), but there’s not many around that are relevant to what I do with most of my time (apart from the excellent c2cbio podcast of course). I’m really looking forward to listening to episode one, and hope more follow in due course.
  3. The people behind the ‘Science is Vital‘ lobby group, as well as organising a public rally, and a lobby of Parliament, have set up a petition, I urge everyone (in the UK) to go and sign it. If we don’t try to protect science in some degree in the forthcoming round of spending cuts, no one else will.

Parsing Thermo Finnigan RAW files

In a rare move, I’m going to largely copy across a post from my work blog, because I hope it contains useful information. For background, I’m trying to write a simple python script that extracts particular metadata from a .RAW file, produced by a Thermo Finnigan mass spectrometer. Tools that exist for parsing these files require access to proprietary XCalibur libraries, which I do not have.

Thermo provided a link to MSFileReader, a ‘freeware’ COM object that should allow interaction with RAW files without an XCalibur installation. They also sent a PDF guide to the COM object. Although this will allow XCalibur to be avoided, the work is still Windows-bound.

Python and COM objects

Python can talk to COM objects, through the win32com.client package. As a test, I installed Python and MSFileReader and the pywin32 libs on my netbook (which is a Windows 7 machine). Can import the required Python module, but need to extent the PATH somewhat:

>>> sys.path.append('C:\Python26\Lib\site-packages\win32')
>>> sys.path.append('C:\Python26\Lib\site-packages\win32\lib')
>>> from win32com.client import Dispatch
>>> x = Dispatch("NAME")

The key thing here is “NAME”:

The provided PDF gives C snippets for each method available in the COM object. This only provides one clue as to the possible name of the COM object

// example for Open 
TCHAR* szPathName[] = _T(“c:\xcalibur\examples\data\steroids15.raw”); 
long nRet = XRawfileCtrl.Open( szPathName ); 
if( nRet != 0 ) {
    ::MessageBox( NULL, _T(“Error opening file”), _T(“Error”), MB_OK ); 

XRawfileCtrl is used to call the Open() method. However, this and MSFileReader as “NAME” both fail (Invalid class string).

Found ‘multiplierz‘ which seems to use MSFileReader to create mzAPI – which focusses on access to the actual data, rather than the metadata. The code gives some good clues as to how to use the COM object. [doi:10.1186/1471-2105-10-364]

MSFileReader.XRawfile is used as “NAME” in this code.


>>> sys.path.append('C:\Python26\Lib\site-packages\win32')
>>> sys.path.append('C:\Python26\Lib\site-packages\win32\lib')
>>> from win32com.client import Dispatch
>>> x = Dispatch("MSFileReader.XRawfile")
>>> x.Open("C:\Users\path\to\file\msfile.RAW")

To be continued…

Telomerase – make your skin immortal!

I know that the beauty industry has made a habit of twisting science somewhat for it’s own ends (see this and this for instance), but this one takes the biscuit.
The wife spotted a piece in Harper’s Bazaar while she was in the hairdressers yesterday, about an amazing new beauty treatment (the article itself is hard to link to, but it’s number 3 in the list of “9 Skin Secrets for Spring“). Injections of telomerase for $1,500 a pop. Apparently it ‘stimulates resting stem cells’. Obviously the Harper’s piece has guff about it being Nobel-prize winning technology.
Telomerase is an enzyme that amplifies DNA repeats at the ends of chromosomes, without this activity, the telomeres would get progressively shorter until the “Hayflick limit” is reached and the cell will stop dividing, or undergo programmed cell death (there’s a reasonable review of the role of telomerase here:
Now I’m no expert, but as far as I know, telomerase is turned off in normal somatic cells, and telomerase activity has been associated with up to 90% of cancers (even its Wikipedia entry will tell me this much, a rather old paper with some concrete figures can be found here: I’m not suggesting for a second that injecting telomerase will give you cancer (the overwhelming probability is it will do nothing at all), but this seems to be an amazing example of abusing science in the name of ‘beauty’.

Impact factors, Colossus and the Wakefield retraction.

(Graphs from The Independent (London), 21 June 2008)

Today was one of those days where lots of interesting stuff turns up. On the BBC, there was 2 very good pieces about the flaws in the scientific process, specifically closed peer review and impact factors.

I also notice that the BBC are running a daily piece about the history of computing in the UK this week, parts one and two have already been published. Today’s article about Colossus is especially good.

Also, after last week’s excellent, and damning, judgement from the GMC –

– regarding Andrew Wakefield’s reprehensible behaviour in his research into the ‘link’ between MMR and autism, today The Lancet finally pulled the paper in which his findings were published 12 years ago. Wakefield et al (1998) (doi:10.1016/S0140-6736(97)11096-0) has now been retracted from the public record after the Lancet concluded that the claims made by the researchers were ‘false’ ( – apologies for paywall).

Defining absolute protein abundance

At the heart of Systems Biology is a vast hunger for measurements. mRNA abundance, metabolite concentration, reactions rates, degradation rates, protein abundance. This last measurement has long been problematic for researchers, mass spectrometers get increasingly accurate and powerful, but are still hindered by the simple fact that observed signal intensity does not necessarily correlate directly with the abundance of that peptide in the sample. Factors such as peptide ionisation efficiencies, dominant neighbour effects, and missing observations all give rise to erroneous estimates of peptide quantities. Until recently, the best way to get close to measures of protein abundance was to use a peptide tagging methodology, but these are typically expensive, and provide only relative quantification (useful for expression proteomics studies, less useful if you need to know the absolute levels of a protein for a Systems Biology study).

Recently, a three step method has been proposed for determining the absolute quantities of proteins in the cell, on a proteome scale. Step one is isoelectric focussing of tryptic digests of whole cell extracts. Step two, calculating the absolute abundance of a small group of proteins by Selective Reaction Monitoring (SRM). SRM uses spike in, isotopically labelled peptides of known concentration as references to calculate the actual abundance of peptides of interest. Finally, step three uses these abundances as reference points to calculate the abundance of all proteins in the sample, using the median intensities from the 3 most intense peptides for each protein.

Leptospira interrogans (Wikimedia Commons)

Leptospira interrogans (Wikimedia Commons)

Using this methodology, the abundances of >50% of the proteome of a human parasite (Leptospira interrogans) have been determined to an accuracy of ~2-fold. These abundance measurements were confirmed by almost literally counting the number of flagellar proteins present in a cell by cryo-electron tomography.

Although current hardware probably limits this technique to a few thousand proteins, that is still a big step forward on what was previously possible. If whole proteome scale absolute abundance measurements become an achievable reality, maybe proteomics can finally take on microarrays as the dominant technique in the post genomics world.
Malmström, J., Beck, M., Schmidt, A., Lange, V., Deutsch, E., & Aebersold, R. (2009). Proteome-wide cellular protein concentrations of the human pathogen Leptospira interrogans Nature, 460 (7256), 762-765 DOI: 10.1038/nature08184

Nature Methods

homecoverI love my free Nature Methods subscription. It allows me to get my hands on a paper journal, which I rarely get to do these days, and the content is actually pretty marvellous.

This month there’s a new technique for enzymatic assembly of DNA molecules from the Venter Institute, a standardised methodology for proteomics sample preparation, and a great technology feature from Nathan Blow about new proteomics techniques, including surface plasmon resonance (about which I knew nothing before today). Not to mention cool pictures of mice having light shone on their brains.

You can still apply for a free subscription, and if you are eligible to do so (individuals in North America and Europe involved in research within the life sciences or chemistry), I would urge you to.

Save the Scientist, Save the World?

Gordon Brown has already saved the world once, but it didn’t take. So the world needs another solution. i humbly suggest that the way to help save the economy, of Britain at least, is to invest heavily in Science and Technology. In the following I try to justify this as anything other than pure selfishness.

Science is very often one of the first casualties of government spending in a recession. This is because it is seen as a luxury, a good-time frippery that is difficult to justify when times are hard. The reverse should be true, science and technology investment are not disposable because they are the generators of future income, the basis of a future successful economy.

The economy of this country has, for a long time now, been based in the service sector. This keeps people employed, which drives the economy because employed people buy things. But we no longer produce anything of note, we don’t generate significant external input into our economy – except through the financial sector… and I think everyone knows what happened there by now. We reached a point where confidence amongst those employed in the service sector collapsed, so they stopped buying things, this means that the service sector looked to the financial sector for support, but the financial sector, to all intents and purposes, no longer existed, so the service sector began collapsing upon itself. This is self-perpetuating, it leads to job losses, which leads to less buying of things, which leads to further job losses… and so on. (I realise this is a gross simplification of the real situation, and not 100% accurate, but it is pretty close to the real thing, and makes my point).

The government has declared its intention to follow a Keynesian approach and spend its way out of recession, taking upon itself the responsibility of injecting the cash that the economy needs to rebuild itself. This is a well recognised approach, and has merit, the new investment has to come from somewhere, and no institution has the borrowing power of the government. However, we (as a nation) must be able to recover this investment at some future point. This means we have to create wealth that is not already in the system. We have to make something that the rest of the world wants to buy.

So, invest in science, engineering and technology. Reverse the decline in these disciplines, the unpopularity of Maths and Physics in the classroom, the hemorrhage of the talent we do have overseas. Make the product the rest of the world buy into our innovation. Funding research keeps the current generation of innovators employed (the selfish bit), and creates new opportunities for the next generation. And not just for those lucky enough to have the education to pursue this route. Infrastructure is needed to surround research. Newcastle University is one of the largest employers in the North East.

At this point I am clearly in danger of getting carried away, so it’s probably best to wrap up. Since I started writing this particular perma-draft, many things have happened. Gordon Brown spoke in congress, about the need to ‘educate our way out of the downturn, invest and invent our way out of the downturn and re-tool and re-skill our way out of the downturn.‘ The US stimulus package has promised vast investment in science and technology. And just today President Obama unfroze research into Stem Cells in the US. All of these are obviously good things, let’s hope the momentum can be maintained, and the doom merchants don’t win.

Fixing Proteomics

Fixing ProteomicsI’ve only just discovered the Fixing Proteomics Campaign, thanks to a post on FriendFeed. It’s an initiative that I probably should have known about before, since it appears to originate, at least partly, from Nonlinear Dynamics, a Newcastle based proteomics informatics company. The campaign is also dedicated to a message I have been trying to spread among the researchers I interact with during my work: experiments must be robustly designed, and an unreproducible experimental result is meaningless.

The website for the campaign contains some useful resources for spreading this message, most effective are the analogies that illustrate the most common experimental design techniques, and the 4-step guide for Fixing Proteomics (the subject of the FF link, above). I have used something akin to the analogies in lectures I have given about experimental design (indeed I have used the apocryphal ‘Fahrenheit and the Cow’ story itself), and I will certainly be using the 4-steps in the future, and referencing the Fixing Proteomics website too.

Just one note: as Frank points out in the FriendFeed thread, the PSI could be highlighted a little more. Proteomics experiments would not be reproducible at all, particularly cross-site, without the efforts made by the standards community. As AnalysisXML enters its public comment phase, it is worth remembering the contribution they have made to opening up data formats and making data and metadata available in a non-proprietry way.

Blog for Darwin

This post forms part of the ‘Blog for Darwin’ blog carnival.

I wasn’t going to write this post. I am very much of the opinion that holding up one man as a figurehead for an entire science is a mistake, and sets up too many straw-man arguments for detractors to propound (of the nature of: x was mistaken, so his theory y must also be wrong). Darwin lived in the 19th century, limited by the 19th century’s knowledge of science. A period where ‘Biology’ as a science didn’t really exist. Of course he was wrong about some stuff, and by equating Evolution with Darwinism, we give the denialists a stick with which to beat us (and this also leads to misleading and pernicious headlines like that in the New Scientist a couple of weeks ago). I ‘believe’ in the theory of gravity (as supported by the weight (ho ho) of evidence), that doesn’t make me a Newtonist.

There is no doubting that evolution is more than just Darwin, and that the Darwinian view of evolution probably doesn’t totally hold water any more, but that is hardly a surprise. It is 150 years old (in its published form). So, much as I admire his achievements, I wasn’t totally behind the idea of ‘Darwin Day’. Grist to the mill of creationists who see Darwin as the sole pedestal for the Theory of Evolution.

But then you see the amount of pseudoscience that persists in the mainstream media, and results of surveys like this one, which suggests that around 10% of Britons believe the earth was created by a supernatural being sometime in the last 10,000 years, and you think: ‘Why should I be churlish about something which is basically pro-science, and is getting a shed-load of high quality, high profile coverage?’. So, yes, if I can increase the positive noise surrounding February 12th 2009, I will. I will shout about Darwin from the rooftops if it gets something close to actual science in the news pages for a change.

For the rest of this year, this is where the battle will be fought. The hearts and minds of the anti-science luddites must be won over by the elegance and wonder of a beautiful theory, arrived at by a brilliant man who spent many years of his life in painstaking examination of the many glorious wonders of the natural world, and slowly formulating a way in which they were all connected. He truly changed our understanding of the world. Let us celebrate that fact.

Just don’t call me a Darwinist.