# Technical

My needs are small, I want an RSS feed of the stuff I want to share from Google Reader, so that other people can follow the things I share in Reader (if they want), and I can pipe that information elsewhere (I use dlvr.it to post selected RSS feeds into Twitter). Google doesn’t want to provide that anymore, so I’ll hack something together.

The ingredients:

1. These simple instructions for how to render an RSS feed from a MySQL backend.
3. My rudimentary PHP hackery skills

The code:
All source is available on BitBucket.

First, we need a database connection. The database is set up exactly as described in (1), above.

<?php
DEFINE('DB_USER', 'db_user');
DEFINE('DB_HOST', 'localhost');
DEFINE('DB_NAME', 'db_name');
// Make the connnection and then select the database.
$dbc = @mysql_connect(DB_HOST, DB_USER, DB_PASSWORD) OR die(mysql_error()); mysql_select_db(DB_NAME) OR die(mysql_error()); ?>  Now, when the page is visited, we want to render what is in the database as an RSS feed (again, this is a simple adaptation of the code in (1)): <?php class RSS { public function RSS() { require_once ('mysql_connect.php'); } public function GetFeed() { return$this->getDetails() . $this->getItems(); } private function dbConnect() { DEFINE('LINK', mysql_connect(DB_HOST, DB_USER, DB_PASSWORD)); } private function getDetails() { //header of the RSS feed$detailsTable = "webref_rss_details";
$this->dbConnect($detailsTable);
$query = "SELECT * FROM ".$detailsTable;
$result = mysql_db_query (DB_NAME,$query, LINK);
while($row = mysql_fetch_array($result)) {
//fairly minimal description of the feed
$details = '<?xml version="1.0" encoding="ISO-8859-1" ?> <rss version="2.0"> <channel> <title>'.$row['title'] .'</title>
<link>'. $row['link'] .'</link> <description>'.$row['description'] .'</description>
<language>'. $row['language'] .'</language> '; } return$details;
}

private function getItems() {
//return all the items foe the RSS feed
$itemsTable = "webref_rss_items";$this->dbConnect($itemsTable);$query = "SELECT * FROM ". $itemsTable;$result = mysql_db_query(DB_NAME, $query, LINK);$items = '';
while($row = mysql_fetch_array($result)) {
$items .= '<item> <title>'.$row["title"] .'</title>
<link>'. $row["link"] .'</link> <description><![CDATA['.$row["description"] .']]></description>
</item>';
}
//close the feed
$items .= '</channel> </rss>'; return$items;
}
}
?>


Finally, we need a method for adding new stuff for the feed. This code takes the GET variables passed to it by Google Reader, and stores them in the DB:

<?php
if ($_GET['url']) { //receive google reader 'send to' items, and store in mysqldb$url = $_GET['url'];$source = $_GET['source'];$title = $_GET['title'];$simple_check = $_GET['check']; //stops anyone adding new items to your feed unless they have the key if ($simple_check == 'uniquepasscodehere') {
$insert_statement = "INSERT INTO webref_rss_items(title, description, link) VALUES('$title', '$source', '$url')";
require_once('mysql_connect.php');
$result = mysql_query($insert_statement, $dbc); if ($result) {
echo "<p>Success!";
//would be nice to close the window automatically after a couple of seconds
}
else {
die('<p>Invalid query: ' . mysql_error());
}
}
}
else {
//render everything in the db as RSS
$rss = new RSS(); echo$rss->GetFeed();
}
?>


Now, I can set up the Send To: item in Google Reader:

# Graphing protein databases

I’m giving a lecture next week to the Bioinformatics Masters students here about protein structure prediction. As part of the introduction to this topic, I have a traditional ‘data explosion’ slide, to illustrate the gap between the quantity of protein sequence data available versus the number of solved protein structures in the PDB (hence the need for bioinformatics to help fill the gap, by good prediction algorithms). When I last gave this talk (scarily, 4 years ago), this slide was just text, a description of the present size of UniProt & the PDB.

Since 2006 my lecturing style has progressed somewhat, I don’t like to have slides with just words on anymore, so I wanted to replace this slide, rather than just updating the numbers. Graphs of the growing sizes of the databases are easy to find online, but to my mind the real story here is of the gap in the sizes of the 2 databases (UniProt & PDB), and whether it is growing (or are protein structural determination methods catching up). This graph doesn’t (to my knowledge) exist, so, inspired by this question on BioStar I set out to draw them.

The first task is to retrieve numbers from each of the databases of their size at particular dates. For the PDB this is simple, because they distribute a CSV file of this information. You can get it too, it’s linked to here. For UniProt, it was non-obvious where to find this information. Every time there’s a new release, the webpage documenting that release gives the size of UniProt at the point of release (and it’s components, SwissProt and TrEMBL), but it is hard to find these pages for any release that is not current. So my approach was to download the history of UniProt from their FTP server, and use BioPython to calculate the size of each release:

[python]
import os
import sys
from Bio import SwissProt

def main():
dirs = os.listdir("data")
results = map(numbers, dirs)

def numbers(dir):
directory = "data/"+dir
h = open(directory+"/reldate.txt")
h.close()
date = lines[1].rstrip() #more processing required to return just date
sh = open(directory+"/uniprot_sprot.dat")
descriptions = [record.accessions for record in SwissProt.parse(sh)]
sprot_size = len(descriptions)
sh.close()
th = open(directory+"/uniprot_trembl.dat") #and the same for trembl
descriptions = [record.accessions for record in SwissProt.parse(th)]
trembl_size = len(descriptions)
th.close()
return (date,sprot_size,trembl_size)
[/python]

It was only once I was coming to the end of this process (slow, because we’re dealing with 16 releases of UniProt: 150GB of data) that I found this page, which was fairly hidden away, but gives me the sizes of SwissProt from the last 25 years. Curses! So much effort seemingly gone to waste. However, there doesn’t appear to be a corresponding page for TrEMBL, which is much larger (being a conceptual translation of EMBL), and I wanted these numbers too, to illustrate the full scope of the problem. So my effort was not in vein.

Now that we have all the numbers in an appropriate format (DATE,DATABASE,SIZE), we can draw some graphs. For this I use the ggplot2 library and R, which seems to be de rigueur for pretty visualisations these days. Here’s some code:

library(ggplot2)
colnames(pdb) = c("Year", "Database", "value")
pdb$Year <- as.Date(pdb$Year)
png("/path/to/graphs/uniprot_graphs/pdb.png", bg="transparent", width=800, height=600)
qplot(Year, value, data=pdb, geom="line", color=I("red")) + scale_x_date(format="%Y") + scale_y_continuous("Entries", formatter="comma")
dev.off()

colnames(spdb) = c("Year", "Database", "value")
spdb$Year <- as.Date(spdb$Year)
png("/path/to/graphs/sp_pdb.png", bg="transparent", width=800, height=600)
qplot(Year, value, data=spdb, geom="line", group=Database, color=Database) + scale_x_date(format="%Y") + scale_y_continuous("Entries", formatter="comma")
dev.off()

colnames(all) = c("Year", "Database", "value")
all$Year <- as.Date(all$Year)
png("/path/to/graphs/all.png", bg="transparent", width=800, height=600)
qplot(Year, value, data=all, geom="line", group=Database, color=Database) + scale_x_date(format="%Y") + scale_y_log10("Entries", breaks=c(10^4,10^5,10^6,10^7))
dev.off()


This very simple R produces 3 plots, all of which are informative in different ways.

Plot 1 is a simple restatment of the PDB graph, which I produced just so all my graphs would look the same, it’s a pretty standard exponential curve (though admittedly the numbers are slightly smaller than the numbers you may be used to seeing on such plots).

Plot 2 compares the size of SwissProt with the size of the PDB. I’m extremely happy with this one, as it shows precisely what I wanted it to, SwissProt being much larger than the PDB, and marching away at an increasing rate. For the record, the most recent size of the PDB and SwissProt in the graph are 68,998 and 522,019 respectively (compared with when I last gave the protein structure lecture: 40,132 & 241,365).

The final plot is just to scare people. It includes TrEMBL, and had to be plotted on a log10 scale, because TrEMBL is another order of magnitude larger than SwissProt (12,347,303 sequences).

Addendum – further to all this, the problem of the gap between sequence and structure is actually more stark than presented here. Although the PDB today (11/11/10) contains 69,162 structures, they are highly redundant, and there are only 39,724 unique sequences of known structure.

# Parsing Thermo Finnigan RAW files

In a rare move, I’m going to largely copy across a post from my work blog, because I hope it contains useful information. For background, I’m trying to write a simple python script that extracts particular metadata from a .RAW file, produced by a Thermo Finnigan mass spectrometer. Tools that exist for parsing these files require access to proprietary XCalibur libraries, which I do not have.

Thermo provided a link to MSFileReader, a ‘freeware’ COM object that should allow interaction with RAW files without an XCalibur installation. They also sent a PDF guide to the COM object. Although this will allow XCalibur to be avoided, the work is still Windows-bound.

Python and COM objects

Python can talk to COM objects, through the win32com.client package. As a test, I installed Python and MSFileReader and the pywin32 libs on my netbook (which is a Windows 7 machine). Can import the required Python module, but need to extent the PATH somewhat:

>>> sys.path.append('C:\Python26\Lib\site-packages\win32')
>>> sys.path.append('C:\Python26\Lib\site-packages\win32\lib')
>>> from win32com.client import Dispatch
>>> x = Dispatch("NAME")


The key thing here is “NAME”:

The provided PDF gives C snippets for each method available in the COM object. This only provides one clue as to the possible name of the COM object

// example for Open
TCHAR* szPathName[] = _T(“c:\xcalibur\examples\data\steroids15.raw”);
long nRet = XRawfileCtrl.Open( szPathName );
if( nRet != 0 ) {
::MessageBox( NULL, _T(“Error opening file”), _T(“Error”), MB_OK );
...
}


XRawfileCtrl is used to call the Open() method. However, this and MSFileReader as “NAME” both fail (Invalid class string).

Found ‘multiplierz‘ which seems to use MSFileReader to create mzAPI – which focusses on access to the actual data, rather than the metadata. The code gives some good clues as to how to use the COM object. [doi:10.1186/1471-2105-10-364]

MSFileReader.XRawfile is used as “NAME” in this code.

So:

>>> sys.path.append('C:\Python26\Lib\site-packages\win32')
>>> sys.path.append('C:\Python26\Lib\site-packages\win32\lib')
>>> from win32com.client import Dispatch