Last week a pre-print was published describing Kallisto – a quantification method for RNA-Seq reads described (by the authors) as ‘near optimal’. Here is Lior Pachter’s associated blog post. Kallisto follows a current trend for ‘alignment free’ quantification methods for RNA-Seq analysis, with recent methods including Sailfish and it’s successor, Salmon.
These methods do, of course, do ‘alignment’ (pseudo-alignments in Kallisto parlance, lightweight alignments according to Salmon), but they do not expend computation in determining the optimal alignment, or disk in saving the results of the alignment. Making an alignment ‘good enough’ to determine the originating transcript and then disposing of it seems like a sensible use of resources – if you don’t intend to use the alignment for anything else later on (which usually I do – so alignment isn’t going to go away any time soon).
I’ve been using Salmon for a few months now, and have been very impressed with its speed and apparent accuracy, but I felt that the publication of Kallisto means I should actually do some testing.
What follows is pretty quick and dirty, and I know it could well stand some improvements along the way. I’ve tried to document the workflow adequately – let me know if you have any specific suggestions for improvement.
Building a test set
Initially I wanted to run the tests with a full transcriptome simulation, but I’ve been having some technical issues with
polyester, the RNA-Seq experiment simulator, and I haven’t had the time to work them out. So instead, I am working with the transcripts of a sample of 250 random genes. Reads were simulated for these transcripts to match empirically observed counts from a recent experiment. This simulation gave me a test set of 206,124 paired-end reads for 1,129 transcripts. This data set was then used for quantification with both Salmon (v0.3.0) and Kallisto (v 0.42.1).
I then ran the quantification with each tool & each set of reads 10 times, to get an idea of the variability of the results. For Salmon, this meant 10 separate runs, for Kallisto it meant extracting the HDF5 file from a run with 10 bootstraps. For interest, I tracked the time and resource use of each tool, though this is not a big consideration. Since we are now in the place where most tools operate in minutes per sample (alignment via STAR or HISAT, and these quantification methods), a few minutes either way is going to make a negligible difference, and speed is usually entirely the wrong metric to focus on.
For the record – Salmon took a lot longer for this experiment, though the requirement to keep counting reads until at least 50,000,000 have been processed (by default, configurable through the
-n parameter) accounts for the majority of the time discrepancy & would not be such a ‘penalty’ to operation with a normally-sized experiment.
The two graphs below show the coefficient of variance for the observations by each of the software tools, vs the raw read count. In both cases, the transcripts with low base mean exhibit the largest relative variance – this is hardly surprising. In general, variance is proportional to mean expression.
For comparison, I took the mean observation from the 10 runs of Salmon, and the final
abundance.txt observations from Kallisto, and compared them to the ‘ground truth’ – the count table from which the simulation was derived. Plots of these comparisons are below.
I think both tools here are doing a pretty bang-up job — though Kallisto is performing better in this test, particularly with high abundance isoforms. The correlation with ‘truth’ is stronger (Spearman, reported on the graphs above), as is the mean absolute difference from truth (10.04 for Kallisto, 60.66 for Salmon).
Both Salmon and Kallisto are still under active development (Salmon has not yet been published, and Kallisto is only just in pre-print), so this is actually relatively early days for quantification by alignment-free methods (see this post by the Salmon developer Rob Patro for some potential future directions). The fact, then, that both tools are already doing such a good job of quantification is very exciting.
In response to the comment from Rob Patro below, I’m including a graph of the comparison of TPM (Transcripts Per Million) – again, truth vs Salmon & Kallisto, this time in one figure.
EDIT THE SECOND:
Further feedback from Rob, suggesting I use the non-bias corrected results from Salmon. This has a pretty significant effect on the results. The revised plots are included below. The Salmon help does mention that bias-correction is experimental…