TPM comparison with non-bias corrected Salmon results

“Alignment free” transcriptome quantification

Last week a pre-print was published describing Kallisto – a quantification method for RNA-Seq reads described (by the authors) as ‘near optimal’. Here is Lior Pachter’s associated blog post. Kallisto follows a current trend for ‘alignment free’ quantification methods for RNA-Seq analysis, with recent methods including Sailfish and it’s successor, Salmon.

These methods do, of course, do ‘alignment’ (pseudo-alignments in Kallisto parlance, lightweight alignments according to Salmon), but they do not expend computation in determining the optimal alignment, or disk in saving the results of the alignment. Making an alignment ‘good enough’ to determine the originating transcript and then disposing of it seems like a sensible use of resources – if you don’t intend to use the alignment for anything else later on (which usually I do – so alignment isn’t going to go away any time soon).

I’ve been using Salmon for a few months now, and have been very impressed with its speed and apparent accuracy, but I felt that the publication of Kallisto means I should actually do some testing.

What follows is pretty quick and dirty, and I know it could well stand some improvements along the way. I’ve tried to document the workflow adequately – let me know if you have any specific suggestions for improvement.

Building a test set

Initially I wanted to run the tests with a full transcriptome simulation, but I’ve been having some technical issues with polyester, the RNA-Seq experiment simulator, and I haven’t had the time to work them out. So instead, I am working with the transcripts of a sample of 250 random genes. Reads were simulated for these transcripts to match empirically observed counts from a recent experiment. This simulation gave me a test set of 206,124 paired-end reads for 1,129 transcripts. This data set was then used for quantification with both Salmon (v0.3.0) and Kallisto (v 0.42.1).


I then ran the quantification with each tool & each set of reads 10 times, to get an idea of the variability of the results. For Salmon, this meant 10 separate runs, for Kallisto it meant extracting the HDF5 file from a run with 10 bootstraps. For interest, I tracked the time and resource use of each tool, though this is not a big consideration. Since we are now in the place where most tools operate in minutes per sample (alignment via STAR or HISAT, and these quantification methods), a few minutes either way is going to make a negligible difference, and speed is usually entirely the wrong metric to focus on.

For the record – Salmon took a lot longer for this experiment, though the requirement to keep counting reads until at least 50,000,000 have been processed (by default, configurable through the -n parameter) accounts for the majority of the time discrepancy & would not be such a ‘penalty’ to operation with a normally-sized experiment.


The two graphs below show the coefficient of variance for the observations by each of the software tools, vs the raw read count. In both cases, the transcripts with low base mean exhibit the largest relative variance – this is hardly surprising. In general, variance is proportional to mean expression.


Per-transcript coefficient of variance for Salmon observations


Per-transcript coefficient of variance for Kallisto observations


For comparison, I took the mean observation from the 10 runs of Salmon, and the final abundance.txt observations from Kallisto, and compared them to the ‘ground truth’ – the count table from which the simulation was derived. Plots of these comparisons are below.


Correlation of Salmon counts with ground truth. The red line indicates perfect correlation.


Correlation of Kallisto counts with ground truth. The red line indicates perfect correlation.

I think both tools here are doing a pretty bang-up job — though Kallisto is performing better in this test, particularly with high abundance isoforms. The correlation with ‘truth’ is stronger (Spearman, reported on the graphs above), as is the mean absolute difference from truth (10.04 for Kallisto, 60.66 for Salmon).

Both Salmon and Kallisto are still under active development (Salmon has not yet been published, and Kallisto is only just in pre-print), so this is actually relatively early days for quantification by alignment-free methods (see this post by the Salmon developer Rob Patro for some potential future directions). The fact, then, that both tools are already doing such a good job of quantification is very exciting.


In response to the comment from Rob Patro below, I’m including a graph of the comparison of TPM (Transcripts Per Million) – again, truth vs Salmon & Kallisto, this time in one figure.

Transcripts per million comparison. Graph is of log2(TPM+1).

Transcripts per million comparison. Graph is of log2(TPM+1).


Further feedback from Rob, suggesting I use the non-bias corrected results from Salmon. This has a pretty significant effect on the results. The revised plots are included below. The Salmon help does mention that bias-correction is experimental…

Non-Bias Corrected Salmon counts vs Ground Truth

Non-Bias Corrected Salmon counts vs Ground Truth

TPM comparison with non-bias corrected Salmon results

TPM comparison with non-bias corrected Salmon results


  1. Hi Simon,

    Thanks for this post! It’s really interesting. I should mention that the testing setup (i.e. ~200k reads) is really close to the worst scenario possible for Salmon, as the online inference algorithm is really geared toward larger data sets. This means it probably takes a significant hit in terms of runtime (having to pass over the data *many* times) and perhaps, slightly in terms of accuracy. If you have a chance, would you mind showing the results in terms of TPM? Also, as you mention, Salmon (and Kallisto) are still under active development. We’ve made a potentially significant change to the way that the number of mapping reads is estimated (they were likely being somewhat underestimated previous) — I’d be very curious if the latest version of Salmon ( performs any differently. Anyway, these are minor points; thanks for the very cool blog post. We’re working hard on the Salmon manuscript and hope to have a pre-print out relatively soon.


    1. Rob

      Thanks for the comment. I will get around to checking out the latest release of Salmon sometime soon – thanks for the pointer to the “Salmon-only” Github repo. In the meantime, the (rather unpolished!) scripts and data for this analysis are on Github:

      The TPM comparison is now included in the post – the Kallisto TPM calculation is based on effective transcript length, so differs slightly from Salmon, but the results are comparable. The graph is in log2 space because it was easier to see what’s going on…


      1. Thanks a ton, Simon! I will definitely take a look at this as soon as I get a chance. The Salmon-only repo is a new development, as it allows us to slowly strip away the parts of the Sailfish codebase that are not required for Salmon. This, in turn, allows us to update the codebase and add new features more rapidly (Kallisto, actually, has inspired some new features that are currently in the queue). It also makes it easier to point people at just Salmon, which we will be doing with the pre-print (shortly, I hope).

  2. One more quick thought (after looking at the scripts & data), would you mind also including the **non** bias-corrected Salmon results in the plots (I’ll also be looking at this myself)? Thanks again!

  3. Interesting post!
    I’m also doing some comparison on my end from single-end data and I observe huge differences between eXpress and Kallisto.

    Do you expect to observe that same “almost perfect” correlation with Kallisto on single-end data ?


  4. Looking at the final plot (the TPMs), it seems that Kallisto has some false negatives that aren’t quite being accounted for in the correlations. Thanks to Simon’s awesomeness (i.e. putting the scripts required to reproduce his tests up on GitHub), I’ve reproduced these results accounting for these false negatives. The relevant IPython notebook (I’m horrible at R) is here (


  5. Hi Simon, thanks for your interest in kallisto and thanks for benchmarking it along with Salmon. I have a few points I’d like to raise though:

    – I don’t think it’s fair to compare the variability in the bootstraps from kallisto to the variability in salmon. They are inherently quite different. The variability in bootstraps from kallisto is measuring the variability you would expect assuming that you are taking several draws from the same population (and estimating them using kallisto). It is important to remember that the kallisto algorithm (currently) is 100% deterministic and the standard abundances (non-bootstrap) will always be the same. The differences you are seeing in Salmon are a bit different. It seems the Salmon algorithm is non-deterministic and you are getting different results when quantifying the same data.
    – While using the mean of the bootstraps seems like a reasonable idea, the mean of the bootstraps will never explain the observed data as well as the observed data because you are sampling from the observed data, treating it as the population. While this is true, you will see that using more bootstraps (significantly more than 10) will give you results closer to observed estimates, but they will still have some bias. The purpose of the bootstraps is therefore to estimate the variance in the kallisto abundance estimates, not to refine, or improve, on the initial prediction

    Finally, I’ve done my own re-analysis of your simulations (thanks again for making your analysis so easily available!). You can find the rmarkdown file here:

    Along with the repo:

    with a few things we have looked at while developing kallisto, some of which are in the preprint:

    – relative difference
    – relative error
    – some helpful plots related to these metrics

    In addition, the zeroes are handled a bit differently than in Rob’s code. Instead of filtering out things where kallisto/Salmon estimate lowly expressed (or close to zero things), we are only filtering out things where the truth is less than the threshold (Rob’s does this as well). I believe this is the correct way of filtering out the lowly expressed



  6. Hi Harold,

    Thanks for your extra re-analysis. The analysis capabilities of sleuth look quite nice, and I’m looking forward to the package. I was wondering if you might elaborate a bit more on the “filtering” issue. I should mention that in the Python analysis, the filtering is much more of a data-specific “truncation” than a filtering to a sub-set of the data. By this I mean that my `filterExpression` function does not remove any data points from the dataset. Rather, it simply sets values less than a specified threshold to 0 (leaving the corresponding “row” in the data frame). Further, the value is only truncated to 0 in the cell in question (e.g. estimated number of reads), the other columns (e.g. true number of reads) retain their original values. This ensures that the total number data points remain the same in the different comparisons. Because, the python code doesn’t remove any data points from comparison but, rather, simply truncates small values, I believe that applying the same truncation value in both the true and estimated expression is the correct thing to do. The motivation for this is that we are, essentially, defining a minimum expression value that “matters”. However, since we are not removing any data points from the analysis, truncating the true expression values and the estimated expression values are independent operations. The filtering in the true data essentially says that transcripts with a true expression less than a specific threshold don’t have meaningful expression — likewise with the estimated expression. It seems that it makes sense to apply such a truncation symmetrically. Of course, when filtering by removing data points, the argument is very different, since removing transcripts with low estimated expression may discard from consideration those transcripts with a low estimated but high true expression (and vice versa), which, clearly, one would not want to do.


  7. P.S. I really wish that the blog comments had an edit capability (like Reddit comments or Gitter chats)! The other thing that I noticed when looking at the data is that there are a few hundred transcripts that were quantified (part of select_transcripts.fa) but which don’t have an entry in the ground truth (oracle) data. I believe that these transcripts should be included in any analysis, since the tools (Kallisto and Salmon) *did* consider them when estimating relative abundances (this means we’re performing analysis on all 1,129 transcripts rather than the 913 that appear in the intersection of the ground truth file and the estimated expressions). Also, I’ll make attempt to port your other analyses into Python to keep the two at parity. Rather than averaging all of the Salmon runs, I will pick a single run — this seems a better representation since, as you mention, there is a stochastic element to salmon’s optimization and the optimization algorithm works “within sample”. Further, there is a lower bound on “cancelling” errors between runs since 0 is a lower bound on e.g. TPM and estimated number of reads. Currently, I’m getting somewhat different results in the Python analysis of e.g. median relative error where both method seem to have a median relative error of 0.0. I’ll report back here when I can figure out what might be causing these differences.


    1. Okay, I’ve updated the previous notebook (available to view here — with an analysis of relative difference. Indeed, in the Python analysis, I get a median relative difference of 0.0 using both methods (and the histogram of relative differences look very similar). Though there are likely some other things going on, it looks like the biggest component of the difference may come from truncating tiny estimated expression values to 0. I have a small analysis of the effect of this on non-zero median relative difference in my notebook. Essentially, the relative difference for salmon was being driven up by transcripts with a tiny predicted TPM where the true TPM is 0 (all of which get a relative difference of 2.0 — the largest possible — the way it is being calculated). For example, this non-zero median relative difference drops by an order of magnitude if we simply say that any predicted TPM < 1e-30 should be 0. I would argue that applying this truncation of values (again, this is not a subset-filtering — all transcripts remain in the comparison) is the fairest way of comparing across different methods. Specifically, I think it makes sense since some methods (in this case, Kallisto) apply a truncation internally — e.g. — while others (in this case, Salmon) do not.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s