I posted this wondering on Twitter earlier today:
— Simon Cockell (@sjcockell) February 22, 2016
Which elicited precisely no response. So I actually had to do my own research.
This is the hacked together results of that hour or so. Presented on the grounds that it may help others, as well as being an aide memoir for me.
In general, linear representations of reference genomes fail (unsurprisingly) to capture the complexities of populations. It is almost universally the case, though, that the tools we have available to us work with precisely this type of reference.
If we take the example of the human genome – since GRCh37 (Feb ’09), the GRC has attempted to represent some of the population-level complexity of humans by making alternative haplotypes available for regions of high complexity (the MHC locus of chromosome 6 is the canonical example). Church et al (2015) provide a more complete overview of the issues than I could hope to .
Despite this recognition of the issues at hand, a flattened representation of the ‘genome graph’ is still predominantly used for downstream mapping applications. With GRCh38 having 178 regions with alt loci (as opposed to 3 in GRCh37), the need for an adequate toolchain becomes more pressing.
Church et al points to a Github repo which is used for tracking software tools which make use of the full GRC assembly. I think the contents speaks for itself.
I’ve found a few useful resources discussing the technology surrounding the use of a reference graph, as opposed to the flattened representation. Kehr et al (2014) discusses the different graph models available for building graph-based alignment tools . Though the focus on actual implementations is rather historical (the most modern tool referenced was released in 2011).
More up-to-date is the work of Dilthey et al (2015), which looks at improving the genome representation of the MHC region specifically through the use of reference graphs and hidden Markov models . However, this work doesn’t seek to tackle a generic approach to read alignment to a genome graph. We do get a proposal for a useful graph model (the Population Reference Graph, or PRG) and a nice method for using data specifically in the MHC region, using HMMs. It’s also unclear to me (from my brief reading of the paper) how well this approach would scale.
From the CS side of things, we get an extension of the Burrows-Wheeler Transform for graphs rather than strings from Sirén et al (2014) . This is an approach which would clearly allow the very popular transformation used so widely in short read alignment to be adapted to a graph-based algorithm.
Then, finally, I came across an implementation. BWBBLE represents a collection of genomes, not using a graph structure, but by making use of ambiguity codes and string padding . Huang et al then rely on a conventional BWT to index this string representation of the ‘multi-genome’. This work also describes the implementation of an aligner that makes use of this genome representation.
BWBBLE feels a bit ‘close but no cigar’ – it’s not using a graph to represent a population of genomes and no one is really using it.
Finally, I get to the place where I actually started. HISAT2. I knew that this aligner was using graph-based FM indices to represent the genome (based on the work of Sirén et al mentioned above). I knew this made it possible to represent SNPs and small indels in the reference. I have no idea whether this allows HISAT2 to fully represent alt loci in its reference. Though from the descriptions available, it seems unlikely. I was pretty impressed by HISAT when it was released , but have yet to test HISAT2 in anger.
HISAT was posited as the replacement for Tophat as an RNA-Seq aligner. HISAT2’s webpage makes it clear the developers intend it to be used over both HISAT and Tophat. It is not clear that any such thing is happening (STAR seems to be the de facto Tophat replacement in most people’s RNA-Seq toolchain, where they still rely on alignment, or are not still using Tophat).
My perspective on this is one of a complete outsider doing about an hour’s worth of reading, but it seems to me that there is a complete paucity of tools allowing for alignment to a graph model of a reference genome. There’s plenty of discussion of the issues, and a recognition that such tools are necessary, but little so far in the way of implementation.
I assume these tools are in the works (as I said, I’m an outsider looking in here, I have no idea who’s developing what).
I’ll leave with this, which is part of what got me wondering about the state of this field in the first place:
— Titus Brown (@ctitusbrown) February 21, 2016
I guess 3-5 years is a loooong time in genomics.
implements alignment, succinct db of graph (sequences + haplotyps), text/binary formats, visualization, lossless RDF transformation, annotation, variant calling, graph statistics, normalization, long read aln/assembly, sequence to debruijn graph, kmers, read simulation, graph comparison, and tools to project models (graph alns and variant calls) into linear ref.
Jeffrey in the comments below also points out a presentation on Google Docs by Erik: https://docs.google.com/presentation/d/1bbl2zY4qWQ0yYBHhoVuXb79HdgajRotIUa_VEn3kTpI/edit#slide=id.p.
So, not quite the paucity of implementation I first feared (though it was clear stuff must be in the works – good to know some of the what and where) – and Twitter came in handy for the research in the end…