date: | 2012-11-2 |
---|
An increasing number of people are asking about using our assembly approaches for things that we haven’t yet written (or posted) papers about. Moreover, our assembly strategies themselves are also under constant evolution as we do more research and find ever-wider applicability of our approaches.
Note, this is modified from Titus’ blog post, here – go check the bottom of that for comments.
This handbook distills the cumulative expertise of Adina Howe, Titus Brown, Erich Schwarz, Jason Pell, Camille Scott, Elijah Lowe, Kanchan Pavangadkar, Likit Preeyanon, and others.
khmer is really focused on short read data, and, more specifically, Illumina, because that’s where we have a too-much-data problem. However, a lot of the prescriptions below can be adapted to longer read technologies such as 454 and Ion Torrent without much effort.
Don’t try to use our k-mer approaches with PacBio – the error rate is too high.
There are many blog posts about this stuff on Titus Brown’s blog. We will try to link them in where appropriate.
Do all the quality filtering, trimming, etc. that you think you should do.
The khmer tools work “out of the box” on interleaved paired-end data.
All of our scripts will take in .fq or .fastq files as FASTQ, and all other files as FASTA. gzip files are always accepted. Let us know if not; that’s a bug!
Broadly, normalize each insert library separately, in the following way:
For high-coverage libraries (> ~50x), do three-pass digital
normalization: run normalize-by-median.py with --cutoff=20
and then run filter-abund.py with
--cutoff=2
. Now split out the remaining
paired-end/interleaved and single-end reads using
extract-paired-reads.py, and run normalize-by-median.py
on the paired-end and single-end files (using --unpaired-reads
) with --cutoff=5
.
For low-coverage libraries (< 50x) do single-pass digital normalization:
run normalize-by-median.py to --cutoff=10
.
You can read about this process in the digital normalization paper.
--cutoff=20
.You can read about this process in the digital normalization paper.
--cutoff=20
(we’ve also found --cutoff=10
works
fine).sandbox/filter-below-abund.py
with --cutoff=50
(if you
ran normalize-by-median.py with --cutoff=10
) or wiht --cutoff=100
if you ran
normalize-by-median.py with --cutoff=20
)(We actually use Velvet at this point, but there should be no harm in using a metagenome assembler such as MetaVelvet or MetaIDBA or SOAPdenovo.)
Read more about this in the partitioning paper. We have some upcoming papers on partitioning and metagenome assembly, too; we’ll link those in when we can.
(Not tested by us!)
--cutoff=20
.(Not tested by us!)
Others have told us that you can apply digital normalization to Illumina data prior to using Illumina for RNA scaffolding or error correcting PacBio reads.
Our suggestion for this, based on no evidence whatsoever, is to
run normalize-by-median.py with --cutoff=20
on the Illumina data.
For now, khmer only deals with assembly! So: assemble. Then, go back to your original, unnormalized reads, and map those to your assembly with e.g. bowtie. Then count as you normally would).
The basic philosophy of digital normalization is “load your most valuable reads first.” Diginorm gets rid of redundancy iteratively, so you are more likely to retain the first reads fed in; this means you should load in paired end reads, or longer reads, first.
You can use --loadgraph
and
--savegraph
to do iterative
normalizations on multiple files in multiple steps. For example, break
normalize-by-median.py [ ... ] file1.fa file2.fa file3.fa
into multiple steps like so:
normalize-by-median.py [ ... ] --savegraph file1.ct file1.fa
normalize-by-median.py [ ... ] --loadgraph file1.ct --savegraph file2.ct file2.fa
normalize-by-median.py [ ... ] --loadgraph file2.ct --savegraph file3.ct file3.fa
The results should be identical!
If you want to independently normalize multiple files for speed reasons, go ahead. Just remember to do a combined normalization at the end. For example, instead of
normalize-by-median.py [ ... ] file1.fa file2.fa file3.fa
you could do
normalize-by-median.py [ ... ] file1.fa
normalize-by-median.py [ ... ] file2.fa
normalize-by-median.py [ ... ] file3.fa
and then do a final
normalize-by-median.py [ ... ] file1.fa.keep file2.fa.keep file3.fa.keep
The results will not be identical, but should not differ significantly. The multipass approach will take more total time but may end up being faster walltime because you can execute the independent normalizations on multiple computers.
For a cleverer approach that we will someday implement, read the Beachcomber’s Dilemma.