Partitioning large data sets (50m+ reads)

"Partitioning" is what khmer calls the process of separating reads that do not connect to each other into different logical bins. The goal of partitioning is to apply divide & conquer to the process of metagenomic assembly.

Basic partitioning

The basic workflow for partitioning is in the figure below:


Briefly, you load everything into khmer's probabilistic graph representation; exhaustively explore the graph to find all disconnected sequences; merge the results of the (parallelized) graph exploration; annotate sequences with their partition; and then extract the different partitions into files grouped by partition size. These groups can then be assembled individually.

Artifact removal

As part of our partitioning research, we discovered that large Illumina data sets tend to contain a single large, connected component. This connected component seems to stem from sequencing artifacts that causes knots in the assembly graph. We have developed tools to forcibly remove the knot at the heart of the graph.

Here's the workflow:


Running on an example data set

Here is a set of commands for running both basic partitioning and artifact removal on a small soil metagenomics data set that we've made available for this purpose.

The data set is about 1.1G and you can download it from here:

cd /path/to/data

# the next command will create a '50m.ct' and a '50m.tagset',
# representing the de Bruijn graph -k 32 -N 4 -x 16e9 50m iowa-corn-50m.fa.gz

# this will then partition that graph. should take a while.
# update threads to something higher if you have more cores.
# this creates a bunch of files, 50m.subset.*.pmap --threads 4 -s 1e5 50m

# now, merge the pmap files into one big pmap file, 50m.pmap.merged 50m

# next, annotate the original sequences with their partition numbers.
# this will create iowa-corn-50m.fa.gz.part 50m iowa-corn-50m.fa.gz

# now, extract the partitions in groups into 'iowa-corn-50m.groupNNNN.fa' iowa-corn-50m iowa-corn-50m.fa.gz.part

# at this point, you can assemble the group files individually.  Note,
# however, that the last one them is quite big?  this is because it's
# the lump! yay!

# if you want to break up the lump, go through the partitioning bit
# on the group file, but this time with a twist:
mv iowa-corn-50m.group0005.fa corn-50m.lump.fa

# create graph, -x 8e9 lump corn-50m.lump.fa

# create an initial set of stoptags to help in knot-traversal; otherwise,
# partitioning and knot-traversal (which is systematic) is really expensive. lump

# now partition the graph, using the stoptags file --stoptags lump.stoptags lump

# use the partitioned subsets to find the k-mers that nucleate the lump -x 2e8 -N 4 lump

# remove those k-mers from the fasta files *.stoptags corn-50m.lump.fa

# now, reload the filtered data set in and partition again.
# NOTE: '' uses the file extension to determine
# if the file is formatted as FASTA or FASTQ. The default is
# fasta, therefore if your files are fastq formatted you need
# to append 'fastq' to the name so that ''
# will parse the file correctly -x 8e9 lumpfilt corn-50m.lump.fa.stopfilt -T 4 lumpfilt lumpfilt lumpfilt corn-50m.lump.fa.stopfilt corn-50m-lump corn-50m.lump.fa.stopfilt.part

# and voila, after all that, you should now have your de-knotted lump in
#*.fa.  The *.group????.fa files can now be
# assembled individually by your favorite assembler.

Post-partitioning assembly

The 'extract-partitions' script takes reads belonging to each partition and aggregates them into 'group' files; each group file contains at least one entire partition (and generally a lot more). Note, you can control the number of reads in each file (equiv, the size of these files) with some of the arguments that 'extract-partitions' takes.

Now that you have these files... what do you do with them? The short answer is: assemble them! Each of these group files contains reads that do not connect to reads in other files, so the files can be assembled individually (which is the whole point of partitioning).

If you're using Velvet, checkout the sandbox/ script, which you can run like this:

bash /path/to/khmer/sandbox/ <groupfile> <k>

This script does three things:

  • first, it breaks the reads up into paired reads and single reads, and puts them in separate files (.pe and .se);
  • second, it strips off the partition information from the reads, which confuses Velvet;
  • and third, it runs velveth and velvetg to actually assemble.

You can implement your own approach, of course, but this is an example of what we do ourselves.

comments powered by Disqus