Introduction to khmer

Introduction

khmer is a software library and toolkit for k-mer based analysis and transformation of nucleotide sequence data. The primary focus of development has been scaling assembly of metagenomes and transcriptomes.

khmer supports a number of transformations, both inexact transformations (abundance filtering; error trimming) and exact transformations (graph-size filtering, to throw away disconnected reads; partitioning, to split reads into disjoint sets). All of these transformations operate with constant memory consumption (with the exception of partitioning), and typically require less memory than is required to assemble the data.

Most of khmer is built around a single underlying probabilistic data structure known as a Bloom filter (also see Count-Min Sketch and These Are Not The k-mers You're Looking For). In khmer this data structure is implemented as a set of hash tables, each of different size, with no collision detection. These hash tables can store either the presence (Bloom filter) or frequency (Count-Min Sketch) of specific k-mers. The lack of collision detection means that the data structure may report a k-mer as being present when it is not, in fact, in the data set. However, it will never incorrectly report a k-mer as being absent when it truly is present. This one-sided error makes the Bloom filter very useful for certain kinds of operations.

khmer supports arbitrarily large k-sizes, although certain graph-based operations are limited to k <= 32.

The khmer core library is implemented in C++, while all of the khmer scripts and tests access the core library via a Python wrapper.

Tutorials highlighting khmer are available at khmer-protocols and khmer-recipes. The former provides detailed protocols for using khmer to analyze either a transcriptome or a metagenome. The latter provides individual recipes for using khmer in a variety of sequence-oriented tasks such as extracting reads by coverage, estimating a genome or metagenome size from unassembled reads, and error-trimming reads via streaming k-mer abundance.

Using khmer

khmer comes "out of the box" with a number of scripts that make it immediately useful for a few different operations, including (but not limited to) the following.

  • normalizing read coverage ("digital normalization")
  • dividing reads into disjoint sets that do not connect ("partitioning")
  • eliminating reads that will not be used by a de Bruijn graph assembler;
  • removing reads with low- or high-abundance k-mers;
  • trimming reads of certain kinds of sequencing errors;
  • counting k-mers and estimating data set coverage based on k-mer counts;
  • running Velvet and calculating assembly statistics;
  • converting FASTQ to FASTA;
  • converting between paired and interleaved formats for paired FASTQ data

Practical considerations

The most important thing to think about when using khmer is whether or not the transformation or filter you're applying is appropriate for the data you're trying to assemble. For example, two of the most powerful operations available in khmer, graph-size filtering and graph partitioning, only make sense for assembly datasets with many theoretically unconnected components. This is typical of metagenomic data sets. Also, while digital normalization can be helpful for transcriptome assembly, it is inappropriate for other RNA-seq applications, such as differential expression analysis, that rely on signal from variable coverage.

The second most important consideration is memory usage. The effectiveness of all of the Bloom filter-based functions (which is everything interesting in khmer!) depends critically on having enough memory to do a good job. See Setting khmer memory usage for more information.

comments powered by Disqus