.. This file is part of khmer, https://github.com/dib-lab/khmer/, and is Copyright (C) 2011-2015 Michigan State University Copyright (C) 2015 The Regents of the University of California. It is licensed under the three-clause BSD license; see LICENSE. Contact: khmer-project@idyll.org Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the Michigan State University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Contact: khmer-project@idyll.org ********************* Introduction to khmer ********************* Introduction ============ khmer is a library and toolkit for doing k-mer-based dataset analysis and transformations. Our focus in developing it has been on scaling assembly of metagenomes and mRNA. khmer can be used for a number of transformations, including inexact transformations (abundance filtering and error trimming) and exact transformations (graph-size filtering, to throw away disconnected reads; and partitioning, to split reads into disjoint sets). Of these, only partitioning is not constant memory. In all cases, the memory required for assembly with Velvet or another de Bruijn graph assembler will be more than the memory required to use our software. Our software will not increase the memory required for Velvet, either, although we may not be able to *decrease* the memory required for assembly for every data set. Most of khmer relies on an underlying probabilistic data structure known as a `Bloom filter `__ (also see `Count-Min Sketch `__ and `These Are Not The k-mers You're Looking For `__), which is essentially a set of hash tables, each of different size, with no collision detection. These hash tables are used to store the presence of specific k-mers and/or their count. The lack of collision detection means that the Bloom filter may report a k-mer as being "present" when it is not, in fact, in the data set; however, it will never incorrectly report a k-mer as being absent when it *is* present. This one-sided error makes the Bloom filter very useful for certain kinds of operations. khmer is also independent of a specific k-size (K), and currently works for K <= 32. We will be integrating code for K<=64 soon. khmer is implemented in C++ with a Python wrapper, which is what all of the scripts use. Documentation for khmer is provided on the Web sites for `khmer-protocols `__ and `khmer-recipes `__. khmer-protocols provides detailed protocols for using khmer to analyze either a transcriptome or a metagenome. khmer-recipes provides individual recipes for using khmer in a variety of sequence-oriented tasks such as extracting reads by coverage, estimating a genome or metagenome size from unassembled reads, and error-trimming reads via streaming k-mer abundance. Using khmer =========== khmer comes "out of the box" with a number of scripts that make it immediately useful for a few different operations, including: - normalizing read coverage ("digital normalization") - dividing reads into disjoint sets that do not connect ("partitioning") - eliminating reads that will not be used by a de Bruijn graph assembler; - removing reads with low- or high-abundance k-mers; - trimming reads of certain kinds of sequencing errors; - counting k-mers and estimating data set coverage based on k-mer counts; - running Velvet and calculating assembly statistics; - optimizing assemblies on various parameters; - converting FASTQ to FASTA; and a few other random functions. Practical considerations ======================== The most important thing to think about when using khmer is whether or not the transformation or filter you're applying is appropriate for the data you're trying to assemble. Two of the most powerful operations available in khmer, graph-size filtering and graph partitioning, only make sense for assembly datasets with many theoretically unconnected components. This is typical of metagenomic data sets. The second most important consideration is memory usage. The effectiveness of all of the Bloom filter-based functions (which is everything interesting in khmer!) depends critically on having enough memory to do a good job. See :doc:`user/choosing-table-sizes` for more information. Copyright and license ===================== Portions of khmer are Copyright California Institute of Technology, where the exact counting code was first developed. All other code developed through 2014 is copyright Michigan State University. Portions are copyright Michigan State University and Regents of the University of California. All the code is freely available for use and re-use under the BSD License.