============================= ktable, simple k-mer counting ============================= (This is not why most people want to use khmer -- see :doc:`guide` for that.) khmer ktables are a simple C++ library for counting k-mers in DNA sequences. There's a complete Python wrapping and it should be pretty darned fast; it's intended for genome-scale k-mer counting. 'ktable's are a table of 4**k counters. khmer then maps each k-mer into this table with a simple (and reversible) hash function. Right now, only the Python interface is documented here. The C++ interface is essentially identical; if you need to use it and want it documented, drop me a line. Counting Speed and Memory Usage =============================== On the 5 mb *Shewanella oneidensis* genome, khmer takes less than a second to count all k-mers, for any k between 6 and 12. At 13 it craps out because the table goes over my default stack size limit. Approximate memory usage can be calculated by finding the size of a ``long long`` on your machine and then multiplying that by 4**k. For a 12bp wordsize, this works out to 16384*1024; on an Intel-based processor running Linux, ``long long`` is 8 bytes, so memory usage is approximately 128 mb. Python interface ================ Essentially everything requires a ``ktable``. :: import khmer ktable = khmer.new_ktable(L) These commands will create a new ``ktable`` of size 4**L, suitable for counting L-mers. Each ``ktable`` object has a few accessor functions: * ``ktable.ksize()`` will return L. * ``ktable.max_hash()`` will return the max hash value in the table, 4**L - 1. * ``ktable.n_entries()`` will return the number of table entries, 4**L. The forward and reverse hashing functions are directly accessible: * ``hashval = ktable.forward_hash(kmer)`` will return the hash value of the given kmer. * ``kmer = ktable.reverse_hash(hashval)`` will return the kmer that hashes to the given hashval. There are also some counting functions: * ``ktable.count(kmer)`` will increment the count associated with the given kmer by one. * ``ktable.consume(sequence)`` will run through the sequence and count each kmer present. * ``n = ktable.get(kmer|hashval)`` will return the count associated with the given kmer string or the given hashval, whichever is passed in. * ``ktable.set(kmer|hashval, count)`` set the count for the given kmer string or hashval. In all of the cases above, 'kmer' is an L-length string, 'hashval' is a non-negative integer, and 'sequence' is a DNA sequence containing ONLY A/C/G/T. **Note:** 'N' is not a legal DNA character as far as khmer is concerned! And, finally, there are some set operations: * ``ktable.clear()`` empties the ktable. * ``ktable.update(other)`` adds all of the entries in ``other`` into ``ktable``. The wordsize must be the same for both ktables. * ``intersection = ktable.intersect(other)`` returns a ktable where only nonzero entries in both ktables are kept. The count for ach entry is the sum of the counts in ``ktable`` and ``other``. An Example ========== This short code example will count all 6-mers present in the given DNA sequence, and then print them all out along with their prevalence. :: # make a new ktable, L=6 ktable = khmer.new_ktable(6) # count all k-mers in the given string ktable.consume("ATGAGAGACACAGGGAGAGACCCAATTAGAGAATTGGACC") # run through all entries. if they have nonzero presence, print. for i in range(0, ktable.n_entries()): n = ktable.get(i) if n: print ktable.reverse_hash(i), "is present", n, "times." :: CTB, 3/2005