(This is not why most people want to use khmer – see An assembly handbook for khmer - rough draft for that.)
khmer ktables are a simple C++ library for counting k-mers in DNA sequences. There’s a complete Python wrapping and it should be pretty darned fast; it’s intended for genome-scale k-mer counting.
‘ktable’s are a table of 4**k counters. khmer then maps each k-mer into this table with a simple (and reversible) hash function.
Right now, only the Python interface is documented here. The C++ interface is essentially identical; if you need to use it and want it documented, drop me a line.
On the 5 mb Shewanella oneidensis genome, khmer takes less than a second to count all k-mers, for any k between 6 and 12. At 13 it craps out because the table goes over my default stack size limit.
Approximate memory usage can be calculated by finding the size of a long long on your machine and then multiplying that by 4**k. For a 12bp wordsize, this works out to 16384*1024; on an Intel-based processor running Linux, long long is 8 bytes, so memory usage is approximately 128 mb.
Essentially everything requires a ktable.
import khmer
ktable = khmer.new_ktable(L)
These commands will create a new ktable of size 4**L, suitable for counting L-mers.
Each ktable object has a few accessor functions:
- ktable.ksize() will return L.
- ktable.max_hash() will return the max hash value in the table, 4**L - 1.
- ktable.n_entries() will return the number of table entries, 4**L.
The forward and reverse hashing functions are directly accessible:
- hashval = ktable.forward_hash(kmer) will return the hash value
of the given kmer.
- kmer = ktable.reverse_hash(hashval) will return the kmer that hashes
to the given hashval.
There are also some counting functions:
- ktable.count(kmer) will increment the count associated with the given kmer by one.
- ktable.consume(sequence) will run through the sequence and count each kmer present.
- n = ktable.get(kmer|hashval) will return the count associated with the given kmer string or the given hashval, whichever is passed in.
- ktable.set(kmer|hashval, count) set the count for the given kmer string or hashval.
In all of the cases above, ‘kmer’ is an L-length string, ‘hashval’ is a non-negative integer, and ‘sequence’ is a DNA sequence containing ONLY A/C/G/T.
Note: ‘N’ is not a legal DNA character as far as khmer is concerned!
And, finally, there are some set operations:
- ktable.clear() empties the ktable.
- ktable.update(other) adds all of the entries in other into ktable. The wordsize must be the same for both ktables.
- intersection = ktable.intersect(other) returns a ktable where only nonzero entries in both ktables are kept. The count for ach entry is the sum of the counts in ktable and other.
This short code example will count all 6-mers present in the given DNA sequence, and then print them all out along with their prevalence.
# make a new ktable, L=6
ktable = khmer.new_ktable(6)
# count all k-mers in the given string
ktable.consume("ATGAGAGACACAGGGAGAGACCCAATTAGAGAATTGGACC")
# run through all entries. if they have nonzero presence, print.
for i in range(0, ktable.n_entries()):
n = ktable.get(i)
if n:
print ktable.reverse_hash(i), "is present", n, "times."
CTB, 3/2005
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.