ktable, simple k-mer counting

(This is not why most people want to use khmer – see An assembly handbook for khmer - rough draft for that.)

khmer ktables are a simple C++ library for counting k-mers in DNA sequences. There’s a complete Python wrapping and it should be pretty darned fast; it’s intended for genome-scale k-mer counting.

‘ktable’s are a table of 4**k counters. khmer then maps each k-mer into this table with a simple (and reversible) hash function.

Right now, only the Python interface is documented here. The C++ interface is essentially identical; if you need to use it and want it documented, drop me a line.

Counting Speed and Memory Usage

On the 5 mb Shewanella oneidensis genome, khmer takes less than a second to count all k-mers, for any k between 6 and 12. At 13 it craps out because the table goes over my default stack size limit.

Approximate memory usage can be calculated by finding the size of a long long on your machine and then multiplying that by 4**k. For a 12bp wordsize, this works out to 16384*1024; on an Intel-based processor running Linux, long long is 8 bytes, so memory usage is approximately 128 mb.

Python interface

Essentially everything requires a ktable.

import khmer
ktable = khmer.new_ktable(L)

These commands will create a new ktable of size 4**L, suitable for counting L-mers.

Each ktable object has a few accessor functions:

  • ktable.ksize() will return L.
  • ktable.max_hash() will return the max hash value in the table, 4**L - 1.
  • ktable.n_entries() will return the number of table entries, 4**L.

The forward and reverse hashing functions are directly accessible:

  • hashval = ktable.forward_hash(kmer) will return the hash value

    of the given kmer.

  • kmer = ktable.reverse_hash(hashval) will return the kmer that hashes

    to the given hashval.

There are also some counting functions:

  • ktable.count(kmer) will increment the count associated with the given kmer by one.
  • ktable.consume(sequence) will run through the sequence and count each kmer present.
  • n = ktable.get(kmer|hashval) will return the count associated with the given kmer string or the given hashval, whichever is passed in.
  • ktable.set(kmer|hashval, count) set the count for the given kmer string or hashval.

In all of the cases above, ‘kmer’ is an L-length string, ‘hashval’ is a non-negative integer, and ‘sequence’ is a DNA sequence containing ONLY A/C/G/T.

Note: ‘N’ is not a legal DNA character as far as khmer is concerned!

And, finally, there are some set operations:

  • ktable.clear() empties the ktable.
  • ktable.update(other) adds all of the entries in other into ktable. The wordsize must be the same for both ktables.
  • intersection = ktable.intersect(other) returns a ktable where only nonzero entries in both ktables are kept. The count for ach entry is the sum of the counts in ktable and other.

An Example

This short code example will count all 6-mers present in the given DNA sequence, and then print them all out along with their prevalence.

# make a new ktable, L=6
ktable = khmer.new_ktable(6)

# count all k-mers in the given string
ktable.consume("ATGAGAGACACAGGGAGAGACCCAATTAGAGAATTGGACC")

# run through all entries. if they have nonzero presence, print.
for i in range(0, ktable.n_entries()):
   n = ktable.get(i)
   if n:
      print ktable.reverse_hash(i), "is present", n, "times."
CTB, 3/2005
comments powered by Disqus

Table Of Contents

Previous topic

Partitioning large data sets (50m+ reads)

Next topic

Architecture and Design

This Page

Edit this document!

This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.

  1. Go to ktable, simple k-mer counting on GitHub.
  2. Edit files using GitHub's text editor in your web browser (see the 'Edit' tab on the top right of the file)
  3. Fill in the Commit message text box at the bottom of the page describing why you made the changes. Press the Propose file change button next to it when done.
  4. Then click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format please see the reST primer.