Sequence access patterns in khmer

(Developer docs)

These are the patterns for which we want to provide both flexible and fast parsing options.

Pattern 1: read, no write, with modify data structures

Used by load-into-counting and load-graph.

Single-threaded

Pseudocode:

for read in dataset:
   hash.update(read)

Multi-threaded

Here, threaded_fn is a function that’s called by multiple threads, so e.g. ‘hash.update’ might be being called simultaneously.

def threaded_fn(list_of_reads):
   for read in list_of_reads:
      hash.update(read)

Pattern 2: read and write, with static data structures

Used by filter-abund.

Single-threaded

Here, ro_hash is a read-only data structure; it is not being updated by anyone.

for read in dataset:
   read = ro_hash.modify(read)
   if read:
      output(read)

Multi-threaded

Again, ro_hash is entirely read-only and not being updated or modified at all:

def threaded_fn(list_of_reads):
   for read in list_of_reads:
       read = ro_hash.modify(read)
       if read:
          output(read)

Pattern 3: read, write, and modify data structures

Used by normalize-by-median. This pattern is the most general, so if we can implement if efficiently in all cases, then

Single-threaded

Code:

for read in dataset:
   read = hash.modify(read)
   if condition:
      output(read)

Multi-threaded

Code:

def threaded_fn(list_of_reads):
   for read in list_of_reads:
       read = hash.modify(read)
       if condition:
          output(read)

Notes on parallelization

Re multi-chassis parallelization, the ‘hash’ objects above are large (often multiple GB) and so our primary focus has been on big, multi-threaded SMP machines, rather than on distributing the processing across multiple machines.

Paired-end reads

There are a number of situations where we want to get a pair of reads, if such are available; e.g. instead of:

for read in list_of_reads

we want

for a, b in list_of_reads

Note that in some cases either a or b may be None, i.e. we may be dealing with orphaned reads.

Delivering k-mers rather than reads

In some cases - long sequences/chromosomes, for example – it may be more efficient to have a k-mer data pump rather than a read data pump. Thoughts for the future.

Hash function thoughts

Another thought for the future is that we may want to have multiple distinct hash functions; probably the best way to do this is to allow each Hashtable object to have its own hash function. This way we can take advantage of double-stranded hash functions (the current), single-stranded hash functions, and alternative implementations of either (e.g. cyclic hashes like Jordan is using).

comments powered by Disqus

Edit this document!

This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.

  1. Go to Sequence access patterns in khmer on GitHub.
  2. Edit files using GitHub's text editor in your web browser (see the 'Edit' tab on the top right of the file)
  3. Fill in the Commit message text box at the bottom of the page describing why you made the changes. Press the Propose file change button next to it when done.
  4. Then click Send a pull request.
  5. Your changes are now queued for review under the project's Pull requests tab on GitHub!

For an introduction to the documentation format please see the reST primer.