Sequence access patterns in khmer¶

(Developer docs)

These are the patterns for which we want to provide both flexible and fast parsing options.

Pattern 1: read, no write, with modify data structures¶

Used by load-into-counting and load-graph.

Single-threaded¶

Pseudocode:

for read in dataset:
   hash.update(read)

Multi-threaded¶

Here, threaded_fn is a function that’s called by multiple threads, so e.g. ‘hash.update’ might be being called simultaneously.

def threaded_fn(list_of_reads):
   for read in list_of_reads:
      hash.update(read)

Pattern 2: read and write, with static data structures¶

Used by filter-abund.

Single-threaded¶

Here, ro_hash is a read-only data structure; it is not being updated by anyone.

for read in dataset:
   read = ro_hash.modify(read)
   if read:
      output(read)

Multi-threaded¶

Again, ro_hash is entirely read-only and not being updated or modified at all:

def threaded_fn(list_of_reads):
   for read in list_of_reads:
       read = ro_hash.modify(read)
       if read:
          output(read)

Pattern 3: read, write, and modify data structures¶

Used by normalize-by-median. This pattern is the most general, so if we can implement if efficiently in all cases, then

Single-threaded¶

Code:

for read in dataset:
   read = hash.modify(read)
   if condition:
      output(read)

Multi-threaded¶

Code:

def threaded_fn(list_of_reads):
   for read in list_of_reads:
       read = hash.modify(read)
       if condition:
          output(read)

Notes on parallelization¶

Re multi-chassis parallelization, the ‘hash’ objects above are large (often multiple GB) and so our primary focus has been on big, multi-threaded SMP machines, rather than on distributing the processing across multiple machines.

Paired-end reads¶

There are a number of situations where we want to get a pair of reads, if such are available; e.g. instead of:

for read in list_of_reads

we want

for a, b in list_of_reads

Note that in some cases either a or b may be None, i.e. we may be dealing with orphaned reads.

Delivering k-mers rather than reads¶

In some cases - long sequences/chromosomes, for example – it may be more efficient to have a k-mer data pump rather than a read data pump. Thoughts for the future.

Hash function thoughts¶

Another thought for the future is that we may want to have multiple distinct hash functions; probably the best way to do this is to allow each Hashtable object to have its own hash function. This way we can take advantage of double-stranded hash functions (the current), single-stranded hash functions, and alternative implementations of either (e.g. cyclic hashes like Jordan is using).

comments powered by Disqus

Sequence access patterns in khmer¶

Pattern 1: read, no write, with modify data structures¶

Single-threaded¶

Multi-threaded¶

Pattern 2: read and write, with static data structures¶

Single-threaded¶

Multi-threaded¶

Pattern 3: read, write, and modify data structures¶

Single-threaded¶

Multi-threaded¶

Notes on parallelization¶

Paired-end reads¶

Delivering k-mers rather than reads¶

Hash function thoughts¶

Table Of Contents

Previous topic

Next topic

This Page

Edit this document!

Navigation

Sequence access patterns in khmer¶

Pattern 1: read, no write, with modify data structures¶

Single-threaded¶

Multi-threaded¶

Pattern 2: read and write, with static data structures¶

Single-threaded¶

Multi-threaded¶

Pattern 3: read, write, and modify data structures¶

Single-threaded¶

Multi-threaded¶

Notes on parallelization¶

Paired-end reads¶

Delivering k-mers rather than reads¶

Hash function thoughts¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation

Edit this document!