(Developer docs)
These are the patterns for which we want to provide both flexible and fast parsing options.
Used by load-into-counting and load-graph.
Here, threaded_fn is a function that’s called by multiple threads, so e.g. ‘hash.update’ might be being called simultaneously.
def threaded_fn(list_of_reads):
for read in list_of_reads:
hash.update(read)
Used by filter-abund.
Here, ro_hash is a read-only data structure; it is not being updated by anyone.
for read in dataset:
read = ro_hash.modify(read)
if read:
output(read)
Again, ro_hash is entirely read-only and not being updated or modified at all:
def threaded_fn(list_of_reads):
for read in list_of_reads:
read = ro_hash.modify(read)
if read:
output(read)
Used by normalize-by-median. This pattern is the most general, so if we can implement if efficiently in all cases, then
Code:
def threaded_fn(list_of_reads):
for read in list_of_reads:
read = hash.modify(read)
if condition:
output(read)
Re multi-chassis parallelization, the ‘hash’ objects above are large (often multiple GB) and so our primary focus has been on big, multi-threaded SMP machines, rather than on distributing the processing across multiple machines.
There are a number of situations where we want to get a pair of reads, if such are available; e.g. instead of:
for read in list_of_reads
we want
for a, b in list_of_reads
Note that in some cases either a or b may be None, i.e. we may be dealing with orphaned reads.
In some cases - long sequences/chromosomes, for example – it may be more efficient to have a k-mer data pump rather than a read data pump. Thoughts for the future.
Another thought for the future is that we may want to have multiple distinct hash functions; probably the best way to do this is to allow each Hashtable object to have its own hash function. This way we can take advantage of double-stranded hash functions (the current), single-stranded hash functions, and alternative implementations of either (e.g. cyclic hashes like Jordan is using).
This file can be edited directly through the Web. Anyone can update and fix errors in this document with few clicks -- no downloads needed.
For an introduction to the documentation format please see the reST primer.