Sequence access patterns in khmer ================================= (Developer docs) These are the patterns for which we want to provide both flexible and fast parsing options. Pattern 1: read, no write, with modify data structures ------------------------------------------------------ Used by load-into-counting and load-graph. Single-threaded ~~~~~~~~~~~~~~~ Pseudocode:: for read in dataset: hash.update(read) Multi-threaded ~~~~~~~~~~~~~~ Here, threaded_fn is a function that's called by multiple threads, so e.g. 'hash.update' might be being called simultaneously. :: def threaded_fn(list_of_reads): for read in list_of_reads: hash.update(read) Pattern 2: read and write, with static data structures ------------------------------------------------------ Used by filter-abund. Single-threaded ~~~~~~~~~~~~~~~ Here, ro_hash is a read-only data structure; it is not being updated by anyone. :: for read in dataset: read = ro_hash.modify(read) if read: output(read) Multi-threaded ~~~~~~~~~~~~~~ Again, ro_hash is entirely read-only and not being updated or modified at all:: def threaded_fn(list_of_reads): for read in list_of_reads: read = ro_hash.modify(read) if read: output(read) Pattern 3: read, write, and modify data structures -------------------------------------------------- Used by normalize-by-median. This pattern is the most general, so if we can implement if efficiently in all cases, then Single-threaded ~~~~~~~~~~~~~~~ Code:: for read in dataset: read = hash.modify(read) if condition: output(read) Multi-threaded ~~~~~~~~~~~~~~ Code:: def threaded_fn(list_of_reads): for read in list_of_reads: read = hash.modify(read) if condition: output(read) Notes on parallelization ------------------------ Re multi-chassis parallelization, the 'hash' objects above are large (often multiple GB) and so our primary focus has been on big, multi-threaded SMP machines, rather than on distributing the processing across multiple machines. Paired-end reads ---------------- There are a number of situations where we want to get a pair of reads, if such are available; e.g. instead of:: for read in list_of_reads we want :: for a, b in list_of_reads Note that in some cases either a or b may be None, i.e. we may be dealing with orphaned reads. Delivering k-mers rather than reads ----------------------------------- In some cases - long sequences/chromosomes, for example -- it may be more efficient to have a k-mer data pump rather than a read data pump. Thoughts for the future. Hash function thoughts ---------------------- Another thought for the future is that we may want to have multiple distinct hash functions; probably the best way to do this is to allow each Hashtable object to have its own hash function. This way we can take advantage of double-stranded hash functions (the current), single-stranded hash functions, and alternative implementations of either (e.g. cyclic hashes like Jordan is using).