Architecture and Design ======================= What follows is an attempt to describe the overall architecture and design of the *khmer* software under the hood. Where appropriate, implementation details will be mentioned. Also, possible future directions and design considerations will be mentioned as appropriate. Overview -------- Data pumps stage data from disk storage into an in-memory cache. The in-memory cache is divided into segments, one segment per thread. A cache manager exposes an interface for staging the data via the data pumps and for accessing the data in the cache segments. Read parsers convert the staged data into read objects. A separate state object is maintained for each thread using a parser. Existing-tracking or counting Bloom filters can use the read parsers as a source of reads from which to extract k-mers. The read parsers and the layers under them can be controlled via global configuration objects, which provide default values during their instantiation. In many cases, these default values can also be overridden by supplying pertinent arguments to the constructors. Only one global configuration object is considered active at a given time; but, a singleton pattern is not enforced and more than one may be available to supply alternative configurations. The top-level makefile for the project contains a user-configurable section, wherein preprocessor, compiler, and linker options may be selected via convenient, prefabricated bundles. The ability to generate profiling instrumentation, compile with debugging symbols, and to generate tracing instrumentation are all controlled via these option bundles. The lower levels of the code, such as the data pumps, cache manager, and read parsers all have significant built-in profiling and tracing instrumentation. This instrumentation is conditionally-compiled according to the option bundles selected in the top-level makefile. Namespace --------- Unless otherwise noted, all C++ classes, functions, and static variables noted in this document are members of the ``khmer`` namespace. Likewise, unless otherwise noted, all Python classes, functions, and module variables noted in this document are members of the ``khmer`` module. .. cpp:namespace:: khmer .. py:module:: khmer .. todo:: Use ``breathe`` to interface with ``Doxygen`` for better documentation. Configuration Objects --------------------- C++ API ~~~~~~~ The declaration of the configuration objects is contained in :file:`lib/khmer_config.hh`. .. cpp:class:: Config .. cpp:function:: Config &get_active_config( ) .. cpp:function:: void set_active_config( Config &c ) An *active* configuration object is always present. A reference to this object is supplied via the :cpp:func:`get_active_config` function. The initial settings of the active configuration object are quite conservative. New configuration objects are created with the empty constructor; all settings modifications occur upon already-created instances via their setter methods. The active configuration object can be set via the :cpp:func:`set_active_config` function, which takes a reference to a :cpp:class:`Config` object as its only argument. Except for read-only configuration options, such as extra sanity checking, which are determined at the time of compilation, the configuration options are manipulated via getter/setter methods. The most prominent or useful getter/setter methods are the following: .. cpp:function:: uint32_t Config::get_number_of_threads( ) const .. cpp:function:: void Config::set_number_of_threads( uint32_t const n ) .. cpp:function:: uint64_t const Config::get_reads_input_buffer_size( ) const .. cpp:function:: void Config::set_reads_input_buffer_size( uint64_t const sz ) Python API ~~~~~~~~~~ The :cpp:class:`Config` objects are exposed in the Python wrapper. .. py:function:: get_config( ) .. py:function:: set_config( c ) The C++ getter/setter methods are exposed via the same names in Python. .. todo:: The getter/setter methods should be exposed as properties in Python. Trace Loggers ------------- Trace loggers can log execution traces and other useful debugging information to files on a per-thread basis. This makes them very useful for debugging multi-threaded code, especially in the absence of advanced commercial debuggers, such as DDT or TotalView. Trace loggers are controlled via several user-configurable variables in the top-level makefile. As of this writing, these variables are ``WITH_INTERNAL_TRACING``, ``TRACE_STATE_CHANGES``, ``TRACE_BUSYWAITS``, ``TRACE_SPINLOCKS``, ``TRACE_MEMCOPIES``, and ``TRACE_DATA``. The ``TRACE_`` options are ineffective unless ``WITH_INTERNAL_TRACING`` is set to ``true``. .. todo:: Replace the editing of makefiles with a configure script or else move to an all-Pythonic regime where the user would edit :file:`setup.cfg`. See `issue #9 `_ in the Github issue tracker for the project. The data pump and read parser code, as well as some of the Bloom filter code, is impregnated with trace loggers. Other parts of the source code could use them as well. Trace logger objects are not exposed directly via the Python wrapper; they are only available in the C++ API. The trace logger class is declared in the :file:`lib/trace_logger.hh` file. .. cpp:class:: TraceLogger Tracing be performed at coarser or finer levels of detail, as desired. An enumeration of named integral constants provides the available levels. The use of ``TLVL_ALL`` will trace everything which is instrumented for tracing. After that, ``TLVL_DEBUG9`` is the next finest level of detail. The enumeration ascends to higher and higher numerical values which indicate more coarseness; specifically the ordering is trace levels ``TLVL_DEBUG8`` through ``TLVL_DEBUG0``, followed by ``TLVL_INFO9`` through ``TLVL_INFO0``, and then ``TLVL_WARNING``, ``TLVL_ERROR``, and ``TLVL_CRITICAL``. The special level ``TLVL_NONE`` means that nothing will be traced even though tracing may be activated at compile time. Note that ``TLVL_ALL`` corresponds to 0 and ``TLVL_NONE`` corresponds to 255; this is useful for setting trace levels in method arguments via the Python interface. .. todo:: Expose trace level names via the Python interface. .. todo:: Allow C++ :cpp:class:`TraceLogger` objects to be targets of Python :py:mod:`logging` module? Two constructors are available for instantiating objects of type :cpp:class:`TraceLogger`. One takes the trace level and a ``FILE *`` stream handle. The other takes the trace level, a file name format string, and a variable number of arguments to ``sprintf`` into that format string. This form exists so that trace files, named according to logical thread ID, can be created, for example. The trace level argument is the finest requested level of detail which will be traced by the object. The objects instantiated by these constructors are *function objects* (also sometimes known as *functors* to the chagrin of some mathematicians). This is to say that the objects may be called. .. cpp:function:: void TraceLogger::operator( )( uint8_t const level, char const * const format, ...) const The ``level`` argument is the desired level of detail. If the object was instantiated for a coarser level than the requested level, then nothing will be logged. The ``format`` argument is the format string for the underlying ``fprintf`` call and a variable number of arguments may be supplied for use with this format string. Performance Metrics ------------------- Performance metrics can be gathered on a per-thread basis and can measure things which are not covered by traditional profiling tools. Such metrics may include the input or output rate in bytes per second, for example. Not all platforms support the high resolution, per-thread timers needed to effectively use these metrics. The Python wrapper does not presently support reporting on performance metrics. .. todo:: Support reporting on performance metrics from within the Python wrapper. The performance metrics abstract base class is declared in the :file:`lib/perf_metrics.hh` file. This class is subclassed for various specific domains. .. cpp:class:: IPerformanceMetrics The class provides a hassle-free stopwatch. .. cpp:function:: void IPerformanceMetrics::start_timers( ) .. cpp:function:: void IPerformanceMetrics::stop_timers( ) These functions record the amount of physical time elapsed since the thread was created and the amount of time that the thread has spent using CPU cores. Two sets of internal scratch variables are used for this purpose: one set of start times and one set of stop times. .. warning:: Because of the use of internal scratch variables, these methods are not reentrant. Timer deltas must be collected before new calls can be issued to the stopwatch. This is the trade-off for convenience.... Once start and stop times have been accumulated, then a timer delta can be calculated and stowed in the appropriate category. Categories are determined by keys which are defined in subclasses of the abstract class. The delta accumulator takes a category key as an argument and is declared pure virtual in the abstract base class so that it must be implemented in subclasses where they category keys are enumerated. .. cpp:function:: void IPerformanceMetrics::accumulate_timer_deltas( uint32_t metrics_key ) Input Data Pumps ---------------- An input data pump object copies data from a file into a cache in memory. Since accesses to memory are typically three orders of magnitude faster than to an individual hard disk drive and since many operations to process the data are slower than reading it from a file, it makes sense to stage some of it into memory. Having the data in memory can reduce the latency from accessing it upon demand. And, the cache in memory can be filled more rapidly than it is processed. The input data pumps are declared in :file:`lib/read_parsers.hh`. All of them derive from an abstract base class. .. todo:: Refactor the data pumps into a separate header and implementation file. .. cpp:class:: IStreamReader Presently, three types of data pumps are implemented. .. cpp:class:: RawStreamReader .. cpp:class:: GzStreamReader .. cpp:class:: Bz2StreamReader These input data pumps are not exposed via the Python wrapper. The :cpp:class:`IStreamReader` interface defines one method of interest. .. cpp:function:: uint64_t const read_into_cache( uint8_t * const cache, uint64_t const cache_size ) This is a pure virtual method which must be overridden in subclasses. The ``cache`` parameter receives an argument which is an arbitrary piece of memory which it treats as an array of bytes. The ``cache_size`` parameter receives an argument which is the size, in bytes, of the cache. The return value is the number of bytes read into the cache from file. Raw Stream Reader ~~~~~~~~~~~~~~~~~ The raw stream reader is constructed from a file descriptor, such as returned by the ``open`` system call. An optional read alignment may be supplied to this constructor. Depending on the operating system and file system, this may be used as a chunk size and alignment for *direct I/O*. Otherwise, it is ignored. Direct I/O allows for blocks to copied directly from a block device into user-space memory rather than passing through a kernel-space block cache first. This reduces the number of memory copies involved in processing the data. .. warning:: Direct I/O has received some testing within the software, but has not been tested enough to be considered production-ready. .. note:: In principle, the file descriptor could number 0 (stdin) and one could create pipelines, but this is not supported at the higher level interfaces. Reading is currently performed in a synchronous manner, which is fine for most typical use cases of the software since input is not the bottleneck. .. todo:: Support asynchronous reading. Gzip Stream Reader and Decompressor ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The stream reader and decompressor for the ``gzip`` format is based on ``zlib``. No direct I/O is supported by this stream reader and its constructor therefore only accepts a file descriptor. Furthermore, data must be copied and decompressed sequentially and cannot be read asynchronously. In the regime that higher level processing is fast, this stream reader is likely to be a bottleneck, especially as there is overhead from decompression. However, as pipelining support does not yet exist in the software, providing native support for a popular compression format makes sense. Also, some users of the software may not be familiar with standard Unix compression tools, such as ``gzip``; built-in support of popular compression formats removes a barrier to entry for these users. .. todo:: Implement higher level support for pipelining so that parallelized decompressors can feed a raw stream reader, assuming that they can output decompressed data to stdout and do so in order. Alternatively, if a parallelized variant of ``zlib`` can be found, then that should be used in place of ``zlib`` for native support. Bzip2 Stream Reader and Decompressor ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The stream reader and decompressor for the ``bzip2`` format is based on the ``bzip2`` library. The same notes and considerations for the ``gzip`` stream reader also apply to this one as well. As a historical note, it is worth mentioning that the logic for reading from a ``bzip2``-compressed file stream is significantly more complicated than for raw or ``gzip``-compressed streams because of the way the library API is structured and the nature of the compression format. Prior to the architecture being described, data pumps and reads parsers were tightly coupled and implementing a ``bzip2`` data pump in that architecture would have been very painful. As it turns out, the current architecture preemptively fixed a bug in the old ``gzip`` data pump before it was reported against that architecture. So, this decoupled design has already paid for itself several times over. Thread Identity Maps -------------------- Higher level processing requires that threads be able to persistently work with the same set of data. A thread does not inherently "know" what its index into a particular lookup table is. However, this index can be mapped to an OS-native identifier for a thread. Using an appropriate system call, a thread can query its own native identifier from the operating system and then use this as a map key to find its logical identifier within the software. This logical identifier serves as the thread's index into any lookup tables which it may need to use. The self-identification is also important on the grounds of a software engineering principle: don't break existing interfaces. Prior to the current architecture, the code was not thread-safe. In order to add thread-safety in a reliable manner and not break existing interfaces, self-identification of threads was necessary. The thread identity map class is declared in the :file:`lib/thread_id_map.hh` file. .. cpp:class:: ThreadIDMap This class is not exposed via the Python wrapper as it is an internal mechanism. And, the implementation of the class varies according to operating system. The only important method for those who wish to avail themselves to this bookkeeping method is the one which returns the logical identifier (lookup table index) of the current thread. .. cpp:function:: uint32_t const ThreadIDMap::get_thread_id( ) New entries are added to the map as new threads call this method for the first time. Thus, the bookkeeping is automatic and does not get in the way of the developer. Cache Managers -------------- A cache manager provides memory into which an input data pump may copy. The provided memory is segmented on a per-thread basis. On machines with multiple NUMA nodes, this can help performance by decreasing the likelihood of cross-node fetches and stores. More importantly, it provides an association between a particular thread and a particular cache segment, so that higher level processing, such as parsing, can always be guaranteed to operate on the same contiguous portion of memory. .. todo:: Implement pinning of threads to specific cores on operating systems which support this. Preventing the migration of threads between cores should mostly eliminate cross-node fetches and stores. The :file:`lib/read_parsers.hh` file declares the cache manager and cache manager segment classes. .. cpp:class:: CacheManager .. cpp:class:: CacheManagerSegment As multiple threads share access to the same data pump, the cache manager orchestrates access to this resource. Internally, a spinlock is used to limit access to one thread at a time. .. todo:: Increase period of spinlock trials from once per iteration to something greater, similar to what the other busywaiters which perform atomic tests use. Internally, a :cpp:class:`ThreadIDMap` is used to match a current thread with its corresponding entry in the table of cache segments. A convenience method is provided for the current thread to find its corresponding cache segment, creating it if it doesn't already exist. .. cpp:function:: CacheSegment &CacheManger::_get_segment( bool const higher = false ) This is a private method used only within cache mangers. The ``higher`` parameter is vestigial remnant from an earlier implementation and can likely be removed. .. todo:: Remove the ``higher`` parameter from :cpp:func:`_get_segment`. Developers wishing to use a cache manager rather than muck around in its implementation will probably find the following methods most useful. .. cpp:function:: bool const CacheManager::has_more_data( ) .. cpp:function:: uint64_t const CacheManager::get_bytes( uint8_t * const buffer, uint64_t buffer_len ) .. cpp:function:: void CacheManager::split_at( uint64_t const pos ) The :cpp:func:`has_more_data` method queries both the underlying stream and the current cache segment to see if more data is available. If both the underlying stream is exhausted and the memory cursor, which tracks how much of a cache segment has been accessed since its last refill, is at the end of the segment, then no more data is considered to be available and the current thread hits a synchronization barrier to wait for the other threads to finish. The :cpp:func:`get_bytes` method copies up to ``buffer_len`` bytes of memory from the current cache segment into the supplied buffer ``buffer``. All bookkeeping, such as replenishing the cache segment from the underlying stream, is handled behind the scenes. The method also copies memory from the appropriate *copyaside buffer* as necessary. Copyaside buffers are created by the :cpp:func:`split_at` method and represent extensions to the current cache segment. .. todo:: Expose the underlying memory segments directly to higher level processing, such as parsing, to eliminate the memory copy overhead that :cpp:func:`get_bytes` carries. Note that this comes at the cost of some horrid bookkeeping on the part of the higher level functions. The :cpp:func:`get_bytes` method exists to handle this bookkeeping. The :cpp:func:`split_at` method copies up to ``pos`` bytes from the beginning of the current cache segment into a copyaside buffer. The copyaside buffer will then be available for the previous (in terms of lookup table index modulo the number of threads) cache segment. This method helps with multi-threaded parsing of files when parser offsets into a file do not correspond with record boundaries. A parser can scan forward to the next record boundary and then set the scanned over bytes aside to be appended to the cache segment which contains the beginning of the partial record. The initial implementation of the cache manager used *setaside buffers*, which were just reserved portions of cache segments and no memory copies were performed. However, the bookkeeping was quite complicated and after several bugs slipped through the cracks, the setaside buffer logic was converted to copyaside buffers. The cost of the memory copies is essentially nothing in the typical use cases encountered by the software. Copyaside buffers are also much more amenable to asynchronous refilling of cache segments, should that be supported at a later point. .. todo:: Implement asynchronous refills of cache segments. Reads and Read Pairs -------------------- Reads are simple data structures which contain genomic sequences, as well identifiers and quality scores for those sequences. The class is declared in :file:`lib/read_parsers.hh`. .. cpp:class:: Read The Python wrapper exposes an interface to reads. .. py:class:: Read The data members are accessed as properties. These mimic the access keys for ``screed`` records. .. py:attribute:: Read.name .. py:attribute:: Read.sequence .. py:attribute:: Read.accuracy .. py:attribute:: Read.annotations No distinction is currently made between FASTA and FASTQ reads. .. todo:: Create an IRead abstract base class and subclass for FASTA and FASTQ record types. This would remove wasted fields for FASTA records and allow the type of records being used at any level of processing. Read pairs are two reads bound together in a STL ``pair``. This is intended to track sequences with paired ends. Read Parsers ------------ C++ API ~~~~~~~ Read parsers create the aforementioned :cpp:class:`Read` objects. The :file:`lib/read_parsers.hh` file declares an abstract base class as well as FASTA and FASTQ parsers derived from that. These are made available from within a namespace which encapsulates most classes in the :file:`lib/read_parsers.hh` file. .. cpp:namespace:: khmer::read_parsers .. cpp:class:: IParser .. cpp:class:: FastaParser .. cpp:class:: FastqParser An instance of the appropriate subclass is created via a factory method provided by the abstract class. This method infers the correct subclass instance to create based on file name extension. The file name is required but the other arguments are optional. If the other arguments are supplied, then they override the defaults from the active :cpp:class:`Config` object. .. cpp:function:: IParser * const IParser::get_parser( std::string const &ifile_name, uint32_t const number_of_threads, uint64_t const cache_size, uint8_t const trace_level ) .. todo:: Sniff file type rather than rely on extension. Just as the :cpp:class:`CacheManager` maintains per-thread state in :cpp:class:`CacheSegment` objects, the parser classes maintain per-thread state in special objects as well. .. cpp:class:: ParserState The parser state maintains a line buffer, among other things, and tracks how much of it has been parsed by each call to the parser. The :cpp:class:`IParser` interface provides some useful methods. .. cpp:function:: bool IParser::is_complete( ) .. cpp:function:: void IParser::imprint_next_read( Read &the_read ) .. cpp:function:: void IParser::imprint_next_read_pair( ReadPair &the_read_pair, uint8_t mode = PAIR_MODE_ERROR_ON_UNPAIRED ) The :cpp:func:`is_complete` method checks if parsing of the current stream is complete and blocks in a synchronization barrier if it is but some threads are still working. The :cpp:func:`imprint_next_read` method attempts to parse another read from the file stream and create a :cpp:class:`Read` object from it. Note that a legacy method ``get_next_read`` is still available but its use in new code is discouraged. The legacy method involves an additional memory copy. The :cpp:func:`imprint_next_read_pair` method attempts to parse a pair of reads from the file stream a create a `ReadPair` object from them. Currently, this has two implemented modes of operation with a third one planned. The modes are ``PAIR_MODE_ALLOW_UNPAIRED``, ``PAIR_MODE_IGNORE_UNPAIRED``, and ``PAIR_MODE_ERROR_ON_UNPAIRED``. The first one is not yet implemented; it may be useful for filtering or diverting paired or unpaired reads out of a stream. The ``PAIR_MODE_IGNORE_UNPAIRED`` mode simply ignores unpaired reads and only returns paired reads. The ``PAIR_MODE_ERROR_ON_UNPAIRED`` mode raises an exception if an unpaired read is encountered. As a note, both the old-style ("/1" and "/2") and new-style ("1..." and "2:...") Illumina read pairs are detected from sequence identifiers. .. todo:: Implement ``PAIR_MODE_ALLOW_UNPAIRED`` mode. .. todo:: Place burden of input parsing and output formatting on :cpp:class:`Read` obects rather than on parser methods. Demote parsers to role of facilitator. Maybe? Python Wrapper ~~~~~~~~~~~~~~ The Python wrapper exposes a read parser class. .. py:class:: ReadParser This class has no subclasses, but handles various formats appropriately. An instance of the class is an iterator, which produces one read at a time. There is also a method for iterating over read pairs and the class exposes the same constants for controlling its behavior as the underlying C++ class does. .. py:method:: ReadParser.iter_read_pairs( pair_mode ) k-mer Counters and Bloom Filters -------------------------------- C++ API ~~~~~~~ The Bloom filter counting is described elsewhere and so we won't go into details of it here. Some of the methods of the hash tables has been granted thread safety and can use the thread-safe :cpp:class:`IParser` objects. .. cpp:class:: Hashtable .. cpp:class:: Hashbits .. cpp:function:: void Hashtable::consume_fasta( IParser * parser, unsigned int &total_reads, unsigned long long &n_consumed, HashIntoType lower_bound, HashIntoType upper_bound, CallbackFn callback, void * callback_data ) .. cpp:function:: void Hashbits::consume_fasta_and_tag( IParser * parser, unsigned int &total_reads, unsigned long long &n_consumed, CallbackFn callback, void * callback_data ) For legacy support, methods with signatures that have a file name parameter rather than a :cpp:class:`IParser` parameter are still provided as well. (They wrap the ones with the parser parameter.) As with the cache managers and read parsers, the hashtables track per-thread state. .. cpp:class:: Hasher Since more than one pool of threads (e.g., one set of threads per reads parser and one reads parser per file stream) may be used with a particular hash table object, the hash table objects internally maintain the notion of thread pools. The universally unique identifier (UUID) of an object (e.g., a reads parser) is used to map to the correct thread pool. This is behind-the-scenes accounting and a developer should generally not have to worry about this. But, if you are converting another method to be thread-safe and it can take different reads parsers on different invocations, then be sure to consider this. .. todo:: Drop more logic currently implemented in Python to C++ to gain multi-threading efficiencies. Not everything can really scale well using the existing interfaces working in Python. .. todo:: Cache k-mers to hash in small buckets which correspond to regions of the hash tables. This will allow for multiple updates per memory page and reduce the number of CPU cache misses. .. todo:: Abstact the counter storage from the hash functions. A number of open issues can be addressed by doing this. The counter storage might be better implemented with partial template specialization than with subclassing. For small hash tables sizes, not hashing makes more sense because every possible k-mer in the k-mer space can be addressed directly in memory. Counter storage will be most efficient for powers-of-two numbers of bits per counter. Blah, blah... these and other thoughts are discussed more thoroughly in the various GitHub issues involving them. Python Wrapper ~~~~~~~~~~~~~~ The hash table objects have methods which take :py:class:`ReadParser` objects and invoke the appropriate C++ methods underneath the hood. .. py:method:: new_hashtable.consume_fasta_with_reads_parser( rparser ) .. py:method:: new_counting_hash.consume_fasta_and_tag_with_reads_parser( rparser ) .. todo:: Convert factory functions into callable classes and properly attribute those classes. Python Wrapper -------------- The Python wrapper resides in :file:`python/_khmermodule.cc`. C++ code is used to call the CPython API to bind some of the C++ classes and methods to Python classes and methods. Some of the newer additions to the wrapper, such as the :py:class:`Read` and :py:class:`ReadParser` classes should be considered models for future additions as they expose callable classes with properties and iterators and which look just like Python classes for the most part. Much of the older code relies on factory functions to create objects and those objects are not very Pythonic. The newer additions are also much less cluttered and more readable (though the author of this sentence may be biased in this regard). .. todo:: Use SWIG to generate the interface. Maybe? .. vim: set ft=rst ts=3 sts=3 sw=3 et tw=79: