What's New In khmer 2.0?¶
Streaming I/O from Unix Pipes¶
All scripts now accept input from named (like
/dev/stdin, or that created
<( list ) process substituion) and unamed pipes (like output piped in
from another program with
|). The STDIN stream can also be specified using
a single dash:
New parameter for memory usage, and/or tablesize/number of table parameters.¶
Digital normalization script now supports mixed paired and unpaired read input¶
normalize-by-median.py now supports mixed paired and unpaired (or
"broken-paired") input. Behavior can be forced to either treat all
reads as singletons or to require all reads be properly paired using
--paired, respectively. If
--paired is set,
--unpaired-reads can be
used to include a file of unpaired reads. The unpaired reads will be examined
after all of the other sequence files.
normalize-by-median.py --quiet can be used to reduce the amount of
Mixed-pair sequence file format support¶
Scripts now output columnar data in CSV format by default¶
All scripts that output any kind of columnar data now do so in CSV format,
with headers. Previously this had to be enabled with
(Affects abundance-dist-single.py, abundance-dist.py,
count-median.py, and count-overlap.py.)
normalize-by-median.py --report also now outputs in CSV format.
Estimate number of unique kmers¶
unique-kmers.py estimates the k-mer cardinality of a dataset using the HyperLogLog probabilistic data structure. This allows very low memory consumption, which can be configured through an expected error rate. Even with low error rate (and higher memory consumption), it is still much more efficient than exact counting and alternative methods. It supports multicore processing (using OpenMP) and streaming, and so can be used in conjunction with other scripts (like normalize-by-median.py and filter-abund.py). This is the work of Luiz Irber and it is the subject of a paper in draft.
New datastructure and script names¶
For clarity the Count-Min Sketch based data structure previously known as
"counting_hash" or "counting_table" and variations of these is now known as
countgraph. Likewise with the Bloom Filter based data structure previously
known at "hashbits", "presence_table" and variations of these is now known as
nodegraph. Many options relating to
table have been changed to
Binary file formats have changed¶
All binary khmer formats (presence tables, counting tables, tag sets,
stop tags, and partition subsets) have changed. Files are now
pre-pended with the string
OXLI to indicate that they are from
Files of the above types made in previous versions of khmer are not compatible with v2.0; the reverse is also true.
In addition to the
OXLI string, the Nodegraph and Countgraph file format
now includes the number of occupied bins. See khmer/Oxli Binary File Formats
load-graph.py no longer appends .pt to the specified filename¶
Previously, load-graph.py` appended a
.pt extension to the
specified output filename and partition-graph.py appended a
to the given input filename. Now, load-graph.py writes to the
specified output filename and partition-graph.py does not append a
.pt to the given input filename.
Some reporting options have been turned always on¶
The total number of unique k-mers will always be reported every time a new
countgraph is made. The
--report-total-kmers option has been removed from
abundance-dist-single.py, filter-abund-single.py, and
normalize-by-median.py to reflect this. Likewise with
write-fp-rate for load-into-counting.py and
load-graph.py; the false positive rate will always be
written to the
An uncommon error recovery routine was removed¶
To simplify the codebase
--save-on-failure and its helper option
--dump-frequency have been removed from normalize-by-median.py.
Single file output option names have been normalized¶
The common option
--min-tablesize was renamed to
--max-tablesize to reflect
this more desirable behavior.
In conjuction with the new
option, the option
-p has been eliminated.
As CSV format is now the default, the
--csv option has been removed.