All scripts now accept input from named (like /dev/stdin
, or that created
using <( list )
process substituion) and unamed pipes (like output piped in
from another program with |
). The STDIN stream can also be specified using
a single dash: -
.
There is now a -M
/
--max-memory-usage
parameter that sets the number of tables (
-N
/
--n_tables
) and tablesize
(-x
/--max-tablesize
) parameters automatically to match the
desired memory usage.
normalize-by-median.py now supports mixed paired and unpaired (or
"broken-paired") input. Behavior can be forced to either treat all
reads as singletons or to require all reads be properly paired using
--force_single
or
--paired
, respectively. If
--paired
is set,
--unpaired-reads
can be
used to include a file of unpaired reads. The unpaired reads will be examined
after all of the other sequence files.
normalize-by-median.py --quiet
can be used to reduce the amount of
diagnostic output.
split-paired-reads.py --output-orphaned
/-0
has been added to allow for orphaned reads and give
them a file to be sorted into.
All scripts that output any kind of columnar data now do so in CSV format,
with headers. Previously this had to be enabled with --csv
.
(Affects abundance-dist-single.py, abundance-dist.py,
count-median.py, and count-overlap.py.)
normalize-by-median.py --report
also now outputs in CSV format.
sample-reads-randomly.py now retains pairs in the output, by
default. This can be overridden to match previous behavior
with --force_single
.
unique-kmers.py estimates the k-mer cardinality of a dataset using the HyperLogLog probabilistic data structure. This allows very low memory consumption, which can be configured through an expected error rate. Even with low error rate (and higher memory consumption), it is still much more efficient than exact counting and alternative methods. It supports multicore processing (using OpenMP) and streaming, and so can be used in conjunction with other scripts (like normalize-by-median.py and filter-abund.py). This is the work of Luiz Irber and it is the subject of a paper in draft.
For clarity the Count-Min Sketch based data structure previously known as
"counting_hash" or "counting_table" and variations of these is now known as
countgraph
. Likewise with the Bloom Filter based data structure previously
known at "hashbits", "presence_table" and variations of these is now known as
nodegraph
. Many options relating to table
have been changed to
graph
.
All binary khmer formats (presence tables, counting tables, tag sets,
stop tags, and partition subsets) have changed. Files are now
pre-pended with the string OXLI
to indicate that they are from
this project.
Files of the above types made in previous versions of khmer are not compatible with v2.0; the reverse is also true.
In addition to the OXLI
string, the Nodegraph and Countgraph file format
now includes the number of occupied bins. See khmer/Oxli Binary File Formats
for details.
Previously, load-graph.py` appended a .pt
extension to the
specified output filename and partition-graph.py appended a .pt
to the given input filename. Now, load-graph.py writes to the
specified output filename and partition-graph.py does not append a
.pt
to the given input filename.
The total number of unique k-mers will always be reported every time a new
countgraph is made. The --report-total-kmers
option has been removed from
abundance-dist-single.py, filter-abund-single.py, and
normalize-by-median.py to reflect this. Likewise with
write-fp-rate
for load-into-counting.py and
load-graph.py; the false positive rate will always be
written to the .info
files.
To simplify the codebase --save-on-failure
and its helper option
--dump-frequency
have been removed from normalize-by-median.py.
--out
is now --output
for both normalize-by-median.py
and trim-low-abund.py
.
The common option --min-tablesize
was renamed to
--max-tablesize
to reflect
this more desirable behavior.
In conjuction with the new split-paired-reads.py --output-orphaned
option, the option --force-paired
/-p
has been eliminated.
As CSV format is now the default, the --csv
option has been removed.
count-overlap.py has been removed.