What's New In khmer 2.0?

New behavior

Streaming I/O from Unix Pipes

All scripts now accept input from named (like /dev/stdin, or that created using <( list ) process substituion) and unamed pipes (like output piped in from another program with |). The STDIN stream can also be specified using a single dash: -.

New parameter for memory usage, and/or tablesize/number of table parameters.

There is now a -M/ --max-memory-usage parameter that sets the number of tables ( -N/ --n_tables) and tablesize (-x/--max-tablesize) parameters automatically to match the desired memory usage.

Digital normalization script now supports mixed paired and unpaired read input

normalize-by-median.py now supports mixed paired and unpaired (or "broken-paired") input. Behavior can be forced to either treat all reads as singletons or to require all reads be properly paired using --force_single or --paired, respectively. If --paired is set, --unpaired-reads can be used to include a file of unpaired reads. The unpaired reads will be examined after all of the other sequence files. normalize-by-median.py --quiet can be used to reduce the amount of diagnostic output.

Mixed-pair sequence file format support

split-paired-reads.py --output-orphaned/-0 has been added to allow for orphaned reads and give them a file to be sorted into.

Scripts now output columnar data in CSV format by default

All scripts that output any kind of columnar data now do so in CSV format, with headers. Previously this had to be enabled with --csv. (Affects abundance-dist-single.py, abundance-dist.py, count-median.py, and count-overlap.py.) normalize-by-median.py --report also now outputs in CSV format.

Reservoir sampling script extracts paired reads by default

sample-reads-randomly.py now retains pairs in the output, by default. This can be overridden to match previous behavior with --force_single.

New scripts

Estimate number of unique kmers

unique-kmers.py estimates the k-mer cardinality of a dataset using the HyperLogLog probabilistic data structure. This allows very low memory consumption, which can be configured through an expected error rate. Even with low error rate (and higher memory consumption), it is still much more efficient than exact counting and alternative methods. It supports multicore processing (using OpenMP) and streaming, and so can be used in conjunction with other scripts (like normalize-by-median.py and filter-abund.py). This is the work of Luiz Irber and it is the subject of a paper in draft.

Incompatible changes

New datastructure and script names

For clarity the Count-Min Sketch based data structure previously known as "counting_hash" or "counting_table" and variations of these is now known as countgraph. Likewise with the Bloom Filter based data structure previously known at "hashbits", "presence_table" and variations of these is now known as nodegraph. Many options relating to table have been changed to graph.

Binary file formats have changed

All binary khmer formats (presence tables, counting tables, tag sets, stop tags, and partition subsets) have changed. Files are now pre-pended with the string OXLI to indicate that they are from this project.

Files of the above types made in previous versions of khmer are not compatible with v2.0; the reverse is also true.

In addition to the OXLI string, the Nodegraph and Countgraph file format now includes the number of occupied bins. See khmer/Oxli Binary File Formats for details.

load-graph.py no longer appends .pt to the specified filename

Previously, load-graph.py` appended a .pt extension to the specified output filename and partition-graph.py appended a .pt to the given input filename. Now, load-graph.py writes to the specified output filename and partition-graph.py does not append a .pt to the given input filename.

Some reporting options have been turned always on

The total number of unique k-mers will always be reported every time a new countgraph is made. The --report-total-kmers option has been removed from abundance-dist-single.py, filter-abund-single.py, and normalize-by-median.py to reflect this. Likewise with write-fp-rate for load-into-counting.py and load-graph.py; the false positive rate will always be written to the .info files.

An uncommon error recovery routine was removed

To simplify the codebase --save-on-failure and its helper option --dump-frequency have been removed from normalize-by-median.py.

Single file output option names have been normalized

--out is now --output for both normalize-by-median.py and trim-low-abund.py.

Miscellaneous changes

The common option --min-tablesize was renamed to --max-tablesize to reflect this more desirable behavior.

In conjuction with the new split-paired-reads.py --output-orphaned option, the option --force-paired/-p has been eliminated.

As CSV format is now the default, the --csv option has been removed.

Removed script

count-overlap.py has been removed.

comments powered by Disqus