khmer's command-line interface¶

The simplest way to use khmer's functionality is through the command line scripts, located in the scripts/ directory of the khmer distribution. Below is our documentation for these scripts. Note that all scripts can be given -h/--help which will print out a list of arguments taken by that script.

Scripts that use k-mer counting tables or k-mer graphs take an -M parameter, which sets the maximum memory usage in bytes. This should generally be set as high as possible; see Setting khmer memory usage for more information.

k-mer counting and abundance filtering
Partitioning
Digital normalization
Read handling: interleaving, splitting, etc.

Note

Almost all scripts take in either FASTA and FASTQ format, and output the same.

Gzip and bzip2 compressed files are detected automatically.

k-mer counting and abundance filtering¶

load-into-counting.py¶

Build a k-mer countgraph from the given sequences.

usage: load-into-counting.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [--small-count] [-T THREADS] [-b] [-s FORMAT] [-f] [-q] output_countgraph_filename input_sequence_filename [input_sequence_filename ...]

output_countgraph_filename¶: The name of the file to write the k-mer countgraph to.

input_sequence_filename¶: The names of one or more FAST[AQ] input sequence files.

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

--small-count¶: Reduce memory usage by using a smaller counter for individual kmers.

-T <threads>, --threads <threads>¶: Number of simultaneous threads to execute

-b, --no-bigcount¶: The default behaviour is to count past 255 using bigcount. This flag turns bigcount off, limiting counts to 255.

-s {json,tsv}, --summary-info {json,tsv}¶: What format should the machine readable run summary be in? (json or tsv, disabled by default)

-f, --force¶: Overwrite output file if it exists

-q, --quiet¶

Note: with -b/--no-bigcount the output will be the exact size of the k-mer countgraph and this script will use a constant amount of memory. In exchange k-mer counts will stop at 255. The memory usage of this script with -b will be about 1.15x the product of the -x and -N numbers.

Example:

load-into-counting.py -k 20 -x 5e7 out data/100k-filtered.fa

Multiple threads can be used to accelerate the process, if you have extra cores to spare.

Example:

load-into-counting.py -k 20 -x 5e7 -T 4 out data/100k-filtered.fa

abundance-dist.py¶

Calculate abundance distribution of the k-mers in the sequence file using a pre-made k-mer countgraph.

usage: abundance-dist.py [--version] [--info] [-h] [-z] [-s] [-b] [-f] [-q] input_count_graph_filename input_sequence_filename output_histogram_filename

input_count_graph_filename¶: The name of the input k-mer countgraph file.

input_sequence_filename¶: The name of the input FAST[AQ] sequence file.

output_histogram_filename¶: The columns are: (1) k-mer abundance, (2) k-mer count, (3) cumulative count, (4) fraction of total distinct k-mers.

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-z, --no-zero¶: Do not output zero-count bins

-s, --squash¶: Overwrite existing output_histogram_filename

-b, --no-bigcount¶: Do not count k-mers past 255

-f, --force¶: Continue even if specified input files do not exist or are empty.

-q, --quiet¶

Example:

load-into-counting.py -x 1e7 -N 2 -k 17 counts \
        tests/test-data/test-abund-read-2.fa
abundance-dist.py counts tests/test-data/test-abund-read-2.fa test-dist

abundance-dist-single.py¶

Calculate the abundance distribution of k-mers from a single sequence file.

usage: abundance-dist-single.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [--small-count] [-T THREADS] [-z] [-b] [-s] [--savegraph filename] [-f] [-q] input_sequence_filename output_histogram_filename

input_sequence_filename¶: The name of the input FAST[AQ] sequence file.

output_histogram_filename¶: The name of the output histogram file. The columns are: (1) k-mer abundance, (2) k-mer count, (3) cumulative count, (4) fraction of total distinct k-mers.

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

--small-count¶: Reduce memory usage by using a smaller counter for individual kmers.

-T <threads>, --threads <threads>¶: Number of simultaneous threads to execute

-z, --no-zero¶: Do not output zero-count bins

-b, --no-bigcount¶: Do not count k-mers past 255

-s, --squash¶: Overwrite output file if it exists

--savegraph <filename>¶: Save the k-mer countgraph to the specified filename.

-f, --force¶: Override sanity checks

-q, --quiet¶

Note that with -b/--no-bigcount this script is constant memory; in exchange, k-mer counts will stop at 255. The memory usage of this script with -b will be about 1.15x the product of the -x and -N numbers.

To count k-mers in multiple files use load_into_counting.py and abundance_dist.py.

Example:

abundance-dist-single.py -x 1e7 -N 2 -k 17 \
        tests/test-data/test-abund-read-2.fa test-dist

filter-abund.py¶

Trim sequences at a minimum k-mer abundance.

usage: filter-abund.py [--version] [--info] [-h] [-T THREADS] [-C CUTOFF] [-V] [-Z NORMALIZE_TO] [-o optional_output_filename] [-f] [-q] [--gzip | --bzip] input_count_graph_filename input_sequence_filename [input_sequence_filename ...]

input_count_graph_filename¶: The input k-mer countgraph filename

input_sequence_filename¶: Input FAST[AQ] sequence filename

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-T <threads>, --threads <threads>¶: Number of simultaneous threads to execute

-C <cutoff>, --cutoff <cutoff>¶: Trim at k-mers below this abundance.

-V, --variable-coverage¶: Only trim low-abundance k-mers from sequences that have high coverage.

-Z <normalize_to>, --normalize-to <normalize_to>¶: Base the variable-coverage cutoff on this median k-mer abundance.

-o <optional_output_filename>, --output <optional_output_filename>¶: Output the trimmed sequences into a single file with the given filename instead of creating a new file for each input file.

-f, --force¶: Overwrite output file if it exists

-q, --quiet¶

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

Trimmed sequences will be placed in ${input_sequence_filename}.abundfilt for each input sequence file. If the input sequences are from RNAseq or metagenome sequencing then --variable-coverage should be used.

Example:

load-into-counting.py -k 20 -x 5e7 countgraph data/100k-filtered.fa
filter-abund.py -C 2 countgraph data/100k-filtered.fa

filter-abund-single.py¶

Trims sequences at a minimum k-mer abundance (in memory version).

usage: filter-abund-single.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [--small-count] [-T THREADS] [-C CUTOFF] [-V] [-Z NORMALIZE_TO] [--savegraph filename] [-o optional_output_filename] [-f] [-q] [--gzip | --bzip] input_sequence_filename

input_sequence_filename¶: FAST[AQ] sequence file to trim

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

--small-count¶: Reduce memory usage by using a smaller counter for individual kmers.

-T <threads>, --threads <threads>¶: Number of simultaneous threads to execute

-C <cutoff>, --cutoff <cutoff>¶: Trim at k-mers below this abundance.

-V, --variable-coverage¶: Only trim low-abundance k-mers from sequences that have high coverage.

-Z <normalize_to>, --normalize-to <normalize_to>¶: Base the variable-coverage cutoff on this median k-mer abundance.

--savegraph <filename>¶: If present, the name of the file to save the k-mer countgraph to

-o <optional_output_filename>, --outfile <optional_output_filename>¶: Override default output filename and output trimmed sequences into a file with the given filename.

-f, --force¶: Overwrite output file if it exists

-q, --quiet¶

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

Trimmed sequences will be placed in ${input_sequence_filename}.abundfilt.

This script is constant memory.

To trim reads based on k-mer abundance across multiple files, use load-into-counting.py and filter-abund.py.

Example:

filter-abund-single.py -k 20 -x 5e7 -C 2 data/100k-filtered.fa

trim-low-abund.py¶

Trim low-abundance k-mers using a streaming algorithm.

usage: trim-low-abund.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [--small-count] [-C CUTOFF] [-Z TRIM_AT_COVERAGE] [-o output_filename] [-V] [-l filename] [-s filename] [-q] [--summary-info FORMAT] [--force] [--ignore-pairs] [-T TEMPDIR] [--gzip | --bzip] [--diginorm] [--diginorm-coverage DIGINORM_COVERAGE] [--single-pass] input_filenames [input_filenames ...]

input_filenames¶

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

--small-count¶: Reduce memory usage by using a smaller counter for individual kmers.

-C <cutoff>, --cutoff <cutoff>¶: remove k-mers below this abundance

-Z <trim_at_coverage>, --trim-at-coverage <trim_at_coverage>, --normalize-to <trim_at_coverage>¶: trim reads when entire read above this coverage

-o <output_filename>, --output <output_filename>¶: only output a single file with the specified filename; use a single dash "-" to specify that output should go to STDOUT (the terminal)

-V, --variable-coverage¶: Only trim low-abundance k-mers from sequences that have high coverage.

-l <filename>, --loadgraph <filename>¶: load a precomputed k-mer graph from disk

-s <filename>, --savegraph <filename>¶: save the k-mer countgraph to disk after allreads are loaded.

-q, --quiet¶

--summary-info {json,tsv}¶: What format should the machine readable run summary be in? (json or tsv, disabled by default)

--force¶

--ignore-pairs¶: treat all reads as if they were singletons

-T <tempdir>, --tempdir <tempdir>¶: Set location of temporary directory for second pass

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

--diginorm¶: Eliminate high-coverage reads altogether (digital normalization).

--diginorm-coverage <diginorm_coverage>¶: Coverage threshold for --diginorm

--single-pass¶: Do not do a second pass across the low coverage data

The output is one file for each input file, <input file>.abundtrim, placed in the current directory. This output contains the input sequences trimmed at low-abundance k-mers.

The -V/--variable-coverage parameter will, if specified, prevent elimination of low-abundance reads by only trimming low-abundance k-mers from high-abundance reads; use this for non-genomic data sets that may have variable coverage.

Note that the output reads will not necessarily be in the same order as the reads in the input files; if this is an important consideration, use load-into-counting.py and filter-abund.py. However, read pairs will be kept together, in "broken-paired" format; you can use extract-paired-reads.py to extract read pairs and orphans.

Example:

trim-low-abund.py -x 5e7 -k 20 -C 2 data/100k-filtered.fa

count-median.py¶

Count k-mers summary stats for sequences

usage: count-median.py [--version] [--info] [-h] [-f] input_count_graph_filename input_sequence_filename output_summary_filename

input_count_graph_filename¶: input k-mer countgraph filename

input_sequence_filename¶: input FAST[AQ] sequence filename

output_summary_filename¶: output summary filename

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-f, --force¶: Overwrite output file if it exists

Count the median/avg k-mer abundance for each sequence in the input file, based on the k-mer counts in the given k-mer countgraph. Can be used to estimate expression levels (mRNAseq) or coverage (genomic/metagenomic).

The output file contains sequence id, median, average, stddev, and seq length, in comma-separated value (CSV) format.

Example:

load-into-counting.py counts tests/test-data/test-reads.fq.gz
count-median.py counts tests/test-data/test-reads.fq.gz medians.txt

NOTE: All 'N's in the input sequences are converted to 'A's.

unique-kmers.py¶

Estimate number of unique k-mers, with precision <= ERROR_RATE.

usage: unique-kmers.py [--version] [--info] [-h] [-q] [-k KSIZE] [-e ERROR_RATE] [-R filename] [-S] [--diagnostics] input_sequence_filename [input_sequence_filename ...]

input_sequence_filename¶: Input FAST[AQ] sequence filename(s).

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-q, --quiet¶

-k <ksize>, --ksize <ksize>¶: k-mer size to use

-e <error_rate>, --error-rate <error_rate>¶: Acceptable error rate

-R <filename>, --report <filename>¶: generate informational report and write to filename

-S, --stream-records¶: write input sequences to STDOUT

--diagnostics¶: print out recommended tablesize arguments and restrictions

A HyperLogLog counter is used to do cardinality estimation. Since this counter is based on a tradeoff between precision and memory consumption, the -e/--error-rate can be used to control how much memory will be used. In practice the memory footprint is small even at low error rates (< 0.01).

-k/--ksize should be set to the desired k-mer size.

Informational output is sent to STDERR, but a report file can be generated with -R/--report.

--stream-records will write the sequences taken in to STDOUT. This is useful for workflows: count unique kmers in a stream, then do digital normalization.

--diagnostics will provide detailed options for tablesize and memory limitations for various false positive rates. This is useful for configuring other khmer scripts. This will be written to STDERR.

Example:

unique-kmers.py -k 17 tests/test-data/test-abund-read{,-2,-3}.fa

Example:

unique-kmers.py -k 17 --diagnostics tests/test-data/test-abund-read.fa

Example:

unique-kmers.py --stream-records -k 17 tests/test-data/test-reads.fa | \
normalize-by-median.py -k 17 -o normalized /dev/stdin

Example:

unique-kmers.py -R unique_count -k 30 \
tests/test-data/test-abund-read-paired.fa

Partitioning¶

do-partition.py¶

Load, partition, and annotate FAST[AQ] sequences

usage: do-partition.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [-T THREADS] [-s SUBSET_SIZE] [--no-big-traverse] [--keep-subsets] [-f] graphbase input_sequence_filename [input_sequence_filename ...]

graphbase¶: base name for output files

input_sequence_filename¶: input FAST[AQ] sequence filenames

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

-T <threads>, --threads <threads>¶: Number of simultaneous threads to execute

-s <subset_size>, --subset-size <subset_size>¶: Set subset size (usually 1e5-1e6 is good)

--no-big-traverse¶: Truncate graph joins at big traversals

--keep-subsets¶: Keep individual subsets

-f, --force¶: Overwrite output file if it exists

Load in a set of sequences, partition them, merge the partitions, and annotate the original sequences files with the partition information.

This script combines the functionality of load-graph.py, partition-graph.py, merge-partitions.py, and annotate-partitions.py into one script. This is convenient but should probably not be used for large data sets, because do-partition.py doesn't provide save/resume functionality.

Example:

do-partition.py -k 20 example tests/test-data/random-20-a.fa

load-graph.py¶

Load sequences into the compressible graph format plus optional tagset.

usage: load-graph.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [-T THREADS] [--no-build-tagset] [-f] output_nodegraph_filename input_sequence_filename [input_sequence_filename ...]

output_nodegraph_filename¶: output k-mer nodegraph filename.

input_sequence_filename¶: input FAST[AQ] sequence filename

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

-T <threads>, --threads <threads>¶: Number of simultaneous threads to execute

--no-build-tagset, -n¶: Do NOT construct tagset while loading sequences

-f, --force¶: Overwrite output file if it exists

See extract-partitions.py for a complete workflow.

partition-graph.py¶

Partition a sequence graph based upon waypoint connectivity

usage: partition-graph.py [--version] [--info] [-h] [-S filename] [-s SUBSET_SIZE] [--no-big-traverse] [-f] [-T THREADS] basename

basename¶: basename of the input k-mer nodegraph + tagset files

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-S <filename>, --stoptags <filename>¶: Use stoptags in this file during partitioning

-s <subset_size>, --subset-size <subset_size>¶: Set subset size (usually 1e5-1e6 is good)

--no-big-traverse¶: Truncate graph joins at big traversals

-f, --force¶: Overwrite output file if it exists

-T <threads>, --threads <threads>¶: Number of simultaneous threads to execute

The resulting partition maps are saved as ${basename}.subset.#.pmap files.

See 'Artifact removal' to understand the stoptags argument.

merge-partition.py¶

Merge partition map '.pmap' files.

usage: merge-partition.py [--version] [--info] [-h] [-k KSIZE] [--keep-subsets] [-f] graphbase

graphbase¶: basename for input and output files

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size (default: 32)

--keep-subsets¶: Keep individual subsets (default: False)

-f, --force¶: Overwrite output file if it exists

Take the ${graphbase}.subset.#.pmap files and merge them all into a single ${graphbase}.pmap.merged file for annotate-partitions.py to use.

annotate-partitions.py¶

Annotate sequences with partition IDs.

usage: annotate-partitions.py [--version] [--info] [-h] [-k KSIZE] [-f] graphbase input_sequence_filename [input_sequence_filename ...]

graphbase¶: basename for input and output files

input_sequence_filename¶: input FAST[AQ] sequences to annotate.

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size (default: 32)

-f, --force¶: Overwrite output file if it exists

Load in a partitionmap (generally produced by partition-graph.py or merge-partitions.py) and annotate the sequences in the given files with their partition IDs. Use extract-partitions.py to extract sequences into separate group files.

Example (results will be in random-20-a.fa.part):

load-graph.py -k 20 example tests/test-data/random-20-a.fa
partition-graph.py example
merge-partitions.py -k 20 example
annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa

extract-partitions.py¶

Separate sequences that are annotated with partitions into grouped files.

usage: extract-partitions.py [--version] [--info] [-h] [-X MAX_SIZE] [-m MIN_PART_SIZE] [-n] [-U] [-f] [--gzip | --bzip] output_filename_prefix input_partition_filename [input_partition_filename ...]

output_filename_prefix¶

input_partition_filename¶

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-X <max_size>, --max-size <max_size>¶: Max group size (n sequences)

-m <min_part_size>, --min-partition-size <min_part_size>¶: Minimum partition size worth keeping

-n, --no-output-groups¶: Do not actually output groups files.

-U, --output-unassigned¶: Output unassigned sequences, too

-f, --force¶: Overwrite output file if it exists

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

Example (results will be in example.group0000.fa):

load-graph.py -k 20 example tests/test-data/random-20-a.fa
partition-graph.py example
merge-partitions.py -k 20 example
annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa
extract-partitions.py example random-20-a.fa.part

(extract-partitions.py will produce a partition size distribution in <base>.dist. The columns are: (1) number of reads, (2) count of partitions with n reads, (3) cumulative sum of partitions, (4) cumulative sum of reads.)

Artifact removal¶

The following scripts are specialized scripts for finding and removing highly-connected k-mers (HCKs). See Partitioning large data sets (50m+ reads).

make-initial-stoptags.py¶

Find an initial set of highly connected k-mers.

usage: make-initial-stoptags.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [--small-count] [-s SUBSET_SIZE] [-S filename] [-f] graphbase

graphbase¶: basename for input and output filenames

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

--small-count¶: Reduce memory usage by using a smaller counter for individual kmers.

-s <subset_size>, --subset-size <subset_size>¶: Set subset size (default 1e4 is prob ok)

-S <filename>, --stoptags <filename>¶: Use stoptags in this file during partitioning

-f, --force¶: Overwrite output file if it exists

Loads a k-mer nodegraph/tagset pair created by load-graph.py, and does a small set of traversals from graph waypoints; on these traversals, looks for k-mers that are repeatedly traversed in high-density regions of the graph, i.e. are highly connected. Outputs those k-mers as an initial set of stoptags, which can be fed into partition-graph.py, find-knots.py, and filter-stoptags.py.

The k-mer countgraph size options parameters are for a k-mer countgraph to keep track of repeatedly-traversed k-mers. The subset size option specifies the number of waypoints from which to traverse; for highly connected data sets, the default (1000) is probably ok.

find-knots.py¶

Find all highly connected k-mers.

usage: find-knots.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [--small-count] [-f] graphbase

graphbase¶: Basename for the input and output files.

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

--small-count¶: Reduce memory usage by using a smaller counter for individual kmers.

-f, --force¶: Continue past warnings

Load an k-mer nodegraph/tagset pair created by load-graph.py, and a set of pmap files created by partition-graph.py. Go through each pmap file, select the largest partition in each, and do the same kind of traversal as in make-initial-stoptags.py from each of the waypoints in that partition; this should identify all of the Highly Connected Kmers in that partition. These HCKs are output to <graphbase>.stoptags after each pmap file.

Parameter choice is reasonably important. See the pipeline in Partitioning large data sets (50m+ reads) for an example run.

This script is not very scalable and may blow up memory and die horribly. You should be able to use the intermediate stoptags to restart the process, and if you eliminate the already-processed pmap files, you can continue where you left off.

filter-stoptags.py¶

Trim sequences at stoptags.

usage: filter-stoptags.py [--version] [--info] [-h] [-k KSIZE] [-f] input_stoptags_filename input_sequence_filename [input_sequence_filename ...]

input_stoptags_filename¶

input_sequence_filename¶

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size

-f, --force¶: Overwrite output file if it exists

Load stoptags in from the given .stoptags file and use them to trim or remove the sequences in <file1-N>. Trimmed sequences will be placed in <fileN>.stopfilt.

Digital normalization¶

normalize-by-median.py¶

Do digital normalization (remove mostly redundant sequences)

usage: normalize-by-median.py [--version] [--info] [-h] [-k KSIZE] [-U UNIQUE_KMERS] [--fp-rate FP_RATE] [-M MAX_MEMORY_USAGE] [--small-count] [-q] [-C CUTOFF] [-p] [--force_single] [-u unpaired_reads_filename] [-s filename] [-R report_filename] [--report-frequency report_frequency] [-f] [-o filename] [-l filename] [--gzip | --bzip] input_sequence_filename [input_sequence_filename ...]

input_sequence_filename¶: Input FAST[AQ] sequence filename.

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-k <ksize>, --ksize <ksize>¶: k-mer size to use

--n_tables <n_tables>, -N <n_tables>¶: ==SUPPRESS==

-U <unique_kmers>, --unique-kmers <unique_kmers>¶: approximate number of unique kmers in the input set

--fp-rate <fp_rate>¶: Override the automatic FP rate setting for the current script

--max-tablesize <max_tablesize>, -x <max_tablesize>¶: ==SUPPRESS==

-M <max_memory_usage>, --max-memory-usage <max_memory_usage>¶: maximum amount of memory to use for data structure

--small-count¶: Reduce memory usage by using a smaller counter for individual kmers.

-q, --quiet¶

-C <cutoff>, --cutoff <cutoff>¶: when the median k-mer coverage level is above this number the read is not kept.

-p, --paired¶: require that all sequences be properly paired

--force_single¶: treat all sequences as single-ended/unpaired

-u <unpaired_reads_filename>, --unpaired-reads <unpaired_reads_filename>¶: include a file of unpaired reads to which -p/--paired does not apply.

-s <filename>, --savegraph <filename>¶: save the k-mer countgraph to disk after all reads are loaded.

-R <report_filename>, --report <report_filename>¶: write progress report to report_filename

--report-frequency <report_frequency>¶: report progress every report_frequency reads

-f, --force¶: continue past file reading errors

-o <filename>, --output <filename>¶: only output a single file with the specified filename; use a single dash "-" to specify that output should go to STDOUT (the terminal)

-l <filename>, --loadgraph <filename>¶: load a precomputed k-mer graph from disk

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

Discard sequences based on whether or not their median k-mer abundance lies above a specified cutoff. Kept sequences will be placed in <fileN>.keep.

By default, paired end reads will be considered together; if either read should be kept, both will be kept. (This keeps both reads from a fragment, and helps with retention of repeats.) Unpaired reads are treated individually.

If -p/--paired is set, then proper pairing is required and the script will exit on unpaired reads, although --unpaired-reads can be used to supply a file of orphan reads to be read after the paired reads.

--force_single will ignore all pairing information and treat reads individually.

With -s/--savegraph, the k-mer countgraph will be saved to the specified file after all sequences have been processed. -l/--loadgraph will load the specified k-mer countgraph before processing the specified files. Note that these graphs are are in the same format as those produced by load-into-counting.py and consumed by abundance-dist.py.

To append reads to an output file (rather than overwriting it), send output to STDOUT with --output - and use UNIX file redirection syntax (>>) to append to the file.

Example:

normalize-by-median.py -k 17 tests/test-data/test-abund-read-2.fa

Example:

normalize-by-median.py -p -k 17 \
tests/test-data/test-abund-read-paired.fa

Example:

normalize-by-median.py -p -k 17 -o - tests/test-data/paired.fq \
>> appended-output.fq

Example:

normalize-by-median.py -k 17 -f tests/test-data/test-error-reads.fq \
tests/test-data/test-fastq-reads.fq

Example:

normalize-by-median.py -k 17 -s test.ct \
tests/test-data/test-abund-read-2.fa \
tests/test-data/test-fastq-reads.fq

Read handling: interleaving, splitting, etc.¶

extract-long-sequences.py¶

Extract FASTQ or FASTA sequences longer than specified length (default: 200 bp).

usage: extract-long-sequences.py [--version] [--info] [-h] [-o output] [-l LENGTH] [--gzip | --bzip] input_filenames [input_filenames ...]

input_filenames¶: Input FAST[AQ] sequence filename.

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-o <output>, --output <output>¶: The name of the output sequence file.

-l <length>, --length <length>¶: The minimum length of the sequence file.

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

Example:

extract-long-sequences.py --length 10 tests/test-data/paired-mixed.fa

extract-paired-reads.py¶

Take a mixture of reads and split into pairs and orphans.

usage: extract-paired-reads.py [--version] [--info] [-h] [-d OUTPUT_DIR] [-p filename] [-s filename] [-f] [--gzip | --bzip] [infile]

infile¶

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-d <output_dir>, --output-dir <output_dir>¶: Output split reads to specified directory. Creates directory if necessary

-p <filename>, --output-paired <filename>¶: Output paired reads to this file

-s <filename>, --output-single <filename>¶: Output orphaned reads to this file

-f, --force¶: Overwrite output file if it exists

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

Many read-handling programs (assemblers, mappers, etc.) require that you give them either perfectly interleaved files, or files containing only single reads. This script takes files that were originally interleaved but where reads may have been orphaned (via error filtering, application of abundance filtering, digital normalization in non-paired mode, or partitioning) and separates the interleaved reads from the orphaned reads.

The default output is two files, <input file>.pe and <input file>.se, placed in the current directory. The .pe file contains interleaved and properly paired sequences, while the .se file contains orphan sequences.

The directory into which the interleaved and orphaned reads are output may be specified using -d/--output-dir. This directory will be created if it does not already exist.

Alternatively, you can specify the filenames directly with -p/--output-paired and -s/--output-single, which will override the -d/--output-dir option.

Example:

extract-paired-reads.py tests/test-data/paired.fq

fastq-to-fasta.py¶

Converts FASTQ format (.fq) files to FASTA format (.fa).

usage: fastq-to-fasta.py [--version] [--info] [-h] [-o filename] [-n] [--gzip | --bzip] input_sequence

input_sequence¶: The name of the input FASTQ sequence file.

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-o <filename>, --output <filename>¶: The name of the output FASTA sequence file.

-n, --n_keep¶: Option to keep reads containing 'N's in input_sequence file. Default is to drop reads

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

interleave-reads.py¶

Produce interleaved files from R1/R2 paired files

usage: interleave-reads.py [--version] [--info] [-h] [-o filename] [--no-reformat] [-f] [--gzip | --bzip] left right

left¶

right¶

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-o <filename>, --output <filename>¶

--no-reformat¶: Do not reformat read names or enforce consistency

-f, --force¶: Overwrite output file if it exists

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

The output is an interleaved set of reads, with each read in <R1> paired with a read in <R2>. By default, the output goes to stdout unless -o/--output is specified.

As a "bonus", this file ensures that if read names are not already formatted properly, they are reformatted consistently, such that they look like the pre-1.8 Casava format (@name/1, @name/2). This reformatting can be switched off with the --no-reformat flag.

Example:

interleave-reads.py tests/test-data/paired.fq.1 \
        tests/test-data/paired.fq.2 -o paired.fq

readstats.py¶

Display summary statistics for one or more FASTA/FASTQ files.

usage: readstats.py [--version] [--info] [-h] [-o filename] [--csv] filenames [filenames ...]

filenames¶

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-o <filename>, --output <filename>¶: output file for statistics; defaults to stdout.

--csv¶: Use the CSV format for the statistics, including column headers.

Report number of bases, number of sequences, and average sequence length for one or more FASTA/FASTQ files; and report aggregate statistics at end.

With -o/--output, the output will be saved to the specified file.

Example:

readstats.py tests/test-data/test-abund-read-2.fa

sample-reads-randomly.py¶

Uniformly subsample sequences from a collection of files

usage: sample-reads-randomly.py [--version] [--info] [-h] [-N NUM_READS] [-M MAX_READS] [-S NUM_SAMPLES] [-R RANDOM_SEED] [--force_single] [-o filename] [-f] [--gzip | --bzip] filenames [filenames ...]

filenames¶

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-N <num_reads>, --num_reads <num_reads>¶

-M <max_reads>, --max_reads <max_reads>¶

-S <num_samples>, --samples <num_samples>¶

-R <random_seed>, --random-seed <random_seed>¶: Provide a random seed for the generator

--force_single¶: Ignore read pair information if present

-o <filename>, --output <filename>¶

-f, --force¶: Overwrite output file if it exits

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

Take a list of files containing sequences, and subsample 100,000 sequences (-N/--num_reads) uniformly, using reservoir sampling. Stop after first 100m sequences (-M/--max_reads). By default take one subsample, but take -S/--samples samples if specified.

The output is placed in -o/--output <file> (for a single sample) or in <file>.subset.0 to <file>.subset.S-1 (for more than one sample).

This script uses the reservoir sampling algorithm.

split-paired-reads.py¶

Split interleaved reads into two files, left and right.

usage: split-paired-reads.py [--version] [--info] [-h] [-d output_directory] [-0 output_orphaned] [-1 output_first] [-2 output_second] [-f] [--gzip | --bzip] [infile]

infile¶

--version¶: show program's version number and exit

--info¶: print citation information

-h, --help¶: show this help message and exit

-d <output_directory>, --output-dir <output_directory>¶: Output split reads to specified directory. Creates directory if necessary

-0 <output_orphaned>, --output-orphaned <output_orphaned>¶: Allow "orphaned" reads and extract them to this file

-1 <output_first>, --output-first <output_first>¶: Output "left" reads to this file

-2 <output_second>, --output-second <output_second>¶: Output "right" reads to this file

-f, --force¶: Overwrite output file if it exists

--gzip¶: Compress output using gzip

--bzip¶: Compress output using bzip2

Some programs want paired-end read input in the One True Format, which is interleaved; other programs want input in the Insanely Bad Format, with left- and right- reads separated. This reformats the former to the latter.

The directory into which the left- and right- reads are output may be specified using -d/--output-dir. This directory will be created if it does not already exist.

Alternatively, you can specify the filenames directly with -1/--output-first and -2/--output-second, which will override the -d/--output-dir setting on a file-specific basis.

-0/:option:'--output-orphans` will allow broken-paired format, and orphaned reads will be saved separately, to the specified file.

Example:

split-paired-reads.py tests/test-data/paired.fq

Example:

split-paired-reads.py -0 reads-output-file tests/test-data/paired.fq

Example:

split-paired-reads.py -1 reads.1 -2 reads.2 tests/test-data/paired.fq

comments powered by Disqus

Table Of Contents

khmer's command-line interface¶

k-mer counting and abundance filtering¶

load-into-counting.py¶

abundance-dist.py¶

abundance-dist-single.py¶

filter-abund.py¶

filter-abund-single.py¶

trim-low-abund.py¶

count-median.py¶

unique-kmers.py¶

Partitioning¶

do-partition.py¶

load-graph.py¶

partition-graph.py¶

merge-partition.py¶

annotate-partitions.py¶

extract-partitions.py¶

Artifact removal¶

make-initial-stoptags.py¶

find-knots.py¶

filter-stoptags.py¶

Digital normalization¶

normalize-by-median.py¶

Read handling: interleaving, splitting, etc.¶

extract-long-sequences.py¶

extract-paired-reads.py¶

fastq-to-fasta.py¶

interleave-reads.py¶

readstats.py¶

sample-reads-randomly.py¶

split-paired-reads.py¶