khmer's command-line interface¶
The simplest way to use khmer's functionality is through the command
line scripts, located in the scripts/ directory of the khmer
distribution. Below is our documentation for these scripts. Note
that all scripts can be given -h
which will print out
a list of arguments taken by that script.
Many scripts take -x
and -N
parameters, which drive khmer's
memory usage. These parameters depend on details of your data set; for more information
on how to choose them, see Choosing table sizes for khmer.
You can also override the default values of --ksize
/-k
,
--n_tables
/-N
, and --min-tablesize
/-x
with
the environment variables KHMER_KSIZE, KHMER_N_TABLES, and
KHMER_MIN_TABLESIZE respectively.
- k-mer counting and abundance filtering
- Partitioning
- Digital normalization
- Read handling: interleaving, splitting, etc.
Note
Almost all scripts take in either FASTA and FASTQ format, and output the same. Some scripts may only recognize FASTQ if the file ending is '.fq' or '.fastq', at least for now.
Files ending with '.gz' will be treated as gzipped files, and files ending with '.bz2' will be treated as bzip2'd files.
k-mer counting and abundance filtering¶
load-into-counting.py¶
Build a k-mer counting table from the given sequences.
usage: load-into-counting.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--threads THREADS] [-b] [--summary-info FORMAT] [--report-total-kmers] [-f] output_countingtable_filename input_sequence_filename [input_sequence_filename ...]
-
output_countingtable_filename
¶
The name of the file to write the k-mer counting table to.
-
input_sequence_filename
¶
The names of one or more FAST[AQ] input sequence files.
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-q
,
--quiet
¶
-
--ksize
<int>
,
-k
<int>
¶ k-mer size to use
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on tablesize to use
-
--threads
<int>
,
-T
<int>
¶ Number of simultaneous threads to execute
-
-b
,
--no-bigcount
¶
The default behaviour is to count past 255 using bigcount. This flag turns bigcount off, limiting counts to 255.
-
--summary-info
<format>
,
-s
<format>
¶ What format should the machine readable run summary be in? (json or tsv, disabled by default)
-
--report-total-kmers
,
-t
¶
Prints the total number of k-mers to stderr
-
-f
,
--force
¶
Overwrite output file if it exists
Note: with -b
the output will be the exact size of the
k-mer counting table and this script will use a constant amount of memory.
In exchange k-mer counts will stop at 255. The memory usage of this script
with -b
will be about 1.15x the product of the -x
and
-N
numbers.
Example:
load-into-counting.py -k 20 -x 5e7 out.ct data/100k-filtered.fa
Multiple threads can be used to accelerate the process, if you have extra cores to spare.
Example:
load-into-counting.py -k 20 -x 5e7 -T 4 out.ct data/100k-filtered.fa
abundance-dist.py¶
Calculate abundance distribution of the k-mers in the sequence file using a pre-made k-mer counting table.
usage: abundance-dist.py [-h] [-z] [-s] [--csv] [--version] [-f] input_counting_table_filename input_sequence_filename output_histogram_filename
-
input_counting_table_filename
¶
The name of the input k-mer counting table file.
-
input_sequence_filename
¶
The name of the input FAST[AQ] sequence file.
-
output_histogram_filename
¶
The columns are: (1) k-mer abundance, (2) k-mer count, (3) cumulative count, (4) fraction of total distinct k-mers.
-
-h
,
--help
¶
show this help message and exit
-
-z
,
--no-zero
¶
Do not output 0-count bins
-
-s
,
--squash
¶
Overwrite existing output_histogram_filename
-
--csv
¶
Use the CSV format for the histogram. Includes column headers.
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Continue even if specified input files do not exist or are empty.
abundance-dist-single.py¶
Calculate the abundance distribution of k-mers from a single sequence file.
usage: abundance-dist-single.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--threads THREADS] [-z] [-b] [-s] [--csv] [--savetable filename] [--report-total-kmers] [-f] input_sequence_filename output_histogram_filename
-
input_sequence_filename
¶
The name of the input FAST[AQ] sequence file.
-
output_histogram_filename
¶
The name of the output histogram file. The columns are: (1) k-mer abundance, (2) k-mer count, (3) cumulative count, (4) fraction of total distinct k-mers.
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-q
,
--quiet
¶
-
--ksize
<int>
,
-k
<int>
¶ k-mer size to use
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on tablesize to use
-
--threads
<int>
,
-T
<int>
¶ Number of simultaneous threads to execute
-
-z
,
--no-zero
¶
Do not output 0-count bins
-
-b
,
--no-bigcount
¶
Do not count k-mers past 255
-
-s
,
--squash
¶
Overwrite output file if it exists
-
--csv
¶
Use the CSV format for the histogram. Includes column headers.
-
--savetable
<filename>
¶ Save the k-mer counting table to the specified filename.
-
--report-total-kmers
,
-t
¶
Prints the total number of k-mers to stderr
-
-f
,
--force
¶
Overwrite output file if it exists
Note that with -b
this script is constant memory; in exchange,
k-mer counts will stop at 255. The memory usage of this script with
-b
will be about 1.15x the product of the -x
and
-N
numbers.
To count k-mers in multiple files use load_into_counting.py and abundance_dist.py.
filter-abund.py¶
Trim sequences at a minimum k-mer abundance.
usage: filter-abund.py [-h] [--threads THREADS] [--cutoff CUTOFF] [--variable-coverage] [--normalize-to NORMALIZE_TO] [-o optional_output_filename] [--version] [-f] input_counting_table_filename input_sequence_filename [input_sequence_filename ...]
-
input_counting_table_filename
¶
The input k-mer counting table filename
-
input_sequence_filename
¶
Input FAST[AQ] sequence filename
-
-h
,
--help
¶
show this help message and exit
-
--threads
<int>
,
-T
<int>
¶ Number of simultaneous threads to execute
-
--cutoff
<int>
,
-C
<int>
¶ Trim at k-mers below this abundance.
-
--variable-coverage
,
-V
¶
Only trim low-abundance k-mers from sequences that have high coverage.
-
--normalize-to
<int>
,
-Z
<int>
¶ Base the variable-coverage cutoff on this median k-mer abundance.
-
-o
<optional_output_filename>
,
--out
<optional_output_filename>
¶ Output the trimmed sequences into a single file with the given filename instead of creating a new file for each input file.
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
Trimmed sequences will be placed in ${input_sequence_filename}.abundfilt
for each input sequence file. If the input sequences are from RNAseq or
metagenome sequencing then --variable-coverage
should be used.
Example:
load-into-counting.py -k 20 -x 5e7 table.ct data/100k-filtered.fa
filter-abund.py -C 2 table.ct data/100k-filtered.fa
filter-abund-single.py¶
Trims sequences at a minimum k-mer abundance (in memory version).
usage: filter-abund-single.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--threads THREADS] [--cutoff CUTOFF] [--savetable filename] [--report-total-kmers] [-f] input_sequence_filename
-
input_sequence_filename
¶
FAST[AQ] sequence file to trim
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-q
,
--quiet
¶
-
--ksize
<int>
,
-k
<int>
¶ k-mer size to use
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on tablesize to use
-
--threads
<int>
,
-T
<int>
¶ Number of simultaneous threads to execute
-
--cutoff
<int>
,
-C
<int>
¶ Trim at k-mers below this abundance.
-
--savetable
<filename>
¶ If present, the name of the file to save the k-mer counting table to
-
--report-total-kmers
,
-t
¶
Prints the total number of k-mers to stderr
-
-f
,
--force
¶
Overwrite output file if it exists
Trimmed sequences will be placed in ${input_sequence_filename}.abundfilt.
This script is constant memory.
To trim reads based on k-mer abundance across multiple files, use load-into-counting.py and filter-abund.py.
Example:
filter-abund-single.py -k 20 -x 5e7 -C 2 data/100k-filtered.fa
trim-low-abund.py¶
Trim low-abundance k-mers using a streaming algorithm.
usage: trim-low-abund.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--cutoff CUTOFF] [--normalize-to NORMALIZE_TO] [-o filename] [--variable-coverage] [-l filename] [-s filename] [--force] [--ignore-pairs] [--tempdir TEMPDIR] input_filenames [input_filenames ...]
-
input_filenames
¶
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-q
,
--quiet
¶
-
--ksize
<int>
,
-k
<int>
¶ k-mer size to use
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on tablesize to use
-
--cutoff
<int>
,
-C
<int>
¶ remove k-mers below this abundance
-
--normalize-to
<int>
,
-Z
<int>
¶ base cutoff on this median k-mer abundance
-
-o
<filename>
,
--out
<filename>
¶ only output a single file with the specified filename; use a single dash "-" to specify that output should go to STDOUT (the terminal)
-
--variable-coverage
,
-V
¶
Only trim low-abundance k-mers from sequences that have high coverage.
-
-l
<filename>
,
--loadtable
<filename>
¶ load a precomputed k-mer table from disk
-
-s
<filename>
,
--savetable
<filename>
¶ save the k-mer counting table to disk after allreads are loaded.
-
--force
¶
-
--ignore-pairs
¶
-
--tempdir
<str>
,
-T
<str>
¶
The output is one file for each input file, <input file>.abundtrim, placed in the current directory. This output contains the input sequences trimmed at low-abundance k-mers.
The -V/--variable-coverage
parameter will, if specified,
prevent elimination of low-abundance reads by only trimming
low-abundance k-mers from high-abundance reads; use this for
non-genomic data sets that may have variable coverage.
Note that the output reads will not necessarily be in the same order
as the reads in the input files; if this is an important consideration,
use load-into-counting.py
and filter-abund.py
. However, read
pairs will be kept together, in "broken-paired" format; you can use
extract-paired-reads.py
to extract read pairs and orphans.
Example:
trim-low-abund.py -x 5e7 -k 20 -C 2 data/100k-filtered.fa
count-median.py¶
Count k-mers summary stats for sequences
usage: count-median.py [-h] [--version] [-f] [--csv] input_counting_table_filename input_sequence_filename output_summary_filename
-
input_counting_table_filename
¶
input k-mer count table filename
-
input_sequence_filename
¶
input FAST[AQ] sequence filename
-
output_summary_filename
¶
output summary filename
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
-
--csv
¶
Use the CSV format for the histogram.Includes column headers.
Count the median/avg k-mer abundance for each sequence in the input file, based on the k-mer counts in the given k-mer counting table. Can be used to estimate expression levels (mRNAseq) or coverage (genomic/metagenomic).
The output file contains sequence id, median, average, stddev, and
seq length; fields are separated by spaces. For khmer 1.x
count-median.py will split sequence names at the first space which
means that some sequence formats (e.g. paired FASTQ in Casava 1.8
format) will yield uninformative names. Use --csv
to
fix this behavior.
Example:
count-median.py counts.ct tests/test-data/test-reads.fq.gz medians.txt
NOTE: All 'N's in the input sequences are converted to 'G's.
count-overlap.py¶
Count the overlap k-mers which are the k-mers appearing in two sequence datasets.
usage: count-overlap.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--csv] [-f] input_presence_table_filename input_sequence_filename output_report_filename
-
input_presence_table_filename
¶
input k-mer presence table filename
-
input_sequence_filename
¶
input sequence filename
-
output_report_filename
¶
output report filename
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-q
,
--quiet
¶
-
--ksize
<int>
,
-k
<int>
¶ k-mer size to use
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on tablesize to use
-
--csv
¶
Use the CSV format for the curve output in ${output_report_filename}.curve, including column headers.
-
-f
,
--force
¶
Overwrite output file if it exists
An additional report will be written to ${output_report_filename}.curve containing the increase of overlap k-mers as the number of sequences in the second database increases.
Partitioning¶
do-partition.py¶
Load, partition, and annotate FAST[AQ] sequences
usage: do-partition.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--threads THREADS] [--subset-size SUBSET_SIZE] [--no-big-traverse] [--keep-subsets] [-f] graphbase input_sequence_filename [input_sequence_filename ...]
-
graphbase
¶
base name for output files
-
input_sequence_filename
¶
input FAST[AQ] sequence filenames
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-q
,
--quiet
¶
-
--ksize
<int>
,
-k
<int>
¶ k-mer size to use
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on tablesize to use
-
--threads
<int>
,
-T
<int>
¶ Number of simultaneous threads to execute
-
--subset-size
<float>
,
-s
<float>
¶ Set subset size (usually 1e5-1e6 is good)
-
--no-big-traverse
¶
Truncate graph joins at big traversals
-
--keep-subsets
¶
Keep individual subsets (default: False)
-
-f
,
--force
¶
Overwrite output file if it exists
Load in a set of sequences, partition them, merge the partitions, and annotate the original sequences files with the partition information.
This script combines the functionality of load-graph.py, partition-graph.py, merge-partitions.py, and annotate-partitions.py into one script. This is convenient but should probably not be used for large data sets, because do-partition.py doesn't provide save/resume functionality.
load-graph.py¶
Load sequences into the compressible graph format plus optional tagset.
usage: load-graph.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--threads THREADS] [--no-build-tagset] [--report-total-kmers] [--write-fp-rate] [-f] output_presence_table_filename input_sequence_filename [input_sequence_filename ...]
-
output_presence_table_filename
¶
output k-mer presence table filename.
-
input_sequence_filename
¶
input FAST[AQ] sequence filename
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-q
,
--quiet
¶
-
--ksize
<int>
,
-k
<int>
¶ k-mer size to use
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on tablesize to use
-
--threads
<int>
,
-T
<int>
¶ Number of simultaneous threads to execute
Do NOT construct tagset while loading sequences
-
--report-total-kmers
,
-t
¶
Prints the total number of k-mers to stderr
-
--write-fp-rate
,
-w
¶
Write false positive rate into .info file
-
-f
,
--force
¶
Overwrite output file if it exists
See extract-partitions.py for a complete workflow.
partition-graph.py¶
Partition a sequence graph based upon waypoint connectivity
usage: partition-graph.py [-h] [--stoptags filename] [--subset-size SUBSET_SIZE] [--no-big-traverse] [--version] [-f] [--threads THREADS] basename
-
basename
¶
basename of the input k-mer presence table + tagset files
-
-h
,
--help
¶
show this help message and exit
Use stoptags in this file during partitioning
-
--subset-size
<float>
,
-s
<float>
¶ Set subset size (usually 1e5-1e6 is good)
-
--no-big-traverse
¶
Truncate graph joins at big traversals
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
-
--threads
<int>
,
-T
<int>
¶ Number of simultaneous threads to execute
The resulting partition maps are saved as '${basename}.subset.#.pmap' files.
See 'Artifact removal' to understand the stoptags argument.
merge-partition.py¶
Merge partition map '.pmap' files.
usage: merge-partition.py [-h] [--ksize KSIZE] [--keep-subsets] [--version] [-f] graphbase
-
graphbase
¶
basename for input and output files
-
-h
,
--help
¶
show this help message and exit
-
--ksize
<int>
,
-k
<int>
¶ k-mer size (default: 32)
-
--keep-subsets
¶
Keep individual subsets (default: False)
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
Take the ${graphbase}.subset.#.pmap files and merge them all into a single ${graphbase}.pmap.merged file for annotate-partitions.py to use.
annotate-partitions.py¶
Annotate sequences with partition IDs.
usage: annotate-partitions.py [-h] [--ksize KSIZE] [--version] [-f] graphbase input_sequence_filename [input_sequence_filename ...]
-
graphbase
¶
basename for input and output files
-
input_sequence_filename
¶
input FAST[AQ] sequences to annotate.
-
-h
,
--help
¶
show this help message and exit
-
--ksize
<int>
,
-k
<int>
¶ k-mer size (default: 32)
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
Load in a partitionmap (generally produced by partition-graph.py or merge-partitions.py) and annotate the sequences in the given files with their partition IDs. Use extract-partitions.py to extract sequences into separate group files.
Example (results will be in random-20-a.fa.part
):
load-graph.py -k 20 example tests/test-data/random-20-a.fa
partition-graph.py example
merge-partitions.py -k 20 example
annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa
extract-partitions.py¶
Separate sequences that are annotated with partitions into grouped files.
usage: extract-partitions.py [-h] [--max-size MAX_SIZE] [--min-partition-size MIN_PART_SIZE] [--no-output-groups] [--output-unassigned] [--version] [-f] output_filename_prefix input_partition_filename [input_partition_filename ...]
-
output_filename_prefix
¶
-
input_partition_filename
¶
-
-h
,
--help
¶
show this help message and exit
-
--max-size
<int>
,
-X
<int>
¶ Max group size (n sequences)
-
--min-partition-size
<int>
,
-m
<int>
¶ Minimum partition size worth keeping
-
--no-output-groups
,
-n
¶
Do not actually output groups files.
-
--output-unassigned
,
-U
¶
Output unassigned sequences, too
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
Example (results will be in example.group0000.fa
):
load-graph.py -k 20 example tests/test-data/random-20-a.fa
partition-graph.py example
merge-partitions.py -k 20 example
annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa
extract-partitions.py example random-20-a.fa.part
(extract-partitions.py will produce a partition size distribution
in <base>.dist. The columns are: (1) number of reads, (2) count
of partitions with n reads, (3) cumulative sum of partitions,
(4) cumulative sum of reads.)
Artifact removal¶
The following scripts are specialized scripts for finding and removing highly-connected k-mers (HCKs). See Partitioning large data sets (50m+ reads).
make-initial-stoptags.py¶
Find an initial set of highly connected k-mers.
usage: make-initial-stoptags.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--subset-size SUBSET_SIZE] [--stoptags filename] [-f] graphbase
basename for input and output filenames
show this help message and exit
show program's version number and exit
k-mer size to use
number of k-mer counting tables to use
lower bound on tablesize to use
Set subset size (default 1e4 is prob ok)
Use stoptags in this file during partitioning
Overwrite output file if it exists
Loads a k-mer presence table/tagset pair created by load-graph.py, and does a small set of traversals from graph waypoints; on these traversals, looks for k-mers that are repeatedly traversed in high-density regions of the graph, i.e. are highly connected. Outputs those k-mers as an initial set of stoptags, which can be fed into partition-graph.py, find-knots.py, and filter-stoptags.py.
The k-mer counting table size options parameters are for a k-mer counting table to keep track of repeatedly-traversed k-mers. The subset size option specifies the number of waypoints from which to traverse; for highly connected data sets, the default (1000) is probably ok.
find-knots.py¶
Find all highly connected k-mers.
usage: find-knots.py [-h] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--version] graphbase
-
graphbase
¶
Basename for the input and output files.
-
-h
,
--help
¶
show this help message and exit
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on the size of the k-mer counting table(s)
-
--version
¶
show program's version number and exit
Load an k-mer presence table/tagset pair created by load-graph, and a set of pmap files created by partition-graph. Go through each pmap file, select the largest partition in each, and do the same kind of traversal as in make-initial-stoptags.py from each of the waypoints in that partition; this should identify all of the HCKs in that partition. These HCKs are output to <graphbase>.stoptags after each pmap file.
Parameter choice is reasonably important. See the pipeline in Partitioning large data sets (50m+ reads) for an example run.
This script is not very scalable and may blow up memory and die horribly. You should be able to use the intermediate stoptags to restart the process, and if you eliminate the already-processed pmap files, you can continue where you left off.
filter-stoptags.py¶
Trim sequences at stoptags.
usage: filter-stoptags.py [-h] [--ksize KSIZE] [--version] [-f] input_stoptags_filename input_sequence_filename [input_sequence_filename ...]
show this help message and exit
k-mer size
show program's version number and exit
Overwrite output file if it exists
Load stoptags in from the given .stoptags file and use them to trim or remove the sequences in <file1-N>. Trimmed sequences will be placed in <fileN>.stopfilt.
Digital normalization¶
normalize-by-median.py¶
Do digital normalization (remove mostly redundant sequences)
usage: normalize-by-median.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [-C CUTOFF] [-p] [-u unpaired_reads_filename] [-s filename] [-R filename] [-f] [--save-on-failure] [-d DUMP_FREQUENCY] [-o filename] [--report-total-kmers] [--force] [-l filename] input_sequence_filename [input_sequence_filename ...]
-
input_sequence_filename
¶
Input FAST[AQ] sequence filename.
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-q
,
--quiet
¶
-
--ksize
<int>
,
-k
<int>
¶ k-mer size to use
-
--n_tables
<int>
,
-N
<int>
¶ number of k-mer counting tables to use
-
--min-tablesize
<float>
,
-x
<float>
¶ lower bound on tablesize to use
-
-C
<int>
,
--cutoff
<int>
¶
-
-p
,
--paired
¶
-
-u
<unpaired_reads_filename>
,
--unpaired-reads
<unpaired_reads_filename>
¶ with paired data only, include an unpaired file
-
-s
<filename>
,
--savetable
<filename>
¶ save the k-mer counting table to disk after allreads are loaded.
-
-R
<filename>
,
--report
<filename>
¶
-
-f
,
--fault-tolerant
¶
continue on next file if read errors are encountered
-
--save-on-failure
¶
Save k-mer counting table when an error occurs
-
-d
<int>
,
--dump-frequency
<int>
¶ dump k-mer counting table every d files
-
-o
<filename>
,
--out
<filename>
¶ only output a single file with the specified filename; use a single dash "-" to specify that output should go to STDOUT (the terminal)
-
--report-total-kmers
,
-t
¶
Prints the total number of k-mers post-normalization to stderr
-
--force
¶
Overwrite output file if it exists
-
-l
<filename>
,
--loadtable
<filename>
¶ load a precomputed k-mer table from disk
Discard sequences based on whether or not their median k-mer abundance lies above a specified cutoff. Kept sequences will be placed in <fileN>.keep.
Paired end reads will be considered together if -p
is set. If
either read will be kept, then both will be kept. This should result in
keeping (or discarding) each sequencing fragment. This helps with retention
of repeats, especially. With :option: -u/--unpaired-reads
,
unpaired reads from the specified file will be read after the paired data
is read.
With -s
/--savetable
, the k-mer counting table
will be saved to the specified file after all sequences have been
processed. With -d
, the k-mer counting table will be
saved every d files for multifile runs; if -s
is set,
the specified name will be used, and if not, the name backup.ct
will be used. -l
/--loadtable
will load the
specified k-mer counting table before processing the specified
files. Note that these tables are are in the same format as those
produced by load-into-counting.py and consumed by
abundance-dist.py.
-f
/--fault-tolerant
will force the program to continue
upon encountering a formatting error in a sequence file; the k-mer counting
table up to that point will be dumped, and processing will continue on the
next file.
To append reads to an output file (rather than overwriting it), send output to STDOUT with --out - and use UNIX file redirection syntax (>>) to append to the file.
Example:
normalize-by-median.py -k 17 tests/test-data/test-abund-read-2.fa
Example:
normalize-by-median.py -p -k 17 tests/test-data/test-abund-read-paired.fa
Example:
normalize-by-median.py -p -k 17 -o - tests/test-data/paired.fq >> appended-output.fq
Example:
normalize-by-median.py -k 17 -f tests/test-data/test-error-reads.fq tests/test-data/test-fastq-reads.fq
Example:
normalize-by-median.py -k 17 -d 2 -s test.ct tests/test-data/test-abund-read-2.fa tests/test-data/test-fastq-reads
Read handling: interleaving, splitting, etc.¶
extract-long-sequences.py¶
Extract FASTQ or FASTA sequences longer than specified length (default: 200 bp).
usage: extract-long-sequences.py [-h] [-o OUTPUT] [-l LENGTH] input_filenames [input_filenames ...]
-
input_filenames
¶
Input FAST[AQ] sequence filename.
-
-h
,
--help
¶
show this help message and exit
-
-o
,
--output
¶
The name of the output sequence file.
-
-l
<int>
,
--length
<int>
¶ The minimum length of the sequence file.
extract-paired-reads.py¶
Take a mixture of reads and split into pairs and orphans.
usage: extract-paired-reads.py [-h] [--version] [-f] infile
-
infile
¶
-
-h
,
--help
¶
show this help message and exit
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
The output is two files, <input file>.pe and <input file>.se, placed in the current directory. The .pe file contains interleaved and properly paired sequences, while the .se file contains orphan sequences.
Many assemblers (e.g. Velvet) require that you give them either perfectly interleaved files, or files containing only single reads. This script takes files that were originally interleaved but where reads may have been orphaned via error filtering, application of abundance filtering, digital normalization in non-paired mode, or partitioning.
Example:
extract-paired-reads.py tests/test-data/paired.fq
fastq-to-fasta.py¶
Converts FASTQ format (.fq) files to FASTA format (.fa).
usage: fastq-to-fasta.py [-h] [-o filename] [-n] input_sequence
-
input_sequence
¶
The name of the input FASTQ sequence file.
-
-h
,
--help
¶
show this help message and exit
-
-o
<filename>
,
--output
<filename>
¶ The name of the output FASTA sequence file.
-
-n
,
--n_keep
¶
Option to drop reads containing 'N's in input_sequence file.
interleave-reads.py¶
Produce interleaved files from R1/R2 paired files
usage: interleave-reads.py [-h] [-o filename] [--version] [-f] infiles [infiles ...]
-
infiles
¶
-
-h
,
--help
¶
show this help message and exit
-
-o
<filename>
,
--output
<filename>
¶
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
The output is an interleaved set of reads, with each read in <R1> paired
with a read in <R2>. By default, the output goes to stdout unless
-o
/--output
is specified.
As a "bonus", this file ensures that if read names are not already formatted properly, they are reformatted consistently, such that they look like the pre-1.8 Casava format (@name/1, @name/2).
Example:
interleave-reads.py tests/test-data/paired.fq.1 tests/test-data/paired.fq.2 -o paired.fq
readstats.py¶
Display summary statistics for one or more FASTA/FASTQ files.
usage: readstats.py [-h] [-o filename] [--csv] filenames [filenames ...]
-
filenames
¶
-
-h
,
--help
¶
show this help message and exit
-
-o
<filename>
,
--output
<filename>
¶ output file for statistics; defaults to stdout.
-
--csv
¶
Use the CSV format for the statistics, including column headers.
Report number of bases, number of sequences, and average sequence length for one or more FASTA/FASTQ files; and report aggregate statistics at end.
With -o
/:options:`--output`, the output will be saved to the
specified file.
Example:
readstats.py tests/test-data/test-abund-read-2.fa
sample-reads-randomly.py¶
Uniformly subsample sequences from a collection of files
usage: sample-reads-randomly.py [-h] [-N NUM_READS] [-M MAX_READS] [-S NUM_SAMPLES] [-R RANDOM_SEED] [--force_single] [-o output_file] [--version] [-f] filenames [filenames ...]
-
filenames
¶
-
-h
,
--help
¶
show this help message and exit
-
-N
<int>
,
--num_reads
<int>
¶
-
-M
<int>
,
--max_reads
<int>
¶
-
-S
<int>
,
--samples
<int>
¶
-
-R
<int>
,
--random-seed
<int>
¶
-
--force_single
¶
Ignore read pair information if present
-
-o
<output_file>
,
--output
<output_file>
¶
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exits
Take a list of files containing sequences, and subsample 100,000
sequences (-N
/--num_reads
) uniformly, using
reservoir sampling. Stop after first 100m sequences
(-M
/--max_reads
). By default take one subsample,
but take -S
/--samples
samples if specified.
The output is placed in -o
/--output
<file>
(for a single sample) or in <file>.subset.0 to <file>.subset.S-1
(for more than one sample).
This script uses the reservoir sampling algorithm.
split-paired-reads.py¶
Split interleaved reads into two files, left and right.
usage: split-paired-reads.py [-h] [-o output_directory] [-1 output_first] [-2 output_second] [-p] [--version] [-f] infile
-
infile
¶
-
-h
,
--help
¶
show this help message and exit
-
-o
<output_directory>
,
--output-dir
<output_directory>
¶ Output split reads to specified directory. Creates directory if necessary
-
-1
<output_first>
,
--output-first
<output_first>
¶ Output "left" reads to this file
-
-2
<output_second>
,
--output-second
<output_second>
¶ Output "right" reads to this file
-
-p
,
--force-paired
¶
Require that reads be interleaved
-
--version
¶
show program's version number and exit
-
-f
,
--force
¶
Overwrite output file if it exists
Some programs want paired-end read input in the One True Format, which is interleaved; other programs want input in the Insanely Bad Format, with left- and right- reads separated. This reformats the former to the latter.
The directory into which the left- and right- reads are output may be
specified using -o
/--output-dir
. This directory will be
created if it does not already exist.
Alternatively, you can specify the filenames directly with
-1
/--output-first
and
-2
/--output-second
, which will override the
-o
/--output-dir
setting on a file-specific basis.
-p
/--force-paired
will require the input file to
be properly interleaved; by default, this is not required.
Example:
split-paired-reads.py tests/test-data/paired.fq
Example:
split-paired-reads.py -o ~/reads-go-here tests/test-data/paired.fq
Example:
split-paired-reads.py -1 reads.1 -2 reads.2 tests/test-data/paired.fq