============================== khmer's command-line interface ============================== The simplest way to use khmer's functionality is through the command line scripts, located in the scripts/ directory of the khmer distribution. Below is our documentation for these scripts. Note that all scripts can be given '-h' as an option, which will print out a list of arguments taken by that script. Many scripts take '-x' and '-N' parameters, which drive khmer's memory usage. These parameters depend on details of your data set; for more information on how to choose them, see :doc:`choosing-hash-sizes`. 1. :ref:`scripts-counting` 2. :ref:`scripts-partitioning` 3. :ref:`scripts-diginorm` 4. :ref:`scripts-read-handling` .. note:: Almost all scripts take in either FASTA and FASTQ format, and output the same. Some scripts may only recognize FASTQ if the file ending is '.fq' or '.fastq', at least for now. Files ending with '.gz' will be treated as gzipped files, and files ending with '.bz2' will be treated as bzip2'd files. .. _scripts-counting: k-mer counting and abundance filtering ====================================== **load-into-counting.py**: build a counting hash. Usage:: load-into-counting.py [ options ] ... Build a counting hash table using the sequences in , and save it to . Note that with the '-b' option will be the exact size of your hashtables, and load-into-counting.py will be constant memory; in exchange, k-mer counts will stop at 255. The memory usage of this script with '-b' will be about 1.15x the product of the -x and -N numbers. Example:: load-into-counting.py -k 20 -x 5e7 out.kh data/100k-filtered.fa Multiple threads can be used to accelerate the process, if you have extra cores to spare. Example:: load-into-counting.py -k 20 -x 5e7 -T 4 out.kh data/100k-filtered.fa **abundance-dist.py**: calculate the abundance distribution. Usage:: abundance-dist.py [ options ] Use a counting hash table to count the k-mer abundance distribution in ; output distribution to . Example:: load-into-counting.py -k 20 -x 5e7 out.kh data/100k-filtered.fa abundance-dist.py -z out.kh data/100k-filtered.fa out.hist The file 'out.hist' will contain the histogram of k-mer abundances. Columns are (1) k-mer abundance, (2) k-mer count, (3) cumulative count, (4) fraction of total distinct k-mers. **abundance-dist-single.py**: calculate the k-mer abundance distribution for a single file Usage:: abundance-dist-single.py [ options ] Count the k-mer abundance distribution in ; output distribution to . (This is a single-step/in-memory version of abundance-dist.py; no counting hash file will be created unless --savehash is specified.) Note that with the '-b' option this script is constant memory; in exchange, k-mer counts will stop at 255. The memory usage of this script with '-b' will be about 1.15x the product of the -x and -N numbers. To count k-mers in multiple files, use load-into-counting.py and abundance-dist.py. Example:: abundance-dist-single.py -k 20 -x 5e7 data/100k-filtered.fa out.hist The file 'out.hist' will contain the histogram of k-mer abundances. Columns are (1) k-mer abundance, (2) k-mer count, (3) cumulative count, (4) fraction of total distinct k-mers. **filter-abund.py**: trim sequences at a min k-mer abundance. Usage:: filter-abund.py [ -C -V [ -Z num ] ] ... Load a counting hash table from and use it to trim the sequences in . Trimmed sequences will be placed in .abundfilt. The '-V' option sets variable coverage, which means that only low-abundance k-mers in high-coverage reads are trimmed; this should be used for RNAseq and metagenome data. The '-Z num' option sets the meaning of "high coverage"; by default it's 20. Example:: load-into-counting.py -k 20 -x 5e7 table.kh data/100k-filtered.fa filter-abund.py -C 2 table.kh data/100k-filtered.fa **filter-abund-single.py**: trim sequences at a min k-mer abundance. Usage:: filter-abund-single.py [ -C ... ] Trim sequences at k-mer abundance < C. (This is an in-memory version of filter-abund.py that works for single files.) Trimmed sequences will be placed in .abundfilt. This script is constant memory. To trim reads based on k-mer abundance across multiple files, use load-into-counting and filter-abund. Example:: filter-abund-single.py -k 20 -x 5e7 -C 2 data/100k-filtered.fa **count-median.py**: count median, average, and stddev of k-mer abundance in sequences. Usage:: count-median.py Load a counting hash table from and use it to calculate the median, average, and stddev k-mer count for each sequence in the . Output counts are placed in , in the format "name median average stddev sequence_length". **count-overlap.py**: count the overlap k-mers, which are the k-mers apperaring in two sequence datasets. Usage:: count-overlap.py [options] Note: "curve_report_filename" is optional. If it is provided, the script will generate a report of the increase of overlap k-mers as the number of sequences in the second dataset increases. The report can be used to generate a curve to show that trend easily.The report file contains the numbers of unique k-mers in the two datasets separately and the number of overlap k-mers appearing in both datasets.. .. _scripts-partitioning: Partitioning ============ **do-partition.py**: load, partition, and annotate sequences. Usage:: do-partiton.py [ options] [ ... ] Load in a set of sequences, partition them, merge the partitions, and annotate the original sequences files with the partition information. This script combines the functionality of load-graph, partition-graph, merge-partitions, and annotate-partitions into one script. This is convenient but should probably not be used for large data sets, because do-partition.py doesn't provide save/resume functionality. **load-graph.py**: load sequences into the compressible graph format. Usage:: load-graph.py [ options ] [ ... ] Load in a set of sequences, marking waypoints as you go, and save into a ht/tagset pair of files. See 'extract-partitions' for a complete workflow. **partition-graph.py**: partition a graph based on waypoint connectivity. Usage:: partition-graph.py [ options ] Partition the given graph (ht + tagset) into disconnected subgraphs, and save the resulting partitionmap(s) as ``*.pmap`` files. See 'Artifact removal' to understand the stoptags argument. **merge-partitions.py**: merge pmap files into a single merged pmap file. Usage:: merge-partitions.py [ options ] Take the .subset.N.pmap files and merge them all into a single .pmap.merged file for annotate-partitions to use. **annotate-partitions.py**: annotate sequences with partition IDs. Usage:: annotate-partitions.py [ -k ] [ ... ] Load in a partitionmap (generally produced by partition-graph.py/merge-partitins.py) and annotate the sequences in the given files with their partition IDs. Use 'extract-partitions.py' to actually extract sequences into separate group files. Example (results will be in ``random-20-a.fa.part``):: load-graph.py -k 20 example tests/test-data/random-20-a.fa partition-graph.py example merge-partitions.py -k 20 example annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa **extract-partitions.py**: separate sequences annotated with partitions into group files Usage:: extract-partitions.py [ options ] [ ... ] Example (results will be in ``example.group0000.fa``):: load-graph.py -k 20 example tests/test-data/random-20-a.fa partition-graph.py example merge-partitions.py -k 20 example annotate-partitions.py -k 20 example tests/test-data/random-20-a.fa extract-partitions.py example random-20-a.fa.part Artifact removal ---------------- The following scripts are specialized scripts for finding and removing highly-connected k-mers (HCKs). See :doc:`partitioning-big-data`. **make-initial-stoptags.py**: find an initial set of HCKs Usage:: make-initial-stoptags.py [ options ] Load a ht/tagset created by load-graph.py, and do a small set of traversals from graph waypoints; on these traversals, look for k-mers that are repeatedly traversed in high-density regions of the graph, i.e. are highly connected. Output those k-mers as an initial set of stoptags, which can be fed into partition-graph.py, find-knots.py, and filter-stoptags.py. The hashtable size options parameters are for a counting hash to keep track of repeatedly-traversed k-mers. The subset size option specifies the number of waypoints from which to traverse; for highly connected data sets, the default (1000) is probably ok. **find-knots.py**: find all HCKs Usage:: find-knots.py [ options ] Load an ht/tagset created by load-graph, and a set of pmap files created by partition-graph. Go through each pmap file, select the largest partition in each, and do the same kind of traversal as in make-initial-stoptags from each of the waypoints in that partition; this should identify all of the HCKs in that partition. These HCKs are output to .stoptags after each pmap file. Parameter choice is reasonably important. See the pipeline in :doc:`partitioning-big-data` for an example run. This script is not very scalable and may blow up memory and die horribly. You should be able to use the intermediate stoptags to restart the process, and if you eliminate the already-processed pmap files, you can continue where you left off. **filter-stoptags.py**: trim sequences at stoptags. Usage:: filter-stoptags.py [ -k ] [ ... ] Load stoptags in from the given .stoptags file and use them to trim or remove the sequences in . Trimmed sequences will be placed in .stopfilt. .. _scripts-diginorm: Digital normalization ===================== **normalize-by-median.py**: do digital normalization (remove mostly redundant sequences) Usage:: normalize-by-median.py [ options ] ... Discard sequences based on whether or not their **median** k-mer abundance lies above a specified cutoff. Kept sequences will be placed in .keep. Paired end reads will be considered together if the ``-p`` flag is set. If either read will be kept, then both will be kept. This should result in keeping (or discarding) each sequencing fragment. This helps with retention of repeats, especially. With the ``-s/--savehash`` option, the counting hash will be saved to the specified file after all sequences have been processed. With ``-d``, the counting hash will be saved every d files for multifile runs; if ``-s`` is set, the specified name will be used, and if not, the name backup.ht will be used. ``-l/--loadhash`` will load the specified hash. The ``-f`` flag will force the program to continue upon encountering a formatting error in a sequence file; the counting hash up to that point will be dumped, and processing will continue on the next file. Example:: normalize-by-median.py -k 17 tests/test-data/test-abund-read-2.fa Example:: normalize-by-median.py -p -k 17 tests/test-data/test-abund-read-paired.fa Example:: normalize-by-median.py -k 17 -f tests/test-data/test-error-reads.fq tests/test-data/test-fastq-reads.fq Example:: normalize-by-median.py -k 17 -d 2 -s test.ht tests/test-data/test-abund-read-2.fa tests/test-data/test-fastq-reads.fq .. _scripts-read-handling: Read handling: interleaving, splitting, etc. ============================================ **extract-paired-reads.py:** take an interleaved file with some orphan reads, and separate paired from orphans. Usage:: extract-paired-reads.py The output is two files, .pe and .se, placed in the current directory. The .pe file contains interleaved and properly paired sequences, while the .se file contains orphan sequences. Many assemblers (e.g. Velvet) require that you give them either perfectly interleaved files, or files containing only single reads. This script takes files that were originally interleaved but where reads may have been orphaned via error filtering, application of abundance filtering, digital normalization in non-paired mode, or partitioning. Example:: extract-paired-reads.py tests/test-data/paired.fq **interleave-reads.py:** take left (/1) and right (/2) reads, and interleave them. Usage:: interleave-reads.py [ -o output ] The output is an interleaved set of reads, with each read in paired with a read in . By default, the output goes to stdout unless '-o' is specified. As a "bonus", this file ensures that read names are formatted in a consistent way, such that they look like the latest Casava format (@name/1, @name/2). Example:: interleave-reads.py tests/test-data/paired.fq.1 tests/test-data/paired.fq.2 -o paired.fq **split-paired-reads.py:** split interleaved read files into two files, one left and one right. Usage:: split-paired-reads.py The output is placed in .1 and .2. Some programs want paired-end read input in the One True Format, which is interleaved; other programs want input in Insanely Bad Format, with left- and right- reads separated. This reformats the former to the latter. Example:: split-paired-reads.py tests/test-data/paired.fq