Blog posts and additional documentation

Hashtable and filtering

The basic inexact-matching approach used by the hashtable code is described in this blog post:

A test data set (soil metagenomics, 88m reads, 10gb) is here:

Illumina read abundance profiles

khmer can be used to look at systematic variations in k-mer statistics across Illumina reads; see, for example, this blog post:

The fasta-to-abundance-hist and abundance-hist-by-position scripts can be used to generate the k-mer abundance profile data, after loading all the k-mer counts into a .ct file:

# first, load all the k-mer counts: -k 20 -x 1e7 25k.ct data/25k.fq.gz

# then, build the '.freq' file that contains all of the counts by position
python sandbox/ 25k.ct data/25k.fq.gz

# sum across positions.
python sandbox/ data/25k.fq.gz.freq > out.dist

The hashtable method 'dump_kmers_by_abundance' can be used to dump high abundance k-mers, but we don't have a script handy to do that yet.

You can assess high/low abundance k-mer distributions with the hi-lo-abundance-by-position script: -k 20 25k.ct data/25k.fq.gz
python sandbox/ 25k.ct data/25k.fq.gz

This will produce two output files, <filename>.pos.abund=1 and <filename>.pos.abund=255.

Finding valleys/minima in k-mer abundance profiles

Using k-mer abundance profiles to dynamically calculate the abundance threshold separating erroneous k-mers from real k-mers is described in this blog post:

comments powered by Disqus