Thursday, November 22, 2012

JELLYFISH - Fast, Parallel k-mer Counting for DNA


What is Jellyfish - Fast, Parallel k-mer Counting for DNA?
(Taken from Jellyfish Site)
JELLYFISH is a tool for fast, memory-efficient counting of k-mers in DNA. A k-mer is a substring of length k, and counting the occurrences of all such substrings is a central step in many analyses of DNA sequence. JELLYFISH can count k-mers using an order of magnitude less memory and an order of magnitude faster than other k-mer counting packages by using an efficient encoding of a hash table and by exploiting the "compare-and-swap" CPU instruction to increase parallelism.

JELLYFISH is a command-line program that reads FASTA and multi-FASTA files containing DNA sequences. It outputs its k-mer counts in an binary format, which can be translated into a human-readable text format using the "jellyfish dump" command. See the documentation below for more details.



Requirements:

JELLYFISH runs on 64-bit Intel-compatible processors running Linux or FreeBSD (including Intel Macs). It requires GNU GCC to compile.


Download (current version 1.1.6.):
http://www.cbcb.umd.edu/software/jellyfish/jellyfish-1.1.6.tar.gz


Installation:
# ./configure --prefix=/usr/local/jellyfish
# make
# make install

Testing- Test 1
# make check

... 
...
====================
All 19 tests passed
(1 test was not run)
====================
...
...
All tests should pass and 1 test should be skipped (big.sh). Running
'make check' will use about 50MB of disk space and will use every CPUs
found on the machine. On our test machine with 32 cores, it takes a
few minutes to run.

Testing -Test 2
# make check BIG=1

....
....
PASS: tests/generate_sequence.sh
PASS: tests/serial_hashing.sh
PASS: tests/parallel_hashing.sh
PASS: tests/serial_direct_indexing.sh
PASS: tests/parallel_direct_indexing.sh
....
....

No comments: