Software / hpc-clust

Description

HPC-CLUST is a set of tools designed to cluster large numbers (>1 million) of pre-aligned nucleotide sequences. HPC-CLUST performs the clustering of sequences using the Hierarchical Clustering Algorithm (HCA). There are currently three different cluster metrics implemented: single-linkage, complete-linkage, and average-linkage. In addition, there are currently four sequence distance functions implemented, these are: identity (gap-gap counting as match), nogap (gap-gap being ignored), nogap-single (like nogap, but consecutive gap-nogap's count as a single mismatch), tamura (distance is calculated with the knowledge that transitions are more likely than transversions).

One advantage that HCA has over other algorithms is that instead of producing only the clustering at a given threshold, it produces the set of merges occuring at each threshold. With this approach, the clusters can afterwards very quickly be reported for every arbitrary threshold with little extra computation. This approach also allows the plotting of the variation of number of clusters with clustering threshold without requiring the clustering to be run for each threshold independently.

Another advantage of the HPC-CLUST implementation is that the single-, complete-, and average-linkage clusterings can be computed in a single run with little overhead.

Documentation

Download

github:

git clone https://github.com/jfmrod/hpc-clust.git

releases:

Reference

Matias Rodrigues JF, Mering C von. HPC-CLUST: Distributed hierarchical clustering for very large sets of nucleotide sequences. Bioinformatics. 2013. [doi:10.1093/bioinformatics/btt657]

License

The main HPC-CLUST code is available under the GPL v3 license. The eutils supporting library included in the distribution is available under a separate license. Commercial use of the unmodified and the binary version is allowed. Please read the COPYING file included in the package for further details.

History