Swiss Institute of Bioinformatics
University of Zürich
Software / hpc-clust
HPC-CLUST is a set of tools designed to cluster large numbers (>1 million) of pre-aligned nucleotide sequences. HPC-CLUST performs the clustering of sequences using the Hierarchical Clustering Algorithm (HCA). There are currently three different cluster metrics implemented: single-linkage, complete-linkage, and average-linkage. In addition, there are currently four sequence distance functions implemented, these are: identity (gap-gap counting as match), nogap (gap-gap being ignored), nogap-single (like nogap, but consecutive gap-nogap's count as a single mismatch), tamura (distance is calculated with the knowledge that transitions are more likely than transversions).
One advantage that HCA has over other algorithms is that instead of producing only the clustering at a given threshold, it produces the set of merges occuring at each threshold. With this approach, the clusters can afterwards very quickly be reported for every arbitrary threshold with little extra computation. This approach also allows the plotting of the variation of number of clusters with clustering threshold without requiring the clustering to be run for each threshold independently.
Another advantage of the HPC-CLUST implementation is that the single-, complete-, and average-linkage clusterings can be computed in a single run with little overhead.
git clone https://github.com/jfmrod/hpc-clust.git
- hpc-clust-1.2.1 source [19 March 2015]
- hpc-clust-1.2.1 linux binaries [19 March 2015]
- hpc-clust-1.1.1 source [21 October 2014]
- hpc-clust-1.1.1 linux binaries [21 October 2014]
ReferenceMatias Rodrigues JF, Mering C von. HPC-CLUST: Distributed hierarchical clustering for very large sets of nucleotide sequences. Bioinformatics. 2013. [doi:10.1093/bioinformatics/btt657]
LicenseThe main HPC-CLUST code is available under the GPL v3 license. The eutils supporting library included in the distribution is available under a separate license. Commercial use of the unmodified and the binary version is allowed. Please read the COPYING file included in the package for further details.
- 1.2.1 (19 March 2015)
- Fixed bug in loading of fasta file for -makeotus and -makereps actions
- 1.2.0 (5 February 2015)
- Average linkage computation is faster and is computed until the specified threshold
- Added -makeotus, -makeotus_mothur, and -makerefs options to hpc-clust
- 1.1.1 (21 October 2014)
- Fixed bug in make-otus.sh when using fasta files
- Added make-otus-mothur.sh to create otu list files in mothur format
- 1.1.0 (5 June 2014)
- Fixed bug in mpi version introduced with change to long indices
- Added test suites
- 1.0.2 (23 May 2014)
- Added support for aligned fasta format (automatically detects format based on whether the first character is '>')
- Added support for computing the clustering of more than 2 million sequences (--enable-longind option for configure command)
- Fixed issue with eutils not compiling with some gnu compiler versions (push_back error in ebasicarray.h)
- 1.0.1 (May 2014)
- Several bugs fixed in optimized distance calculation functions, sorting function (only with -O optimization), distributed computing (when distance threshold is strict or sparse databases)