Clustering High-throughput Sequencing Samples with Long k-tuple sequence signatures

It is an effective pipeline with long k-tuples as features to compare the metagenomic sequencing samples. For each high-throughput sequencing sample, its long k-tuple count vector is obtained by counting the occurrence frequency of long k-tuple features. All count vectors are integrated into a feature matrix with Boolean or Frequency types. Then, these identical pattern features which have uniform distribution over different samples are merged. The term frequency/inverse document frequency (TF-IDF) measure is applied to calculate the weight of each merged k-tuple feature. A feature selection method called Term Contribution (TC) ranks the contribution of each merged feature. There is an infection point observed in the ranked TC curve. The features whose TC value is bigger than the infection point are selected for the clustering. The hamming and dot-product dissimilarity measure are used to comparing two samples.
CULO-tuple is a very handy pipeline that implemented and executed under Linux-like platform. This pipeline allows users to execute the processing applications for each step with simple and friendly usage commands. The more detailed manual of this pipeline is provided here.
Citation Download
The source code of the pipeline.
Development Team
The whole pipeline was designed and implemented by Ying Wang’s group, Automation Department, Xiamen University, P.R.China. Any questions and suggestions are more than welcome to or

Last edited Oct 7, 2015 at 12:36 PM by Ouye, version 19