Manual of CULO-tuple pipeline on Linux platform

Pre-required running environment
  1. Linux operational platform;
  2. Python 2.7 or higher version;
  3. R 2.7 or higher version;
  4. Tuple counting tools, (e.g: DSK, jellyfish);
Processing procedure
Step 1: Download this pipeline relevant source code and test data to your workspace directory.

Step 2: Counting the 40 length sequence tuple for each metagenomics sample by scanning each short reads with DSK. Command as follows:
  1. Compiling: make omp=1 k=40. This is compile command is for parallel running model, k denotes as the tuple length. And for serial model, just ignore the “omp” options.
  2. tuple Counting: ./dsk Sample_A_01.fa 40
  3. Format Transformation: ./parese Sample_A_01.solid_kmers_binary > Sample_A_01_40-tuple.txt

Step 3: Filter out the barely occurred tuple for each sample

./ -f Sample_A_01_40-tuple.txt -n x
Where x is the threshold for the tuple that just its occurrence greater than x would be retained.

Step 4: Sorting each filtered sample based on the tuple sequence signature.

./ –f xxxx.filtered.txt –p outputDir
Where –f option requires the filtered samples in former step, and –p option denotes the output directory of sorted files, default set is current directory.

Step 5: Integrate all the samples into a feature matrix.

This procedure we utilize the Linux command as follows:

join -a1 -a2 -1 1 -2 1 -e '0' -o 0,1.2,2.2 sample-01.txt sample-02.txt > new-01.txt
At first, each two distinguish samples would be joined together into a new sample, and then, this procedure will be repeated on the new samples until all samples integrating into a feature matrix. For more details usage of command join is provided here.

Step 6: merging identical pattern features
  • Frequency type
./ –f xxx_OriginalFeatureMatrix.txt -o xxx_MergedFeatureMatrix_frequency.txt
  • Boolean type
./ –f xxx_OriginalFeatureMatrix.txt -o xxx_MergedFeatureMatrix_boolean.txt
Where –f option requires the feature matrix in former step, and -o option is the output file name.

Step 7: calculating TF-IDF weights and TC values of merged features

./ -f xxx_MergedFeatureMatrix.txt -n N
./ -w weight_matrix.txt -t total_w.txt -o xxx_TF-IDF_TC.txt
Where –f option is the merged feature matrix(Frequency or Boolean) in former step, and N is the number of samples.

Step 8: ranking TC values of merged features

./ -f xxx_TF-IDF_TC.txt -n N
Step 9: selecting merged features whose TC values are bigger than the inflection point

./feature_selecting -f xxx_MergedFeatureMatrix.txt -t xxx_TF-IDF_TC_topRanking.txt -o xxx_SelectedFeatureMatrix.txt
Step 10: calculating the dissimilarity bteween samples
  • Frequency type
./ -f xxx_SelectedFeatureMatrix.txt -s xxx_samplelist.txt
  • Boolean
./ -f xxx_SelectedFeatureMatrix.txt -s xxx_samplelist.txt

Last edited Oct 22, 2015 at 8:55 AM by Ouye, version 25