Identification of Regulatory Modules

in Time Series Gene Expression Data using a

Linear Time Biclustering Algorithm

Sara C. Madeira, Miguel C. Teixeira, Isabel Sá-Correia and Arlindo L. Oliveira

This webpage makes available a prototype implementation of the CCC-Biclustering algorithm coded in Java together with the datasets and examples used in the paper:

Sara C. Madeira, Miguel C. Teixeira, Isabel Sá Correia and Arlindo L. Oliveira, "Identification of Regulatory Modules in Time Series Gene Expression Data using a Linear Time Biclustering Algorithms", IEEE/ACM Transactions on Computational Biology and Bioinformtaics (to appear). [DOI Article Link]

Datasets

Synthetic

Randomly generated 1500x50 matrix [.txt]

Randomly generated 1500x50 matrix with 10 planted CCC-Biclusters [.txt] [.txt]

Real

Cell Cycle [.txt]

Heat Stress [.txt]

Results

Synthetic

Randomly generated 1500x50 matrix

Sorted by statistical significance p-value [.txt]

Randomly generated 1500x50 matrix with 10 planted CCC-Biclusters

Sorted by statistical significance p-value [.txt]

Sorted by statistical significance p-value, filtered statistical p-values not passing the statistical test at 1% level (after Bonferroni correction) [.txt]

Sorted by statistical significance p-value, filtered statistical p-values not passing the statistical test at 1% level (after Bonferroni correction) , filtered similarities above 25% [.txt]

Real

Cell Cycle

Sorted by MSR [.txt]

Sorted by statistical significance p-value, filtered similarities above 25% [.txt]

Heat Stress

Sorted by statistical significance p-value [.txt]

Sorted by statistical significance p-value, filtered statistical p-values not passing the statistical test at 1% level (after Bonferroni correction) [.txt]

Sorted by statistical significance p-value, filtered statistical p-values not passing the statistical test at 1% level (after Bonferroni correction) , filtered similarities above 25% [.txt] details

Software

The software available here allows the reproduction of the results in the paper and also the execution of the CCC-Biclustering algorithm using a gene expression matrix provided by the user. The gene expression matrix must be a .txt file formatted as in the examples provided below.

The algorithm is coded in Java. Before running the examples below please make sure the version of jdk installed in your computer is at least jdk1.5. The algorithm should run in any operating system. A gigabyte of memory is recommended if you want to run the algorihm in large gene expression matrices.

In order to run the algorithm copy the .jar file together with the .txt file containing the expression matrix to the same directory and type the commands below in the command line.

If you have any questions please contact Sara C. Madeira.

Reproduce Results in the Paper

Synthetic Data [.jar] [matrixNoPlantedBiclusters ] [matrixPlantedBiclusters]

            java -jar -Xss50M -Xms1024M -Xmx1024M Test_TCBB_CCC_Biclustering_Synthetic.jar

Cell Cycle Data [.jar] [cell_cycle.txt]

            java -jar -Xss50M -Xms1024M -Xmx1024M Test_TCBB_Cell_Cycle.jar

Heat Stress Data [.jar][heat_stress.txt]

            java -jar -Xss50M -Xms1024M -Xmx1024M Test_TCBB_Heat_Stress.jar
Run CCC-Biclustering with Other Datasets [.jar]
            java -jar -Xss50M -Xms1024M -Xmx1024M Test_TCBB_CCC_Biclustering.jar yourExpressionMatrix.txt overlapping

        yourExpressionMatrix.txt - name of the .txt file containing your expression matrix
        overlapping - float in [0,1] containing the maximum percentage of overlapping allowed (all CCC-Biclusters overlapping more than this value are filtered)

Suplementary Material [.pdf]

The CCC-Biclustering algorithm (together with extended versions allowing missing values and the discovery of anticorrelated and scaled expression patterns) is integrated in the software BiGGEsTS (Biclustering Gene Expression Time Series), a free and open source software tool providing an integrated environment for the biclustering analysis of time series gene expression data. This software enables a user-friendly usage of the algorithm in a graphical environment together with the possibility to preprocess the data and postprocess and analyse the results using several criteria.

Last Update: July 2009