A Polynomial Time Biclustering Algorithm

for Finding Genes with Approximate Expression Patterns

in Gene Expression Time Series

Sara C. Madeira and Arlindo L. Oliveira

This webpage makes available a prototype implementation of the e-CCC-Biclustering algorithm coded in Java together with the dataset and examples used in the paper:

Sara C. Madeira and Arlindo L. Oliveira, "A polynomial time biclustering algorithm for finding genes with approximate expression patterns in gene expression time series", Algorithms for Molecular Biology 2009, 4:8 (4 June 2009). [DOI Article Link]

Datasets

Synthetic

Illustrative example [.txt]

Illustrative example with missing values [.txt]

Real

Heat Stress [.txt]

Results

Synthetic

Illustrative example

Maximal CCC-Biclusters with at least two rows [.txt]

Maximal 1-CCC-Biclusters with at least three rows/columns [.txt]

Maximal 1-CCC-Biclusters with at least three rows/columns and errors restricted to the 1-neighborhood of the alphabet {D,N,U} [.txt]

Illustrative example with missing values

Maximal 1-CCC-Biclusters with at least three rows/columns when missing values are considered as valid errors [.txt]

Maximal 1-CCC-Biclusters with at least three rows and two columns when missing values are "jumped over" [.txt]

Maximal 1-CCC-Biclusters with sign-changes with at least three rows/columns when missing values are "jumped over" [.txt]

Real

e-CCC-Biclustering

Sorted by statistical significance p-value [.txt]

Sorted by statistical significance p-value, filtered statistical p-values not passing the statistical test at 1% level (after Bonferroni correction) [.txt]

Sorted by statistical significance p-value, filtered statistical p-values not passing the statistical test at 1% level (after Bonferroni correction) , filtered similarities above 25% [.txt]

Biological validation details

CCC-Biclustering

Sorted by statistical significance p-value [.txt]

Sorted by statistical significance p-value, filtered statistical p-values not passing the statistical test at 1% level (after Bonferroni correction) [.txt]

Sorted by statistical significance p-value, filtered statistical p-values not passing the statistical test at 1% level (after Bonferroni correction) , filtered similarities above 25% [.txt]

Biological validation details

Software

The software available here allows the reproduction of the results in the paper and also the execution of the e-CCC-Biclustering algorithm using a gene expression matrix provided by the user. The gene expression matrix must be a .txt file formatted as in the examples provided below.

The algorithm is coded in Java. Before running the examples below please make sure the version of jdk installed in your computer is at least jdk1.5. The algorithm should run in any operating system. A gigabyte of memory is recommended if you want to run the algorihm in large gene expression matrices.

In order to run the algorithm copy the .jar file together with the .txt file containing the expression matrix to the same directory and type the commands below in the command line.

If you have any questions please contact Sara C. Madeira.

Reproduce Results in the Paper

Heat Stress Data

CCC-Biclustering: follow the instructions on the CCC-Biclustering webpage.

e-CCC-Biclustering:  [.jar][heat_stress.txt]

        java -jar -Xss50M -Xms1024M -Xmx1024M Test_AMB_Heat_Stress.jar
Run e-CCC-Biclustering with Other Datasets and Options

Running standard e-CCC-Biclustering (no extensions)

        Copy the following [.jar] file into the directory where you want to run the algorithm together with the .txt file containing your expression matrix.

        Then type the command below in your command line and replace the 5 parameters with the values of your choice.

  java -jar -Xss50M -Xms1024M -Xmx1024M Test_AMB_E_CCC_Biclustering.jar yourExpressionMatrix.txt maxErrors rowQuorum columnQuorum overlapping

        yourExpressionMatrix.txt - name of the .txt file containing your expression matrix
maxErrors - integer containing the amount of errors allowed, per gene, in the e-CCC-Biclustering algorithm (value of e)
rowQuorum - integer containing the row quorum (minimum number of genes allowed in e-CCC-Biclusters)
columnQuorum - integer containing the column quorum (minimum number of contiguous time points allowed in e-CCC-Biclusters)
overlapping - float in [0,1] containing the maximum percentage of overlapping allowed (all e-CCC-Biclusters overlapping more than this value are filtered)

For example, if you want to use a matrix in the file matrix.txt, use e=1, row quorum = 3, column quorum = 2 and filter e-CCC-Biclusters overllapping more than 25% you should type and execute the following command:

        java -jar -Xss50M -Xms1024M -Xmx1024M Test_AMB_E_CCC_Biclustering.jar data.txt 1 3 2 0.25

Running extended e-CCC-Biclustering

        Copy the following [.jar] file into the directory where you want to run the algorithm together with the .txt file containing your expression matrix.

        Then type the command below in your command line and replace the 8 parameters with the values of your choice.

   java -jar -Xss50M -Xms1024M -Xmx1024M Test_AMB_E_CCC_Biclustering_Extensions.jar yourExpressionMatrix.txt maxErrors rowQuorum columnQuorum overlapping missings anticorrelation restrictedErrors

        yourExpressionMatrix.txt - name of the .txt file containing your expression matrix
maxErrors - integer containing the amount of errors allowed, per gene, in the e-CCC-Biclustering algorithm (value of e)
rowQuorum - integer containing the row quorum (minimum number of genes allowed in e-CCC-Biclusters)
columnQuorum - integer containing the column quorum (minimum number of contiguous time points allowed in e-CCC-Biclusters)
overlapping - float in [0,1] containing the maximum percentage of overlapping allowed (all e-CCC-Biclusters overlapping more than this value are filtered)

missings - char with three possible values:

                                    R - remove genes with missing values
                                    A - allow missing values as valid errors
                                    J - "jump over" missing values

anticorrelation - char with two possible values:
N - no anticorrelation allowed
Y - anticorrelation alllowed, the algorithm will look for e-CCC-Biclusters with Sign-Changes

restrictedErrors - char with two possible values:
N - errors are not restricted

Y - errors are restricted to the symbols in the 1-neighborhood of the symbols in the alphabet. Since the alphabet {D,N,U} is used in the predefined discretization step provided in this version of the prothotype, the number of neighbors used in the restrited errors extension can only be equal to 1.

For example, if you want to use a matrix in the file matrix.txt, use e=1, row quorum = 3, column quorum = 2, filter e-CCC-Biclusters overllapping more than 25%, "jump over" missing values, allow anticorrelation and restricted the errors to the 1-neighborhood of the symbols in the alphabet you should type and execute the following command:

java -jar -Xss50M -Xms1024M -Xmx1024M Test_AMB_E_CCC_Biclustering.jar data.txt 1 3 2 0.25 J Y Y

The e-CCC-Biclustering algorithm and its extended version are integrated in the software BiGGEsTS (Biclustering Gene Expression Time Series), a free and open source software tool providing an integrated environment for the biclustering analysis of time series gene expression data. This software enables a user-friendly usage of the algorithm in a graphical environment together with the possibility to preprocess the data and postprocess and analyse the results using several criteria.

Last Update: July 2009