ArrayMining - Online Microarray Data Mining
Ensemble and Consensus Analysis Methods for Gene Expression Data
Contact | [X] |
The project and the website are maintained by Enrico Glaab School of Computer Science, Nottingham University, UK webmaster@arraymining.net |
|
Close |
Golub et al. (1999) Leukemia data set | [X] |
Short description: Analysis of patients with acute lymphoblastic leukemia (ALL, 1) or acute myeloid leukemia (AML, 0). Sample types: ALL, AML No. of genes: 7129 No. of samples: 72 (class 0: 25, class 1: 47) Normalization: VSN (Huber et al., 2002) References: - Golub et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science (1999), 531-537 - Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104 |
van't Veer et al. (2002) Breast cancer data set | [X] |
Short description: Samples from Breast cancer patients were subdivided in a "good prognosis" (0) and "poor prognosis" (1) group depending on the occurrence of distant metastases within 5 years. The data set is pre-processed as described in the original paper and was obtained from the R package "DENMARKLAB" (Fridlyand and Yang, 2004). Sample types: good prognosis, poor prognosis No. of genes: 4348 (pre-processed) No. of samples: 97 (class 0: 51, class 1: 46) Normalization: see reference (van't Veer et al., 2002) References: - van't Veer et al., Gene expression profiling predicts clinical outcome of breast cancer, Nature (2002), 415, p. 530-536 - Fridlyand,J. and Yang,J.Y.H. (2004) Advanced microarray data analysis: class discovery and class prediction (http://genome. cbs.dtu.dk/courses/norfa2004/Extras/DENMARKLAB.zip) |
Yeoh et al. (2002) Leukemia multi-class data set | [X] | ||
Short description: A multi-class data set for the prediction of the disease subtype in pediatric acute lymphoblastic leukemia (ALL).
No. of samples: 327 Normalization: VSN (Huber et al., 2002) References: - Yeoh et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. March 2002. 1: 133-143 - Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104 |
Alon et al. (1999) Colon cancer data set | [X] |
Short description: Analysis of colon cancer tissues (1) and normal colon tissues (0). Sample types: tumour, healthy No. of genes: 2000 No. of samples: 62 (class 1: 40, class 0: 22) Normalization: VSN (Huber et al., 2002) References: - U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine, “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” in Proceedings of the National Academy of Science (1999), vol. 96, pp. 6745–6750 - Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104 |
Singh et al. (2002) Prostate cancer data set | [X] |
Short description: Analysis of prostate cancer tissues (1) and normal tissues (0). Sample types: tumour, healthy No. of genes: 2135 (pre-processed) No. of samples: 102 (class 1: 52, class 0: 50) Normalization: GeneChip RMA (GCRMA) References: - D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J.Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D’Amico, J.P. Richie, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1(2): pp. 203–209, 2002 - Z. Wu and R.A. Irizarry. Stochastic Models Inspired by Hybridization Theory for Short Oligonucleotide Arrays. Journal of Computational Biology, 12(6): pp. 882–893, 2005 |
Shipp et al. (2002) B-Cell Lymphoma data set | [X] |
Short description: Analysis of Diffuse Large B-Cell lymphoma samples (1) and follicular B-Cell lymphoma samples (0). Sample types: DLBCL, follicular No. of genes: 2647 (pre-processed) No. of samples: 77 (class 1: 58, class 0: 19) Normalization: VSN (Huber et al., 2002) References: - M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, et al. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine, 8(1): pp. 68–74, 2002 - Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104 |
Shin et al. (2007) T-Cell Lymphoma data set | [X] |
Short description: Analysis of cutaneous T-Cell lymphoma (CTCL) samples from lesional skin biopsies. Samples are divided in lower-stage (stages IA and IB, 0) and higher-stage (stages IIB and III) CTCL. Sample types: lower_stage, higher_stage No. of genes: 2922 (pre-processed) No. of samples: 63 (class 1: 20, class 0: 43) Normalization: VSN (Huber et al., 2002) References: - J. Shin, S. Monti, D. J. Aires, M. Duvic, T. Golub, D. A. Jones and T. S. Kuppe, Lesional gene expression profiling in cutaneous T-cell lymphoma reveals natural clusters associated with disease outcome. Blood, 110(8): pp. 3015, 2007 - Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104 |
Armstrong et al. (2002) Leukemia data set | [X] |
Short description: Comparison of three classes of Leukemia samples: Acute lymphoblastic leukemia (ALL, 0), acute myelogenous leukemia (AML, 1) and ALL with mixed-lineage leukemia gene translocation (MLL, 3). Sample types: ALL, AML, MLL No. of genes: 8560 (pre-processed) No. of samples: 72 (class 0: 24, class 1: 28, class 2: 20) Normalization: VSN (Huber et al., 2002) References: - S.A. Armstrong, J.E Staunton, L.B. Silverman, R. Pieters, M.L. den Boer, M.D. Minden, S.E. Sallan, E.S. Lander, T.R. Golub, S.J. Korsmeyer; MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 30(1): pp. 41–47, 2002 - Huber et al., Variance stabilization applied to microarray data calibration and to the quantification of differential expression, Bioinformatics (2002) 18 Suppl.1 96-104 |
K-Means clustering algorithm | [X] |
Short description: The k-means clustering algorithm (Hartigan and Wong, 1979) is one of the simplest partition-based clustering methods. The data points are partitioned into a given number of k clusters such that the sum of squares from the points to the closest cluster centre is minimized. The algorithm starts with a random cluster centre assignment and then iteratively repeats the following two-step procedure: First, all data points are assigned to the closest cluster centre, then the position of the cluster centres is updated by calculating the centroid of the data points for each cluster. References: - J. A. Hartigan, M. A. Wong, A K-means clustering algorithm, JR Stat. Soc. Ser. C-Appl. Stat, p. 100-108, 28, 1979 |
Partitioning Around Medoids | [X] |
Short description: The "Partioning around medoids"-algorithm (PAM) is a robust altnerative to the partition-based k-means clustering algorithm. PAM searches for k representative objects ("medoids") instead of cluster centres and minimizes a sum of dissimilarities instead of a sum of squared euclidean distances. In contrast to k-means the algorithm also accepts a dissimilarity matrix as input.For a complete description of the PAM algorithm see chapter 2 of Kaufman and Rousseeuw (1990). References: - L. Kaufman and P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York, 1990 |
Self-Organizing Maps | [X] |
Short description: Kohonen's Self-Organizing Maps (SOM) is a partition-based clustering algorithm which combines clustering and dimension reduction. The data points are mapped into a low-dimensional grid based on their closeness to "prototype" grid points. A SOM can be generated in an online algorithm by iteratively moving prototype grid points that are close to a randomly chosen data point closer towards it based on a given learning rate parameter. A detailed explanation of the algorithm can be found in Kohonen's book "Self-Organizing Maps" (1995). References: - Kohonen, T., Self-Organizing Maps. Springer-Verlag, 1995 |
Self-organizing Tree Algorithm | [X] |
Short description: The Self-Organizing Tree Algorithm (SOTA) builds an unsupervised neural network with a binary tree topology. The algorithm iteratively splits the node with the largest diversity into two child nodes ("cells") until the maximum number of clusters is reached. SOTA can be understood as a combination Kohonen's Self-Organizing Maps (SOM) and hierarchical clustering, which enables the algorithm to capture a hierarchical structure in the data. Further details about the algorithm can be found in the reference paper (Herrero, 2005). References: - Herrero, J., Valencia, A, and Dopazo, J. (2005). A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17, 126-136 |
Divisive Analysis Clustering | [X] |
Short description: Divisive Analysis Clustering (DIANA) is a divisive hierarchical clustering algorithm, starting with a single cluster for all observations and iteratively dividing it. In each step the cluster with the largest diameter is chosen and divided by identifying the most disparate observation, moving it to a newly created cluster and re-assigning the observations that are closer to this new cluster than to their old one. A complete description is available in chapter 6 of Kaufman and Rousseeuw (1990). References: - L. Kaufman and P.J. Rousseeuw, Finding groups in data: An introduction to cluster analysis. Wiley Series in Probability and Mathematical Statistics. Applied Probability and Statistics, New York, 1990 |
Hybrid Hierarchical Clustering | [X] |
Short description: Hybrid Hierarchical Clustering is a combination of agglomerative and devisive hierarchical clustering. The algorithm first uses a bottum-up hierarchical clustering to identify maximum "mutual clusters", where a mutual cluster is a group of data points such that the largest distance between any pair among them is shorter than the shortest distance to any point outside the group. In the second step, top-down clustering is applied on the data, limited by the constraint that mutual clusters may not be broken. A detailed description of the algorithm is given in Chipman and Tibshirani's paper (2006). References: - H. Chipman, R. Tibshirani, Hybrid Hierarchical Clustering with Applications to Microarray Data, Biostatistics, 7, p. 302-317 |
Hierachical clustering | [X] |
Short description: Hierarchical clustering (HCL) is one of the most popular and traditional clustering methods and provides easily interpretable tree-visualizations (dendrograms) of the clustering results. Agglomerative variants of the HCL algorithm iteratively combine clusters with smallest between-cluster dissimilarity (BCD), whereas divisive variants of HCL iteratively divide the cluster with maximum BCD. In contrast to most partition-based clustering algorithms HCL does not require the user to specify the number of clusters as an input parameter, but the method is limited by the assumption that the data has a hierarchical structure. For more details about HCL and different BCD measures see Hartigan, "Clustering Algorithms" (1975). References: - J. A. Hartigan (1975). Clustering Algorithms. New York: Wiley |
Combination of all methods | [X] |
Short description: The "ALL"-option applies all implemented clustering algorithms (k-Means, PAM, SOM, SOTA, HCL, DIANA and AGNES) individually to the selected data set. It provides tables and plots for different cluster validaty measures to compare the results for different clustering algorithms and different numbers of clusters. This option can help the user to identify the optimal number of clusters and choose the most suitable clustering method for a specific data set (however, please note that the runtime for the "ALL"-option is of course longer than for any single clustering algorithm). References: (see references for single methods) |
Help | [X] |
Close |
Terms and Conditions | [X] |
Close |
Arraymining.net - Newsletter | [X] |
Stay informed about updates and new features on our website by joining our newsletter. Your email address remains strictly confidential and will only be used to inform you about major updates of our web-service (<= 1 email per month). You can unsubscribe at any time by clicking on the unsubscribe link at the bottom of our e-mails. |
Arraymining.net - Newsletter | [X] |
|
Gene Expression Omnibus (GEO) data base | [X] |
Short description: The Gene Expression Omnibus (GEO) data base is the largest public microarray data base containing samples from more than 150,000 studies. Every GDS-entry in GEO can be analysed on ArrayMining.net based on the corresponding identifier. Please note that this feature is still experimental and is likely to require longer run-times than the analysis of pre-normalized data sets. References: - Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R. NCBI GEO: mining tens of millions of expression profiles - database and tools update. Nucleic Acids Res, 35, D760, 2007 |
Class Discovery Analysis (Unsupervised Learning)
The module below allows you to perform a clustering analysis for a pre-normalized input matrix.
To obtain instructions click help.