clustergram - Compute hierarchical clustering, display dendrogram and heat map, and create clustergram object

Syntax

CGobj = clustergram(Data)

CGobj = clustergram(Data, ...'RowLabels', RowLabelsValue, ...)
CGobj = clustergram(Data, ...'ColumnLabels', ColumnLabelsValue, ...)
CGobj = clustergram(Data, ...'Standardize', StandardizeValue, ...)
CGobj = clustergram(Data, ...'Cluster', ClusterValue, ...)
CGobj = clustergram(Data, ...'RowPdist', RowPdistValue, ...)
CGobj = clustergram(Data, ...'ColumnPdist', ColumnPdistValue, ...)
CGobj = clustergram(Data, ...'Linkage', LinkageValue, ...)
CGobj = clustergram(Data, ...'Dendrogram', DendrogramValue, ...)
CGobj = clustergram(Data, ...'OptimalLeafOrder', OptimalLeafOrderValue, ...)
CGobj = clustergram(Data, ...'ColorMap', ColorMapValue, ...)
CGobj = clustergram(Data, ...'DisplayRange', DisplayRangeValue, ...)
CGobj = clustergram(Data, ...'SymmetricRange', SymmetricRangeValue, ...)
CGobj = clustergram(Data, ...'LogTrans', LogTransValue, ...)
CGobj = clustergram(Data, ...'Ratio', RatioValue, ...)
CGobj = clustergram(Data, ...'Impute', ImputeValue, ...)
CGobj = clustergram(Data, ...'RowMarker', RowMarkerValue, ...)
CGobj = clustergram(Data, ...'ColumnMarker', ColumnMarkerValue, ...)

Arguments

DataDataMatrix object or numeric matrix of data. If the matrix contains gene expression data, typically each row corresponds to a gene and each column corresponds to sample.
RowLabelsValueVector of numbers or cell array of text strings to label the rows in the dendrogram and heat map. Default is a vector of values 1 through M, where M is the number of rows in Data.
ColumnLabelsValueVector of numbers or cell array of text strings to label the columns in the dendrogram and heat map. Default is a vector of values 1 through N, where N is the number of columns in Data.
StandardizeValue

Numeric value that specifies the dimension for standardizing the values in Data. The standardized values are transformed so that the mean is 0 and the standard deviation is 1 in the specified dimension. Choices are:

  • 1 — Standardize along the columns of data.

  • 2 (default) — Standardize along the rows of data.

  • 3 — Do not perform standardization.

ClusterValue

Numeric value that specifies the dimension for clustering the values in Data. Choices are:

  • 1 — Cluster rows of data only.

  • 2 — Cluster columns of data only.

  • 3 (default) — Cluster rows of data, then cluster columns of row-clustered data.

RowPdistValueString that specifies the distance metric to pass to the pdist function (Statistics Toolbox software) to use to calculate the pairwise distances between rows. For information on choices, see the pdist function. Default is 'euclidean'.

    Note   If the distance metric requires extra arguments, then RowPdistValue is a cell array. For example, to use the Minkowski distance with exponent P, you would use {'minkowski', P}.

ColumnPdistValueString that specifies the distance metric to pass to the pdist function (Statistics Toolbox software) to use to calculate the pairwise distances between columns. For information on choices, see the pdist function. Default is 'euclidean'.

    Note   If the distance metric requires extra arguments, then ColumnPdistValue is a cell array. For example, to use the Minkowski distance with exponent P, you would use {'minkowski', P}.

LinkageValue

String or two-element cell array of strings that specifies the linkage method to pass to the linkage function (Statistics Toolbox software) to use to create the hierarchical cluster tree for rows and columns. If a two-element cell array of strings, the first element is used for linkage between rows, and the second element is used for linkage between columns. For information on choices, see the linkage function. Default is 'average'.

    Tip   To specify the linkage method for only one dimension, set the other dimension to ''.

DendrogramValue

Scalar or two-element numeric vector or cell array of strings that specifies the 'colorthreshold' property to pass to the dendrogram function (Statistics Toolbox software) to create the dendrogram plot. If a two-element numeric vector or cell array, the first element is for the rows, and the second element is for the columns. For more information, see the dendrogram function.

    Tip   To specify the 'colorthreshold' property for only one dimension, set the other dimension to ''.

OptimalLeafOrderValueProperty to enable or disable the optimal leaf ordering calculation, which determines the leaf order that maximizes the similarity between neighboring leaves. Choices are true (enable) or false (disable). Default depends on the size of Data. If the number of rows or columns in Data is greater than 1000, default is false; otherwise, default is true.

    Note   Disabling the optimal leaf ordering calculation can be useful when working with large data sets because this calculation uses a large amount of memory and can be very time consuming.

ColorMapValue

Either of the following:

  • M-by-3 matrix of RGB values

  • Name of or handle to a function that returns a colormap, such as redgreencmap or redbluecmap

Default is redgreencmap, in which red represents values above the mean, black represents the mean, and green represents values below the mean of a row (gene) across all columns (samples).

DisplayRangeValue

Positive scalar that specifies the display range of standardized values. Default is 3, which means there is a color variation for values between –3 and 3, but values >3 will be the same color as 3, and values < –3 will be the same color as –3.

For example, if you specify redgreencmap for the 'ColorMap' property, pure red represents values ≥ DisplayRangeValue, and pure green represents values ≤ –DisplayRangeValue.

SymmetricRangeValueProperty to force the color scale of the heat map to be symmetric around zero. Choices are true (default) or false.
LogTransValueControls the log2 transform of Data from natural scale. Choices are true or false (default).
RatioValue

Either of the following:

  • Scalar

  • Two-element vector

It specifies the ratio of space that the row and column dendrograms occupy relative to the heat map. If RatioValue is a scalar, it is used as the ratio for both dendrograms. If RatioValue is a two-element vector, the first element is used for the ratio of the row dendrogram width to the heat map width, and the second element is used for the ratio of the column dendrogram height to the heat map height. The second element is ignored for one-dimensional clustergrams. Default is 1/5.

ImputeValue

Any of the following:

  • Name of a function that imputes missing data

  • Handle to a function that imputes missing data

  • Cell array where the first element is the name of or handle to a function that imputes missing data and the remaining elements are property name/property value pairs used as inputs to the function

    Caution   If you have missing data points, use the 'Impute' property; otherwise, the clustergram function will error.

RowMarkerValue

Optional structure array for annotating the groups (clusters) of rows determined by the clustergram function. Each structure in the array represents a group of rows and contains the following fields:

  • GroupNumber — The row group number to annotate.

  • Annotation — String specifying text to annotate the row group.

  • Color — String or three-element vector of RGB values specifying a color, which is used to label the row group. For more information on specifying colors, see colorspec. If this field is empty, default is 'blue'.

ColumnMarkerValue

Optional structure array for annotating groups (clusters) of columns determined by the clustergram function. Each structure in the array represents a group of columns and contains the following fields:

  • GroupNumber —The column group number to annotate.

  • Annotation — String specifying text to annotate the column group.

  • Color — String or three-element vector of RGB values specifying a color, which is used to label the column group. For more information on specifying colors, see colorspec. If this field is empty, default is 'blue'.

Description

CGobj = clustergram(Data) performs hierarchical clustering analysis on the values in Data, a DataMatrix object or numeric matrix , creates CGobj, an object containing the analysis data, and displays a dendrogram and heat map. It uses hierarchical clustering with euclidean distance metric and average linkage to generate the hierarchical tree. The clustering is performed first along the columns (producing row-clustered data), and then along the rows in the matrix Data. If Data contains gene expression data, typically the rows correspond to genes and the columns correspond to samples.

CGobj = clustergram(Data, ...'PropertyName', PropertyValue, ...) calls clustergram with optional properties that use property name/property value pairs. You can specify one or more properties in any order. Each PropertyName must be enclosed in single quotation marks and is case insensitive. These property name/property value pairs are as follows:


CGobj = clustergram(Data, ...'RowLabels', RowLabelsValue, ...)
uses the contents of RowLabelsValue, a vector of numbers or cell array of text strings, as labels for the rows in the dendrogram and heat map. Default is a vector of values 1 through M, where M is the number of rows in Data.

CGobj = clustergram(Data, ...'ColumnLabels', ColumnLabelsValue, ...) uses the contents of ColumnLabelsValue, a vector of numbers or cell array of text strings, as labels for the columns in the dendrogram and heat map. Default is a vector of values 1 through M, where M is the number of columns in Data.

CGobj = clustergram(Data, ...'Standardize', StandardizeValue, ...) specifies the dimension for standardizing the values in Data. The standardized values are transformed so that the mean is 0 and the standard deviation is 1 in the specified dimension. StandardizeValue can be:

CGobj = clustergram(Data, ...'Cluster', ClusterValue, ...) specifies the dimension for clustering the values in Data. ClusterValue can be:

CGobj = clustergram(Data, ...'RowPdist', RowPdistValue, ...) specifies the distance metric to pass to the pdist function (Statistics Toolbox software) to use to calculate the pairwise distances between rows. RowPdistValue is a string. For information on choices, see the pdist function. Default is 'euclidean'.

CGobj = clustergram(Data, ...'ColumnPdist', ColumnPdistValue, ...) specifies the distance metric to pass to the pdist function (Statistics Toolbox software) to use to calculate the pairwise distances between columns. ColumnPdistValue is a string. For information on choices, see the pdist function. Default is 'euclidean'.

CGobj = clustergram(Data, ...'Linkage', LinkageValue, ...) specifies the linkage method to pass to the linkage function (Statistics Toolbox software) to use to create the hierarchical cluster tree for rows and columns. LinkageValue is a string or two-element cell array of strings. If a two-element cell array of strings, the first element is used for linkage between rows, and the second element is used for linkage between columns. For information on choices, see the linkage function. Default is 'average'.

CGobj = clustergram(Data, ...'Dendrogram', DendrogramValue, ...) specifies the 'colorthreshold' property to pass to the dendrogram function (Statistics Toolbox software) to create the dendrogram plot. DendrogramValue is a scalar or two-element numeric vector or cell array of strings that specifies the 'colorthreshold' property. If a two-element numeric vector or cell array, the first element is for the rows, and the second element is for the columns. For more information, see the dendrogram function.

CGobj = clustergram(Data, ...'OptimalLeafOrder', OptimalLeafOrderValue, ...) enables or disables the optimal leaf ordering calculation, which determines the leaf order that maximizes the similarity between neighboring leaves. Choices are true (enable) or false (disable). Default depends on the size of Data. If the number of rows or columns in Data is greater than 1000, default is false; otherwise, default is true.

CGobj = clustergram(Data, ...'ColorMap', ColorMapValue, ...) specifies the colormap to use to create the clustergram. This controls the colors used to display the heat map. ColorMapValue is either an M-by-3 matrix of RGB values or the name of or handle to a function that returns a colormap, such as redgreencmap or redbluecmap. Default is redgreencmap.

CGobj = clustergram(Data, ...'DisplayRange', DisplayRangeValue, ...) specifies the display range of standardized values. DisplayRangeValue must be a positive scalar. Default is 3, which means there is a color variation for values between –3 and 3, but values >3 will be the same color as 3, and values < –3 will be the same color as –3.

For example, if you specify redgreencmap for the 'ColorMap' property, pure red represents values ≥ DisplayRangeValue, and pure green represents values ≤ –DisplayRangeValue.

CGobj = clustergram(Data, ...'SymmetricRange', SymmetricRangeValue, ...) controls whether the color scale of the heat map is symmetric around zero. SymmetricRangeValue can be true (default) or false.

CGobj = clustergram(Data, ...'LogTrans', LogTransValue, ...) controls the log2 transform of Data from natural scale. Choices are true or false (default).

CGobj = clustergram(Data, ...'Ratio', RatioValue, ...) specifies the ratio of space that the row and column dendrograms occupy relative to the heat map. If RatioValue is a scalar, it is used as the ratio for both dendrograms. If RatioValue is a two-element vector, the first element is used for the ratio of the row dendrogram width to the heat map width, and the second element is used for the ratio of the column dendrogram height to the heat map height. The second element is ignored for one-dimensional clustergrams. Default is 1/5.

CGobj = clustergram(Data, ...'Impute', ImputeValue, ...) specifies a function and optional inputs that impute missing data. ImputeValue can be any of the following:

CGobj = clustergram(Data, ...'RowMarker', RowMarkerValue, ...) specifies an optional structure array for annotating the groups of rows determined by the clustergram function. Each structure in the array represents a group of rows and contains the following fields:

CGobj = clustergram(Data, ...'ColumnMarker', ColumnMarkerValue, ...) specifies an optional structure array for annotating the groups of columns determined by the clustergram function. Each structure in the array represents a group of columns and contains the following fields:

Examples

The following example uses data from an experiment (DeRisi et al., 1997) that used DNA microarrays to study temporal gene expression of almost all genes in Saccharomyces cerevisiae during the metabolic shift from fermentation to respiration. Expression levels were measured at seven time points during the diauxic shift.

  1. Load the MAT-file, provided with the Bioinformatics Toolbox software, that contains filtered yeast data. This MAT-file includes three variables: yeastvalues, a matrix of gene expression data, genes, a cell array of GenBank accession numbers for labeling the rows in yeastvalues, and times, a vector of time values for labeling the columns in yeastvalues.

    load filteredyeastdata
    
  2. Create a clustergram object and display the dendrograms and heat map from the gene expression data in the first 30 rows of the yeastvalues matrix.

    cgo = clustergram(yeastvalues(1:30,:))
    Clustergram object with 30 rows of nodes and 7 columns of nodes.

  3. Use the set method and the genes and times vectors to add meaningful row and column labels to the clustergram.

    set(cgo,'RowLabels',genes(1:30),'ColumnLabels',times)

  4. Add a color bar to the clustergram by clicking the Insert Colorbar button on the toolbar, then view a data tip containing the intensity value, row label, and column label for a specific area of the heat map by clicking the Data Cursor button on the toolbar, then clicking an area in the heat map. To delete this data tip, right-click it, then select Delete Current Datatip.

  5. Use the get method to display the properties of the clustergram object, cgo:

    get(cgo)
    
               RowLabels: {30x1 cell}
            ColumnLabels: {7x1 cell}
             Standardize: {'ROW (2)'}
                 Cluster: {'ALL (3)'}
                RowPDist: {'Euclidean'}
             ColumnPDist: {'Euclidean'}
                 Linkage: {'Average'}
              Dendrogram: {[0]}
        OptimalLeafOrder: 1
                LogTrans: 0
                Colormap: [11x3 double]
            DisplayRange: 3
          SymmetricRange: 1
                   Ratio: [0.2000 0.2000]
                  Impute: []
              RowMarkers: []
           ColumnMarkers: []
  6. Change the clustering parameters by changing the linkage method and changing the color of the groups of nodes in the dendrogram whose linkage is less than a threshold of 3.

    set(cgo,'Linkage','complete','Dendrogram',3)

  7. Place the cursor on a branch node in the dendrogram to highlight (in blue) the group associated with it. Press and hold the mouse button to display a data tip listing the group number and the nodes (genes or samples) in the group.

  8. Right-click a branch node in the dendrogram to display a menu of options.

    The following options are available:

  9. Create a clustergram object in the MATLAB Workspace of Group 18 by right-clicking it, then selecting Export Group to Workspace. In the Export to Workspace dialog box, type Group18, then click OK.

  10. Use the get method to display the properties of the clustergram object, Group18.

    get(Group18)
    
           RowGroupNames: {2x1 cell}
            RowNodeNames: {3x1 cell}
        ColumnGroupNames: {6x1 cell}
         ColumnNodeNames: {7x1 cell}
              ExprValues: [3x7 double]
             Standardize: {'ROW (2)'}
                 Cluster: {'ALL (3)'}
                RowPDist: {'Euclidean'}
             ColumnPDist: {'Euclidean'}
                 Linkage: 'complete'
              Dendrogram: 3
        OptimalLeafOrder: 1
                LogTrans: 0
                Colormap: [11x3 double]
            DisplayRange: 3
          SymmetricRange: 1
                   Ratio: [0.2000 0.2000]
                  Impute: []
              RowMarkers: []
           ColumnMarkers: []
  11. Use the view method to view the clustergram (dendrograms and heat map) of the clustergram object, Group18.

    view(Group18)

  12. View all the gene expression data using a diverging red and blue colormap.

    cgo_all = clustergram(yeastvalues,'Colormap',redbluecmap)
    Clustergram object with 614 rows of nodes and 7 columns of nodes.

  13. Create structure arrays to specify marker colors and annotations for two groups of rows (510 and 593) and two groups of columns (4 and 5).

    rm = struct('GroupNumber',{510,593},'Annotation',{'A','B'},...
         'Color',{'b','m'});
    cm = struct('GroupNumber',{4,5},'Annotation',{'Time1','Time2'},...
         'Color',{[1 1 0],[0.6 0.6 1]});
    
  14. Use the 'RowMarker' and 'ColumnMarker' properties to add the color markers and annotations to the clustergram.

    set(cgo_all,'RowMarker',rm,'ColumnMarker',cm)

    Click the color column markers to display the annotations.

References

[1] Bar-Joseph, Z., Gifford, D.K., and Jaakkola, T.S. (2001). Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, Suppl 1:S22 – 9. PMID: 11472989.

[2] Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863–8.

[3] DeRisi, J.L., Iyer, V.R., and Brown, P.O. (1997). Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278, 680–686s.

[4] Golub, T.R., Slonim, D.K., and Tamayo, P., et al. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286 (15), 531–537.

See Also

Bioinformatics Toolbox functions: redbluecmap, redgreencmap

Bioinformatics Toolbox object: clustergram object

Bioinformatics Toolbox methods of a clustergram object: get, plot, set, view

Statistics Toolbox functions: cluster, dendrogram, linkage, pdist

  


 © 1984-2009- The MathWorks, Inc.    -   Site Help   -   Patents   -   Trademarks   -   Privacy Policy   -   Preventing Piracy   -   RSS