STEM分析

上传者：杜艳红
|
上传时间：2015-04-15
|
密次下载

STEM分析

BMC Bioinformatics

Software

BioMed Central

Open Access

STEM: a tool for the analysis of short time series gene expression

data

JasonErnst* and ZivBar-Joseph

Address: Center for Automated and Learning and Discovery, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave., Pittsburgh, PA 15213, USA

Email: JasonErnst*-jernst@cs.cmu.edu; ZivBar-Joseph-zivbj@cs.cmu.edu* Corresponding author

Published: 05 April 2006BMC Bioinformatics2006, 7:191

doi:10.1186/1471-2105-7-191

This article is available from: http://wendang.chazidian.com/1471-2105/7/191

Received: 12 December 2005Accepted: 05 April 2006

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Background: Time series microarray experiments are widely used to study dynamical biologicalprocesses. Due to the cost of microarray experiments, and also in some cases the limitedavailability of biological material, about 80% of microarray time series experiments are short (3–8time points). Previously short time series gene expression data has been mainly analyzed usingmore general gene expression analysis tools not designed for the unique challenges andopportunities inherent in short time series gene expression data.

Results: We introduce the Short Time-series Expression Miner (STEM) the first software programspecifically designed for the analysis of short time series microarray gene expression data. STEMimplements unique methods to cluster, compare, and visualize such data. STEM also supportsefficient and statistically rigorous biological interpretations of short time series data through itsintegration with the Gene Ontology.

Conclusion: The unique algorithms STEM implements to cluster and compare short time seriesgene expression data combined with its visualization capabilities and integration with the GeneOntology should make STEM useful in the analysis of data from a significant portion of allmicroarray studies. STEM is available for download for free to academic and non-profit users atBackground

Microarray time series gene expression experiments arewidely used to study a range of biological processes suchas the cell cycle [1], development [2], and immuneresponse [3]. Based on an analysis of the Gene ExpressionOmnibus [4], approximately a third of all microarraystudies involve time series experiments with three or moretime points, and of these time series experiments over80% contain no more than eight time points (Figure 1).In many cases experimental costs prevent data from moretime points from being collected. In some studies, partic-

ularly clinical studies, the availability of biological mate-rial can limit the number of time points collected. Thus,even if the price of microarray experiments were to godown short time series expression experiments wouldremain prevalent.

In this paper we introduce the Short Time-series Expres-sion Miner (STEM), the first software applicationdesigned specifically for the analysis of short time seriesgene expression datasets (3–8 time points). Data fromshort time series gene expression experiments poses

所谓的短时间序列指的是涉及的时间节点比较少

BMC Bioinformatics 2006, 7:191Figure 1

Distribution of microarray experiments by type. Sum-mary of the 786 microarray datasets for human, mouse, rat, and yeast in the Gene Expression Omnibus as of August 2005. As can be seen, 27.5% of the sets are time series experiments with 3–8 time points. All of these sets were labeled as either time, development, or age in the database. An additional 1% percent contains other types of sequential experiments including dose or temperature response, with 3–8 different levels.

unique challenges. In these experiments thousands ofgenes are being profiled simultaneously while the numberof time points is few. In such cases many genes will havethe same expression pattern just by random chance. Fur-thermore as with any time series experiment, there areusually few, if any, full time series repeats from which to

内容需要下载文档才能查看

gain statistical power. STEM uses a method of analysis thattakes advantage of the number of genes being large andthe number of time points being few to identify statisti-cally significant temporal expression profiles and thegenes associated with these profiles [5]. STEM also sup-ports Gene Ontology (GO) [6] enrichment analyses forsets of genes having the same temporal expression patternproviding the means for an efficient and statistically rigor-ous biological interpretation of significant temporalexpression patterns. The integration of STEM with GO isbidirectional. STEM can easily determine and visualize thebehavior of genes belonging to a given GO category, iden-tifying which temporal expression profiles were enrichedfor genes in that category. Finally, STEM also supports theability to compare temporal responses of genes acrossexperimental conditions.

The novel clustering algorithm which STEM implementsfor short time series expression data is briefly reviewed inthe Implementation section. For a detailed discussion ofthe clustering algorithm including experimental results onsimulated data and a comparison with the k-means clus-

http://wendang.chazidian.com/1471-2105/7/191

tering algorithm on real biological data using GO we referthe reader to [5]. The main focus of this paper is onSTEM's integration with GO, its support for comparingdata sets across experimental conditions, its visualizationcapabilities, and a comparison with related software.To date, researchers analyzing short time series expressiondata relied mainly on two types of software. The first isgeneral gene expression analysis software implementingmethods which do not take advantage of the sequentialinformation in time series data. The second is gene expres-sion time series analysis software implementing methodsprimarily designed for longer time series. General methodsfor gene expression analysis that are frequently applied totime series expression data include popular clusteringmethods such as hierarchical clustering [7], k-means clus-tering [8], and self-organizing maps [9]. These standardclustering methods ignore the temporal dependencyamong successive time points. Specifically, if we were torandomly permute the order of time points, the results ofthese methods would not change. Two software packagesavailable for clustering time series gene expression thatimplement methods that take advantage of the temporaldependency of time points are the Graphical Query Lan-guage (GQL) [10] and the Cluster Analysis of GeneExpression Dynamics (CAGED) [11]. GQL implements aclustering algorithm based on a mixture of hiddenmarkov models. CAGED implements a clustering algo-rithm based on autoregressive equations. Unlike STEMthese methods generally require the estimation of manyparameters and are thus less appropriate for short timeseries data. Also unlike STEM, both standard clusteringmethods and previously suggested temporal analysismethods do not differentiate between real and randompatterns. This is a particular problem for short time seriesexpression data since, as mentioned above, many genesmay have the same expression pattern by random chance.A detailed comparison of STEM with the software imple-menting methods of analysis primarily designed forlonger time series appears in the Discussion section of thispaper.

STEM is freely available for download at [12] for non-commercial research purposes. A comprehensive anddetailed manual is also available at [12] and as Additionalfile 1 to this paper.

Implementation

STEM is implemented entirely in Java and will work withany operating system supporting Java 1.4 or later. Por-tions of the interface of STEM are implemented using athird party library, the Java Piccolo toolkit from the Uni-versity of Maryland [13]. STEM also makes use of externalGene Ontology and gene annotation files. STEM candownload these files directly from the websites of the

Figure 2

STEM input interface. The image shows the STEM input interface, which is divided into four sections. In the top section a user specifies the gene expression data and normalization options. In the second section a user specifies the gene annotation source, in this case the annotations are selected to be Human annotations from the European Bioinformatics Institute. In the third section a user specifies to either use the STEM clustering method or k-means, and can also change various parameter set-

内容需要下载文档才能查看

tings. The fourth section of the interface contains the execute button.Gene Ontology [14] or European Bioinformatics Insti-tutes [15].

A user of STEM first specifies a tab delimited gene expres-sion data file as input to STEM. Next, the user specifies agene annotation source, and may adjust default parame-ters through the input interface shown in Figure 2. Follow-ing the input phase, the STEM clustering algorithmexecutes and a new window will appear displaying theclustering results (Figure 3). From this new window, a userwill have the option to specify a comparison data set.The novel clustering algorithm that STEM implementstakes advantage of there being only a few time points in a

dataset. The clustering algorithm first selects a set of dis-tinct and representative temporal expression profiles(which we will refer to as model profiles from now on).These model profiles are selected independent of the data.The procedure for selecting the model profiles, and theo-retical guarantees that the models profiles selected are rep-resentative and distinct appear in [5]. See Figure 3 for anexample of a set of model profiles. The clustering algo-rithm then assigns each gene passing the filtering criteria(see Additional file 1 for details on gene filtering) to themodel profile that most closely matches the gene's expres-sion profile as determined by the correlation coefficient.Since the model profiles were selected independent of thedata, the algorithm can then determine which profiles

Figure 3

Example model profiles overview interface. The example data is drawn from an experiment measuring the response of gastric epithelial cells infected with the vacA-mutant strain of the pathogen Helicobacter pylori [3]. The data was sampled at five time points 0 h, .5 h, 3 h, 6 h, and 12 h. The data set was filtered to contain only the 2989 genes with no missing data (though STEM can handle missing data without filtering, see Additional file 1) that exhibited a .8 log base two fold increase or decrease for at least one time point. The number in the top left-hand corner of a profile box is the profile ID number. The colored pro-files had a statistically significant number of genes assigned. Non-white profiles of the same color represent profiles grouped into a single cluster. By clicking on one of the buttons along the bottom of the window, a dialog window appears by which the profiles can be reordered by various criteria. Another button displays a table of all genes passing filter and the profile to which

内容需要下载文档才能查看

they were assigned. Clicking on a profile box brings up detailed information about the profile (Figure 5).

have a statistically significant higher number of genesassigned using a permutation test. This test determines anassignments of genes to model profiles using a largenumber of permutations of the time points (or columns).It then uses standard hypothesis testing to determinewhich model profiles have significantly more genesassigned under the true ordering of time points comparedto the average number assigned to the model profile in thepermutation runs. Significant model profiles can either beanalyzed independently, or grouped together based onsimilarity to form clusters of significant profiles.Based on a reviewer's suggestion, STEM now also providesan implementation of the k-means clustering algorithm. Auser thus has the option to compare directly within STEM,results of STEM's novel clustering method with those pro-duced using k-means. A user that still prefers the k-meansclustering methodology for clustering short time seriesdata, or is interested in using k-means to cluster othertypes of data for which the STEM clustering method doesnot apply, may still be interested in using STEM's imple-mentation of k-means in order to leverage STEM's visuali-zation capabilities and integration with GO. The results

Figure 4

Model profiles reordered interface. The profiles from Figure 3 are reordered based on actual size based p-value enrich-ment for genes being annotated as belonging to the GO category DNA metabolism. For each profile the number of DNA

内容需要下载文档才能查看

metabolism genes assigned to it and the enrichment p-value appears in the lower left corner of the profile box.and discussion of STEM in this paper are presented usingSTEM's novel clustering method. For details on using thek-means clustering algorithm with STEM see Additionalfile 1.

cluster of significant profiles. By default profiles on themain window are ordered such that significant profilesappear before non-significant profiles, and among signif-icant profiles those profiles of the same color appear nextto each other. The profiles can be reordered based on thenumber of genes assigned, the number of genes expected,or their significance p-value. Additionally as we discussbelow, the profiles can also be reordered based on theirrelevance to a given GO category (Figure 4), a user definedgene set, or profile(s) from a comparison experiment.When the profiles are reordered relevant informationappears in the profile boxes.

The model overview screen is designed such that bydefault a user can visualize all profiles simultaneously, butas a result each profile box needs to be relatively small. Attimes however, a user will be interested in focusing on a

Results

Model profiles overview interface

A screenshot of the main interface window of STEMappears in Figure 3. In this window each box correspondsto one of the model temporal expression profiles. Clickingon a profile box displays a new window, described in thenext subsection, with detailed information about the pro-file. The colored profiles have a statistically significantnumber of genes assigned. Colored profiles which havethe same color are all similar to each other (based on cor-relation coefficients, see Additional file 1 for moredetails). These profiles are grouped together to form a