The concept of CoP

The main concept

is to associate a co-expression module, which is composed of genes with similar expression profiles, with biological information for suggesting a hypothesis on gene function.

The datasets studied

basically include ten thousands or more probes and a hundred or more chips. In the present version of CoP, Affymetrix GeneChip microarray datasets were assembled: thale cress (Arabidopsis thaliana), soy bean (Glycine max; Ogata et al. 2009c), barley (Hordeum vulgare), rice (Oryza sativa), popoar (Poplus trichocarpa; Ogata et al. 2009a), wheat (Triticum aestivum), grape (Vitis vinifera), and maize (Zea mays). In the near future, data of the other microarray designs will be also studied.

Gene expression values

are originally positive after treating a CEL file with the MAS5 algorithm. The signal-to-noise ratio is generally higher in high positive expression values than in low ones. To effectively utilize data with the high ratio, we focused on positive gene-to-gene correlation. Kim et al. (BMC Informatics 2006, 7:44) introduced distance measures between two expression profiles, such as Pearson correlation, cosine correlation, Euclidian distance, and their original measure. Among them, Pearson correlation and cosine correlation are similarity measures. Pearson correlation ranges from -1 to 1, in spite of plus or minus expression values. In contrast, cosine correlation ranges from 0 to 1 in the case of the dataset only with positive expression values. Therefore, we selected cosine correlation to represent positive gene-to-gene correlation.

Specific expression

of plant genes were analyzed. Cosine correlation have another merit over other metrices, except for Pearson correlation, for gene expression similarity/dissimilarity. The correlation coefficient can be explicitly separated into products of pairs of individual elements as follows.

equation1 equation2

The separated values represent the contribution of individual elements to the correlation coefficient. Using the public datasets of plant gene expression profiles, therefore, the values were used to assess specific expression in particular experiments. To alleviate the influence of sample size to the values, we calculated the products of the values and the sample size (Standard value of gene expression). When the standard value is 1, the value represents average. To statistically interpret the standard value, we calculated percentile of the values.

Co-expression analysis

was performed to predict gene function on the basis of gene expression similarity. To assemble co-expression modules, we used the Confeito algorithm ( Ogata et al., Genome Informatics, 23:117-127, 2009b), which is based on co-exression network analysis.

Association of co-expression modules with biological information

is a useful approach to predict the function of genes included in the modules. We obtained the three kinds of biological information for the purpose; i.e., 1) metabolic pathway data from KEGG PATHWAYS and KaPPA-View 4, 2) biological processes of Gene Ontology, and 3) the description on microarray experiments from Gene Expression Omnibus. To assess the similarity of memberships between co-expression modules and genes in metabolic pathways and biological processes, we calculated the harmonic mean about the difference of the memberships as described in the terms page.

Publications

Back to the CoP portal site

Back to the KAGIANA project homepage

Team of Applied Plant Genomics, Kazusa DNA Research Institute