About EXPOLDB      HelpDesk  
 
 Home
Query EXPOLDB
 SimRep 
Create Graph
Data Sources 
 Tutorial 
 Related Links 
Institute of Genomics and Integrative Biology

“Expression Linked Polymorphism Database (EXPOLDB): A resource for linking genome wide expression with cis modulators of transcription in the Human Genome ”

The availability of the human genome sequence and the parallel accumulation of microarray data offer new opportunities to examine the role of repetitive sequences as modulators of gene expression underlying the variation in expression. Therefore, in order to obtain insight into gene regulation in humans, a necessary pre-requisite step is to link gene expression data from microarray experiments with the distribution of cis modulators of transcription such as the dinucleotide (TG/CA)n repeats. At present there is no database linking genome wide expression with the distribution of (TG/CA)n and other simple repeats.

EXPOLDB contains information on (i) transcript levels and their variation from oligonucleotide microarray (HG U95A v2) data, (ii) distribution of (TG/CA)n repeats and their polymorphic status in more than 5000 human genes, (iii) a plethora of functional information such as biochemical roles of the gene products, tissue specific expression, and associated hyperlinks and (iv) levels and variation in expression of known human housekeeping genes across unrelated individuals and monozygotic twins. In addition, examples of well studied genes regulated by (TG/CA)n repeats are included for reference. EXPOLDB can be queried through a user-friendly interface and allows navigation to obtain additional functional information through combinatorial queries. Further information can be obtained by browsing HTML links to the publicly available databases inclusing Gene Cards, UCSC Golden Path, Entrez Gene, BodyMap, GXD, ENSEMBL, HuGE Index and PubMed.

EXPOLDB incorporates an online tool 'SimRep' which can be used to examine the distribution of simple repeats and patterns in EXPOLDB genes or in a given nucleotide sequence. Users can search either for a dinucleotide repeat by selecting it from the pull down menu or for a specific microsatellite repeat of their choice by entering the pattern to be searched. Patterns can be specified using both the standard four base symbols A, T, G and C. Other symbols recommended by IUPAC can also be used. The minimum length for scoring a repeat can be specified in the field 'Enter Cut-Off'. SimRep reports the length and location of the repeats or patterns in the given sequence in the form of a table. Repeats are reported in both strands if the option 'All' is selected. The positions of repeats either in forward strand (+) or in reverse strand (-) are reported with respect to the forward strand only as per the convention followed by genome sequence annotation groups. In the case of palindromic dinucleotide repeats such as GC, AT only one strand is reported.

As a resource aimed to provide both genotype and phenotype data by including information on gene expression, variability and presence of cis regulatory elements, EXPOLDB is likely to have wider usability and can serve as a useful resource to researchers assessing natural variation in gene expression and those interested in examining the role of (TG/CA)n repeats in gene expression. A few case studies describing the implementation of EXPOLDB are highlighted below.

Tetrapodic layout of EXPOLDB: The 4 domains of layout are shown in bold face type. Note that all attributes of a gene are singularly linked to its official HGNC symbol serving as the primary key.


Database and web interface

The backend data was prepared in MS Access 2000 (Microsoft Corporation, Inc. USA). Server site scripting was prepared using ASP (Active Server Pages, version 3.0), PHP (PHP: Hypertext Preprocessor, version 5.0) and Perl (Practical Extraction Report Language, version 5.8.1). The client site scripting was prepared using JavaScript and HTML (Hyper Text Markup Language, version 4.0). Internet Information Server (IIS) version 6.0 was used as web server.

EXPOLDB as a resource to examine natural variation in population

Natural variation in gene expression between healthy human individuals has been largely unexplored. The variation in gene expression is an outcome of the complex inter-play of genetic polymorphisms (acting in cis or in trans), physiological variations (such as time of day, gender) and environmental factors. In order to understand the genetic basis of variation in gene expression between normal human individuals, we need to obtain genome-wide expression data from various populations. We examined the gene expression profiles in 13 normal human individuals including five pairs of monozygotic twins and three unrelated individuals in blood leukocytes measured using HG U95Av2 oligonucleotide microarrays consisting probes for ~10, 000 genes.. A total of 5,407 genes were found expressed in blood leukocytes. Of these, a total of 2,888 genes were found to be differentially expressed in pairwise comparisons between unrelated individuals and 212 genes in monozygotic twins. Information on mean expression and variability (CV) for human housekeeping genes that had present call (P) in any of the used arrays is also provided in EXPOLDB. This database is likely to be a useful resource for those that are interested in studying natural variations in humans.
The raw data from the array experiments has been submitted to Gene expression Omnibus (GEO; www.ncbi.nlm.nih.gov/geo) under the Series ID: GSE928, and the following accession numbers: GSM14477, GSM14478, GSM14479, GSM14480, GSM14481, GSM14482, GSM14483, GSM14485, GSM20645, GSM29053, GSM29054, GSM29055, GSM29056, GSM29057 and GSM29058.

Mean expression and Measures of variability

The mean expression of each gene was computed from the log10 transformed 'signal' values with P calls across all 13 arrays. All genes considered for mean expression had present (P) call in arrays. Replicate probe sets if present, were clustered together and averaged. A total of 5,407 genes were found expressed in blood leukocytes using these criteria. For these genes, the coefficient of variation (CV) was computed as SD/Mean where SD is the standard deviation of the log10 transformed 'signal' values across the 13 arrays.
Two measures have been used to assess variation in gene expression. The metric ‘signal log ratio’ is used to query differentially expressed genes. The metric ‘coefficient of variation (CV)’ is used to query the variability in expression levels across individuals. It is important to note the difference between the two metrics. First, the computation of ‘signal log ratio’ was carried out according to the Affymetrix procedure whereas the CV was carried out according to Hsiao et al. 2001. Second, the two metrics do not signify the same biological phenomena. For example, signal log ratio is measured in pair wise comparisons whereas CV is measured across all arrays. Although high signal log ratio may relate to high CVs, the exact relationship has not been established since, to our knowledge only HuGE Index reports these values in addition to EXPOLDB. Thus, users are advised to exercise care while using the two metrics and the use of a particular metric will depend on the nature of the study.
To enable distinction between the two query strategies and to avoid confusion, we have provided the 'signal log ratio' selection option only with ‘Differentially Expressed Genes’ query page. Similarly, the CV option is available only with ‘Expression in Blood’ page. Searches for a set of differentially expressed genes can be carried out by either defining a range of signal log ratio or selecting from a given list of specified range. The variability in the expression of genes (CV) can be queried through the page entitled “Expression in blood” by either selecting a specified CV range from the list box or specifying a range of CV or the entire range of CV.


Examining the role of (TG/CA)n repeats as modulators of gene expression

(TG/CA)n repeats as cis modulator of transcription

About 50% of the human genome consists of repetitive elements comprising of simple repeats, short interspersed nucleotide elements (SINEs), medium reiteration, long terminal repeats and long interspersed nucleotide elements (LINES). Among the dinucleotide repeats, (TG/CA)n is the most frequent in the human genome and many of these repeats exhibit length polymorphism. This property has been extensively used in the construction of genetic maps. (TG/CA)n repeats which have alternating purine/pyrimidine sequences have a propensity to undergo conformational transition on methylation under physiological conditions.

(TG/CA)n repeats can influence transcription of a gene in cis (variations within the gene) due to ‘incidence’ or ‘secondary elongation’ in these repeats (either within or in close proximity to the gene). The functional roles of (TG/CA)n repeats are beginning to emerge. Since the early observation on the modulation of transcription by (TG)n tracts by Hamada et al (1984), experimental evidences on the role of (TG/CA)n repeats in the regulation of gene expression have been steadily accumulating. The up-regulation or down-regulation of transcription by (TG/CA)n repeats has been reported for the following genes: rat alpha-lactalbumin (Meera et al., 1989), rat prolactin (Naylor et al. 1990), Acetyl-CoA carboxylase ACC (Tae et al. 1994), matrix metalloprotease MMP-9 (Shimajiri et al. 1999), gamma interferon IFN-gamma (Pravica et al. 1999), epidermal growth factor receptor EGFR (Gebhardt et al. 1999), salt sensitivity HSD11B2 (Agarwal et al. 2000), and tilipia Prolactin1 (Streelman et al. 2002). Expression differences in these genes due to polymorphic (TG/CA)n repeats varies from one gene to another, exhibits a wide range, and in some cases is reported to be as high as 20 folds. While the effects induced by microsatellite variation may be complicated by the presence of other transcriptional regulatory elements in the proximity of a given gene, these observations underscore the importance of the (TG/CA)n repeats and their polymorphisms in gene regulation. A list of more examples from literature illustrating the role of these repeats as modulators of gene expression is given below.

"Examples from Literature"

Recently, several reports describing the association of polymorphism in (TG/CA)n repeats with genetic diseases have appeared such as in the coronary heart disease (eNOS, endothelial nitric oxide synthase; Laule et al. 2003), in diabetic retinopathy (ALR2, aldose reductase; Kumaramanickavel et al. 2003), in asthma (IFN-gamma, gamma interferon; Nagarkatti et al. 2002) and in breast cancer (IGF-I, insulin-like growth factor-I;Yu et al. 2001).

Earlier we have shown in rat alpha-lactalbumin (Meera and Brahmachari et al., 1989) that the presence of (TG/CA)n repeats correlates with lower gene expression levels. Recently, we examined the distribution of the (TG/CA)n repeats in the human genome and also the correlation of incidence of the repeats with gene expression levels of housekeeping genes measured using oligonucleotide microarrays. The number of short intragenic (TG/CA)n repeats was significantly higher than the number of long repeats, the proportion of genes with (TG/CA)n repeats (n ≥ 12 units) had lower mean expression levels compared to those without these repeats, the genes belonging to the functional class of ‘signaling and communication’ had a positive association with repeats in contrast to the genes belonging to the ‘information’ class that were negatively associated with repeats (Sharma et al. 2003). Taken together, these observations underscore the importance of investigating natural variation in transcript levels at the genomic scale with respect to the distribution of (TG/CA)n repeats both within the genes and in their close proximity.

Case studies to examine the role of (TG/CA)n repeats as modulators of gene expression
using EXPOLDB

Housekeeping genes

Housekeeping genes are expressed constitutively in all tissues to maintain cellular functions and hence used as controls to examine the expression of other genes. The housekeeping genes are less likely to be affected by variations in tissue specific factors, number of different cell types in blood leukocytes (if similar quantities of total RNA is taken) and other structural alterations in chromatin structure that may vary between different individuals. The clustering of housekeeping genes in the human genome supports the above rationale and indicates that it may be advantageous to assemble them in a common region that remains in an open conformation across all the cells (Lercher et al., 2002).
In order to find the genetic basis of the variations in gene expression we examined the incidence of (TG/CA)n (n ≥ 12 units) repeats and its correlation with mean expression in the housekeeping genes. Out of 542 'housekeeping genes', 95 had long intragenic (TG/CA)n repeats (n > 12 units) and a total of 362 housekeeping genes did not contain (TG/CA)n repeats of length n ≥ 6 units. The distributions of the mean expression values of genes with repeats (n > 12 units) and of genes without repeats (using n >=6 units) are shown in figure below. The average of the mean expression values of genes without repeats (3.03 log10 signal units) was found to be higher than the average of the mean expression values of the genes containing long (TG/CA)n repeats (2.90 log10 signal units). These results were statistically significant (t-test, df =455, P <0.006). These results show that harboring of intragenic (TG/CA)n repeats correlates with reduced expression conforming to one set of trends observed in the earlier studies in the genes IFN-gamma, HSD11B2, EGFR. These results provide insight into the overall effect of (TG/CA)n repeats on gene expression.

The RUNX family

The mammalian RUNX genes comprise a small family of three genes RUNX1, RUNX2 and RUNX3 that act as master regulators of gene expression in major developmental pathways (Levanon et al. 2003). They contain a highly conserved region designated ‘runt domain’ (RD), found in the Drosophila gene Runt. RUNX1 and RUNX2 play fundamental roles in organogenesis and are associated with human diseases. Only recently RUNX3 has become the focus of investigations. Sequence analysis suggests that RUNX3 is the evolutionary founder of the mammalian RUNX family (Bangsow et al. 2001) and both genes have similar architecture. In adults, both RUNX1 and RUNX3 are highly expressed in the hematopoietic system with high levels of mRNA and proteins in spleen, thymus and blood (Levanon et al. 1994, 1996; Meyers et al. 1996; Le et al. 1999; Levanon et al. 2003). Thus, the RUNX family provides a set of genes with similar architecture to investigate the effects of (TG/CA)n repeats in expression.
We queried EXPOLDB by submitting the keyword “RUNX*” in the Gene Symbol field. The records for RUNX1 and RUNX3 were retrieved. The RUNX2 gene was not found expressed in blood in our experiments.

The Expol profiles of these two genes indicate that RUNX1 has several (TG/CA)n repeats whereas RUNX3 does not have any (TG/CA)n repeats (n >=6 units). RUNX1 has many long (TG/CA)n repeats (n ≥ 12 units) : (CA)17, (CA)22 and (TG)12, (TG)13, (TG)14, (TG)21, (TG)23, (TG)24 in introns and and one interrupted (TG)7-CG-(TG)9 repeat in exon 8 . The mean expression of RUNX3 (2.90 log10 signal units) is about higher than the mean expression of RUNX1 (2.37 log10 signal units) indicating that RUNX3 was expressed higher than RUNX1. The differences were statistically significant (t-test, df = 13, P < 0.0002). The uniformity observed in the difference between the expression of RUNX3 and RUNX1 in all experiments suggests that the incidence of (TG/CA)n repeats in RUNX1 correlates with its generally observed reduced expression. These results are in corroboration with previous experimental studies including Interferon-gamma (IFN-gamma, (CA)10-15 ), Epidermal growth factor receptor (EGFR, (CA)21, 14) ,and the salt sensitivity HSD11B2 ((CA)14, 23) gene (Pravica et al, 1999, Gebhardt et al, 1999, Agarwal et al, 2000) where reduced expression levels of these genes correlates with either the presence (vs. absence) of repeats or the increase in length of the repeats


The Eukaryotic Initiation Factor housekeeping genes

Five eukaryotic initiation factor genes EIF3S5, EIF3S6, EIF4A1, EIF4A2 and EIF4G2 were retrieved by submitting EIF* in the Gene Symbol field and selecting the housekeeping genes dataset. All these genes had present (P) call in all array experiments. Of these, only EIF3S6 has a perfect dinucleotide stretch (TG)12 within the gene. The rest of the four genes did not have (TG/CA)n repeats (n ≥ 6 units). The mean expression of the EIF genes without (TG/CA)n repeats was 3.27 log10 signal units which is higher than the mean expression of EIF3S6 (2.73 log10 signal units). The differences were statistically significant (t-test, df = 43, P < 0.0001).

The Type II repeat in EIF3S6 was observed to be polymorphic (expanded or contracted by 1 or 2 units) in the individuals used in this study. Only in one case, the Type I repeat had expanded by 2 units. These results firmly support the polymorphic property of Type II repeats and follows the expansion/contraction model proposed by Whittaker et al. 2003, who also reported that changes by 1 or 2 units are common. However, no correlation was discernible between differences in expression levels and expansion or contraction of repeats by 1 or 2 units. It may be noted that using existing measurement techniques, identification of clear correlation between variation in expression levels and in length of repeats may only be possible in cases where the increment or decrement in repeat units spans at least 6 repeat units [see Literature Studies section].
Taken together, the cases of 311 housekeeping genes, RUNX and the EIF families, these observations are consistent with our previous comparative analysis of expression of housekeeping genes with (TG/CA)n repeats (n ≥ 12) and genes without (TG/CA)n repeats (Sharma et al, 2003). The genotyping of (TG/CA)n repeats in EIF3S6 revealed length polymorphism. The results of the genotyping exercise is given below.

Table: Repeat length polymorhism in (TG/CA)n repeats in different individual samples and their corresponding mean expression

Repeats in E1F3S6 as per NCBI sequence
SampleID Repeat 1 Repeat 2 log10 Signal Value
Sample 1 (TG)6* (TG)10* NA
Sample 2 (TG)6 (TG)13 NA
Sample 3 (TG)6* (TG)12 2.75327657
Sample 4 (TG)6 (TG)12 2.663983455
Sample 5 (TG)6 (TG)12 2.977860729
Sample 6 (TG)6 (TG)12 NA
Sample 7 (TG)6 (TG)12 NA
Sample 8 (TG)8 (TG)10 2.324282455
Sample 9 (TG)8 (TG)10 2.352375495

* Ambiguity in the repeat region


Table: Mean expression in different individual samples

SampleID
EIF3S6 EIF4A1 EIF4A2 EIF3S5 EIF4G2
Repeats in NCBI sequence Nil Nil Nil Nil
Sample 1 0 0 0 0
Sample 2 0 0 0 0
Sample 3 3.321826184 3.349180359 3.284047032 3.580936374
Sample 4 3.377451963 3.158663981 3.043715858 3.605434345
Sample 5 3.277540455 3.38877593 3.363856319 3.710887069
Sample 6 0 0 0 0
Sample 7 0 0 0 0
Sample 8 3.101850138 2.71222867 3.135482491 3.351525755
Sample 9 3.011612729 3.053731316 3.093316601 3.486883666

These results provide leads for further experimental investigations to correlate the ‘incidence’ or ‘secondary elongation’ of (TG/CA)n repeats with the observed gene expression and demonstrate the usefulness of EXPOLDB.

Studying genome wide expression pattern of genes

Examining gene expression and variability in Biochemical Pathways

With the present focus of biology shifting towards adopting a systemic approach to understand the complexity of human biology, biochemical pathways have become one of the focus of recent investigations. EXPOLDB offers a unique utility by providing information on gene expression and its variability in several biochemical pathways (134) across different individuals. This potential utility of EXPOLDB is illustrated here by choosing the example of Glycolysis pathway, which is a basic source of energy in mammals. Submitting the keyword 'glycolysis' in the field on 'Functional Pathway' in the query page 'Expression in Blood' retrieved the records on genes coding for enzymes involved in glycolysis and related linked pathways as outlined in KEGG or GenMAPP databases. We examined the expression of ten known genes of this pathway involved in conversion from glucose to pyruvate by selecting the 'square boxes' adjacent to the gene symbols. All ten genes of the glycolysis pathway were present in EXPOLDB and showed coefficient of variation (CV) of expression below 0.15, indicating low variability in accordance with the most constant housekeeping genes described by Hsiao et al. 2002. Since the expression of the gene GAPDH coding for Glyceraldehyde-3-phosphate dehydrogenase was found to be highly variable between different individuals [20,23], analysis using EXPOLDB provides information on variability of additional genes of the glycolytic pathway that are worth examining for their low variability at large scale. If verified, then some of these genes, if not all, can be used as internal controls in mRNA quantitation experiments. The potential utility of EXPOLDB illustrated through the example of 'Glycolysis' pathway and the list of available pathways can be accessed through the following link.

"EXPOLDB as a resource to examine Biochemical Pathways"

Thus, in summary EXPOLDB can be a useful resource to examine the expression pattern of genes involved in various biochemical processes.

Identification of genes varying in twins and unrelated individuals

Natural variation in gene expression between healthy human individuals has been the focus of recent studies (Cheung et al., 2003, Whitney et al., 2003). EXPOLDB houses information on expression pattern of genes that vary in monozygotic twins and unrelated individuals that can be queried in these distinct datasets by using the appropriate metrics of variability (CV or signal log ratio) (Sharma et al., 2005). This genome-wide expression data from various individuals (within monozygotic twins and unrelated individuals) compiled in EXPOLDB provides an opportunity to understand the genetic basis of variation in gene expression between normal human individuals and may serve as a resource for the researchers in this field and aid in systemic approches.

References

Click here to refer the list of references.

 

Any Suggestions?? Help us improve.


About Expol Download Data -  Tutorial - Disclaimer FAQ

©2003 IGIB