About EXPOLDB   HelpDesk  
 
 Home

Query EXPOLDB

 SimRep 
Create Graph
Data Sources 
 Tutorial 
 Related Links 
Institute of Genomics and Integrative Biology

Frequently Asked Questions

1. What does the term ‘normal population’ stands for?
2. Which tissue has been considered for the expression studies?
3. How is the monozygosity of the twins established?
4. How the data is scaled/normalized?
5. How is mean expression and coefficient of variation (CV) calculated?
6. What is the criterion of calling a gene to be differentially expressed in different individuals?
7. What is the difference between the two metrics ‘CV’ and ‘Fold Change’?
8. How are the (TG/CA)n repeats identified? What is the minimum length of (TG/CA)nrepeat ? Are imperfect repeats also scored?
9. How are the Alu repeats scored?
10. Are the experimental conditions for each chip accessible?
11. What is the source database for the polymorphic repeats?
12. What is the source of housekeeping genes?
13. How is the EST based expression calculated for these genes?
14. Is the primary data available for download?
15. Where can I download the annotation files for the probes? Where can I find the probe sequences for a probe set?
16. How can one submit the data into EXPOLDB?
17. What are future plans as far as new features and data sets?
18. Has the data been deposited into any of the public repositories?

1. What does the term ‘normal population’ stands for?

The term ‘normal population’ stands for the disease free or healthy individuals considered for the study. Normal healthy unrelated individuals and twin pairs (with no known physical abnormalities or affliction) were recruited for the study.

2. Which tissue has been considered for the expression studies?

Human blood leucocytes extracted from human blood were considered for the expression studies.

3. How is the monozygosity of the twins established?

Twelve highly polymorphic microsatellite markers, located on 8 different chromosomes (Perkin Elmer Linkage panel set version 2, PE Applied Biosystems, Foster City, CA) were used for haplotyping of genomic DNA from twins to assess their monozygosity.

4. How the data is scaled/normalized.
The data was normalized using the standard affymetrix procedure. It takes all intensity values from a chip image, removes the top and bottom 2% or probes, and scales all values such that the average of the remaining probes equals some user defined number.

5. How is mean expression and coefficient of variation (CV) calculated?

The mean expression for each gene was computed from the average difference values after global scaling across all 9 arrays as described previously (Hsiao et al. 2001). All genes considered for mean expression had present (P) call in all arrays. Replicate probe sets if present, were clustered together and averaged. A total of 2962 genes were found to be present in all the arrays. For these genes, the coefficient of variation (CV) was computed as SD/Mean where SD is the standard deviation of the average difference values across all 9 arrays.

6. What is the criterion of calling a gene to be differentially expressed in different individuals?

In pairwise comparisons, genes with baseline 'absent' (A) and genes with high noise (according to the Affymetrix call) were removed from the dataset. The differentially expressed genes were identified by selecting those with 'present' (P) call that showed a fold change of more than 3 (either increase or decrease). We used a strict 3 fold cutoff instead of the suggested 2 fold because at this cutoff only 4 genes out of ~10,000 genes varied in duplicate experiments using the same RNA sample.
To identify differentially expressed genes between unrelated individuals an additional filter was applied. The filter dataset consisted of the genes that varied within monozygotic twins and the genes whose expression varied in duplicate array experiments between RNA samples isolated from the same individual at two different time points separated by 6 months. The filter was aimed at removing genes whose expression varied due to environmental causes.

7. What is the difference between the two metrics ‘CV’ and ‘Fold Change’?

The metric ‘fold change’ is used to query differentially expressed genes whereas ‘coefficient of variation (CV)’ is used to query the variability in expression levels across individuals. ‘Fold change’ is measured in pair wise comparisons whereas CV is measured across all arrays. To enable distinction between the two query strategies and to avoid confusion, we have provided the fold change selection option only with ‘Differentially Expressed Genes’ query page. Similarly, the CV option is available only with ‘Expression in Blood’ page.

8. How are the (TG/CA)n repeats identified? What is the minimum length of a (TG/CA)n repeat ? Are imperfect repeats also scored?

Perl scripts were written for the identification of perfect (TG/CA)n repeats in genes using a conservative cutoff length n > = 6 units since a minimum length of 8 repeat units has been reported to be polymorphic. Only perfect intragenic repeats were scored. The Perl program to identify the (TG/CA)n repeats is available on request.

9. How are the Alu repeats scored?

Intragenic Alu repeats were scored using RepeatMasker.

10. Are the experimental conditions for each chip accessible?

Yes, the experimental conditions and other experimental details are available at the “Data sources” section of the website.

11. What is the source database for the polymorphic repeats?

Polymorphic (TG/CA)n repeat markers were obtained from CEPH database (http://bioinformatics.weizmann.ac.il/databases/ceph/ceph_genotype_db/ceph_db/v90/mkr/). The local stand-alone version of NCBI BLAST2 software (Wheeler et al. 2003) was used to locate the polymorphic markers in the human genes.

12. What is the source of housekeeping genes?

Information on the housekeeping genes was retrieved from Human Gene Expression Index Database (HuGE Index) (http://server3.mgh.harvard.edu/hio/databases) containing a list of 451 known housekeeping genes, their mean expression values and coefficient of variation (Hsiao et al, 2001).

13. How is the EST based expression calculated for these genes?

The genes were classified into three categories of expression high (H), moderate (M) and weak (W) by computing the abundance of ESTs of a given gene according to the procedure described by Bortoluzzi et al. (2000). For each gene, the number of ESTs obtained from a specific tissue was used for estimating the expression level of a given gene in the tissue, as per thousand of the total detected transcriptional activity.

14. Is the primary data available for download?

The data housed in EXPOLDB is available for free download as tab-delimited text files. The data files can be downloaded by clicking here. For any further information regarding the data files contact us at ramu@igib.res.in or ramu@igib.res.in.

15. Where can I download the annotation files for the probes? Where can I find the probe sequences for a probe set?

The annotation information and the details of a particular probe (including probe sequences), are available at Affymetrix website.

16. How can one submit the data into EXPOLDB?

The gene expression data obtained through Affymetrix protocol can be submitted to the database. It is mandatory to submit the data to NCBI’s GEO repository and then submit the data here. The data has to be in the same format as submitted in GEO. To submit the data, please send a mail to ramu@igib.res.in or ramu@igib.res.in and you will be guided through the steps to submit the data.

17. What are future plans as far as new features and data sets?

See Data Updates.

18. Has the data been deposited into any of the public repositories?

The raw data from the GeneChip experiments has been submitted to Gene expression Omnibus (GEO; www.ncbi.nlm.nih.gov/geo) under following accession numbers: GSM14477, GSM14478, GSM14479, GSM14480, GSM14481, GSM14482, GSM14483, GSM14485 and GSM20645.

 

Any Suggestions?? Help us improve.


About Expol Download Data Tutorial - Disclaimer FAQ

©2003 IGIB