Data Collection Document
- Data Collection of syndromic autism genes

- This dataset contains 99 genes and comes from Catalina Betancur's review published in Brain Research in 2011(Betancur, 2011). Different genetic and genomic disorders in which ASDs have been described as one of the possible manifestations were collected.
- Data Collection of non-syndromic autism genes

- Genes, CNVs and linkage regions associated with autism were searched from literature and curated. Six categories of literatures were included in our collection: genome-wide association studies, expression profiling, genome-wide CNV studies, linkage analysis, low-scale genetic association studies and other low-scale gene studies. Representative meta-data about key clinical and demographic characteristics was collected.
- Flow Chart of data collection

- We search Pubmed to get the literatures we needed. Figure 1 shows the flow chart of the data collection.
- Initial Search for Association Studies: "autism and associat*" (2740 hits)
- Initial Search for other gene Studies: "autism AND (gene OR microarray OR proteomics)"(1368 hits)
- Initial Search for CNV and Linkage Studies: "autism AND (CNV OR copy number variation OR microarray* OR microdel* OR microdup* OR rearrange* OR (genome-wide AND (linkage OR associa* OR scan)))"
- Collecting of meta-data
- Information about key clinical and demographic characteristics of each study was collected.
- Genome-wide association studies(GWAS)

- Expression Profiling

- Genome-wide CNV studies

- Genome-wide Linkage studies

- Low-scale association studies

- Other low-scale studies

Categories Related Features Publication First author Year of publication PubMed ID date of the inclusion Population Ancestral background, Country of origin Sample and control inclusion and exclusion criteria Number of cases and controls with gender ratio Age at examination Diagnosis Criteria Type "GWAS" (genome-wide association study);
"Chromosome #" (chromosome-wide association study);
"cSNP" (coding-region SNP);
"pooled" (large-scale association study based on pooled genotyping);
"Other" (other large-scale association study);Stage Discovery/Replication Study Design Family-based or case-control Methods/Platform Results Number of polymorphisms Related Genes P value and combined P value Genotype & allele distribution Polymorphism (dbSNP ID or most commonly used name) Genotype distribution (allele frequency and genotype frequency) Other autism related features IQ autism-specific endophenotype Table 1: Collected features of GWAS studies
Categories Related Features Publication First author Year of publication PubMed ID date of the inclusion Population Ancestral background, Country of origin Sample and control inclusion and exclusion criteria Number of cases and controls with gender ratio Age at examination Diagnosis Criteria Tissue Used Study Design Methods/Platform Statistic Methods Geo ID Results Reported gene name Reported probes/ESTs/RefSeq_ID Fold Change; Up or Down regulated; P value Other autism related features IQ autism-specific endophenotype Table 2: Collected features of Microarray studies
Categories Related Features Publication First author Year of publication PubMed ID date of the inclusion Population Ancestral background, Country of origin Sample and control inclusion and exclusion criteria Number of cases and controls with gender ratio Age at examination Diagnosis Criteria Tissue Used Study Design Methods/Platform Results Reported gene name Table 3: Collected features of protemics studies
Categories Related Features Publication First author Year of publication PubMed ID date of the inclusion Population Ancestral background, Country of origin Sample and control inclusion and exclusion criteria Number of cases and controls with gender ratio Age at examination Diagnosis Criteria Study Design Family-based or case-control Methods/Platform Results CNV regions (chromosome, start and end) Band Gain/Loss Evidence Type CNVs Only Present In Patients;
De novo CNVs;
Overlapping/Recurrent CNVs;
CNVs Overlapping With ACRD;
CNVs Not Present In Control;
Significant Enriched CNVs;
OthersTable 4: Collected features of CNV studies
Categories Related Features Publication First author Year of publication PubMed ID date of the inclusion Population Ancestral background, Country of origin Sample and control inclusion and exclusion criteria Number of cases and controls with gender ratio Age at examination Diagnosis Criteria Study Design Family-based or case-control Methods/Platform Results Linkage regions (chromosome, start and end) Band Marker LOD, NPL or P value Table 5: Collected features of Linkage studies
Categories Related Features Publication First author Year of publication PubMed ID date of the inclusion Population Ancestral background, Country of origin Sample and control inclusion and exclusion criteria Number of cases and controls with gender ratio Age at examination Diagnosis Criteria Study Design Family-based or case-control Methods/Platform Results Reported gene name Reported study results (positive or negative) P value Genotype & allele distribution Polymorphism (dbSNP ID or most commonly used name) Genotype distribution (allele frequency and genotype frequency) Other autism related features IQ autism-specific endophenotype Table 6: Collected features of low scale association studies
Categories Related Features Publication First author Year of publication PubMed ID date of the inclusion Population Ancestral background, Country of origin Number of cases and controls with gender ratio Diagnosis Criteria Tissue Used autism-specific endophenotype Study Design Methods/Platform Results Reported gene name Description of the gene with autism Reported study results (positive or negative) Evidence Type Genetics; RNA level function; protein level function Table 7: Collected features of other low scale studies
- Data Statistic

- Quality Score

- We made a scoring system to score different datasets. All the genes in the CNVs or Linkage Regions were retrieved from UCSC. In total, ,12,180 genes were collected in our final gene lists. Table 8 shows the function of our score system.
- Function of Quality Score for different categories

- Score Distribution of different categories

- Here, we listed the quality score distribution of the six categories:
- Ranking System
- Ranking Algorithm

- The scores from each experimental type are weighted specifically and then to be a combined score calculated by the following function:
- Scorei=0 if no positive evidence.
- For N datasets, there are possible K (e.g. N+1) different weights, thus, it forms a KN weight matrix pool.
- Benchmark Dataset

- We collected a benchmark dataset of high confident genes from 6 highly accessed review papers since 2004. Except for the genes in the newest review(State, 2010), all other genes need to be mentioned in at least 2 review papers. (http://autism.cbi.pku.edu.cn/core_dataset.php)
- Weight Matrix

- (1) For each weight matrix in the matrix pool, a combined score is calculated for each gene by function 1.
- (2) All genes collected from all sources and the core genes are sorted by their combined scores, respectively.
- (3) In these two sorting lists, a vector is generated to record the ranking positions of core genes in the ranked candidate gene list.
- (4) Select the matrix if m of the core genes is ranked in the top n of the candidate genes. The position ( j) where the m-th gene locates in the candidate gene list is recorded for the evaluation in step two.
- (5) Repeat the above steps until all weight matrices are analyzed.
- The matrix with the best gene rank (95% Benchmark Dataset genes in 98% of all the genes)was chosen:
- Score Distribution

- We use 9, the minimum score of Benchmark Dataset (SHANK2) , as the final cutoff. The dataset has 383 genes. The score distribution is listed below:

Categories GWAS Expression CNV Linkage low scale association low scale others Weight 7 1 3 1 6 5 Table 9: Final Weight Matrix

Figure 8: Distribution of the combined score upon cutoff
- Ranking Algorithm
- Acknowledgement
- Acknowledge to Viktor Persson for the "Indigo" template of the web interface.
- We thank Ge Gao, Chuan-Yun Li, Yong-Xin Ye, and Ying-Fu Zhong for useful comments on the web interface.

Figure 1: Flow Chart of Data Collection
| Categories of studies | Number of Genes | |
|---|---|---|
| Syndromic Autism Genes | 99 | |
| non-syndromic Autism Genes | GWAS | 132 |
| Expression studies | 1664 | |
| Low Scale Association studies | 163 | |
| Other Low Scale Studies | 308 | |
| Total | 2135 | |
| Total | 2193 | |
| CNVs | 4964 CNVs | |
| Linkage regions | 158 Linkage Regions | |
Table 7: Data statistic of current collection
| Experimental Methods | Quality Score of the genes |
|---|---|
| Low scale Association studies | Score 1: one positive study (P<=0.05); Score 2: two or more positive studies and P>0.001; Score 3: two or more positive studies and P<=0.001 |
| GWAS | Score 1: one positive study (P<=1e-5); Score 2: two positive studies and P>1e-7; Score 3: two positive studies and P<=1e-7 |
| Expression studies | Score 1: one positive study; Score 2: two positive studies Score 3: three or more positive studies |
| Single gene studies | Score 1: one positive study; Score 2: two positive studies Score 3: three or more positive studies |
| Score of CNVs related genes | Score 1: 1-3 positive studies; Score 2: 4-8 positive studies; Score 3: >=9 positive studies |
| Score of Linkage regions related genes | Score 1: 1-3 positive studies; Score 2: 4-8 positive studies; Score 3: >=9 positive studies |
Table 8: Function of the score system
| Experimental Methods | Score | Number of genes |
|---|---|---|
| Low scale Association studies | 1 | 128 |
| 2 | 23 | |
| 3 | 12 | |
| GWAS | 1 | 81 |
| 2 | 46 | |
| 3 | 5 | |
| Expression studies | 1 | 1320 |
| 2 | 285 | |
| 3 | 59 | |
| Single gene studies | 1 | 241 |
| 2 | 37 | |
| 3 | 30 | |
| Score of CNVs related genes | 1 | 1086 |
| 2 | 34 | |
| 3 | 19 | |
| Score of Linkage regions related genes | 1 | 535 |
| 2 | 43 | |
| 3 | 0 |

