AutismKB


Deprecated: mysql_connect(): The mysql extension is deprecated and will be removed in the future: use mysqli or PDO instead in /rd1/www/autismkb_www/mysql_connect.php on line 14

Data Collection Document


  • Data Collection of syndromic autism genes
  • This dataset contains 99 genes and comes from Catalina Betancur's review published in Brain Research in 2011(Betancur, 2011). Different genetic and genomic disorders in which ASDs have been described as one of the possible manifestations were collected.

  • Data Collection of non-syndromic autism genes
  • Genes, CNVs and linkage regions associated with autism were searched from literature and curated. Six categories of literatures were included in our collection: genome-wide association studies, expression profiling, genome-wide CNV studies, linkage analysis, low-scale genetic association studies and other low-scale gene studies. Representative meta-data about key clinical and demographic characteristics was collected.
    • Flow Chart of data collection
    • We search Pubmed to get the literatures we needed. Figure 1 shows the flow chart of the data collection.
      Initial Search for Association Studies: "autism and associat*" (2740 hits)
      Initial Search for other gene Studies: "autism AND (gene OR microarray OR proteomics)"(1368 hits)
      Initial Search for CNV and Linkage Studies: "autism AND (CNV OR copy number variation OR microarray* OR microdel* OR microdup* OR rearrange* OR (genome-wide AND (linkage OR associa* OR scan)))"

      Figure 1: Flow Chart of Data Collection

    • Collecting of meta-data
      Information about key clinical and demographic characteristics of each study was collected.
      • Genome-wide association studies(GWAS)
      • CategoriesRelated Features
        PublicationFirst author
        Year of publication
        PubMed ID
        date of the inclusion
        PopulationAncestral background, Country of origin
        Sample and control inclusion and exclusion criteria
        Number of cases and controls with gender ratio
        Age at examination
        Diagnosis Criteria
        Type"GWAS" (genome-wide association study);
        "Chromosome #" (chromosome-wide association study);
        "cSNP" (coding-region SNP);
        "pooled" (large-scale association study based on pooled genotyping);
        "Other" (other large-scale association study);
        StageDiscovery/Replication
        Study DesignFamily-based or case-control
        Methods/Platform
        ResultsNumber of polymorphisms
        Related Genes
        P value and combined P value
        Genotype & allele distributionPolymorphism (dbSNP ID or most commonly used name)
        Genotype distribution (allele frequency and genotype frequency)
        Other autism related featuresIQ
        autism-specific endophenotype

        Table 1: Collected features of GWAS studies

      • Expression Profiling
      • CategoriesRelated Features
        PublicationFirst author
        Year of publication
        PubMed ID
        date of the inclusion
        PopulationAncestral background, Country of origin
        Sample and control inclusion and exclusion criteria
        Number of cases and controls with gender ratio
        Age at examination
        Diagnosis Criteria
        Tissue Used
        Study DesignMethods/Platform
        Statistic Methods
        Geo ID
        ResultsReported gene name
        Reported probes/ESTs/RefSeq_ID
        Fold Change; Up or Down regulated; P value
        Other autism related featuresIQ
        autism-specific endophenotype

        Table 2: Collected features of Microarray studies

        CategoriesRelated Features
        PublicationFirst author
        Year of publication
        PubMed ID
        date of the inclusion
        PopulationAncestral background, Country of origin
        Sample and control inclusion and exclusion criteria
        Number of cases and controls with gender ratio
        Age at examination
        Diagnosis Criteria
        Tissue Used
        Study DesignMethods/Platform
        ResultsReported gene name

        Table 3: Collected features of protemics studies

      • Genome-wide CNV studies
      • CategoriesRelated Features
        PublicationFirst author
        Year of publication
        PubMed ID
        date of the inclusion
        PopulationAncestral background, Country of origin
        Sample and control inclusion and exclusion criteria
        Number of cases and controls with gender ratio
        Age at examination
        Diagnosis Criteria
        Study DesignFamily-based or case-control
        Methods/Platform
        ResultsCNV regions (chromosome, start and end)
        Band
        Gain/Loss
        Evidence TypeCNVs Only Present In Patients;
        De novo CNVs;
        Overlapping/Recurrent CNVs;
        CNVs Overlapping With ACRD;
        CNVs Not Present In Control;
        Significant Enriched CNVs;
        Others

        Table 4: Collected features of CNV studies

      • Genome-wide Linkage studies
      • CategoriesRelated Features
        PublicationFirst author
        Year of publication
        PubMed ID
        date of the inclusion
        PopulationAncestral background, Country of origin
        Sample and control inclusion and exclusion criteria
        Number of cases and controls with gender ratio
        Age at examination
        Diagnosis Criteria
        Study DesignFamily-based or case-control
        Methods/Platform
        ResultsLinkage regions (chromosome, start and end)
        Band
        Marker
        LOD, NPL or P value

        Table 5: Collected features of Linkage studies

      • Low-scale association studies
      • CategoriesRelated Features
        PublicationFirst author
        Year of publication
        PubMed ID
        date of the inclusion
        PopulationAncestral background, Country of origin
        Sample and control inclusion and exclusion criteria
        Number of cases and controls with gender ratio
        Age at examination
        Diagnosis Criteria
        Study DesignFamily-based or case-control
        Methods/Platform
        ResultsReported gene name
        Reported study results (positive or negative)
        P value
        Genotype & allele distributionPolymorphism (dbSNP ID or most commonly used name)
        Genotype distribution (allele frequency and genotype frequency)
        Other autism related featuresIQ
        autism-specific endophenotype

        Table 6: Collected features of low scale association studies

      • Other low-scale studies
      • CategoriesRelated Features
        PublicationFirst author
        Year of publication
        PubMed ID
        date of the inclusion
        PopulationAncestral background, Country of origin
        Number of cases and controls with gender ratio
        Diagnosis Criteria
        Tissue Used
        autism-specific endophenotype
        Study DesignMethods/Platform
        ResultsReported gene name
        Description of the gene with autism
        Reported study results (positive or negative)
        Evidence TypeGenetics; RNA level function; protein level function

        Table 7: Collected features of other low scale studies

    • Data Statistic
    • Categories of studiesNumber of Genes
      Syndromic Autism Genes99
      non-syndromic
      Autism Genes
      GWAS132
      Expression studies1664
      Low Scale Association studies163
      Other Low Scale Studies308
      Total2135
      Total2193
      CNVs4964 CNVs
      Linkage regions158 Linkage Regions

      Table 7: Data statistic of current collection

  • Quality Score
  • We made a scoring system to score different datasets. All the genes in the CNVs or Linkage Regions were retrieved from UCSC. In total, ,12,180 genes were collected in our final gene lists. Table 8 shows the function of our score system.
    • Function of Quality Score for different categories
    • Experimental MethodsQuality Score of the genes
      Low scale Association studiesScore 1: one positive study (P<=0.05);
      Score 2: two or more positive studies and P>0.001;
      Score 3: two or more positive studies and P<=0.001
      GWASScore 1: one positive study (P<=1e-5);
      Score 2: two positive studies and P>1e-7;
      Score 3: two positive studies and P<=1e-7
      Expression studiesScore 1: one positive study;
      Score 2: two positive studies
      Score 3: three or more positive studies
      Single gene studiesScore 1: one positive study;
      Score 2: two positive studies
      Score 3: three or more positive studies
      Score of CNVs related genesScore 1: 1-3 positive studies;
      Score 2: 4-8 positive studies;
      Score 3: >=9 positive studies
      Score of Linkage regions related genesScore 1: 1-3 positive studies;
      Score 2: 4-8 positive studies;
      Score 3: >=9 positive studies

      Table 8: Function of the score system

    • Score Distribution of different categories
    • Here, we listed the quality score distribution of the six categories:
      Experimental MethodsScoreNumber of genes
      Low scale Association studies1128
      223
      312
      GWAS181
      246
      35
      Expression studies11320
      2285
      359
      Single gene studies1241
      237
      330
      Score of CNVs related genes11086
      234
      319
      Score of Linkage regions related genes1535
      243
      30
  • Ranking System
    • Ranking Algorithm
    • The scores from each experimental type are weighted specifically and then to be a combined score calculated by the following function:
      Scorei=0 if no positive evidence.
      For N datasets, there are possible K (e.g. N+1) different weights, thus, it forms a KN weight matrix pool.

      • Benchmark Dataset
      • We collected a benchmark dataset of high confident genes from 6 highly accessed review papers since 2004. Except for the genes in the newest review(State, 2010), all other genes need to be mentioned in at least 2 review papers. (http://autism.cbi.pku.edu.cn/core_dataset.php)

      • Weight Matrix
      • (1) For each weight matrix in the matrix pool, a combined score is calculated for each gene by function 1.
        (2) All genes collected from all sources and the core genes are sorted by their combined scores, respectively.
        (3) In these two sorting lists, a vector is generated to record the ranking positions of core genes in the ranked candidate gene list.
        (4) Select the matrix if m of the core genes is ranked in the top n of the candidate genes. The position ( j) where the m-th gene locates in the candidate gene list is recorded for the evaluation in step two.
        (5) Repeat the above steps until all weight matrices are analyzed.
        The matrix with the best gene rank (95% Benchmark Dataset genes in 98% of all the genes)was chosen:
        CategoriesGWASExpressionCNVLinkagelow scale associationlow scale others
        Weight713165

        Table 9: Final Weight Matrix

      • Score Distribution
      • We use 9, the minimum score of Benchmark Dataset (SHANK2) , as the final cutoff. The dataset has 383 genes. The score distribution is listed below:

        Figure 8: Distribution of the combined score upon cutoff

  • Acknowledgement
    Acknowledge to Viktor Persson for the "Indigo" template of the web interface.
    We thank Ge Gao, Chuan-Yun Li, Yong-Xin Ye, and Ying-Fu Zhong for useful comments on the web interface.

Quick Search:


  (e.g. NLGN4X)

Syndromic Genes

Non-syndromic Genes

AutismKB Statistics

  • Studies: 616
  • Genes: 3,075
  • SNPs/VNTRs: 3,386
  • CNVs: 4,617
  • Linkage Regions: 158
  • Last Update: 05/25/2012
  • Click to Download Data

Registered with NIF