Methods

About This Page

Complete documentation of the analysis methodology including data collection from SFARI/PubMed, AI-powered phenotype extraction, Bayesian Gaussian Mixture Model clustering, and multi-method validation. This platform presents a systematic analysis of genotype-phenotype relationships in ASD based on literature extraction from peer-reviewed publications.

Genes Analyzed
241
With 5+ supporting papers
Papers Processed
~1,050
PubMed indexed
Phenotype Extractions
~6,100
56 standardized traits
Gene Clusters
12
6 major + 6 minor subtypes

Data Pipeline Overview

Complete pipeline from SFARI gene database to validated phenotype clusters

Data Collection Pipeline

1. Gene Source: SFARI Gene Database

1,267 genes from the January 14, 2026 SFARI release were imported. Each gene includes: symbol, name, Ensembl ID, chromosome location, SFARI evidence score (1-3, where 1 = high confidence), and syndromic status flag.

2. Literature Harvesting via PubMed

For each gene, PubMed was queried via NCBI Entrez API using the structured query:

[GENE_NAME] AND (autism OR asd) AND Humans[MeSH]

Animal study filtering: Title-level blocklist plus 500+ keywords to exclude non-human studies. PMC full text retrieved when available with 3-second rate limiting. Average yield: ~50 papers per gene.

3. Analysis Subset Selection

Genes with 5+ supporting papers were selected for clustering analysis, yielding 241 genes with sufficient phenotypic data for robust clustering.

AI-Powered Phenotype Extraction

AI Models

  • Primary Extraction: Claude Haiku 4.5 (claude-haiku-4-5-20251001) - high-throughput extraction
  • Quality Validation: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) - accuracy verification

Extraction Schema (ASD_DX_SX.json v6.0)

100+ traits across 8 clinical categories aligned with DSM-5 criteria:

  • Social Communication - Social reciprocity, nonverbal communication, relationships
  • Restricted/Repetitive Behaviors - Stereotypies, insistence on sameness, sensory
  • Cognitive/Executive - Executive function, attention, cognitive style
  • Neuro-Motor - Motor coordination, seizures, somatic features
  • Emotional Regulation - Anxiety, externalizing behaviors, shutdowns
  • Physical/Growth - Craniofacial features, cardiac, skeletal, growth
  • Developmental - Global delay, intellectual disability, language
  • Sensory - Hearing, vision impairments

Tri-State Classification

Each phenotype is recorded with:

  • Status: Present / Absent / NR (Not Reported)
  • Confidence Score: 0.0-1.0 indicating extraction reliability
  • Evidence Snippets: Source text for validation

Quality Control

  • Sonnet validation of Haiku extractions (max 50 papers/run)
  • Confidence calibration tracking against validation accuracy
  • Cohort overlap detection via demographic fingerprinting

Bayesian GMM Clustering

Data Preparation

  • 241 genes with 5+ supporting papers
  • 56 phenotype traits used for clustering
  • ~14% matrix density (binary gene x phenotype)

Publication Bias Correction

Square-root weighting moderates the influence of extensively-studied genes:

weight[g] = sqrt(paper_count[g]) / max(sqrt(all_paper_counts))
corrected_matrix = binary_matrix * weights

Example: 10 papers = 32% weight, 100 papers = 100% weight

Bayesian Gaussian Mixture Model

BayesianGaussianMixture( n_components=12, covariance_type='full', weight_concentration_prior_type='dirichlet_process', weight_concentration_prior=0.1, # Encourages parsimonious clustering max_iter=1000, n_init=5, random_state=42 )

Key algorithmic choices:

  • Dirichlet Process Prior (α = 0.1): Automatically penalizes unnecessary clusters, encouraging parsimony
  • Full covariance: Allows non-spherical, correlated clusters
  • 5 random initializations: Guards against local optima

PCA Dimensionality Reduction

25 principal components retained, explaining ~70% of variance. This reduces noise while preserving meaningful phenotypic variation.


Clustering Results

12
Clusters Identified
99.99%
Assignment Confidence
7.5e-8
Assignment Entropy
Cluster Label Genes Weight Type
2 Pure ID 60 25.0% Major
9 Minimal Phenotype 43 16.8% Major
3 Full Syndrome 25 10.6% Major
10 Seizure + Language + ADHD 24 9.2% Major
0 GDD Predominant 22 9.5% Major
4 Language + Dysmorphic 22 9.3% Major
1 Behavioral/Anxiety Core 9 4.1% Minor
5 Hypotonia + Sleep + Seizures 8 3.6% Minor
6 Dyspraxia + OCD + ADHD 7 3.2% Minor
7 Head Size + Motor 7 3.1% Minor
8 Infantile Epilepsy 7 3.1% Minor
11 Vision + Feeding + Heart 7 2.6% Minor

Validation Methods

Machine Learning Classification

Classification models were trained to predict cluster membership, validating that clusters capture learnable phenotypic patterns:

68.9%
Decision Tree (depth=6)
85.9%
Random Forest (100 trees)

5-fold stratified cross-validation. Random Forest significantly outperforms chance (8.3% for 12 classes).


Latent Class Analysis (LCA)

Independent GMM with 12 latent classes was fit to validate cluster structure:

71.4%
Adjusted Agreement
97.6%
Mean Classification Confidence

Permutation Testing

1,000 permutations were run to assess statistical significance of trait-cluster associations:

  • 87.5% of traits show significant cluster associations (FDR < 0.05)
  • Top discriminating traits: GDD (effect size 47.04), ID (41.64), Language delay (28.03)

Evidence Scoring

Gene-phenotype associations are scored based on publication evidence strength:

score = 0.5 (baseline) + 0.2 * log(n)/log(1000) # Sample size contribution + study_type_bonus # -0.1 to +0.2 based on design + 0.1 * has_full_text # PMC availability - 0.1 * no_phenotypes # Penalty for sparse data + 0.02 * n_quantified # Bonus for quantitative data (max 0.1)

Scores range 0-1, reflecting publication volume and quality rather than clinical significance.

Quality Control

Multiple quality control measures were implemented:

  • Multi-Model Validation - Haiku extractions validated by Sonnet (max 50 papers/run)
  • Confidence Calibration - Confidence scores calibrated against validation accuracy
  • Cohort Deduplication - Demographic fingerprinting to detect overlapping study populations
  • Robust Subset - Clustering restricted to genes with 5+ papers
  • Cross-Validation - All ML models evaluated with 5-fold stratified CV
  • Animal Study Filtering - 500+ keyword blocklist to exclude non-human studies

Critical Limitations

What This Data CANNOT Tell You

  • Penetrance - Cannot estimate the probability a patient with a given variant will develop a phenotype
  • Severity - No information on phenotype severity, age of onset, or progression
  • Absence of association - A missing gene-trait link means "not reported," NOT "not associated"
  • Clinical prediction - Cannot answer "what should I watch for in my patient?"
  • Diagnostic criteria - Cluster rules are statistical patterns, not diagnostic criteria

Data Quality Limitations

  • Severe publication bias - ~62% Present vs ~2% Absent reporting rate. Researchers report what they find, not what they don't find.
  • Ascertainment bias - Data reflects what researchers chose to study and publish
  • AI extraction errors - Phenotypes extracted by AI from papers; accuracy depends on paper clarity
  • Variable coverage - Gene paper counts range from 5-50+; many genes are under-characterized
  • Matrix sparsity - Only 14% of gene-phenotype cells have reported data

Analytical Limitations

  • Cluster prediction models have NO clinical validation
  • Minor clusters (7 genes each) have limited statistical power
  • "Evidence scores" reflect publication volume, not clinical significance
  • Cluster labels are descriptive summaries, not clinical definitions