Methods
About This Page
Complete documentation of the analysis methodology including data collection from SFARI/PubMed, AI-powered phenotype extraction, Bayesian Gaussian Mixture Model clustering, and multi-method validation. This platform presents a systematic analysis of genotype-phenotype relationships in ASD based on literature extraction from peer-reviewed publications.
Data Pipeline Overview
Complete pipeline from SFARI gene database to validated phenotype clusters
Data Collection Pipeline
1. Gene Source: SFARI Gene Database
1,267 genes from the January 14, 2026 SFARI release were imported. Each gene includes: symbol, name, Ensembl ID, chromosome location, SFARI evidence score (1-3, where 1 = high confidence), and syndromic status flag.
2. Literature Harvesting via PubMed
For each gene, PubMed was queried via NCBI Entrez API using the structured query:
Animal study filtering: Title-level blocklist plus 500+ keywords to exclude non-human studies. PMC full text retrieved when available with 3-second rate limiting. Average yield: ~50 papers per gene.
3. Analysis Subset Selection
Genes with 5+ supporting papers were selected for clustering analysis, yielding 241 genes with sufficient phenotypic data for robust clustering.
AI-Powered Phenotype Extraction
AI Models
- Primary Extraction: Claude Haiku 4.5 (
claude-haiku-4-5-20251001) - high-throughput extraction - Quality Validation: Claude Sonnet 4.5 (
claude-sonnet-4-5-20250929) - accuracy verification
Extraction Schema (ASD_DX_SX.json v6.0)
100+ traits across 8 clinical categories aligned with DSM-5 criteria:
- Social Communication - Social reciprocity, nonverbal communication, relationships
- Restricted/Repetitive Behaviors - Stereotypies, insistence on sameness, sensory
- Cognitive/Executive - Executive function, attention, cognitive style
- Neuro-Motor - Motor coordination, seizures, somatic features
- Emotional Regulation - Anxiety, externalizing behaviors, shutdowns
- Physical/Growth - Craniofacial features, cardiac, skeletal, growth
- Developmental - Global delay, intellectual disability, language
- Sensory - Hearing, vision impairments
Tri-State Classification
Each phenotype is recorded with:
- Status: Present / Absent / NR (Not Reported)
- Confidence Score: 0.0-1.0 indicating extraction reliability
- Evidence Snippets: Source text for validation
Quality Control
- Sonnet validation of Haiku extractions (max 50 papers/run)
- Confidence calibration tracking against validation accuracy
- Cohort overlap detection via demographic fingerprinting
Bayesian GMM Clustering
Data Preparation
- 241 genes with 5+ supporting papers
- 56 phenotype traits used for clustering
- ~14% matrix density (binary gene x phenotype)
Publication Bias Correction
Square-root weighting moderates the influence of extensively-studied genes:
corrected_matrix = binary_matrix * weights
Example: 10 papers = 32% weight, 100 papers = 100% weight
Bayesian Gaussian Mixture Model
Key algorithmic choices:
- Dirichlet Process Prior (α = 0.1): Automatically penalizes unnecessary clusters, encouraging parsimony
- Full covariance: Allows non-spherical, correlated clusters
- 5 random initializations: Guards against local optima
PCA Dimensionality Reduction
25 principal components retained, explaining ~70% of variance. This reduces noise while preserving meaningful phenotypic variation.
Clustering Results
| Cluster | Label | Genes | Weight | Type |
|---|---|---|---|---|
| 2 | Pure ID | 60 | 25.0% | Major |
| 9 | Minimal Phenotype | 43 | 16.8% | Major |
| 3 | Full Syndrome | 25 | 10.6% | Major |
| 10 | Seizure + Language + ADHD | 24 | 9.2% | Major |
| 0 | GDD Predominant | 22 | 9.5% | Major |
| 4 | Language + Dysmorphic | 22 | 9.3% | Major |
| 1 | Behavioral/Anxiety Core | 9 | 4.1% | Minor |
| 5 | Hypotonia + Sleep + Seizures | 8 | 3.6% | Minor |
| 6 | Dyspraxia + OCD + ADHD | 7 | 3.2% | Minor |
| 7 | Head Size + Motor | 7 | 3.1% | Minor |
| 8 | Infantile Epilepsy | 7 | 3.1% | Minor |
| 11 | Vision + Feeding + Heart | 7 | 2.6% | Minor |
Validation Methods
Machine Learning Classification
Classification models were trained to predict cluster membership, validating that clusters capture learnable phenotypic patterns:
5-fold stratified cross-validation. Random Forest significantly outperforms chance (8.3% for 12 classes).
Latent Class Analysis (LCA)
Independent GMM with 12 latent classes was fit to validate cluster structure:
Permutation Testing
1,000 permutations were run to assess statistical significance of trait-cluster associations:
- 87.5% of traits show significant cluster associations (FDR < 0.05)
- Top discriminating traits: GDD (effect size 47.04), ID (41.64), Language delay (28.03)
Evidence Scoring
Gene-phenotype associations are scored based on publication evidence strength:
Scores range 0-1, reflecting publication volume and quality rather than clinical significance.
Quality Control
Multiple quality control measures were implemented:
- Multi-Model Validation - Haiku extractions validated by Sonnet (max 50 papers/run)
- Confidence Calibration - Confidence scores calibrated against validation accuracy
- Cohort Deduplication - Demographic fingerprinting to detect overlapping study populations
- Robust Subset - Clustering restricted to genes with 5+ papers
- Cross-Validation - All ML models evaluated with 5-fold stratified CV
- Animal Study Filtering - 500+ keyword blocklist to exclude non-human studies
Critical Limitations
What This Data CANNOT Tell You
- Penetrance - Cannot estimate the probability a patient with a given variant will develop a phenotype
- Severity - No information on phenotype severity, age of onset, or progression
- Absence of association - A missing gene-trait link means "not reported," NOT "not associated"
- Clinical prediction - Cannot answer "what should I watch for in my patient?"
- Diagnostic criteria - Cluster rules are statistical patterns, not diagnostic criteria
Data Quality Limitations
- Severe publication bias - ~62% Present vs ~2% Absent reporting rate. Researchers report what they find, not what they don't find.
- Ascertainment bias - Data reflects what researchers chose to study and publish
- AI extraction errors - Phenotypes extracted by AI from papers; accuracy depends on paper clarity
- Variable coverage - Gene paper counts range from 5-50+; many genes are under-characterized
- Matrix sparsity - Only 14% of gene-phenotype cells have reported data
Analytical Limitations
- Cluster prediction models have NO clinical validation
- Minor clusters (7 genes each) have limited statistical power
- "Evidence scores" reflect publication volume, not clinical significance
- Cluster labels are descriptive summaries, not clinical definitions