Methods

About This Page

Complete documentation of the analysis methodology including data collection from SFARI/PubMed, AI-powered phenotype extraction, Bayesian Gaussian Mixture Model clustering, and multi-method validation. This platform presents a systematic analysis of genotype-phenotype relationships in ASD based on literature extraction from peer-reviewed publications.

Genes Analyzed

241

With 5+ supporting papers

Papers Processed

~1,050

PubMed indexed

Phenotype Extractions

~6,100

56 standardized traits

Gene Clusters

6 major + 6 minor subtypes

Data Pipeline Overview

Complete pipeline from SFARI gene database to validated phenotype clusters

Data Collection Pipeline

1. Gene Source: SFARI Gene Database

1,267 genes from the January 14, 2026 SFARI release were imported. Each gene includes: symbol, name, Ensembl ID, chromosome location, SFARI evidence score (1-3, where 1 = high confidence), and syndromic status flag.

2. Literature Harvesting via PubMed

For each gene, PubMed was queried via NCBI Entrez API using the structured query:

[GENE_NAME] AND (autism OR asd) AND Humans[MeSH]

Animal study filtering: Title-level blocklist plus 500+ keywords to exclude non-human studies. PMC full text retrieved when available with 3-second rate limiting. Average yield: ~50 papers per gene.

3. Analysis Subset Selection

Genes with 5+ supporting papers were selected for clustering analysis, yielding 241 genes with sufficient phenotypic data for robust clustering.

AI-Powered Phenotype Extraction

AI Models

Primary Extraction: Claude Haiku 4.5 (claude-haiku-4-5-20251001) - high-throughput extraction
Quality Validation: Claude Sonnet 4.5 (claude-sonnet-4-5-20250929) - accuracy verification

Extraction Schema (ASD_DX_SX.json v6.0)

100+ traits across 8 clinical categories aligned with DSM-5 criteria:

Social Communication - Social reciprocity, nonverbal communication, relationships
Restricted/Repetitive Behaviors - Stereotypies, insistence on sameness, sensory
Cognitive/Executive - Executive function, attention, cognitive style
Neuro-Motor - Motor coordination, seizures, somatic features
Emotional Regulation - Anxiety, externalizing behaviors, shutdowns
Physical/Growth - Craniofacial features, cardiac, skeletal, growth
Developmental - Global delay, intellectual disability, language
Sensory - Hearing, vision impairments

Tri-State Classification

Each phenotype is recorded with:

Status: Present / Absent / NR (Not Reported)
Confidence Score: 0.0-1.0 indicating extraction reliability
Evidence Snippets: Source text for validation

Quality Control

Sonnet validation of Haiku extractions (max 50 papers/run)
Confidence calibration tracking against validation accuracy
Cohort overlap detection via demographic fingerprinting

Bayesian GMM Clustering

Data Preparation

241 genes with 5+ supporting papers
56 phenotype traits used for clustering
~14% matrix density (binary gene x phenotype)

Publication Bias Correction

Square-root weighting moderates the influence of extensively-studied genes:

weight[g] = sqrt(paper_count[g]) / max(sqrt(all_paper_counts))
corrected_matrix = binary_matrix * weights

Example: 10 papers = 32% weight, 100 papers = 100% weight

Bayesian Gaussian Mixture Model

BayesianGaussianMixture(
    n_components=12,
    covariance_type='full',
    weight_concentration_prior_type='dirichlet_process',
    weight_concentration_prior=0.1,  # Encourages parsimonious clustering
    max_iter=1000,
    n_init=5,
    random_state=42
)

Key algorithmic choices:

Dirichlet Process Prior (α = 0.1): Automatically penalizes unnecessary clusters, encouraging parsimony
Full covariance: Allows non-spherical, correlated clusters
5 random initializations: Guards against local optima

PCA Dimensionality Reduction

25 principal components retained, explaining ~70% of variance. This reduces noise while preserving meaningful phenotypic variation.

Clustering Results

Clusters Identified

99.99%

Assignment Confidence

7.5e-8

Assignment Entropy

Cluster	Label	Genes	Weight	Type
2	Pure ID	60	25.0%	Major
9	Minimal Phenotype	43	16.8%	Major
3	Full Syndrome	25	10.6%	Major
10	Seizure + Language + ADHD	24	9.2%	Major
0	GDD Predominant	22	9.5%	Major
4	Language + Dysmorphic	22	9.3%	Major
1	Behavioral/Anxiety Core	9	4.1%	Minor
5	Hypotonia + Sleep + Seizures	8	3.6%	Minor
6	Dyspraxia + OCD + ADHD	7	3.2%	Minor
7	Head Size + Motor	7	3.1%	Minor
8	Infantile Epilepsy	7	3.1%	Minor
11	Vision + Feeding + Heart	7	2.6%	Minor

Validation Methods

Machine Learning Classification

Classification models were trained to predict cluster membership, validating that clusters capture learnable phenotypic patterns:

68.9%

Decision Tree (depth=6)

85.9%

Random Forest (100 trees)

5-fold stratified cross-validation. Random Forest significantly outperforms chance (8.3% for 12 classes).

Latent Class Analysis (LCA)

Independent GMM with 12 latent classes was fit to validate cluster structure:

71.4%

Adjusted Agreement

97.6%

Mean Classification Confidence

Permutation Testing

1,000 permutations were run to assess statistical significance of trait-cluster associations:

87.5% of traits show significant cluster associations (FDR < 0.05)
Top discriminating traits: GDD (effect size 47.04), ID (41.64), Language delay (28.03)

Evidence Scoring

Gene-phenotype associations are scored based on publication evidence strength:

score = 0.5 (baseline) + 0.2 * log(n)/log(1000) # Sample size contribution + study_type_bonus # -0.1 to +0.2 based on design + 0.1 * has_full_text # PMC availability - 0.1 * no_phenotypes # Penalty for sparse data + 0.02 * n_quantified # Bonus for quantitative data (max 0.1)

Scores range 0-1, reflecting publication volume and quality rather than clinical significance.

Quality Control

Multiple quality control measures were implemented:

Multi-Model Validation - Haiku extractions validated by Sonnet (max 50 papers/run)
Confidence Calibration - Confidence scores calibrated against validation accuracy
Cohort Deduplication - Demographic fingerprinting to detect overlapping study populations
Robust Subset - Clustering restricted to genes with 5+ papers
Cross-Validation - All ML models evaluated with 5-fold stratified CV
Animal Study Filtering - 500+ keyword blocklist to exclude non-human studies

Critical Limitations

What This Data CANNOT Tell You

Penetrance - Cannot estimate the probability a patient with a given variant will develop a phenotype
Severity - No information on phenotype severity, age of onset, or progression
Absence of association - A missing gene-trait link means "not reported," NOT "not associated"
Clinical prediction - Cannot answer "what should I watch for in my patient?"
Diagnostic criteria - Cluster rules are statistical patterns, not diagnostic criteria

Data Quality Limitations

Severe publication bias - ~62% Present vs ~2% Absent reporting rate. Researchers report what they find, not what they don't find.
Ascertainment bias - Data reflects what researchers chose to study and publish
AI extraction errors - Phenotypes extracted by AI from papers; accuracy depends on paper clarity
Variable coverage - Gene paper counts range from 5-50+; many genes are under-characterized
Matrix sparsity - Only 14% of gene-phenotype cells have reported data

Analytical Limitations

Cluster prediction models have NO clinical validation
Minor clusters (7 genes each) have limited statistical power
"Evidence scores" reflect publication volume, not clinical significance
Cluster labels are descriptive summaries, not clinical definitions