ML Predictions
About This Analysis
Machine learning models trained to predict gene membership across 12 ASD subtypes (6 major + 6 minor) from phenotype profiles. Using continuous trait scores (rather than binary presence/absence), Random Forest achieves 86% accuracy (vs 8.3% random chance), confirming that clusters have distinct, learnable phenotype signatures. A hierarchical classifier (Major/Minor first, then specific cluster) achieves even higher stage-wise accuracy.
What it means: With 12 fine-grained clusters (vs random chance of 8.3%), achieving 72-86% accuracy confirms clusters have distinct, learnable phenotype signatures. Feature importance below reveals which traits best discriminate between subtypes.
What it means: Top-ranked traits are the key discriminators between clusters—they define what makes each cluster distinct.
What it means: Highly predictive traits tend to co-occur with many other phenotypes and may represent core ASD features.
What it means: Breaking the 12-class problem into stages improves accuracy for each decision point, leveraging the natural hierarchy of major (higher-prevalence) and minor (specialized) subtypes.
What it means: Random Forest (purple) achieves the best overall accuracy. Gradient Boosting (pink) offers strong performance with different feature weighting. Decision Tree (blue) provides interpretable rules.
What it means: Consistent scores across folds indicate the model generalizes well and isn't overfitting to specific data subsets.
Methodology
Primary Task: Predict which phenotype-based cluster a gene belongs to using its trait profile. This validates that clusters represent distinct phenotype patterns.
Secondary Task: For each phenotype trait, predict its presence/absence based on other traits. This reveals trait co-occurrence patterns.
Models: Decision Tree for interpretability (clinical rules), Random Forest (200 trees) and Gradient Boosting for robust predictions. Uses continuous trait scores (0-1) for better signal capture.
Validation: 5-fold stratified cross-validation ensures reliable accuracy estimates and prevents overfitting.
What it means: Features ranked highly by both methods are robustly important. Disagreements may indicate Gini overestimating cardinality-correlated features.
What it means: Features ranking high on both metrics are robustly important. Disagreements may indicate features that are used frequently but add little predictive value (high Gini, low Permutation) or vice versa.
Notable Rank Disagreements
Features where Gini and Permutation rankings differ significantlyWhat it means: Large rank differences suggest the trait's role is method-dependent. Traits ranked higher by Permutation may be more truly predictive; those higher by Gini may be splitting artifacts.
| Trait | Gini Rank | Perm Rank | Difference | Interpretation |
|---|
What it means: Selecting only "Expressive Language Delay" without selecting "GDD" or "Hypotonia" will reduce the probability of Intellectual Disability, because absence of those co-occurring traits is informative.
Key insight: Prior P(ID) = 74%. With ELD alone (no GDD/Hypotonia), posterior drops toward baseline rather than rising to 83%.