ML Predictions

12 Subtypes (6+6) Cluster validation via supervised learning

About This Analysis

Machine learning models trained to predict gene membership across 12 ASD subtypes (6 major + 6 minor) from phenotype profiles. Using continuous trait scores (rather than binary presence/absence), Random Forest achieves 86% accuracy (vs 8.3% random chance), confirming that clusters have distinct, learnable phenotype signatures. A hierarchical classifier (Major/Minor first, then specific cluster) achieves even higher stage-wise accuracy.

Clusters to Predict
-
Gene subtypes
Best Accuracy
-
Random Forest (CV)
Top Predictor
-
Most important trait
Algorithms
3
DT, RF, Gradient Boosting
Cluster Classification Performance
Predicting gene cluster membership from phenotype patterns (5-fold CV)
What you're seeing: Classification accuracy for predicting gene cluster membership across 12 subtypes. Results are from 5-fold cross-validation to ensure robust estimates.
What it means: With 12 fine-grained clusters (vs random chance of 8.3%), achieving 72-86% accuracy confirms clusters have distinct, learnable phenotype signatures. Feature importance below reveals which traits best discriminate between subtypes.
Cluster Predictor Importance
Traits most predictive of cluster membership
What you're seeing: Feature importance scores showing which phenotypes are most useful for predicting cluster membership. Toggle between Decision Tree (interpretable) and Random Forest (more accurate).
What it means: Top-ranked traits are the key discriminators between clusters—they define what makes each cluster distinct.
Trait Co-Prediction Importance
Traits that best predict presence of other traits
What you're seeing: Aggregated importance scores showing which traits are most predictive of other traits across all trait-prediction models.
What it means: Highly predictive traits tend to co-occur with many other phenotypes and may represent core ASD features.
Hierarchical Classification Strategy
Two-stage approach: Major/Minor first, then specific cluster
What you're seeing: A hierarchical classifier that first predicts Major vs Minor cluster type (6+6 split), then predicts the specific cluster within that group.
What it means: Breaking the 12-class problem into stages improves accuracy for each decision point, leveraging the natural hierarchy of major (higher-prevalence) and minor (specialized) subtypes.
Model Comparison
Decision Tree vs Random Forest vs Gradient Boosting metrics
What you're seeing: A radar chart comparing all three models across multiple performance dimensions including accuracy, consistency, and feature utilization.
What it means: Random Forest (purple) achieves the best overall accuracy. Gradient Boosting (pink) offers strong performance with different feature weighting. Decision Tree (blue) provides interpretable rules.
Cross-Validation Score Distribution
Per-fold accuracy scores for cluster classifier
What you're seeing: Accuracy scores for each of the 5 cross-validation folds. Each fold uses different training/test splits.
What it means: Consistent scores across folds indicate the model generalizes well and isn't overfitting to specific data subsets.

Methodology

Primary Task: Predict which phenotype-based cluster a gene belongs to using its trait profile. This validates that clusters represent distinct phenotype patterns.

Secondary Task: For each phenotype trait, predict its presence/absence based on other traits. This reveals trait co-occurrence patterns.

Models: Decision Tree for interpretability (clinical rules), Random Forest (200 trees) and Gradient Boosting for robust predictions. Uses continuous trait scores (0-1) for better signal capture.

Validation: 5-fold stratified cross-validation ensures reliable accuracy estimates and prevents overfitting.

Gini vs Permutation Importance Comparison
Two methods for measuring feature importance in Random Forest
What you're seeing: Comparison of two importance methods: Gini impurity (built-in) vs Permutation importance (model-agnostic). Permutation importance shuffles each feature and measures accuracy drop.
What it means: Features ranked highly by both methods are robustly important. Disagreements may indicate Gini overestimating cardinality-correlated features.
Gini-Perm Correlation
-
Rank correlation
Major Disagreements
-
Rank diff > 10
Top by Gini
-
Most splits
Top by Permutation
-
Most predictive
What you're seeing: Two importance metrics compared side-by-side. Gini importance measures how often a feature is used for splits; Permutation importance measures accuracy drop when the feature is shuffled.
What it means: Features ranking high on both metrics are robustly important. Disagreements may indicate features that are used frequently but add little predictive value (high Gini, low Permutation) or vice versa.
Side-by-Side Importance
Gini (blue) vs Permutation (purple) for top features
Importance Correlation
Each point is a feature. Diagonal = perfect agreement.

Notable Rank Disagreements

Features where Gini and Permutation rankings differ significantly
What you're seeing: Traits where the two importance metrics disagree substantially on ranking.
What it means: Large rank differences suggest the trait's role is method-dependent. Traits ranked higher by Permutation may be more truly predictive; those higher by Gini may be splitting artifacts.
Trait Gini Rank Perm Rank Difference Interpretation
Bayesian Phenotype Prediction
Proper Bayesian inference using both presence AND absence of traits
What you're seeing: A Bayesian model using odds-ratio updating. Selected traits provide positive evidence; unselected traits provide negative evidence (weighted at 30% strength since "not selected" ≠ "confirmed absent").
What it means: Selecting only "Expressive Language Delay" without selecting "GDD" or "Hypotonia" will reduce the probability of Intellectual Disability, because absence of those co-occurring traits is informative.
Key insight: Prior P(ID) = 74%. With ELD alone (no GDD/Hypotonia), posterior drops toward baseline rather than rising to 83%.
Traits in Model
-
Phenotypes
Highest Prior
-
Most common trait
Max Conditional
-
Strongest dependency
Selected Evidence
0
Traits selected

Interactive Posterior Calculator

Select PRESENT phenotypes (unselected = treated as likely absent)
Note: Unselected traits count as weak negative evidence. If you only select "Expressive Language Delay" but NOT "GDD", the model interprets this as: ELD is present, GDD is probably absent → ID probability decreases.
Posterior Probabilities
P(Trait | Selected Evidence)
Prior vs Posterior Comparison
How evidence shifts trait probabilities
Conditional Probability Matrix
P(Column Trait | Row Trait) - Probability of column trait given row trait is present