ML Predictions

12 Subtypes (6+6) Cluster validation via supervised learning

About This Analysis

Machine learning models trained to predict gene membership across 12 ASD subtypes (6 major + 6 minor) from phenotype profiles. Using continuous trait scores (rather than binary presence/absence), Random Forest achieves 86% accuracy (vs 8.3% random chance), confirming that clusters have distinct, learnable phenotype signatures. A hierarchical classifier (Major/Minor first, then specific cluster) achieves even higher stage-wise accuracy.

Clusters to Predict

Gene subtypes

Best Accuracy

Random Forest (CV)

Top Predictor

Most important trait

Algorithms

DT, RF, Gradient Boosting

Cluster Classification Performance

Predicting gene cluster membership from phenotype patterns (5-fold CV)

What you're seeing: Classification accuracy for predicting gene cluster membership across 12 subtypes. Results are from 5-fold cross-validation to ensure robust estimates.
What it means: With 12 fine-grained clusters (vs random chance of 8.3%), achieving 72-86% accuracy confirms clusters have distinct, learnable phenotype signatures. Feature importance below reveals which traits best discriminate between subtypes.

Cluster Predictor Importance

Traits most predictive of cluster membership

What you're seeing: Feature importance scores showing which phenotypes are most useful for predicting cluster membership. Toggle between Decision Tree (interpretable) and Random Forest (more accurate).
What it means: Top-ranked traits are the key discriminators between clusters—they define what makes each cluster distinct.

Trait Co-Prediction Importance

Traits that best predict presence of other traits

What you're seeing: Aggregated importance scores showing which traits are most predictive of other traits across all trait-prediction models.
What it means: Highly predictive traits tend to co-occur with many other phenotypes and may represent core ASD features.

Hierarchical Classification Strategy

Two-stage approach: Major/Minor first, then specific cluster

What you're seeing: A hierarchical classifier that first predicts Major vs Minor cluster type (6+6 split), then predicts the specific cluster within that group.
What it means: Breaking the 12-class problem into stages improves accuracy for each decision point, leveraging the natural hierarchy of major (higher-prevalence) and minor (specialized) subtypes.

Model Comparison

Decision Tree vs Random Forest vs Gradient Boosting metrics

What you're seeing: A radar chart comparing all three models across multiple performance dimensions including accuracy, consistency, and feature utilization.
What it means: Random Forest (purple) achieves the best overall accuracy. Gradient Boosting (pink) offers strong performance with different feature weighting. Decision Tree (blue) provides interpretable rules.

Cross-Validation Score Distribution

Per-fold accuracy scores for cluster classifier

What you're seeing: Accuracy scores for each of the 5 cross-validation folds. Each fold uses different training/test splits.
What it means: Consistent scores across folds indicate the model generalizes well and isn't overfitting to specific data subsets.

Methodology

Primary Task: Predict which phenotype-based cluster a gene belongs to using its trait profile. This validates that clusters represent distinct phenotype patterns.

Secondary Task: For each phenotype trait, predict its presence/absence based on other traits. This reveals trait co-occurrence patterns.

Models: Decision Tree for interpretability (clinical rules), Random Forest (200 trees) and Gradient Boosting for robust predictions. Uses continuous trait scores (0-1) for better signal capture.

Validation: 5-fold stratified cross-validation ensures reliable accuracy estimates and prevents overfitting.

Gini vs Permutation Importance Comparison

Two methods for measuring feature importance in Random Forest

What you're seeing: Comparison of two importance methods: Gini impurity (built-in) vs Permutation importance (model-agnostic). Permutation importance shuffles each feature and measures accuracy drop.
What it means: Features ranked highly by both methods are robustly important. Disagreements may indicate Gini overestimating cardinality-correlated features.

Gini-Perm Correlation

Rank correlation

Major Disagreements

Rank diff > 10

Top by Gini

Most splits

Top by Permutation

Most predictive

What you're seeing: Two importance metrics compared side-by-side. Gini importance measures how often a feature is used for splits; Permutation importance measures accuracy drop when the feature is shuffled.
What it means: Features ranking high on both metrics are robustly important. Disagreements may indicate features that are used frequently but add little predictive value (high Gini, low Permutation) or vice versa.

Side-by-Side Importance

Gini (blue) vs Permutation (purple) for top features

Importance Correlation

Each point is a feature. Diagonal = perfect agreement.

Notable Rank Disagreements

Features where Gini and Permutation rankings differ significantly

What you're seeing: Traits where the two importance metrics disagree substantially on ranking.
What it means: Large rank differences suggest the trait's role is method-dependent. Traits ranked higher by Permutation may be more truly predictive; those higher by Gini may be splitting artifacts.

Trait	Gini Rank	Perm Rank	Difference	Interpretation

Bayesian Phenotype Prediction

Proper Bayesian inference using both presence AND absence of traits

What you're seeing: A Bayesian model using odds-ratio updating. Selected traits provide positive evidence; unselected traits provide negative evidence (weighted at 30% strength since "not selected" ≠ "confirmed absent").
What it means: Selecting only "Expressive Language Delay" without selecting "GDD" or "Hypotonia" will reduce the probability of Intellectual Disability, because absence of those co-occurring traits is informative.
Key insight: Prior P(ID) = 74%. With ELD alone (no GDD/Hypotonia), posterior drops toward baseline rather than rising to 83%.

Traits in Model

Phenotypes

Highest Prior

Most common trait

Max Conditional

Strongest dependency

Selected Evidence

Traits selected

Interactive Posterior Calculator

Select PRESENT phenotypes (unselected = treated as likely absent)

Note: Unselected traits count as weak negative evidence. If you only select "Expressive Language Delay" but NOT "GDD", the model interprets this as: ELD is present, GDD is probably absent → ID probability decreases.

Select PRESENT Traits (click to toggle):

Posterior Probabilities

P(Trait | Selected Evidence)

Prior vs Posterior Comparison

How evidence shifts trait probabilities

Conditional Probability Matrix

P(Column Trait | Row Trait) - Probability of column trait given row trait is present