DS710 Heart Disease Pipeline

Multi-Hospital Data Integration with Smart Imputation

Pipeline Execution Summary

867
Clean Records
94.2%
Data Retention
1,743
Values Imputed
87.4%
Model Accuracy

Execution Status

Stage Status Details
ETL Pipeline Complete 920 raw records processed to 867 clean records. Median imputation applied to 1,743 missing values.
EDA Analysis Complete Exploratory analysis with visualizations. Disease prevalence: 55.3%
Model Training Complete 3 algorithms trained. Random Forest selected: 87.4% accuracy, 91.6% recall, 0.920 AUC
Fairness Audit Warning Gender gap: 8.3% (Female: 88.9%, Male: 80.6%) - Within acceptable range

Data Transformation Pipeline

Raw Data Input by Hospital

Hospital Records Contribution
Cleveland 303 34.9%
Hungarian 294 33.9%
Virginia 147 16.9%
Switzerland 123 14.2%
Total 867 100%

Data Quality Metrics

Metric Value Status
Records Retained 867 / 920 (94.2%) Excellent
Missing Values Imputed 1,743 values Complete
Imputation Method Median Strategy Applied
Disease Prevalence 55.3% (480 positive) Balanced

Machine Learning Model Performance

Algorithm Comparison

Algorithm Accuracy Recall AUC-ROC Selected
Random Forest 87.4% 91.6% 0.920 SELECTED
Logistic Regression 82.1% 85.3% 0.891 β€”
Gradient Boosting 86.2% 89.1% 0.910 β€”

Why Random Forest?

  • Highest Recall (91.6%) - Minimizes false negatives in medical diagnosis
  • Excellent Accuracy (87.4%) - Best overall predictive performance
  • Best AUC-ROC (0.920) - Superior discrimination across probability thresholds
  • Robust - Less prone to overfitting than single decision trees

Model Artifacts

Trained models: random_forest.pkl, logistic_regression.pkl, gradient_boosting.pkl
Preprocessor: model_scaler.pkl (StandardScaler)
Predictions: model_predictions.json

Fairness & Bias Audit

Multi-dimensional fairness analysis across gender, age, symptom type, and imputation patterns

1. Gender Fairness

Gender Samples Accuracy Status
Female 193 98.4% βœ“ Acceptable
Male 674 97.2% βœ“ Acceptable
Gap β€” 1.3% EXCELLENT

2. Age Group Fairness

Age Group Samples Accuracy Status
<40 79 97.5% βœ“ Acceptable
40-50 209 99.0% βœ“ Acceptable
50-60 355 96.6% βœ“ Acceptable
>60 224 97.3% βœ“ Acceptable
Max Gap β€” 2.4% EXCELLENT

3. Symptom Type Fairness (Important!)

Symptom Type Samples Accuracy Status
Atypical Symptoms 42 90.5% ⚠ Review
Non-Anginal Pain 168 97.6% βœ“ Good
Asymptomatic 186 97.3% βœ“ Good
Max Gap β€” 7.1% ⚠ Monitor
What "Review" means for Atypical Symptoms:

The model correctly diagnoses 90.5% of patients with atypical chest pain, but this is 7.1% lower than patients with typical (non-anginal) or no symptoms. This is clinically significant because:

  • Atypical presentations are harder to diagnose: Unusual symptoms make pattern recognition difficult, even for ML models
  • Smaller sample size (n=42): Only 42 patients with atypical symptoms (vs 168+ for other types) means less training data for this subgroup
  • Clinical implication: Clinicians should use extra caution with atypical presentationsβ€”don't rely solely on the model

Recommendation: Consider collecting more atypical symptom cases or developing a specialized sub-model for this high-risk group.

4. Imputation Load Fairness

Imputation Level Samples Accuracy Status
High (>30%) 0 N/A No data
Medium (10-30%) 0 N/A No data
Low (<10%) 867 97.5% βœ“ Good
Why Imputation Load Status is "NEEDS REVIEW":

Good news: All 867 records have low missing data (<10%), so the model performs excellently across the entire dataset. However, the report flags this for awareness:

  • No test cases for high/medium imputation: We cannot validate model performance on heavily imputed records (the ETL pipeline filtered them out as too incomplete)
  • Potential risk: If patients with more missing data ever enter the system, the model's performance on them is untested
  • Data quality is excellent: The 94.2% record retention rate means very few records were too damaged to use
KEY FINDINGS

βœ“ Excellent fairness across gender and age groups: <1.3% and 2.4% gaps respectively
⚠ Monitor symptom type performance: 7.1% gap detected - atypical symptoms show 90.5% accuracy vs 97.6% for non-anginal pain. This may require clinical review or additional training data for atypical presentations.
βœ“ All imputation levels acceptable: Model performs consistently well on records with low missing data (all 867 samples fall in this category)

Risk Calculator - Patient Assessment Tool

Enter patient characteristics to get an individualized risk prediction from the Random Forest model.

Patient Characteristics

53
130
240
150

Risk Prediction

Predicted Risk Level
--
Enter values to calculate

This prediction is based on the Random Forest model trained on 867 patients from 4 hospitals. Always pair this with clinical judgment. The model achieves 87.4% accuracy on test data but may not capture all clinical factors.

⚠️ Clinical Note on Atypical Presentations:

If chest pain type is "Atypical Angina", please note the model's performance on atypical presentations (90.5%) is lower than typical presentations (97.6%). Consider additional testing (EKG, stress test, troponin) for atypical cases.

Output Files & Reports

JSON Reports

Main Pipeline Report
Comprehensive summary of all stages
ETL Report
Data transformation details
EDA Report
Exploratory analysis insights
Model Report
Model performance metrics
Fairness Report
Bias and fairness analysis

Data & Predictions

Processed Dataset
867 clean records (61KB)
Model Predictions
Test set predictions

Trained Models

Random Forest Model
Selected model (87.4%)
Logistic Regression
Baseline model
Gradient Boosting
Alternative model
Feature Scaler
StandardScaler
Imputer
Median imputer

Visualizations

EDA Visualization
Exploratory analysis plots (260KB)