Introduction

Cardiovascular disease (CVD) or heart disease is one of the leading causes of death in the United States. The Center for Disease Control Prevention estimates 647,000 deaths per year¹. CVD is an umbrella term that encompasses different heart conditions that include diseased blood vessels (atherosclerosis or vasculitis), structural problems (cardiomegaly), and irregular heartbeats (arrhythmia). Of the CVD, the most common type of heart disease in the United State is coronary artery disease. Most of the time CVD is “silent” and there is no diagnosis until individuals experience signs or symptoms of a heart attack, heart failure, or arrhythmia². Research has identified risk factors that are associated with developing CVD. These risk factors can be non-modifiable, where the factors cannot be changed, or modifiable factors, where the factors can be changed.The non-modifiable risk factors are³:

  • Increasing ageBiological Sex — -men are at greater risk than womenHereditary

The modifiable risk factors include³:

  • Smoking tobaccoHigh blood cholesterolHigh blood pressurePhysical inactivityObesityDiabetes

A physician can gain insight using these risk factors to recommend lifestyle changes or treatment strategies for the patient. The curious question I want to investigate is can an XGBoost tree model predicts if someone has CVD based on these risk factors that physicians use.

Data

The data used to conduct this analysis is from a dataset compiled by four hospitals in Cleveland, Hungary, Switzerland, and VA Long Beach. The data is referred to as the UCI Heart Disease dataset. This dataset consists of 303 individuals with 14 attributes where 138 individuals are presented with no CVD and 165 individuals presented with CVD. Originally, there were 76 attributes, but published experiments refer to using a subset of only 14 attributes. The target variable is the diagnosis of heart disease using the diameter narrowing in any major blood vessel. The cutoff percentage was 50% (see below attribute #14).Only 14 attributes used:

1. age: age in years

2. sex: sex (1 = male; 0 = female)

3. cp: chest pain type — 1: typical angina, 2: atypical angina, 3: non-anginal pain, 4: asymptomatic

4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)

5. chol: serum cholesterol in mg/dl

6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

7. restecg: resting electrocardiographic results — 0: normal, 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria

8. thalach: maximum heart rate achieved

9. exang: exercise-induced angina (1 = yes; 0 = no)

10. oldpeak = ST depression induced by exercise relative to rest

11. slope: the slope of the peak exercise ST-segment — 1: upsloping, 2: flat, 3: downsloping

12. ca: number of major vessels (0–3) colored by fluoroscopy

13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect

14. target: diagnosis of heart disease (angiographic disease status) — 0: < 50% diameter narrowing, 1: > 50% diameter narrowingThe data set and the all variables information can be found here: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

Method

An XGBoost tree model was used for two reasons: 1) the model is created by splitting on a specific feature and 2) it’s more robust to other forms of decision tree models⁴.

Feature Selection

No feature engineering was done because 1) there were only 14 features, and 2) each feature was treated as an independent variable from each other to see what features contributed to the prediction. Moreover, to ensure this point, I checked if there was any collinearity amongst the 14 attributes. From looking at Pearson’s correlation, a strong correlation between variables did not exist (Fig 1).Furthermore, the data was split into a 70:30 ratio (train: test ratio). This split was necessary because the dataset only had 303 individuals, which is relatively a small dataset.

#data-science #machine-learning #in-depth-analysis #cardiology #data analysis

Heart Disease Classification
1.30 GEEK