Predicting Heart Disease using Machine Learning¶
This notebook will introduce some foundation machine learning and data science concepts by exploring the problem of heart disease classification.
For example, given a person's health characteristics, we're going to build a model to predict whether or not they have heart disease.
It is intended to be an end-to-end example of what a data science and machine learning proof of concept might look like.
What is classification?¶
Classification involves deciding whether a sample is part of one class or another (binary classification).
If there are multiple class options, it's referred to as multi-class classification.
What we'll end up with¶
We'll start with the heart disease dataset we've worked on in previous modules and we'll approach the problem following the machine learning modelling framework.
6 Step Machine Learning Modelling Framework |
More specifically, we'll look at the following topics.
Step | What we'll cover |
---|---|
Exploratory data analysis (EDA) | The process of going through a dataset and discovering more about it. |
Model training | Create model(s) to learn to predict a target variable based on other variables. |
Model evaluation | Evaluating a model's predictions using problem-specific evaluation metrics. |
Model comparison | Comparing several different models to find the best one. |
Model hyperparameter tuning | Once we've found a good model, can we tweak its hyperparameters to improve it? |
Feature importance | Since we're predicting the presence of heart disease, are there some features/characteristics that are more important for prediction? |
Cross-validation | If we do build a good model, can we be sure it will work on unseen data? |
Reporting what we've found | If we had to present our work, what would we show someone? |
To work through these topics, we'll use pandas, Matplotlib and NumPy for data anaylsis, as well as, Scikit-Learn for machine learning and modelling tasks.
Tools which can be used for each step of the machine learning modelling process. |
We'll work through each step and by the end of the notebook, we'll have a handful of models, all which can predict whether or not a person has heart disease based on a number of different parameters at a considerable accuracy.
You'll also be able to describe which parameters are more indicative than others, for example, sex may be more important than age.
1. Going through the 6 step ML framework¶
1.1 Problem Definition¶
In our case, the problem we will be exploring is binary classification (a sample can only be one of two things).
This is because we're going to be using a number of differnet features (pieces of information such as health characteristics) about a person to predict whether they have heart disease or not.
In a statement,
Given clinical parameters about a patient, can we predict whether or not they have heart disease?
1.2 What data are we using?¶
What you'll want to do here is dive into the data your problem definition is based on.
This may involve, sourcing data (if it doesn't already exist), defining different parameters, talking to experts about it and finding out what you should expect.
The original data came from the Cleveland database from UCI Machine Learning Repository.
Howevever, we've downloaded it in a formatted way from Kaggle.
The original database contains 76 attributes, but here only 14 attributes will be used. Attributes (also called features) are the variables what we'll use to predict our target variable.
Attributes and features are also referred to as independent variables and a target variable can be referred to as a dependent variable.
Note: We use the independent variable(s)to predict our dependent variable(s).
In our case, the independent variables are a patient's different medical attributes and the dependent variable is whether or not they have heart disease.
1.3 How will we evaluate our model?¶
An evaluation metric is something you usually define at the start of a project.
However, since machine learning is very experimental, it can change over time.
But to begin a project, you might say something like:
If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursure this project.
The reason this is helpful is it provides a rough goal for a machine learning engineer or data scientist to work towards.
Of course, as the project progresses and gets tested in the real world, you may have to adjust this goal/threshold.
1.4 Which features of the data will be important to us?¶
Features are different parts and characteristics of the data.
During this step, you'll want to start exploring what each portion of the data relates to and then create a reference you can use to look up later on.
One of the most common ways to do this is to create a data dictionary.
Heart Disease Data Dictionary¶
A data dictionary describes the data you're dealing with.
Not all datasets come with them so this is where you may have to do your research or ask a subject matter expert (someone who knows about the data) for more.
The following are the features we'll use to predict our target variable (heart disease or no heart disease).
Feature | Description | Example Values |
---|---|---|
age | Age in years | 29, 45, 60 |
sex | 1 = male; 0 = female | 0, 1 |
cp | Chest pain type | 0: Typical angina (chest pain), 1: Atypical angina (chest pain not related to heart), 2: Non-anginal pain (typically esophageal spasms (non heart related), 3: Asymptomatic (chest pain not showing signs of disease) |
trestbps | Resting blood pressure (in mm Hg on admission to the hospital) | 120, 140, 150 |
chol | Serum cholesterol in mg/dl | 180, 220, 250 |
fbs | Fasting blood sugar > 120 mg/dl (1 = true; 0 = false) | 0, 1 |
restecg | Resting electrocardiographic results | 0: Nothing to note, 1: ST-T Wave abnormality, 2: Left ventricular hypertrophy |
thalach | Maximum heart rate achieved | 160, 180, 190 |
exang | Exercise induced angina (1 = yes; 0 = no) | 0, 1 |
oldpeak | ST depression (heart potentially not getting enough oxygen) induced by exercise relative to rest | 0.5, 1.0, 2.0 |
slope | The slope of the peak exercise ST segment | 0: Upsloping, 1: Flatsloping, 2: Downsloping |
ca | Number of major vessels (0-3) colored by fluoroscopy | 0, 1, 2, 3 |
thal | Thalium stress result | 1: Normal, 3: Normal, 6: Fixed defect, 7: Reversible defect |
target | Have disease or not (1 = yes; 0 = no) | 0, 1 |
Note: No personal identifiable information (PPI) can be found in the dataset.
It's a good idea to save these to a Python dictionary or in an external file, so we can look at them later without coming back here.
2. Preparing the tools¶
At the start of any project, it's custom to see the required libraries imported in a big chunk (as you can see in the code cell below).
However, in practice, when starting on new projects you may import libraries as you go (because you don't know what you need ahead of time).
After you've spent a couple of hours working on your problem, you'll probably want to do some tidying up.
This is where you may want to consolidate every library you've used at the top of your notebook.
The libraries you use will differ from project to project. But there are a few which will you'll likely take advantage of during almost every structured data project.
- pandas for data analysis.
- NumPy for numerical operations.
- Matplotlib/seaborn for plotting or data visualization.
- Scikit-Learn for machine learning modelling and evaluation.
# Regular EDA and plotting libraries
import numpy as np # np is short for numpy
import pandas as pd # pandas is so commonly used, it's shortened to pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns # seaborn gets shortened to sns, TK - can seaborn be removed for matplotlib (simpler)?
## Models
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
## Model evaluators
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
# from sklearn.metrics import plot_roc_curve # note: this was changed in Scikit-Learn 1.2+ to be "RocCurveDisplay" (see below)
from sklearn.metrics import RocCurveDisplay # new in Scikit-Learn 1.2+
# Print last updated
import datetime
print(f"Notebook last updated: {datetime.datetime.now()}\n")
# Print versions of libraries we're using (as long as yours are equal or greater than these, your code should work)
print(f"NumPy version: {np.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"matplotlib version: {matplotlib.__version__}")
print(f"Scikit-Learn version: {sklearn.__version__}")
Notebook last updated: 2024-09-24 13:29:26.771285 NumPy version: 2.1.1 pandas version: 2.2.2 matplotlib version: 3.9.2 Scikit-Learn version: 1.5.1
3. Loading Data¶
There are many different ways to store data.
One typical way of storing tabular data, data similar to what you'd see in an Excel file is in .csv
format or CSV format.
CSV stands for comma-separated values.
Other common formats include JSON, SQL and parquet.
Pandas has a built-in function to read .csv
files called read_csv()
which takes the file pathname of your .csv
file. You'll likely use this a lot.
Note: CSV format is good for smaller datasets but can face some speed issues when working with larger datasets. For more on different data formats pandas is compatible with, I'd check out the pandas guide on reading and writing data.
And there are many more read functions for different data formats in the Input/Output section of the pandas documentation.
df = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv")
# df = pd.read_csv("../data/heart-disease.csv") # Read from local directory, 'DataFrame' shortened to 'df'
df.shape # (rows, columns)
(303, 14)
# Check the head of our DataFrame
df.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
Wonderful! We've got some data to work with. Notice how all the column names reflect a field in our data dicitonary above.
4. Data Exploration (exploratory data analysis or EDA)¶
Once you've imported a dataset, the next step is to explore.
Or in formal terms, perform an Exploratory Data Analysis (EDA).
There's no set way of doing this.
But what you should be trying to do is become more and more familiar with the dataset.
Compare different columns to each other, compare them to the target variable.
Refer back to your data dictionary and remind yourself of what different columns mean.
One of my favourites is viewing 10-100 random samples of the data.
Our goal here is to become a subject matter expert on the dataset you're working with.
So if someone asks you a question about it, you can give them an explanation and when you start building models, you can sound check them to make sure they're not performing too well (overfitting and memorizing the data rather than learning generalizable patterns) or why they might be performing poorly (underfitting or not learning patterns in the data).
Since EDA has no real set methodolgy, the following is a short check list you might want to walk through:
- What question(s) are you trying to solve (or prove wrong)?
- What kind of data do you have and how do you treat different types?
- What’s missing from the data and how do you deal with it?
- Where are the outliers and why should you care about them?
- How can you add, change or remove features to get more out of your data?
Once of the quickest and easiest ways to check your data is with the head()
function.
Calling it on any dataframe will print the top 5 rows, tail()
calls the bottom 5. You can also pass a number to them like head(10)
to show the top 10 rows.
# Let's check the top 5 rows of our dataframe
df.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
# And the top 10
df.head(10)
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
5 | 57 | 1 | 0 | 140 | 192 | 0 | 1 | 148 | 0 | 0.4 | 1 | 0 | 1 | 1 |
6 | 56 | 0 | 1 | 140 | 294 | 0 | 0 | 153 | 0 | 1.3 | 1 | 0 | 2 | 1 |
7 | 44 | 1 | 1 | 120 | 263 | 0 | 1 | 173 | 0 | 0.0 | 2 | 0 | 3 | 1 |
8 | 52 | 1 | 2 | 172 | 199 | 1 | 1 | 162 | 0 | 0.5 | 2 | 0 | 3 | 1 |
9 | 57 | 1 | 2 | 150 | 168 | 0 | 1 | 174 | 0 | 1.6 | 2 | 0 | 2 | 1 |
value_counts()
allows you to show how many times each of the values of a categorical column appear.
# Let's see how many positive (1) and negative (0) samples we have in our DataFrame
df.target.value_counts()
target 1 165 0 138 Name: count, dtype: int64
Since these two values are close to even, our target
column can be considered balanced.
An unbalanced target column, meaning some classes have far more samples, can be harder to model than a balanced set.
In an ideal world, all of your target classes have the same number of samples.
If you'd prefer these values in percentages, value_counts()
takes a parameter, normalize
which can be set to true.
# Normalized value counts
df.target.value_counts(normalize=True)
target 1 0.544554 0 0.455446 Name: proportion, dtype: float64
We can plot the target column value counts by calling the plot()
function and telling it what kind of plot we'd like, in this case, bar is good.
# Plot the value counts with a bar graph
df.target.value_counts().plot(kind="bar", color=["salmon", "lightblue"]);
pd.DataFrame.info()
shows a quick insight into the number of missing values you have and what type of data you're working with.
In our case, there are no missing values and all of our columns are numerical in nature.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 303 entries, 0 to 302 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 303 non-null int64 1 sex 303 non-null int64 2 cp 303 non-null int64 3 trestbps 303 non-null int64 4 chol 303 non-null int64 5 fbs 303 non-null int64 6 restecg 303 non-null int64 7 thalach 303 non-null int64 8 exang 303 non-null int64 9 oldpeak 303 non-null float64 10 slope 303 non-null int64 11 ca 303 non-null int64 12 thal 303 non-null int64 13 target 303 non-null int64 dtypes: float64(1), int64(13) memory usage: 33.3 KB
Another way to get quick insights on your DataFrame is to use pd.DataFrame.describe()
.
describe()
shows a range of different metrics about your numerical columns such as mean, max and standard deviation.
df.describe()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 | 303.000000 |
mean | 54.366337 | 0.683168 | 0.966997 | 131.623762 | 246.264026 | 0.148515 | 0.528053 | 149.646865 | 0.326733 | 1.039604 | 1.399340 | 0.729373 | 2.313531 | 0.544554 |
std | 9.082101 | 0.466011 | 1.032052 | 17.538143 | 51.830751 | 0.356198 | 0.525860 | 22.905161 | 0.469794 | 1.161075 | 0.616226 | 1.022606 | 0.612277 | 0.498835 |
min | 29.000000 | 0.000000 | 0.000000 | 94.000000 | 126.000000 | 0.000000 | 0.000000 | 71.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 47.500000 | 0.000000 | 0.000000 | 120.000000 | 211.000000 | 0.000000 | 0.000000 | 133.500000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 0.000000 |
50% | 55.000000 | 1.000000 | 1.000000 | 130.000000 | 240.000000 | 0.000000 | 1.000000 | 153.000000 | 0.000000 | 0.800000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 |
75% | 61.000000 | 1.000000 | 2.000000 | 140.000000 | 274.500000 | 0.000000 | 1.000000 | 166.000000 | 1.000000 | 1.600000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 |
max | 77.000000 | 1.000000 | 3.000000 | 200.000000 | 564.000000 | 1.000000 | 2.000000 | 202.000000 | 1.000000 | 6.200000 | 2.000000 | 4.000000 | 3.000000 | 1.000000 |
4.1 Comparing one feature to another¶
If you want to compare two columns to each other, you can use the function pd.crosstab(index, columns)
.
This is helpful if you want to start gaining an intuition about how your independent variables interact with your dependent variables.
Let's compare our target column with the sex column.
Remember from our data dictionary, for the target column, 1 = heart disease present, 0 = no heart disease.
And for sex, 1 = male, 0 = female.
df.sex.value_counts()
sex 1 207 0 96 Name: count, dtype: int64
There are 207 males and 96 females in our study.
What if we compared the target column values with the sex column values?
# Compare target column with sex column
pd.crosstab(index=df.target, columns=df.sex)
sex | 0 | 1 |
---|---|---|
target | ||
0 | 24 | 114 |
1 | 72 | 93 |
What can we infer from this? Let's make a simple heuristic.
Since there are about 100 women and 72 of them have a positive value of heart disease being present, we might infer, based on this one variable if the participant is a woman, there's a ~72% (72/96 women in our dataset are positive for heart disease) chance she has heart disease.
As for males, there's about 200 total with around half (93/207) indicating a presence of heart disease.
So we might predict, if the participant is male, 50% of the time he will have heart disease.
Averaging these two values, we can assume, based on no other parameters, if there's a person, there's a 62.5% chance they have heart disease.
This can be our very simple baseline, we'll try to beat it with machine learning.
Note: A baseline is a simple model or estimate you start with and try to beat/confirm throughout your experiments. It can be as simple as looking at the data as we've done and creating a predictive heuristic to move forward.
4.2 Making our comparison visual¶
I'm going to introduce you to a motto I remind myself of whenever I'm exploring data.
Visualize, visualize, visualize! - The data explorer's motto.
This is because it's very helpful whenever you're dealing with a new dataset to visualize as much as you can to build up an idea of the dataset in your head.
And one of the best ways to create visualizations is to make plots (graphical representations of our data).
We can plot our pd.crosstab
comparison by calling the plot()
method and passing it a few parameters:
kind
- The type of plot you want (e.g."bar"
for a bar plot).figsize=(length, width)
- How big you want it to be.color=[colour_1, colour_2]
- The different colours you'd like to use.
Different metrics are represented best with different kinds of plots.
In our case, a bar graph is great. We'll see examples of more later. And with a bit of practice, you'll gain an intuition of which plot to use with different variables.
# Create a plot
pd.crosstab(df.target, df.sex).plot(kind="bar",
figsize=(10,6),
color=["salmon", "lightblue"]);
Nice! But our plot is looking pretty bare. Let's add some attributes.
We'll create the plot again with pd.crosstab()
and the plot()
method.
Then, since our plot is built with matplotlib
, we can add some helpful labels to it with plt.title()
, plt.xlabel()
, plt.legend()
and more.
# Create a plot
pd.crosstab(df.target, df.sex).plot(kind="bar", figsize=(10,6), color=["salmon", "lightblue"])
# Add some attributes to it
plt.title("Heart Disease Frequency vs Sex")
plt.xlabel("0 = No Disease, 1 = Disease")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0); # keep the labels on the x-axis vertical
4.3 Comparing age and maximum heart rate¶
Let's try combining a couple of independent variables, such as, age
and thalach
(maximum heart rate) and then comparing them to our target variable heart disease
.
Because there are so many different values for age
and thalach
, we'll use a scatter plot.
# Create another figure
plt.figure(figsize=(10,6))
# Start with positve examples
plt.scatter(df.age[df.target==1],
df.thalach[df.target==1],
c="salmon") # define it as a scatter figure
# Now for negative examples, we want them on the same plot, so we call plt again
plt.scatter(df.age[df.target==0],
df.thalach[df.target==0],
c="lightblue") # axis always come as (x, y)
# Add some helpful info
plt.title("Heart Disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.legend(["Disease", "No Disease"])
plt.ylabel("Max Heart Rate");
What can we infer from this?
It seems the younger someone is, the higher their max heart rate (dots are higher on the left of the graph) and it seems there may be more heart disease in the younger population too (more orange dots).
Both of these are observational of course, but this is what we're trying to do, build an understanding of the data.
Let's check the age distribution.
Note: Distribution can considered as the spread of data. As in, when viewed as a whole, what different values appear in the data?
# Histograms are a great way to check the distribution of a variable
df.age.plot.hist();
We can see it's a normal distribution but slightly swaying to the right, which reflects in the scatter plot above.
Let's keep going.
4.4 Comparing heart disease frequency and chest pain type¶
Let's try comparing another independent variable with our target variable.
This time, we'll use cp
(chest pain) as the independent variable.
We'll use the same process as we did before with sex
.
pd.crosstab(index=df.cp, columns=df.target)
target | 0 | 1 |
---|---|---|
cp | ||
0 | 104 | 39 |
1 | 9 | 41 |
2 | 18 | 69 |
3 | 7 | 16 |
# Create a new crosstab and base plot
pd.crosstab(df.cp, df.target).plot(kind="bar",
figsize=(10,6),
color=["lightblue", "salmon"])
# Add attributes to the plot to make it more readable
plt.title("Heart Disease Frequency Per Chest Pain Type")
plt.xlabel("Chest Pain Type")
plt.ylabel("Frequency")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation = 0);
What can we infer from this?
Remember from our data dictionary what the different levels of chest pain are.
Feature | Description | Example Values |
---|---|---|
cp | Chest pain type | 0: Typical angina (chest pain), 1: Atypical angina (chest pain not related to heart), 2: Non-anginal pain (typically esophageal spasms (non heart related), 3: Asymptomatic (chest pain not showing signs of disease) |
It's interesting that atypical angina (value 1) states it's not related to the heart but seems to have a higher ratio of participants with heart disease than not.
Wait...?
What does atypical agina even mean?
At this point, it's important to remember, if your data dictionary doesn't supply you enough information, you may want to do further research on your values.
This research may come in the form of asking a subject matter expert (such as a cardiologist or the person who gave you the data) or Googling to find out more.
According to PubMed, it seems even some medical professionals are confused by the term.
Today, 23 years later, “atypical chest pain” is still popular in medical circles. Its meaning, however, remains unclear. A few articles have the term in their title, but do not define or discuss it in their text. In other articles, the term refers to noncardiac causes of chest pain.
Although not conclusive, the plot above is a sign there may be a confusion of defintions being represented in data.
4.5 Correlation between independent variables¶
Finally, we'll compare all of the independent variables in one hit.
Why?
Because this may give an idea of which independent variables may or may not have an impact on our target variable.
We can do this using pd.DataFrame.corr()
which will create a correlation matrix for us, in other words, a big table of numbers telling us how related each variable is the other.
# Find the correlation between our independent variables
corr_matrix = df.corr()
corr_matrix
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
age | 1.000000 | -0.098447 | -0.068653 | 0.279351 | 0.213678 | 0.121308 | -0.116211 | -0.398522 | 0.096801 | 0.210013 | -0.168814 | 0.276326 | 0.068001 | -0.225439 |
sex | -0.098447 | 1.000000 | -0.049353 | -0.056769 | -0.197912 | 0.045032 | -0.058196 | -0.044020 | 0.141664 | 0.096093 | -0.030711 | 0.118261 | 0.210041 | -0.280937 |
cp | -0.068653 | -0.049353 | 1.000000 | 0.047608 | -0.076904 | 0.094444 | 0.044421 | 0.295762 | -0.394280 | -0.149230 | 0.119717 | -0.181053 | -0.161736 | 0.433798 |
trestbps | 0.279351 | -0.056769 | 0.047608 | 1.000000 | 0.123174 | 0.177531 | -0.114103 | -0.046698 | 0.067616 | 0.193216 | -0.121475 | 0.101389 | 0.062210 | -0.144931 |
chol | 0.213678 | -0.197912 | -0.076904 | 0.123174 | 1.000000 | 0.013294 | -0.151040 | -0.009940 | 0.067023 | 0.053952 | -0.004038 | 0.070511 | 0.098803 | -0.085239 |
fbs | 0.121308 | 0.045032 | 0.094444 | 0.177531 | 0.013294 | 1.000000 | -0.084189 | -0.008567 | 0.025665 | 0.005747 | -0.059894 | 0.137979 | -0.032019 | -0.028046 |
restecg | -0.116211 | -0.058196 | 0.044421 | -0.114103 | -0.151040 | -0.084189 | 1.000000 | 0.044123 | -0.070733 | -0.058770 | 0.093045 | -0.072042 | -0.011981 | 0.137230 |
thalach | -0.398522 | -0.044020 | 0.295762 | -0.046698 | -0.009940 | -0.008567 | 0.044123 | 1.000000 | -0.378812 | -0.344187 | 0.386784 | -0.213177 | -0.096439 | 0.421741 |
exang | 0.096801 | 0.141664 | -0.394280 | 0.067616 | 0.067023 | 0.025665 | -0.070733 | -0.378812 | 1.000000 | 0.288223 | -0.257748 | 0.115739 | 0.206754 | -0.436757 |
oldpeak | 0.210013 | 0.096093 | -0.149230 | 0.193216 | 0.053952 | 0.005747 | -0.058770 | -0.344187 | 0.288223 | 1.000000 | -0.577537 | 0.222682 | 0.210244 | -0.430696 |
slope | -0.168814 | -0.030711 | 0.119717 | -0.121475 | -0.004038 | -0.059894 | 0.093045 | 0.386784 | -0.257748 | -0.577537 | 1.000000 | -0.080155 | -0.104764 | 0.345877 |
ca | 0.276326 | 0.118261 | -0.181053 | 0.101389 | 0.070511 | 0.137979 | -0.072042 | -0.213177 | 0.115739 | 0.222682 | -0.080155 | 1.000000 | 0.151832 | -0.391724 |
thal | 0.068001 | 0.210041 | -0.161736 | 0.062210 | 0.098803 | -0.032019 | -0.011981 | -0.096439 | 0.206754 | 0.210244 | -0.104764 | 0.151832 | 1.000000 | -0.344029 |
target | -0.225439 | -0.280937 | 0.433798 | -0.144931 | -0.085239 | -0.028046 | 0.137230 | 0.421741 | -0.436757 | -0.430696 | 0.345877 | -0.391724 | -0.344029 | 1.000000 |
Following the data explorer's motto of visualize, visualize, visualize!, let's plot this matrix.
# Let's make it look a little prettier
corr_matrix = df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(corr_matrix,
annot=True,
linewidths=0.5,
fmt= ".2f",
cmap="YlGnBu");
Much better. A higher positive value means a potential positive correlation (increase) and a higher negative value means a potential negative correlation (decrease).
4.6 Enough EDA, let's model¶
Remember, we do exploratory data analysis (EDA) to start building an intuition of the dataset.
What have we learned so far?
Aside from our baseline estimate using sex
, the rest of the data seems to require a bit more exploration before we draw any conclusions.
So what we'll do next is model driven EDA, meaning, we'll use machine learning models to drive our next questions.
A few extra things to remember:
- Not every EDA will look the same, what we've seen here is an example of what you could do for structured, tabular dataset.
- You don't necessarily have to do the same plots as we've done here, there are many more ways to visualize data, I encourage you to look at more.
- Quite often, we'll want to find:
- Distributions - What's the spread of the data? We can do this with
pd.DataFrame.hist(column="target_column")
. - Missing values - Is our data missing anything? Why might this be the case and will this affect us going forward? We can do this with
pd.DataFrame.info()
orpd.isnull()
. - Outliers - Are there any samples that lay quite far outside the rest of our data's distributions? How might these affect the data going forward?
- Distributions - What's the spread of the data? We can do this with
With this being said, let's build some models!
5. Modeling¶
We've explored the data, now we'll try to build a machine learning model to be able to predict our target variable based on the 13 independent variables.
Remember our problem?
Given clinical parameters about a patient, can we predict whether or not they have heart disease?
That's what we'll be trying to answer.
And remember our evaluation metric?
If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursure this project.
That's what we'll be aiming for.
But before we build a model, we have to get our dataset ready.
Let's look at it again.
df.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 | 1 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 | 1 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 | 1 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 | 1 |
We're trying to predict our target variable using all of the other variables.
To do this, we'll split the target variable from the rest.
We can do this by creating:
X
- Our features (all variables except thetarget
column) usingpd.DataFrame.drop(labels="target")
.y
- Our target variable usingdf.target.to_numpy()
(this will extract thetarget
column as a NumPy array).
# Everything except target variable
X = df.drop(labels="target", axis=1)
# Target variable
y = df.target.to_numpy()
Let's see our new variables.
# Independent variables (no target column)
X.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 63 | 1 | 3 | 145 | 233 | 1 | 0 | 150 | 0 | 2.3 | 0 | 0 | 1 |
1 | 37 | 1 | 2 | 130 | 250 | 0 | 1 | 187 | 0 | 3.5 | 0 | 0 | 2 |
2 | 41 | 0 | 1 | 130 | 204 | 0 | 0 | 172 | 0 | 1.4 | 2 | 0 | 2 |
3 | 56 | 1 | 1 | 120 | 236 | 0 | 1 | 178 | 0 | 0.8 | 2 | 0 | 2 |
4 | 57 | 0 | 0 | 120 | 354 | 0 | 1 | 163 | 1 | 0.6 | 2 | 0 | 2 |
# Targets (in the form of a NumPy array)
y, type(y)
(array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), numpy.ndarray)
5.1 Creating a training and test split¶
Now comes one of the most important concepts in machine learning, creating a training/test split.
This is where we'll split our data into a training set and a test set.
We'll use our training set to train our model and our test set to evaluate it.
All the samples in the training set must be separate from those in the test set (and vice versa).
In short:
- Training set (often 70-80% of total data) - Model learns patterns on this dataset to hopefully be able to predict on similar but unseen samples.
- Testing set (often 20-30% of total data) - Trained model gets evaluated on these unseen samples to see how the patterns learned from the training set may perform on future unseen samples (e.g. when used in an application or production setting). However, performance on the test set is not guaranteed in the real world.
Why not use all the data to train a model?¶
Let's say you wanted to take your model into the hospital and start using it on patients.
How would you know how well your model goes on a new patient not included in the original full dataset you had?
This is where the test set comes in.
It's used to mimic taking your model to a real environment as much as possible.
And it's why it's important to never let your model learn from the test set, it should only be evaluated on it.
To split our data into a training and test set, we can use Scikit-Learn's sklearn.model_selection.train_test_split()
and feed it our independent and dependent variables (X
& y
).
# Random seed for reproducibility (since train_test_split is random by default, setting the seed will create reproducible splits)
np.random.seed(42)
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X, # independent variables
y, # dependent variable
test_size = 0.2) # percentage of data to use for test set
The test_size
parameter is used to tell the train_test_split()
function how much of our data we want in the test set.
A rule of thumb is to use 80% of your data to train on and the other 20% to test on.
For our problem, a train and test set are enough. But for other problems, you could also use a validation (train/validation/test) set or cross-validation (we'll see this later on).
But again, each problem will differ.
To learn more about the importance of validation and test sets, I'd recommend reading:
- How (and why) to create a good validation set by Rachel Thomas.
- The importance of a test set by Daniel Bourke.
Let's look at our training data.
X_train.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
132 | 42 | 1 | 1 | 120 | 295 | 0 | 1 | 162 | 0 | 0.0 | 2 | 0 | 2 |
202 | 58 | 1 | 0 | 150 | 270 | 0 | 0 | 111 | 1 | 0.8 | 2 | 0 | 3 |
196 | 46 | 1 | 2 | 150 | 231 | 0 | 1 | 147 | 0 | 3.6 | 1 | 0 | 2 |
75 | 55 | 0 | 1 | 135 | 250 | 0 | 0 | 161 | 0 | 1.4 | 1 | 0 | 2 |
176 | 60 | 1 | 0 | 117 | 230 | 1 | 1 | 160 | 1 | 1.4 | 2 | 2 | 3 |
y_train, len(y_train)
(array([1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1]), 242)
Beautiful, we can see we're using 242 samples to train on.
Let's look at our test data.
X_test.head()
age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
179 | 57 | 1 | 0 | 150 | 276 | 0 | 0 | 112 | 1 | 0.6 | 1 | 1 | 1 |
228 | 59 | 1 | 3 | 170 | 288 | 0 | 0 | 159 | 0 | 0.2 | 1 | 0 | 3 |
111 | 57 | 1 | 2 | 150 | 126 | 1 | 1 | 173 | 0 | 0.2 | 2 | 1 | 3 |
246 | 56 | 0 | 0 | 134 | 409 | 0 | 0 | 150 | 1 | 1.9 | 1 | 2 | 3 |
60 | 71 | 0 | 2 | 110 | 265 | 1 | 0 | 130 | 0 | 0.0 | 2 | 1 | 2 |
y_test, len(y_test)
(array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0]), 61)
And we've got 61 examples we'll test our model(s) on.
Let's build some.
5.2 Choosing a model¶
Now we've got our data prepared, we can start to fit models.
In the modern world of machine learning, there are many potential models we can choose from.
Rather than trying every potential model, it's often good practice to try a handful and see how they go.
We'll start by trying the following models and comparing their results.
- Logistic Regression -
sklearn.linear_model.LogisticRegression()
- K-Nearest Neighbors -
sklearn.neighbors.KNeighboursClassifier()
- RandomForest -
sklearn.ensemble.RandomForestClassifier()
Why these?¶
If we look at the Scikit-Learn algorithm machine learning model map, we can see we're working on a classification problem and these are the algorithms it suggests (plus a few more).
An example path we can take using the Scikit-Learn Machine Learning Map |
"Wait, I don't see Logistic Regression and why not use LinearSVC?"
Good questions.
I was confused too when I didn't see Logistic Regression listed as well because when you read the Scikit-Learn documentation on it, you can see it's a model for classification.
And as for sklearn.svm.LinearSVC
, let's pretend we've tried it (you can try it for yourself if you like), and it doesn't work, so we're following other options in the map.
For now, knowing each of these algorithms inside and out is not essential (however, this would be a fantastic extension to this project).
Machine learning and data science is an iterative practice.
These algorithms are tools in your toolbox.
In the beginning, on your way to becoming a practitioner, it's more important to understand your problem (such as, classification versus regression) and what tools you can use to solve it.
Since our dataset is relatively small, we can run some quick experiments to see which algorithm performs best and iteratively try to improve it.
Many of the algorithms in the Scikit-Learn library have similar APIs (Application Programming Interfaces).
For example, for training a model you can use model.fit(X_train, y_train)
.
And for scoring a model model.score(X_test, y_test)
(scoring a model compares predictions to the ground truth labels).
For classification models, calling score()
usually defaults to returning the ratio (accuracy) of correct predictions (1.0 = 100% correct).
Since the algorithms we've chosen implement the same methods for fitting them to the data as well as evaluating them, let's put them in a dictionary and create a which fits and scores them.
# Put models in a dictionary
models = {"KNN": KNeighborsClassifier(),
"Logistic Regression": LogisticRegression(max_iter=100), # Note: if you see a warning about "convergence not reached", you can increase `max_iter` until convergence is reached
"Random Forest": RandomForestClassifier()}
# Create function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
"""
Fits and evaluates given machine learning models.
models : a dict of different Scikit-Learn machine learning models
X_train : training data
X_test : testing data
y_train : labels assosciated with training data
y_test : labels assosciated with test data
"""
# Random seed for reproducible results
np.random.seed(42)
# Make a list to keep model scores
model_scores = {}
# Loop through models
for name, model in models.items():
# Fit the model to the data
model.fit(X_train, y_train)
# Evaluate the model and append its score to model_scores
model_scores[name] = model.score(X_test, y_test)
return model_scores
Function built!
Now let's see how our collection of models go on our data.
model_scores = fit_and_score(models=models,
X_train=X_train,
X_test=X_test,
y_train=y_train,
y_test=y_test)
model_scores
/Users/daniel/miniforge3/envs/ai/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:469: ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT. Increase the number of iterations (max_iter) or scale the data as shown in: https://scikit-learn.org/stable/modules/preprocessing.html Please also refer to the documentation for alternative solver options: https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression n_iter_i = _check_optimize_result(
{'KNN': 0.6885245901639344, 'Logistic Regression': 0.8852459016393442, 'Random Forest': 0.8360655737704918}
Beautiful!
It looks like each of our models was able to fit our data without any errors.
How about we compare them visually?
5.3 Comparing the results of several models¶
Since we've saved our models scores to a dictionary, we can plot them by first converting them to a DataFrame.
model_compare = pd.DataFrame(model_scores, index=['accuracy'])
model_compare.T.plot.bar();
Beautiful!
From the plot it looks like the sklearn.linear_model.LogisticRegression()
model performs best.
Now... since we've found the best model.
Let's take it to the boss and show her what we've found!
5.4 Taking our best model to the boss (and learning about a few new terms)¶
A conversation with the boss/senior engineer begins...
You: I've found it!
Her: Nice one! What did you find?
You: The best algorithm for predicting heart disease is a Logistic Regression!
Her: Excellent. I'm surprised the hyperparameter tuning is finished by now.
You: wonders what hyperparameter tuning is
You: Ummm yeah, me too, it went pretty quick.
Her: I'm very proud, how about you put together a classification report to show the team, and be sure to include a confusion matrix, and the cross-validated precision, recall and F1 scores. I'd also be curious to see what features are most important. Oh and don't forget to include a ROC curve.
You: asks self, "what are those???"
You: Of course! I'll have to you by tomorrow.
Alright, there were a few words in there that could sound made up to someone who's not a budding data scientist like us.
But being the budding data scientist we are, we also know data scientists make up words all the time.
Let's briefly go through each before we see them in action.
Term | Definition |
---|---|
Hyperparameter tuning | Many machine learning models have a series of settings/dials you can turn to dictate how they perform. Changing these values may increase or decrease model performance. The practice of figuring out the best settings for a model is called hyperparameter tuning. |
Feature importance | If there are a large amount of features we're using to make predictions, do some have more importance than others? For example, for predicting heart disease, which is more important, sex or age? |
Confusion matrix | Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagnol line). |
Cross-validation | Splits your dataset into multiple versions of training and test sets and trains/evaluations your model on each different version. This ensures that your evaluation metrics are across several different splits of data rather than a single split (if it was only a single split, you might get lucky and get better than usual results, the same for the reverse, if you get a poor split, you might find your metrics lower than they should be). |
Precision | A common classification evaluation metric. Measures the proportion of true positives over total number of samples. Higher precision leads to fewer false positives. |
Recall | A common classification evaluation metric. Measures the proportion of true positives over total number of true positives and false negatives. Higher recall leads to fewer false negatives. |
| F1 score | Combines precision and recall into one metric. 1 is best, 0 is worst. |
| Classification report | Sklearn has a built-in function called classification_report()
which returns some of the main classification metrics such as precision, recall and f1-score. |
| ROC Curve | Receiver Operating Characterisitc is a plot of true positive rate versus false positive rate. A perfect curve will follow the left and top border of a plot. |
| Area Under Curve (AUC) | The area underneath the ROC curve. A perfect model achieves a score of 1.0. |
Woah!
There are a fair few things going on here but nothing we can't handle.
We'll explore each of these further throughout the rest of the notebook.
In the meantime, feel free to read more at the linked resources.
6. Hyperparameter tuning and cross-validation¶
To cook your favourite dish, you know to set the oven to 180 degrees and turn the grill on.
But when your roommate cooks their favourite dish, they use 200 degrees and the fan-forced mode.
Same oven, different settings, different outcomes.
The same can be done for machine learning algorithms. You can use the same algorithms but change the settings (hyperparameters) and get different results.
But just like turning the oven up too high can burn your food, the same can happen for machine learning algorithms.
You change the settings and it works so well, it overfits (does too well) the data.
We're looking for the Goldilocks model.
One which does well on our training dataset but also on unseen examples like in the testing dataset/real world.
To test different hyperparameters, you could use a validation set but since we don't have much data, we'll use cross-validation.
Note: A validation set is a third player in the training/test split game. It's designed to be used in between a training and test set. You can think of it as the practice exam before the final exam. As in, the training set is the course material to learn on, the validation set is the practice exam to practice and tweak your skills on and the test set is the final exam to push your skills. In machine learning, the model learns patterns on the training set and then you can tweak hyperparameters to improve results on the validation set before finally testing your model on the testing set. All samples in the training, validation and test sets should be kept exclusive of each other.
The most common type of cross-validation is k-fold.
It involves splitting your data into k-fold's or k-different splits of the data and then training and testing a model on each split.
For example, let's say we had 5 folds (k = 5).
This is what it might look like.
Normal train and test split versus 5-fold cross-validation |
You have 5 different versions of train and test splits.
This means you'll be able to train and test 5 different versions of your model on different data splits and calculate the average performance.
Why do this?
This prevents us from focusing too much on the metrics from one data split (imagine the data split we do contains all the easy samples and the performance metrics we use say that the model performs better than it does).
We'll be using this setup to tune the hyperparameters of some of our models and then evaluate them.
We'll also get a few more metrics like precision, recall, F1-score and ROC at the same time.
Here's the plan:
- Tune model hyperparameters, and see which performs best
- Perform cross-validation
- Plot ROC curves
- Make a confusion matrix
- Get precision, recall and F1-score metrics
- Find the most important model features
6.1 Tune KNeighborsClassifier (K-Nearest Neighbors or KNN) by hand¶
There are several hyperparameters we can tune for the K-Nearest Neighbors (KNN) algorithm (or sklearn.neighbors.KNeighborsClassifier
).
But for now, let's start with one, the number of neighbors.
The default is 5 (n_neigbors=5
).
What are neighbours?
Well, imagine all our different samples on one graph like the scatter graph several cells above.
KNN works by assuming dots which are closer together belong to the same class.
If n_neighbors=5
then it assume a dot with the 5 closest dots around it are in the same class.
We've left out some details here like what defines close or how distance is calculated but I encourage you to research them by going through the documentation.
For now, let's try a few different values of n_neighbors
and test how the results go.
# Create a list of train scores
train_scores = []
# Create a list of test scores
test_scores = []
# Create a list of different values for n_neighbors
neighbors = range(1, 21) # 1 to 20
# Setup algorithm
knn = KNeighborsClassifier()
# Loop through different neighbors values
for i in neighbors:
knn.set_params(n_neighbors = i) # set neighbors value
# Fit the algorithm
knn.fit(X_train, y_train)
# Update the training scores
train_scores.append(knn.score(X_train, y_train))
# Update the test scores
test_scores.append(knn.score(X_test, y_test))
That was quick!
Now let's look at KNN's train scores.
train_scores
[1.0, 0.8099173553719008, 0.7727272727272727, 0.743801652892562, 0.7603305785123967, 0.7520661157024794, 0.743801652892562, 0.7231404958677686, 0.71900826446281, 0.6942148760330579, 0.7272727272727273, 0.6983471074380165, 0.6900826446280992, 0.6942148760330579, 0.6859504132231405, 0.6735537190082644, 0.6859504132231405, 0.6652892561983471, 0.6818181818181818, 0.6694214876033058]
Ok, these are a bit hard to understand, so let's follow the data explorer's motto and visualize, visualize, visualize! In other words, let's plot them.
plt.plot(neighbors, train_scores, label="Train score")
plt.plot(neighbors, test_scores, label="Test score")
plt.xticks(np.arange(1, 21, 1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model score")
plt.legend()
print(f"Maximum KNN score on the test data: {max(test_scores)*100:.2f}%")
Maximum KNN score on the test data: 75.41%
Looking at the graph, n_neighbors = 11
seems best.
Even knowing this, the KNN
's model performance didn't get near what LogisticRegression
or the RandomForestClassifier
did.
Because of this, we'll discard KNN
and focus on the other two.
We've tuned KNN
by hand but let's see how we can LogisticsRegression
and RandomForestClassifier
using RandomizedSearchCV
.
Instead of us having to manually try different hyperparameters by hand, RandomizedSearchCV
tries a number of different combinations, evaluates them and saves the best.
6.2 Tuning models with with RandomizedSearchCV
¶
Reading the Scikit-Learn documentation for LogisticRegression
, we find there's a number of different hyperparameters we can tune.
The same for RandomForestClassifier
.
Let's create a hyperparameter grid (a dictionary of different hyperparameters) for each and then test them out.
Note: Be careful creating a hyperparameter dictionary for tuning as if there are typos in the keys of the dictionary, you will find that your code hyperparameter tuning code will produce errors.
# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 20),
"solver": ["liblinear"]}
# Different RandomForestClassifier hyperparameters
rf_grid = {"n_estimators": np.arange(10, 1000, 50),
"max_depth": [None, 3, 5, 10],
"min_samples_split": np.arange(2, 20, 2),
"min_samples_leaf": np.arange(1, 20, 2)}
Now let's use sklearn.model_selection.RandomizedSearchCV
to try and tune our LogisticRegression
model.
We'll pass it the different hyperparameters from log_reg_grid
as well as set n_iter=20
. This means, RandomizedSearchCV
will try 20 different combinations of hyperparameters from log_reg_grid
and save the best ones.
%%time
# Setup random seed
np.random.seed(42)
# Setup random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
param_distributions=log_reg_grid,
cv=5,
n_iter=20,
verbose=True)
# Fit random hyperparameter search model
rs_log_reg.fit(X_train, y_train);
Fitting 5 folds for each of 20 candidates, totalling 100 fits CPU times: user 160 ms, sys: 7.51 ms, total: 168 ms Wall time: 193 ms
rs_log_reg.best_params_
{'solver': 'liblinear', 'C': np.float64(0.23357214690901212)}
rs_log_reg.score(X_test, y_test)
0.8852459016393442
Nice! That seems on par with the result we got before without any hyperparameter tuning.
Note: Many of the algorithms in Scikit-Learn have pretty good default hyperparameter values so don't be surprised if they perform pretty good on your data straight out of the box. But don't take this as being true all the time. Just because the default hyperparameters perform pretty well on your data doesn't mean there aren't a better set of hyperparameter values out there.
Now we've tuned LogisticRegression
using RandomizedSearchCV
, we'll do the same for RandomForestClassifier
.
%%time
# Setup random seed
np.random.seed(42)
# Setup random hyperparameter search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
param_distributions=rf_grid,
cv=5,
n_iter=20,
verbose=True)
# Fit random hyperparameter search model
rs_rf.fit(X_train, y_train);
Fitting 5 folds for each of 20 candidates, totalling 100 fits CPU times: user 21.6 s, sys: 144 ms, total: 21.8 s Wall time: 22.1 s
# Find the best parameters
rs_rf.best_params_
{'n_estimators': np.int64(210), 'min_samples_split': np.int64(4), 'min_samples_leaf': np.int64(19), 'max_depth': 3}
# Evaluate the randomized search random forest model
rs_rf.score(X_test, y_test)
0.8688524590163934
Excellent! Tuning the hyperparameters for each model saw a slight performance boost in both the RandomForestClassifier
and LogisticRegression
.
This is akin to tuning the settings on your oven and getting it to cook your favourite dish just right.
But since LogisticRegression
is pulling out in front, we'll try tuning it further with GridSearchCV
.
6.3 Tuning a model with GridSearchCV
¶
The difference between RandomizedSearchCV
and GridSearchCV
is:
sklearn.model_selection.RandomizedSearchCV
searches over a grid of hyperparameters performingn_iter
combinations (e.g. will explore random combinations of the hyperparameters for a defined number of iterations).sklearn.model_selection.GridSearchCV
will test every single possible combination of hyperparameters in the grid (this is a thorough test but can take quite a long time).
Each class will save the best model at the end of testing.
Let's see it in action.
%%time
# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 20),
"solver": ["liblinear"]}
# Setup grid hyperparameter search for LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
param_grid=log_reg_grid,
cv=5,
verbose=True)
# Fit grid hyperparameter search model
gs_log_reg.fit(X_train, y_train);
Fitting 5 folds for each of 20 candidates, totalling 100 fits CPU times: user 161 ms, sys: 2.41 ms, total: 163 ms Wall time: 212 ms
# Check the best parameters
gs_log_reg.best_params_
{'C': np.float64(0.23357214690901212), 'solver': 'liblinear'}
# Evaluate the model
gs_log_reg.score(X_test, y_test)
0.8852459016393442
In this case, we get the same results as before since our grid only has a maximum of 20 different hyperparameter combinations.
Note: If there are a large number of hyperparameter combinations in your grid,
GridSearchCV
may take a long time to try them all out. This is why it's a good idea to start withRandomizedSearchCV
, try a certain amount of combinations and then useGridSearchCV
to refine them.
7. Evaluating a classification model, beyond accuracy¶
Now we've got a tuned model, let's get some of the metrics we discussed before.
We want:
Metric/Evaluation Technique | Scikit-Learn method/documentation |
---|---|
ROC curve and AUC score | sklearn.metrics.RocCurveDisplay() , Note: This was previously sklearn.metrics.plot_roc_curve() , as of Scikit-Learn version 1.2+, it is sklearn.metrics.RocCurveDisplay() . |
Confusion matrix | sklearn.metrics.confusion_matrix() |
Classification report | sklearn.metrics.classification_report() |
Precision | sklearn.metrics.precision_score() |
Recall | sklearn.metrics.recall_score() |
F1-score | sklearn.metrics.f1_score() |
Luckily, Scikit-Learn has these all built-in.
What many evaluation metrics have in common is that they compare model predictions to ground truth data.
So we'll need some model predictions!
To access them, we'll have to use our model to make predictions on the test set.
We can make predictions by calling predict()
on a trained model and passing it the data you'd like to predict on.
We'll make predictions on the test data.
Note: When making predictions with a trained model, the data you're trying to predict on must be in the same format your model was trained on. For example, if a model was trained with data formatted in a certain way, it's important to make future predictions on data formatted in that same way.
# Make preidctions on test data
y_preds = gs_log_reg.predict(X_test)
Let's see them.
y_preds
array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])
They look like our original test data labels, except different where the model has predicred wrong.
y_test
array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])
Since we've got our prediction values we can find the metrics we want.
Let's start with the ROC curve and AUC scores.
7.1 ROC Curve and AUC Scores¶
What's a ROC curve?
It's a way of understanding how your model is performing by comparing the true positive rate to the false positive rate.
In our case...
To get an appropriate example in a real-world problem, consider a diagnostic test that seeks to determine whether a person has a certain disease. A false positive in this case occurs when the person tests positive, but does not actually have the disease. A false negative, on the other hand, occurs when the person tests negative, suggesting they are healthy, when they actually do have the disease.
Scikit-Learn implements a function RocCurveDisplay
(previously called plot_roc_curve
in Scikit-Learn versions < 1.2) which can help us create a ROC curve as well as calculate the area under the curve (AUC) metric.
Reading the documentation on the RocCurveDisplay
function we can see it has a class method called from_estimator(estimator, X, y)
as inputs.
Where estimator
is a fitted machine learning model and X
and y
are the data you'd like to test it on.
In our case, we'll use the GridSearchCV version of our LogisticRegression
estimator, gs_log_reg
as well as the test data, X_test
and y_test
.
# Before Scikit-Learn 1.2.0 (will error with versions 1.2+)
# from sklearn.metrics import plot_roc_curve
# plot_roc_curve(gs_log_reg, X_test, y_test);
# Scikit-Learn 1.2.0 or later
from sklearn.metrics import RocCurveDisplay
# from_estimator() = use a model to plot ROC curve on data
RocCurveDisplay.from_estimator(estimator=gs_log_reg,
X=X_test,
y=y_test);
This is great, our model does far better than guessing which would be a line going from the bottom left corner to the top right corner, AUC = 0.5.
But a perfect model would achieve an AUC score of 1.0, so there's still room for improvement.
Let's move on to the next evaluation request, a confusion matrix.
7.2 Creating a confusion matrix¶
A confusion matrix is a visual way to show where your model made the right predictions and where it made the wrong predictions (or in other words, got confused).
Scikit-Learn allows us to create a confusion matrix using sklearn.metrics.confusion_matrix()
and passing it the true labels and predicted labels.
# Display confusion matrix
print(confusion_matrix(y_test, y_preds))
[[25 4] [ 3 29]]
As you can see, Scikit-Learn's built-in confusion matrix is a bit bland. For a presentation you'd probably want to make it visual.
Let's create a function which uses Seaborn's heatmap()
for doing so.
# Import Seaborn
import seaborn as sns
sns.set(font_scale=1.5) # Increase font size
def plot_conf_mat(y_test, y_preds):
"""
Plots a confusion matrix using Seaborn's heatmap().
"""
fig, ax = plt.subplots(figsize=(3, 3))
ax = sns.heatmap(confusion_matrix(y_test, y_preds),
annot=True, # Annotate the boxes
cbar=False)
plt.xlabel("true label")
plt.ylabel("predicted label")
plot_conf_mat(y_test, y_preds)
Beautiful! That looks much better.
You can see the model gets confused (predicts the wrong label) relatively the same across both classes.
In essence, there are 4 occasaions where the model predicted 0 when it should've been 1 (false negative) and 3 occasions where the model predicted 1 instead of 0 (false positive).
As further evaluation, we could look into these samples and see why this may be the case.
7.3 Classification report¶
A classification report is a collection of different metrics and other details.
We can make a classification report using sklearn.metrics.classification_report(y_true, y_pred)
and passing it the true labels as well as our models predicted labels.
A classification report will also give us information on the precision and recall of our model for each class.
# Show classification report
print(classification_report(y_test, y_preds))
precision recall f1-score support 0 0.89 0.86 0.88 29 1 0.88 0.91 0.89 32 accuracy 0.89 61 macro avg 0.89 0.88 0.88 61 weighted avg 0.89 0.89 0.89 61
What's going on here?
Let's refresh ourselves on of the above metrics.
Metric/metadata | Explanation |
---|---|
Precision | Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0. |
Recall | Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0. |
F1 score | A combination of precision and recall. A perfect model achieves an F1 score of 1.0. |
Support | The number of samples each metric was calculated on. |
Accuracy | The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0. |
Macro avg | Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn’t class imbalance into effort, so if you do have class imbalances, pay attention to this metric. |
Weighted avg | Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples). |
Ok, now we've got a few deeper insights on our model.
But these were all calculated using a single training and test set.
What we'll do to make them more solid is calculate them using cross-validation.
How?
We'll take the best model along with the best hyperparameters and use cross_val_score()
along with various scoring
parameter values.
cross_val_score()
works by taking an estimator (machine learning model) along with data and labels.
It then evaluates the machine learning model on the data and labels using cross-validation across cv=5
(the default number of splits) splits and a defined scoring
parameter.
Let's remind ourselves of the best hyperparameters and then see them in action.
# Check best hyperparameters
gs_log_reg.best_params_
{'C': np.float64(0.23357214690901212), 'solver': 'liblinear'}
# Import cross_val_score
from sklearn.model_selection import cross_val_score
# Instantiate best model with best hyperparameters (found with GridSearchCV)
clf = LogisticRegression(C=0.23357214690901212,
solver="liblinear")
Now we've got an instantiated classifier, let's find some cross-validated metrics.
%%time
# Cross-validated accuracy score
cv_acc = cross_val_score(clf,
X,
y,
cv=5, # 5-fold cross-validation, this is the default
scoring="accuracy") # accuracy as scoring
cv_acc
CPU times: user 9.91 ms, sys: 1.35 ms, total: 11.3 ms Wall time: 9.75 ms
array([0.81967213, 0.90163934, 0.8852459 , 0.88333333, 0.75 ])
Woah!
The output from cross_val_score()
shows 5 different metrics across different splits of the data.
This goes to show the power of cross-validation.
If we had have only chosen to go with the results of one data split, we might be thinking our model is under performing or over performing.
Since there are 5 metrics here, we'll take the average.
cv_acc = np.mean(cv_acc)
cv_acc
np.float64(0.8479781420765027)
Now we'll do the same for other classification metrics.
# Cross-validated precision score
cv_precision = np.mean(cross_val_score(clf,
X,
y,
cv=5, # 5-fold cross-validation
scoring="precision")) # precision as scoring
cv_precision
np.float64(0.8215873015873015)
# Cross-validated recall score
cv_recall = np.mean(cross_val_score(clf,
X,
y,
cv=5, # 5-fold cross-validation
scoring="recall")) # recall as scoring
cv_recall
np.float64(0.9272727272727274)
# Cross-validated F1 score
cv_f1 = np.mean(cross_val_score(clf,
X,
y,
cv=5, # 5-fold cross-validation
scoring="f1")) # f1 as scoring
cv_f1
np.float64(0.8705403543192143)
Okay, we've got cross validated metrics, now what?
Let's visualize them.
# Visualizing cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
"Precision": cv_precision,
"Recall": cv_recall,
"F1": cv_f1},
index=[0])
cv_metrics.T.plot.bar(title="Cross-Validated Metrics", legend=False);
Great! This looks like something we could share. An extension might be adding the metrics on top of each bar so someone can quickly tell what they were.
What now?
The final thing to check off the list of our model evaluation techniques is feature importance.
8. Feature importance¶
Feature importance is another way of asking, "Which features contribute most to the outcomes of the model?"
For our problem, trying to predict heart disease using a patient's medical characteristics, getting the feature importance is like asking "Which characteristics contribute most to a model predicting whether someone has heart disease or not?"
Because how each model finds patterns in data is slightly different, how a model judges how important those patterns are is different as well.
This means for each model, there's a slightly different way of finding which features were most important and in turn, the feature importance of one model won't necessarily reflect the feature importance of another.
You can usually find an example via the Scikit-Learn documentation or via searching for something like "MODEL TYPE feature importance", such as, "random forest feature importance".
Since we're using LogisticRegression
, we'll look at one way we can calculate feature importance for it.
To do so, we'll use the coef_
attribute. Looking at the Scikit-Learn documentation for LogisticRegression
, the coef_
attribute is the coefficient of the features in the decision function.
We can access the coef_
attribute after we've fit an instance of LogisticRegression
.
# Fit an instance of LogisticRegression (taken from above)
clf.fit(X_train, y_train);
# Check coef_
clf.coef_
array([[ 0.00369922, -0.90424094, 0.67472825, -0.0116134 , -0.00170364, 0.04787688, 0.33490202, 0.02472938, -0.6312041 , -0.57590972, 0.47095153, -0.65165346, -0.69984212]])
Looking at this it might not make much sense. But these values are how much each feature contributes to how a model makes a decision on whether patterns in a sample of patients health data leans more towards having heart disease or not.
Even knowing this, in it's current form, this coef_
array still doesn't mean much. But it will if we combine it with the columns (features) of our dataframe.
# Match features to columns
features_dict = dict(zip(df.columns, list(clf.coef_[0])))
features_dict
{'age': np.float64(0.0036992219987868977), 'sex': np.float64(-0.9042409356586161), 'cp': np.float64(0.6747282473934053), 'trestbps': np.float64(-0.011613399733807518), 'chol': np.float64(-0.0017036437157196944), 'fbs': np.float64(0.0478768767697894), 'restecg': np.float64(0.3349020243959257), 'thalach': np.float64(0.02472938207178759), 'exang': np.float64(-0.6312040952883138), 'oldpeak': np.float64(-0.575909718275565), 'slope': np.float64(0.4709515257844554), 'ca': np.float64(-0.6516534575992304), 'thal': np.float64(-0.6998421177365038)}
Now we've match the feature coefficients to different features, let's visualize them.
# Visualize feature importance
features_df = pd.DataFrame(features_dict, index=[0])
features_df.T.plot.bar(title="Feature Importance", legend=False);
You'll notice some are negative and some are positive.
The larger the value (bigger bar), the more the feature contributes to the models decision.
If the value is negative, it means there's a negative correlation. And vice versa for positive values.
For example, the sex
attribute has a negative value of -0.904, which means as the value for sex
increases, the target
value decreases.
We can see this by comparing the sex
column to the target
column.
pd.crosstab(df["sex"], df["target"])
target | 0 | 1 |
---|---|---|
sex | ||
0 | 24 | 72 |
1 | 114 | 93 |
You can see, when sex
is 0 (female), there are almost 3 times as many (72 vs. 24) people with heart disease (target
= 1) than without.
And then as sex
increases to 1 (male), the ratio goes down to almost 1 to 1 (114 vs. 93) of people who have heart disease and who don't.
What does this mean?
It means the model has found a pattern which reflects the data. Looking at these figures and this specific dataset, it seems if the patient is female, they're more likely to have heart disease.
How about a positive correlation?
# Contrast slope (positive coefficient) with target
pd.crosstab(df["slope"], df["target"])
target | 0 | 1 |
---|---|---|
slope | ||
0 | 12 | 9 |
1 | 91 | 49 |
2 | 35 | 107 |
Looking back the data dictionary, we see slope
is the "slope of the peak exercise ST segment" where:
- 0: Upsloping: better heart rate with excercise (uncommon)
- 1: Flatsloping: minimal change (typical healthy heart)
- 2: Downslopins: signs of unhealthy heart
According to the model, there's a positive correlation of 0.470, not as strong as sex
and target
but still more than 0.
This positive correlation means our model is picking up the pattern that as slope
increases, so does the target
value.
Is this true?
When you look at the contrast (pd.crosstab(df["slope"], df["target"]
) it is. As slope
goes up, so does target
.
What can you do with this information?
This is something you might want to talk to a subject matter expert about.
They may be interested in seeing where machine learning model is finding the most patterns (highest correlation) as well as where it's not (lowest correlation).
Doing this has a few benefits:
- Finding out more - If some of the correlations and feature importances are confusing, a subject matter expert may be able to shed some light on the situation and help you figure out more.
- Redirecting efforts - If some features offer far more value than others, this may change how you collect data for different problems. See point 3.
- Less but better - Similar to above, if some features are offering far more value than others, you could reduce the number of features your model tries to find patterns in as well as improve the ones which offer the most. This could potentially lead to saving on computation, by having a model find patterns across less features, whilst still achieving the same performance levels.
9. Experimentation¶
We've completed all the metrics your boss requested!
You should be able to put together a great report containing a confusion matrix, and a handful of cross-validated metrics such as precision, recall and F1-score and you can even include which features contribute most to the model making a decision.
But after all this you might be wondering where step 6 in the framework is, experimentation.
Well the secret here is, as you might've guessed, the whole thing is experimentation.
From trying different models, to tuning different models to figuring out which hyperparameters were best.
What we've worked through so far has been a series of experiments.
And the truth is, we could keep going. But of course, things can't go on forever.
So by this stage, after trying a few different things, we'd ask ourselves did we meet the evaluation metric?
Remember we defined one in step 3.
If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we'll pursue this project.
In this case, we didn't.
The highest accuracy our model achieved was below 90%.
What next?¶
You might be wondering, what happens when the evaluation metric doesn't get hit?
Is everything we've done wasted?
No.
It means we know what doesn't work.
In this case, we know the current model we're using (a tuned version of sklearn.linear_model.LogisticRegression
) along with our specific data set doesn't hit the target we set ourselves.
This is where step 6 comes into its own.
A good next step would be to discuss with your team or research on your own different options of going forward.
- Could you collect more data? Across more patients with more features? This may take a while but in machine learning, more data is generally better.
- Could you try a better model? If you're working with structured data, you might want to look into CatBoost or XGBoost.
- Could you improve the current models (beyond what we've done so far)?
- If your model is good enough, how would you export it and share it with others? (Hint: check out Scikit-Learn's documentation on model persistance)
The key here is to remember, your biggest restriction will be time.
Hence why it's paramount to minimise your time between experiments (if you can).
The more things you try, the more you figure out what doesn't work, the more you'll start to get a hang of what does.
And that's the whole nature of machine learning.