Exploring Performance Metrics: NFL Combine Analysis

15 minute read

Background

In the National Football League (NFL), athletes exhibit a remarkable range of physical attributes tailored to the demands of their positions. Every year, the NFL Combine collects physical and performance data from new athletes declaring for the NFL draft. Scouts use this data to further evaluate player prospects beyond their on-field performances.

In this post, I will investigate the effectiveness of K-means clustering in identifying player positions using NFL Combine data from 2000 to 2018. Additionally, I will conduct several tests to investigate annual trends in the Cone Drill.

Exploratory Analysis

The following code will be run using a Python 3 kernel in Jupyter Notebook. We’ll start by loading in the dataset and libraries below!

# Import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Load NFL Combine data from 2000-2018
df = pd.read_csv("C:/Users/ethan/OneDrive/Data Projects/Datasets/NFL_Combine.csv", index_col="Player")

First, let’s take a look at the raw data.

df.shape

(6218, 15)
# Print the first 6 rows
df.head(6)
Player Pos Ht Wt Forty Vertical BenchReps BroadJump Cone Shuttle Year Pfr_ID AV Team Round Pick
John Abraham OLB 76 252 4.55 NaN NaN NaN NaN NaN 2000 AbraJo00 26 New York Jets 1.0 13.0
Shaun Alexander RB 72 218 4.58 NaN NaN NaN NaN NaN 2000 AlexSh00 26 Seattle Seahawks 1.0 19.0
Darnell Alford OT 76 334 5.56 25.0 23.0 94.0 8.48 4.98 2000 AlfoDa20 0 Kansas City Chiefs 6.0 188.0
Kyle Allamon TE 74 253 4.97 29.0 NaN 104.0 7.29 4.49 2000 NaN 0 NaN NaN NaN
Rashard Anderson CB 74 206 4.55 34.0 NaN 123.0 7.18 4.15 2000 AndeRa21 6 Carolina Panthers 1.0 23.0
Jake Arians K 70 202 NaN NaN NaN NaN NaN NaN 2000 arianjak01 0 NaN NaN NaN

We have 15 fields for 6218 players in our dataset, including information on their draft outcome. Some players have missing information, and several fields that are not of interest to us…

Let’s group similar player positons together (i.e., a linebacker LB and an inside linebacker ILB). This will be useful later in our cluster analysis.

# Create a dictonary to categorize player positions in the dataset
position_mapping = {
    "Quarterback": ["QB"],
    "Receiver": ["WR"],
    "Ball-Carrier": ["RB", "FB"],
    "Tight-End": ["TE"],
    "Secondary": ["S", "SS", "FS", "CB", "DB"],
    "Linebacker": ["ILB", "OLB", "Edge", "LB"],
    "D-Line": ["DE", "DT", "NT"],
    "O-Line": ["OT", "OG", "C", "G", "OL"],
    "Special": ["K", "P", "LS"]
}

# Create a function to map categorized positions back to players
def map_position(pos):
    for position, values in position_mapping.items():
        if pos in values:
            return position
    return "Unknown"

# Reformat Position
df["Position"] = df["Pos"].apply(map_position)

Now let’s drop those pesky fields.

# Drop unnecessary fields
df.drop(columns=["Pos", "AV", "Pfr_ID"], inplace=True)

Upon investigating the data source, it is safe for us to assume players with NaN values for Team, Round, and Pick went undrafted. We will alias these players easier to make them easier to identify. As for the players with missing combine metrics, we will remove them altogether.

# Replace NaN values for undrafted players
df.loc[df["Team"].isna(), "Team"] = "Undrafted"
df.loc[df["Round"].isna(), "Round"] = 0
df.loc[df["Pick"].isna(), "Pick"] = 0

# Appropriate integer fields
df['Round'] = df['Round'].astype(int)
df['Pick'] = df['Pick'].astype(int)

# Drop players from dataset with missing combine metrics
df.dropna(inplace=True)

# Rename select fields
df = df.rename(columns={
    'Ht': 'Height',
    'Wt': 'Weight',
    'Vertical': 'Vert',
    'BenchReps': 'Bench',
    'BroadJump': 'Jump'
})

# View the new dataframe
df.head(6)
Player Height Weight Forty Vert Bench Jump Cone Shuttle Year Team Round Pick Position
Darnell Alford 76 334 5.56 25.0 23.0 94.0 8.48 4.98 2000 Kansas City Chiefs 6 188 O-Line
Corey Atkins 72 237 4.72 31.0 21.0 112.0 7.96 4.39 2000 Undrafted 0 0 Linebacker
Reggie Austin 69 175 4.44 35.0 17.0 119.0 7.03 4.14 2000 Chicago Bears 4 125 Secondary
Mark Baniewicz 78 312 5.34 28.0 20.0 96.0 7.72 4.73 2000 Undrafted 0 0 O-Line
Rashidi Barnes 72 208 4.62 35.0 10.0 114.0 6.92 4.32 2000 Cleveland Browns 7 225 Secondary
David Barrett 70 199 4.44 37.5 16.0 116.0 6.81 4.04 2000 Arizona Cardinals 4 102 Secondary

Let’s take one last look at a summary of the new dataframe.

df.describe()
  Height Weight Forty Vert Bench Jump Cone Shuttle Year Round Pick
count 2885.000000 2885.000000 2885.000000 2885.000000 2885.000000 2885.000000 2885.000000 2885.000000 2885.000000 2885.000000 2885.000000
mean 73.992721 252.087348 4.811584 32.598267 21.091854 113.190295 7.305719 4.408548 2009.374350 2.508492 74.869324
std 2.675779 45.882206 0.321231 4.270008 6.395231 9.491045 0.431420 0.269871 5.253415 2.442618 79.269856
min 65.000000 166.000000 4.260000 19.500000 2.000000 82.000000 6.280000 3.750000 2000.000000 0.000000 0.000000
25% 72.000000 210.000000 4.550000 29.500000 17.000000 107.000000 6.970000 4.200000 2005.000000 0.000000 0.000000
50% 74.000000 247.000000 4.730000 33.000000 21.000000 114.000000 7.220000 4.370000 2010.000000 2.000000 53.000000
75% 76.000000 300.000000 5.060000 35.500000 25.000000 120.000000 7.590000 4.590000 2014.000000 5.000000 137.000000
max 81.000000 370.000000 6.000000 45.500000 45.000000 139.000000 9.040000 5.560000 2018.000000 7.000000 259.000000

We are left with 2885 player records to evaluate.

Variable Correlation

Next, we’ll investigate relationships between numeric variables in the dataset by creating a correlation heatmap.

# Create a copy of the dataframe
df_numeric = df.copy()

# Drop 'team' and 'position' columns from the copied dataframe
df_numeric.drop(['Team', 'Position'], axis=1, inplace=True)

# Calculate a correlation matrix
correlation_matrix = df_numeric.corr()

# Create the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap=sns.color_palette("vlag", as_cmap=True), fmt=".2f", square=True)
plt.title('Correlation Heatmap')
plt.show()

Many of these correlations seem intuitive. We might expect a player’s weight to be positively correlated with their 40-Yard Dash time, which we see with a correlation coefficient r of 0.89 (i.e., heavier players often take longer to complete this drill). It’s also reasonable that a player’s standing jump would be negatively correlated with their Forty-Yard Dash time. Players who complete the dash in less time (faster) are more likely to jump higher than heavier, slower players. We will circle back to these correlations in a bit for further analysis.

Cluster Analysis

Let’s explore different player types within the dataset using K-means clustering. To decide on an optimal number of clusters, we’ll develop a function that visualizes inertia across different values of k. An elbow point on this plot will represent a significant decrease in inertia relative to increasing k-values, indicating the optimal number of clusters where adding more clusters does not significantly reduce inertia.

# Create a function to plot the inertias of k for clustering
def optimize_k(data, max_k):
    means=[]
    inertias=[]

    for k in range(1, max_k):
        kmeans=KMeans(n_clusters=k)
        kmeans.fit(data)

        means.append(k)
        inertias.append(kmeans.inertia_)

    # Elbow plot to identify optimal k-value
    fig=plt.subplots(figsize=(10,5))
    plt.plot(means, inertias, "o-")
    plt.xlabel("Number of Clusters")
    plt.ylabel("Inertia")
    plt.grid(True)
    plt.show()

Now that we’ve created an inertia function, we can test it by first clustering on all numeric fields in the dataset. This process will effectively create k player clusters characterized by similar performance metrics among their members.

# Determine k for clustering on all numeric fields in the data
optimize_k(df[["Height", "Weight", "Forty", "Vert", "Bench", "Jump", "Cone", "Shuttle"]], 10)

Based on the elbow plot above, we will select k=3 for our first K-means clustering on all numeric fields.

# Let k=3 and cluster on all numeric fields
kmeans = KMeans(n_clusters=3)
kmeans.fit(df[["Height", "Weight", "Forty", "Vert", "Bench", "Jump", "Cone", "Shuttle"]])
df["kmeans_3_all"]=kmeans.labels_

# Split up clusters for plotting
cluster_1 = df[df["kmeans_3_all"]==1]
cluster_2 = df[df["kmeans_3_all"]==0]
cluster_3 = df[df["kmeans_3_all"]==2]

# Identify the max player count for positions across all clusters
max_count = max(cluster_1['Position'].value_counts().max(), cluster_2['Position'].value_counts().max(), cluster_3['Position'].value_counts().max())
# Set shared y-axis height
y_axis_limit = max_count + 50
# Create empty plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)

# Format cluster 1 subplot
cluster_1['Position'].value_counts().plot(kind='bar', color='firebrick', ax=axes[0], width=0.50)
axes[0].set_title('Cluster 1')
axes[0].set_xlabel('Position')
axes[0].set_ylabel('Count')
axes[0].set_ylim(0, y_axis_limit)
axes[0].tick_params(axis='x', rotation=45)

# Format cluster 2 subplot
cluster_2['Position'].value_counts().plot(kind='bar', color='olivedrab', ax=axes[1], width=0.35)
axes[1].set_title('Cluster 2')
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Count')
axes[1].set_ylim(0, y_axis_limit)
axes[1].tick_params(axis='x', rotation=45)

# Format cluster 3 subplot
cluster_3['Position'].value_counts().plot(kind='bar', color='royalblue', ax=axes[2], width=0.30)
axes[2].set_title('Cluster 3')
axes[2].set_xlabel('Position')
axes[2].set_ylabel('Count')
axes[2].set_ylim(0, y_axis_limit)
axes[2].tick_params(axis='x', rotation=45)

# Display subplots
plt.tight_layout()
plt.show()

As specified, the 2885 player-records have been clustered on all numeric fields (Height, Weight, Forty, Bench…) and assigned to one of three groups. The plots above show the distribution of player position counts within each cluster.

Cluster 1 contains mostly offensive and defensive linemen, characterized by their substantial stature and weight. These players exhibit impressive power and explosiveness, making them formidable forces in the trenches.

Cluster 2 contains a majority of skilled players like recievers, defensive backs, and running backs. Known for their agility, speed, and explosive bursts, these athletes excel in making quick, decisive movements on the field.

Cluster 3 presents a diverse mix of player types, including linebackers, defensive ends, and tight ends. Combining elements from both Cluster 1 and Cluster 2, these players showcase a blend of strength, agility, and versatility, making them crucial assets in various facets of the game.

Rather than looking at all numeric variables, let’s narrow down our selection to two highly correlated variables, like Forty-Yard Dash and Weight (r=0.89). Similar to our last procedure, we’ll have to determine a new value of k to cluster on a new variable set.

# Determine k for clustering on Forty vs Weight
optimize_k(df[["Weight", "Forty"]], 10)

Based on the elbow plot above, we will again select k=3 for clustering on Forty-Yard Dash (Forty) vs Weight.

# Let k=3 and cluster on forty vs weight
kmeans = KMeans(n_clusters=3)
kmeans.fit(df[["Weight", "Forty"]])
df["kmeans_3_Wt_Fty"]=kmeans.labels_

# Split up clusters for plotting
cluster_1 = df[df["kmeans_3_Wt_Fty"]==2]
cluster_2 = df[df["kmeans_3_Wt_Fty"]==0]
cluster_3 = df[df["kmeans_3_Wt_Fty"]==1]

# Identify the max player count for positions across all clusters
max_count = max(cluster_1['Position'].value_counts().max(), cluster_2['Position'].value_counts().max(), cluster_3['Position'].value_counts().max())
# Set shared y-axis height
y_axis_limit = max_count + 50
# Format empty plots
fig, axes = plt.subplots(1, 3, figsize=(15, 5), sharey=True)

# Format cluster 1 subplot
cluster_1['Position'].value_counts().plot(kind='bar', color='firebrick', ax=axes[0], width=0.50)
axes[0].set_title('Cluster 1')
axes[0].set_xlabel('Position')
axes[0].set_ylabel('Count')
axes[0].set_ylim(0, y_axis_limit)
axes[0].tick_params(axis='x', rotation=45)

# Format cluster 2 subplot
cluster_2['Position'].value_counts().plot(kind='bar', color='olivedrab', ax=axes[1], width=0.35)
axes[1].set_title('Cluster 2')
axes[1].set_xlabel('Position')
axes[1].set_ylabel('Count')
axes[1].set_ylim(0, y_axis_limit)
axes[1].tick_params(axis='x', rotation=45)

# Format cluster 3 subplot
cluster_3['Position'].value_counts().plot(kind='bar', color='royalblue', ax=axes[2], width=0.30)
axes[2].set_title('Cluster 3')
axes[2].set_xlabel('Position')
axes[2].set_ylabel('Count')
axes[2].set_ylim(0, y_axis_limit)
axes[2].tick_params(axis='x', rotation=45)

# Display plots
plt.tight_layout()
plt.show()

The distribution of player positions within these new Forty-vs-Weight clusters is nearly identical to that of our clusters on all numeric fields. Now that we’ve narrowed our focus to two dimensions, we can generate a scatter plot comparing Forty-Yard Dash and Weight. We’ll color-code these points based on their respective clusters.

# Scatter plot of Forty vs Weight (3 clusters)
fig, ax = plt.subplots(figsize=(8, 6))

# Define custom colors for the clusters
colors = ['firebrick', 'olivedrab', 'royalblue']

# Function to assign clusters with proper colors in scatter plot
def f(i):
    return [2, 0, 1][i % 3]

# Scatter plot with custom colors for each cluster
for i, color in enumerate(colors):
    cluster_data = df[df["kmeans_3_Wt_Fty"] == f(i)]  # Filter data for each cluster
    ax.scatter(x=cluster_data["Forty"], y=cluster_data["Weight"], c=color, label=f'Cluster {i+1}')

ax.set_xlabel("40-Yard Dash (sec)")
ax.set_ylabel("Weight (lbs)")
ax.legend()  # Show legend for cluster colors

plt.show()

This plot provides a visual representation of our clusters in the context of Forty-Yard Dash times and Weight. As expected, the clusters are clearly delineated: Cluster 1 points (Power Positions) are predominantly positioned towards the top right on the plot, Cluster 2 points (Agile Players) occupy the bottom left, and Cluster 3 points (Hybrid Builds) are situated in the middle.

Cone Drill Regression Analysis

As promised, it’s time to shift our focus back to the correlation heatmap above. Considering annual trends, we observed that Year shared its strongest correlation with Cone Drill (r = -0.16). While this correlation is relatively modest, it does raise a question: Has the average time to complete the Cone Drill seen a significant decrease since 2000?

Before answering this question, we can investigate the average Cone Drill speed year-over-year with a line chart.

# Group by year and calculate the average cone drill speed for each year
average_cone_speed_yearly = df.groupby('Year')['Cone'].mean()

# Plot the average cone drill speed YOY
plt.figure(figsize=(10, 6))
average_cone_speed_yearly.plot(kind='line', marker='o', color='b')
plt.title('Average Cone Speed Year Over Year')
plt.xlabel('Year')
plt.ylabel('Average Cone Speed')
plt.grid(True)
plt.xticks(average_cone_speed_yearly.index)  # Set x-axis ticks to be the years
plt.tight_layout()
plt.show()

While there does appear to be a dip in the average cone drill speed over time, further analysis is necessary to confirm this trend. We’ll begin by fitting a linear regression model to the data. This model will test the following hypotheses:

𝐻0: There is no significant linear relationship between the year of the NFL Combine and the average speed at which players have completed the cone drill from 2000 to 2018.

𝐻𝑎: There is a significant negative linear relationship between the year of the NFL Combine and the average speed at which players have completed the cone drill from 2000 to 2018, indicating that the average speed has decreased during this period.

Import scikit-learn libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Reaggregate data by year
yearly_avg_cone_speed = df.groupby('Year')['Cone'].mean().reset_index()

# Extract independent and dependent variables
X = yearly_avg_cone_speed['Year'].values.reshape(-1, 1)  # Independent variable: Year
y = yearly_avg_cone_speed['Cone'].values  # Dependent variable: average Cone drill speed

# Fit the linear regression model
model = LinearRegression()
model.fit(X, y)

# Predict the average cone drill speed YOY
predicted_cone_speed = model.predict(X)

# Plot the data points and the linear regression line
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color='blue', label='Actual Data')
plt.plot(X, predicted_cone_speed, color='red', linewidth=2, label='Linear Regression Line')
plt.title('Linear Regression: Average Cone Drill Speed vs. Year')
plt.xlabel('Year')
plt.ylabel('Average Cone Drill Speed')
plt.legend()
plt.show()

# Determine the slope (coefficient) of the linear regression model
slope = model.coef_[0]
print("LR Slope:", slope)

# Calculate R-squared
r_squared = r2_score(y, predicted_cone_speed)
print("R-squared (Aggregated Data):", r_squared)

LR Slope: -0.013476126608817907
R-squared (Aggregated Data): 0.5194528023129359

Across all players, the average time to complete the cone drill has been decreasing by about 0.0135 seconds per year. Further, approximately 51.95% of the variance in average cone drill speed can be explained by the year. Let’s examine the results of the hypothesis test based on this model.

import numpy as np
from scipy.stats import t

# Number of observations
n = len(X)

# Degrees of freedom
deg = n - 2  # degrees of freedom for a simple linear regression

# Calculate standard error of the slope coefficient
residuals = y - predicted_cone_speed
standard_error_slope = np.sqrt(np.sum(residuals**2) / deg) / np.sqrt(np.sum((X - np.mean(X))**2))

# Calculate t-statistic
t_statistic = slope / standard_error_slope

# Calculate p-value
p_value = 2 * (1 - t.cdf(np.abs(t_statistic), deg))

print("t-statistic:", t_statistic)
print("p-value:", p_value)
t-statistic: -4.286763092593991
p-value: 0.0004989946412297108

The statistically significant t-statistic and small p-value suggest that the relationship between the year of the NFL Combine and the average cone drill speed is significant. Therefore, we reject the null hypothesis and conclude that there is a significant negative linear relationship between Year and average Cone drill speed in our data.

It’s important to consider the diversity of player positions within each year of the NFL Combine. A particular draft class might predominantly consist of fast players, resulting in a lower average cone drill time for that year. Conversely, the following draft year might be dominated by slower positions. This positional bias could give the impression that overall player speed changed significantly from year to year, when in reality, it may just reflect the composition of the draft class.

We can run a similar analysis to the one above for each cluster group from 2000 to 2018. In doing so, we mitigate some of the bias introduced by variations in positional subgroups. Let’s aggregate the data for each cluster by taking the average cone drill speed each year.

# Aggregate data within each cluster
cluster_1_aggregated = cluster_1.groupby('Year')['Cone'].mean().reset_index()
cluster_2_aggregated = cluster_2.groupby('Year')['Cone'].mean().reset_index()
cluster_3_aggregated = cluster_3.groupby('Year')['Cone'].mean().reset_index()

Cluster 1 Test

# Extracting data for Cluster 1
X_cluster1 = cluster_1_aggregated['Year'].values.reshape(-1, 1)  # Independent variable: Year
y_cluster1 = cluster_1_aggregated['Cone'].values  # Dependent variable: Average cone drill speed

# Fitting the linear regression model for Cluster 1
model_cluster1 = LinearRegression()
model_cluster1.fit(X_cluster1, y_cluster1)

# Predicting the average cone drill speed for Cluster 1
predicted_cone_speed_cluster1 = model_cluster1.predict(X_cluster1)

# Calculate slope for Cluster 1
slope_cluster1 = model_cluster1.coef_[0]

# Calculate R-squared for Cluster 1
r_squared_cluster1 = r2_score(y_cluster1, predicted_cone_speed_cluster1)

# Calculate standard error of the slope coefficient for Cluster 1
residuals_cluster1 = y_cluster1 - predicted_cone_speed_cluster1
deg_cluster1 = len(X_cluster1) - 2  # degrees of freedom for a simple linear regression
standard_error_slope_cluster1 = np.sqrt(np.sum(residuals_cluster1**2) / deg_cluster1) / np.sqrt(np.sum((X_cluster1 - np.mean(X_cluster1))**2))

# Calculate t-statistic for Cluster 1
t_statistic_cluster1 = slope_cluster1 / standard_error_slope_cluster1

# Calculate p-value for Cluster 1
p_value_cluster1 = 2 * (1 - t.cdf(np.abs(t_statistic_cluster1), deg_cluster1))

# Print results for Cluster 1
print("Cluster 1 Results:")
print("LR Slope:", slope_cluster1)
print("R-squared:", r_squared_cluster1)
print("t-statistic:", t_statistic_cluster1)
print("p-value:", p_value_cluster1)

Cluster 1 Results:
LR Slope: -0.007284025968538165
R-squared: 0.27525153202504027
t-statistic: -2.5409474310335876
p-value: 0.021099052456581946

Similarly for clusters 2 and 3…

Cluster 2 Results:
LR Slope: -0.0077117141767901794
R-squared: 0.36953339540415586
t-statistic: -3.156605168807558
p-value: 0.00576068502698579
Cluster 3 Results:
LR Slope: -0.01193051964700224
R-squared: 0.48944498115367974
t-statistic: -4.036966314785228
p-value: 0.0008558581836797252

Cluster 1: The average cone drill time for Power Players has been decreasing by approximately 0.0073 seconds each year. The R-squared value of 0.2752 suggests that about 28% of the variance in the average cone drill speed for this cluster can be explained by the year. The negative t-statistic and p-value suggest that there is a significant negative linear relationship between the year of the NFL Combine and the average cone drill speed. Therefore, we reject the null hypothesis and conclude that the average cone drill speed for Power Players has been decreasing over the years.

Cluster 2: Agile Players have also seen cone drill performance improve with their average completion time decreasing by approximately 0.0077 seconds each year. An R-squared value of 0.3695 suggests that about 37% of the variance in the average cone drill speed for this cluster can be explained by the year. The negative t-statistic and p-value suggest that there is a significant negative linear relationship between the year of the NFL Combine and the average cone drill speed. Therefore, we reject the null hypothesis and conclude that the average cone drill speed for Agile Players has been decreasing over the years.

Cluster 3: Hybrid Players have seen the greatest improvement of all with their average cone drill time decreasing by 0.0119 seconds each year. The R-squared value of 0.4894 suggests that about 49% of the variance in the average cone drill speed for this cluster can be explained by the year. The negative t-statistic and p-value suggest that there is a significant negative linear relationship between the year of the NFL Combine and the average cone drill speed. We again reject the null hypothesis and conclude that the average cone drill speed for Hybrid Players has been decreasing over the years.

To investigate these results further, we will plot all three clusters alongside their regression models for comparison.

# Function to fit linear regression model and predict
def fit_and_predict(cluster_data):
    X = cluster_data['Year'].values.reshape(-1, 1)
    y = cluster_data['Cone'].values
    model = LinearRegression()
    model.fit(X, y)
    return model, model.predict(X)

# Fit and predict for each cluster
model_cluster1, predicted_cone_speed_cluster1 = fit_and_predict(cluster_1_aggregated)
model_cluster2, predicted_cone_speed_cluster2 = fit_and_predict(cluster_2_aggregated)
model_cluster3, predicted_cone_speed_cluster3 = fit_and_predict(cluster_3_aggregated)

# Generate a year-range for plotting the regression lines
years = np.linspace(df['Year'].min(), df['Year'].max(), 100).reshape(-1, 1)
pred_cluster1_line = model_cluster1.predict(years)
pred_cluster2_line = model_cluster2.predict(years)
pred_cluster3_line = model_cluster3.predict(years)

# Plotting
plt.figure(figsize=(12, 8))

# Scatter plot
plt.scatter(cluster_1_aggregated['Year'], cluster_1_aggregated['Cone'], color='firebrick', label='Cluster 1')
plt.scatter(cluster_2_aggregated['Year'], cluster_2_aggregated['Cone'], color='olivedrab', label='Cluster 2')
plt.scatter(cluster_3_aggregated['Year'], cluster_3_aggregated['Cone'], color='royalblue', label='Cluster 3')

# Plot linear regression lines
plt.plot(years, pred_cluster1_line, color='firebrick', linewidth=2, alpha=0.5)
plt.plot(years, pred_cluster2_line, color='olivedrab', linewidth=2, alpha=0.5)
plt.plot(years, pred_cluster3_line, color='royalblue', linewidth=2, alpha=0.5)

# Labels and legend
plt.xlabel('Year')
plt.ylabel('Time (sec)')
plt.title('Year vs Cone Drill Time')
plt.legend()

plt.show()