Each baseball season there are certain to be teams that get you asking the question "Can they win it all?" As I'm writing this, my mind is turned to the 2021 SF Giants who got off to a hot start and just never cooled off. They currently have the best record in baseball after not being given a chance at the start of the season. However, there is certainly still a narative that come postseason, the teams that were supposed to be good going into the season will edge them out. I believe there are a couple things coming in to play here, but I wanted to look at the playoffs of the 2010's to see if there is any good way to measure how a baseball team will perform in the playoffs.
Now there is so much randomness when it comes to the playoffs that there isn't going to be great way to predict playoff success. However, I wanted to look at 3 metrics that I believe could be useful in determining how a team performs.
This is certainly an oversimplification of the things that can go into playoff success, but I believe it will be an interesting thing to look at. Truth be told this idea came from metric 3. The San Fransisco Giants of 2021 haven't hit great against great starting pitching and I wonder if that spells gloom for them in the postseason.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.multicomp
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import os
import random
import warnings
warnings.filterwarnings("ignore")
After loading most of the necessary packages, there are 3 datasets to import. Below are the necessary indices and variables from each of the original data sets
random.seed(4)
start_vars = ['Name', 'Tm', 'IP', 'G', 'GS','ERA+','Year']
game_ind = [0,3,6,9,10,39,67,101,103]
team_dict = {'NYA':'NYY','NYN':'NYM','SDN':'SDP','SLN':'STL','SFN':'SFG','FLO':'FLA','LAN':'LAD','WAS':'WSN','TBA':'TBR','KCA':'KCR','CHN':'CHC','CHA':'CHW','ANA':'LAA'}
game_vars = ['date','tm_away','tm_hm','runs_away','runs_home','er_away','er_home','pitch_id_away','pitch_id_home']
Data for each playoff season from 2010 through 2019 that includes the number of wins in the playoffs and the result of the team from 0 to 4 where 0 would be lost in the wild card round and 4 would be winning the World Series.
playoffs = pd.read_csv("playoff_teams.csv",header=0)
playoffs['Year']=playoffs['Year'].astype(str)
playoffs.rename(columns={'Team':'Tm'},inplace=True)
Data for every pitcher from 2010 through 2019 and their individual stat lines. This is split into two dataframes for relievers and for starters. This data was available on baseball-reference.com.
directory = 'pitchers/'
pitchers_df = pd.DataFrame(columns=start_vars)
for filename in os.listdir(directory):
year = str.split(filename,"_")[0]
df = pd.read_csv(directory+filename,header=0)
df['Year'] = year
df = df[start_vars]
pitchers_df = pd.concat([pitchers_df,df], ignore_index=True)
#Splitting up the name of the pitchers into id and actual name
pitchers_df['player_id']=pitchers_df['Name'].str.split("\\",expand=True)[1]
pitchers_df['Name']=pitchers_df['Name'].str.split("\\",expand=True)[0]
#Setting up the player id to match the format of the player id from Retrosheets
pitchers_df['player_id']=pitchers_df['player_id'].str[:-5]+pitchers_df['player_id'].str[-4:-3]+'0'+pitchers_df['player_id'].str[-2:]
#Dealing with trades by replacing the individual team totals with the season totals for a player
trades = pitchers_df[pitchers_df['Tm']=='TOT']
pitchers_df = pd.merge(pitchers_df, trades, how='left', on= ['Year','player_id','Name'])
pitchers_df.loc[pitchers_df['Tm_y']=='TOT','IP_x'] = pitchers_df.loc[pitchers_df['Tm_y']=='TOT','IP_y']
pitchers_df.loc[pitchers_df['Tm_y']=='TOT','G_x'] = pitchers_df.loc[pitchers_df['Tm_y']=='TOT','G_y']
pitchers_df.loc[pitchers_df['Tm_y']=='TOT','GS_x'] = pitchers_df.loc[pitchers_df['Tm_y']=='TOT','GS_y']
pitchers_df.loc[pitchers_df['Tm_y']=='TOT','ERA+_x'] = pitchers_df.loc[pitchers_df['Tm_y']=='TOT','ERA+_y']
pitchers_df.drop(columns=['Tm_y','IP_y','G_y','GS_y','ERA+_y'],inplace=True)
pitchers_df.rename(columns={'Tm_x':'Tm','IP_x':'IP','G_x':'G','GS_x':'GS','ERA+_x':'ERA+'},inplace=True)
#Drop all 'total' pitcher records now that they have been duplicated
pitchers_df = pitchers_df[pitchers_df['Tm']!='TOT']
#Creating additional variables
pitchers_df['IP_G'] = pitchers_df['IP']/pitchers_df['G'].astype(float)
#Splitting into relievers and starters based on GS vs Games. If more than half of the games were starts than classifying as a starting pitcher.
pitchers_df['Type']='SP'
pitchers_df.loc[pitchers_df['GS']/pitchers_df['G']<0.5,'Type'] = 'RP'
starter_df = pitchers_df[pitchers_df['Type']=='SP']
reliever_df = pitchers_df[pitchers_df['Type']=='RP']
Game log data from Retrosheet.org that includes who the starting pitcher was that day.
directory = 'games/'
game_df = pd.DataFrame(columns=game_vars)
for filename in os.listdir(directory):
df = pd.read_csv(directory+filename,header=None)
df = df[game_ind]
df.columns = game_vars
game_df = pd.concat([game_df,df], ignore_index=True)
#Matching the team abbrieviations to the other data sets
game_df.replace({'tm_away':team_dict,'tm_hm':team_dict},inplace=True)
game_df['Year'] = game_df.date.astype(str).str[:4]
The game logs include both teams per data point so the data needs to be "duplicated" in order to assess each game from each teams point of view. To do I created two dataframes, one for home teams and one for away teams, and then concatenated them. 'opp_er' is the number of earned runs the opponent gave up. The average across the data set is about 4.
away_game = game_df.rename(columns={'tm_away':'Tm','tm_hm':'Opp_Tm','runs_away':'r','runs_home':'opp_r','er_away':'er','er_home':'opp_er','pitch_id_away':'starter','pitch_id_home':'opp_starter'})
home_game = game_df.rename(columns={'tm_hm':'Tm','tm_away':'Opp_Tm','runs_home':'r','runs_away':'opp_r','er_home':'er','er_away':'opp_er','pitch_id_home':'starter','pitch_id_away':'opp_starter'})
home_game = home_game[['date','Tm','Opp_Tm','r','opp_r','er','opp_er','starter','opp_starter']]
all_games = pd.concat([away_game,home_game],ignore_index=True)
print(all_games['opp_er'].mean())
sns.histplot(x='opp_er',data=all_games)
plt.title('Histogram of Earned Runs per Game')
To rank the pitchers I am using ERA+ which is an advanced statistic where 100 is league average and adjusts for park factors. I am going to rank all the starting pitchers per team per year that have met a certain number of innings pitched (set to the mean of the dataset). Then taking the average of the best 3 starters that for each team (penalizing teams that do not have 3 pitchers that qualify).
pitch_ranking_metric = 'ERA+'
starter_num = 3
ip_qual = round(starter_df['IP'].mean())
#Rank starters by year and team
starter_df.sort_values(['Tm','Year',pitch_ranking_metric],ascending=[True,True,False], inplace=True, ignore_index=True)
#Limit search to starters that met the qs threshold
top_starters = starter_df.loc[starter_df['IP']>=ip_qual]
top_starters = top_starters.groupby(['Tm','Year']).head(starter_num)
team_starters = top_starters.groupby(['Tm','Year']).sum()
team_starters.reset_index(inplace=True)
team_starters = team_starters[['Tm','Year','ERA+']]
team_starters.rename(columns={'ERA+':'SP_ERA+'},inplace=True)
team_starters['SP_ERA+'] = team_starters['SP_ERA+']/starter_num
print(team_starters.sort_values('SP_ERA+',ascending=False))
top_starters.loc[(top_starters['Tm']=='HOU')&(top_starters['Year']=='2019')]
The results line up here as the top teams all had amazing pitching staffs and the bottom teams had some rough years.
To find the true "aces" through the years, I am limiting the starters list to only starters that were at least 50% better than their peers, pitched an average of at least 6 innings per start and meeting the innings threshold.
ace_starters = starter_df.loc[(starter_df['ERA+']>=150)&(starter_df['IP_G']>=5)&(starter_df['IP']>=ip_qual)]
ace_starters.reset_index(inplace=True)
print(list(ace_starters['Name'].unique()))
print(len(ace_starters))
sns.histplot(x='Year',data=ace_starters.sort_values('Year'))
Now that is some list. An average of just over 7 pitchers per season. 2012 and 2013 only had a handfull of elite pitchers, but that is okay. Some sure fire hall of famers mixed with some guys who put together some fantastic seasons.
Combining the list of ace starts with the games logs allows us to look at how each team hit against the best starters during the regular season.
games_vs_ace = pd.merge(ace_starters[['Year','player_id']],all_games,left_on=['player_id','Year'],right_on=['opp_starter','Year'])
games_vs_ace['opp_er']=games_vs_ace['opp_er'].astype(float)
print(games_vs_ace['opp_er'].mean())
sns.histplot(x='opp_er',data=games_vs_ace, binwidth=.5)
The average against aces is down about 1.3 runs per game compared to the overall list of games from this decade.
In order to match this data with the rest of our data, this should be grouped by team and year.
team_vs_ace = games_vs_ace.groupby(['Tm','Year']).agg({'opp_er':['mean','count']})
team_vs_ace.reset_index(inplace=True)
team_vs_ace.columns = ['Tm','Year','opp_er','count']
print(team_vs_ace[team_vs_ace['count']>=3].sort_values('opp_er', ascending=False).head(10))
Performing a similar exercise as we did with the starters to get the teams with the best relievers.
reliever_num = 5
ip_qual = round(reliever_df['IP'].mean())
reliever_df.sort_values(['Tm','Year',pitch_ranking_metric],ascending=[True,True,False], inplace=True, ignore_index=True)
top_relievers = reliever_df.loc[reliever_df['IP']>=ip_qual]
top_relievers = top_relievers.groupby(['Tm','Year']).head(reliever_num)
team_relievers = top_relievers.groupby(['Tm','Year']).sum()
team_relievers.reset_index(inplace=True)
team_relievers = team_relievers[['Tm','Year','ERA+']]
team_relievers.rename(columns={'ERA+':'RP_ERA+'},inplace=True)
team_relievers['RP_ERA+'] = team_relievers['RP_ERA+']/reliever_num
print(team_relievers.sort_values('RP_ERA+',ascending=False).head(10))
top_relievers.loc[(top_relievers['Tm']=='STL')&(top_relievers['Year']=='2013')]
Merging the top 3 starter ERA+ average with the playoff average to start creating the master dataframe that we can use to analyze.
playoff_starters = pd.merge(playoffs,team_starters,how='right',right_on=['Tm','Year'],left_on=['Tm','Year'])
games_vs_ace['opp_er']=games_vs_ace['opp_er'].astype(float)
#Teams that were not in the playoffs are given a 0 for wins, round, and 'PostSeason' which designates making the postseason.
playoff_starters['PostSeason']='Made Postseason'
playoff_starters.loc[playoff_starters['Result'].isnull(),'PostSeason']='Missed Postseason'
#Don't count wildcard losers as making the playoffs since it is just a 1 game series
playoff_starters.loc[playoff_starters['Result']==0,'PostSeason']='Missed Postseason'
playoff_starters.loc[playoff_starters['Result'].isnull(),['Wins','Result']]=0
Plotting each teams starting pitching against their playoff wins, shows minimal correlation between the top 3 starters and the result in the postseason. There is a clear dropoff in pitching performance for some non playoff teams. It would appear that if your top 3 starters are not all above average (100 ERA+) then you have no shot at the postseason.
sns.scatterplot(x='Wins',y='SP_ERA+',hue='PostSeason',data=playoff_starters)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Top 3 Starter ERA+ vs Playoff Wins')
Adding in the performance of each team against aces. I had to account for small sample sizes as some teams did not play against aces very often some years. I replaced the values in those cases with the overall average of runs scored against aces for that year. The average is typically around 2.5 runs against the ace depending on the year.
playoffs_df = pd.merge(playoff_starters,team_vs_ace,how='left',left_on=['Tm','Year'],right_on=['Tm','Year'])
#Get median of runs scored by year
yearly_avg = games_vs_ace[['Year','opp_er']]
yearly_avg['er_avg'] = yearly_avg['opp_er'].astype(float)
yearly_avg = yearly_avg[['Year','er_avg']].groupby('Year').mean()
yearly_avg.reset_index(inplace=True)
#Replace values with less than 3 games with the median
playoffs_df = pd.merge(playoffs_df,yearly_avg,how='left',on='Year')
playoffs_df.loc[playoffs_df['count']<3,'opp_er'] = playoffs_df['er_avg']
playoffs_df.loc[playoffs_df['count'].isnull(),'opp_er'] = playoffs_df['er_avg']
print(yearly_avg)
#Create full df with all data needed
full_df = pd.merge(playoffs_df,team_relievers,how='left',left_on=['Tm','Year'],right_on=['Tm','Year'])
full_df.drop(columns=['count','er_avg'],inplace=True)
#Max out wins at 11 for scale (only two teams had 12 wins from wildcard play ins)
full_df.loc[full_df['Wins']>11,'Wins'] = 11
Plotting the runs scored against ace pitchers honestly was not as insightful as I would have liked. There isn't much of a trend within the playoff teams and the variance of the non playoff teams encompasses playoff teams. Hopefully this can be useful in conjunction with the other variables.
I believe a big part of the issue here is that this metric is for the entire game against the ace pitcher rather than just the innings against them. There is certainly the possibility of some of these numbers being skewed by games where the bullpen let in runs.
sns.scatterplot(x='Wins',y='opp_er',hue='PostSeason',data=full_df)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Runs Scored Against Ace Pitchers vs Playoff Wins')
Plotting the top relief pitchers against the number of playoff wins. There is definitely a bit of a trend here. This may be a bit of a self-fulfilling prophecy due to the fact that a couple of teams won world series off of the strength of their bullpens this decade. Kansas City, I'm looking at you.
sns.scatterplot(x='Wins',y='RP_ERA+',hue='PostSeason',data=full_df)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.title('Top 3 Releiver ERA+ vs Playoff Wins')
Creating a dataframe for only postseason teams for the model
post_df = full_df[full_df['PostSeason']=='Made Postseason'].drop(columns='PostSeason')
Plotting the pair plot of all of the data
sns.pairplot(data=post_df, hue='Result')
Looking at the correlation matrix confirms what we saw in the scatter plot. There is minimal correlation with the starting pitcher ERA+, a bit more with the runs against aces, and a bit more with the bullpen ERA+.
plt.figure(figsize=(10,10))
sns.heatmap(post_df.corr(), square=True, annot=True)
Importing all of the packages I wanted to use
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
I am going to run multiple models with cross validation and look at the scores to determine which might be best to use.
def test_models(models,X,y):
for model in models:
cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
n_scores = cross_val_score(model, X, y, scoring = 'neg_mean_absolute_error', cv=cv, n_jobs=-1)
print(model,' MAE: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))
The models I am running are random forest, ada boost regressor, and gradient boost regressor. All of these are more complex versions of a decision tree that will allow our very non linear data to hopefully be modelled accurately. I will be using mean absolute error to evaluate models
models = [RandomForestRegressor(), AdaBoostRegressor(), GradientBoostingRegressor()]
X = post_df[['SP_ERA+','opp_er','RP_ERA+']]
y = post_df['Wins']
test_models(models,X,y)
param_grid = {
'n_estimators': [50, 100,150],
'min_samples_leaf': [1,2],
'min_samples_split': [2, 3],
'random_state': [4]
}
grid_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=4)
model = GridSearchCV(estimator = RandomForestRegressor(), param_grid = param_grid, scoring = 'neg_mean_absolute_error', cv = grid_cv, verbose=2, n_jobs = -1)
model.fit(X,y)
print(model.best_params_)
print(model.best_score_)
best_RF = model.best_estimator_
param_grid = {
'n_estimators': [50,100,150],
'loss': ['linear','square','exponential'],
'random_state': [4]
}
grid_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=4)
model = GridSearchCV(estimator = AdaBoostRegressor(), param_grid = param_grid, scoring = 'neg_mean_absolute_error', cv = grid_cv, verbose=2, n_jobs = -1)
model.fit(X,y)
print(model.best_params_)
print(model.best_score_)
best_ada = model.best_estimator_
param_grid = {
'n_estimators': [50,100,150],
'loss': ['ls','lad','huber'],
'min_samples_leaf': [1,2,3],
'min_samples_split': [2,3,4],
'random_state': [4]
}
grid_cv = RepeatedKFold(n_splits=5, n_repeats=3, random_state=4)
model = GridSearchCV(estimator = GradientBoostingRegressor(), param_grid = param_grid, scoring = 'neg_mean_absolute_error', cv = grid_cv, verbose=2, n_jobs = -1)
model.fit(X,y)
print(model.best_params_)
print(model.best_score_)
best_GB = model.best_estimator_
After tuning the hyperparameters, random forest was able to achieve the lowest mean absolute error.
Taking the best model and inputing the data to receive the results allows us to plot the prediction against the actual result. This shows that the model struggles to predict the higher end of the wins spectrum. This sense since there is less data for the higher wins totals since only 1 team each year reaches the elusive 11 wins needed for a world series victory.
pred = best_RF.predict(X)
res = y-pred
plt.scatter(y,pred)
plt.title('Predicted Win Total vs Actual Win Total')
Adding code to add the prediction to the original dataframe
results_df = post_df
results_df['res'] = res
results_df['pred'] = pred
Looking at the top 5 teams that under performed based on the model, we see some very good teams. The 2011 Yankees were two years removed from a championship and still had the core including Jeter, A-Rod, Cano, Teixera, and Posada. The 2014 Tigers had a hall of fame staff with Scherzer, Verlander, Porcello, and Price. The 2018 Indians had a wealth of pitching as did those two Nationals teams.
results_df.sort_values('res').head(5)
Looking at the teams that most over performed ends up being a list of world series champs plus the 2017 Dodgers. This is because as I mentioned the data is skewed towards lower win totals.
results_df.sort_values('res',ascending=False).head(5)
We can look at all of the predicted champions based on the model by taking the highest predicted win total from the bunch. The deviations from reality occur with the 2010 Texas Rangers, 2012 Tigers, 2014 Royals, and 2016 Indians. The losers of this deal are the Cubs who still haven't broken their curse and the Giants. The Giants actually end up losing 3 championships! Maybe there was some truth to that even year BS.
champs = results_df.sort_values(['Year','pred'],ascending=[True,False]).head(100)
champs.groupby('Year').head(1)
We can also model all of the non playoff teams to see who may have made some noise if they had been able to make it in. The top choice is the 2019 Mets which makes sense due to their elite pitching staff, decent bullpen, and breakout offense that included ROY winner Pete Alonso.
full_X = full_df[['SP_ERA+','opp_er','RP_ERA+']]
full_y = full_df['Wins']
full_pred = best_RF.predict(full_X)
full_res = full_y-full_pred
full_results_df = full_df
full_results_df['res'] = full_res
full_results_df['pred'] = full_pred
full_results_df[full_results_df['PostSeason']=='Missed Postseason'].sort_values('pred',ascending=False).head(5)
In conclusion, predicting playoff success is very difficult for a multitude of reasons. However, this was a very interesting exercise to go through that led to some fun discoveries about some teams from the past decade.
This type of model is never going to be perfect as well due to the fact that the playoffs are not round-robin but match up based. So a team may be the second best team and have to play the best team in the first round.
Some things to add to improve the model: