In honor of Opening Day (and the Nats’ loss), I wanted to see which teams have had the most success on opening day.
It’s…the Mets (?)
Code
import os, requests, zipfile, iofrom IPython.display import HTML#, Markdown#from itables import showimport pandas as pdDATADIR = os.path.join(os.getenv("MY_PROJECT_DIR"), "data", "gl") # Using Quarto's _environment.local file (not saved to GitHub) to set the directory where this project livesifnot os.path.exists(DATADIR): os.makedirs(DATADIR)DOWNLOADFROMRS =Trueif DOWNLOADFROMRS: r = requests.get("https://www.retrosheet.org/CurrentNames.csv")withopen(os.path.join(DATADIR,"CurrentNames.csv"),"w") as f: f.write("current_team_id,hist_franchise_id,league,division,location,nickname,alt_nickname,first_used,last_used,city,state\n") f.write(r.text)"""From https://www.retrosheet.org/Nickname.htm:The fields are:Current franchise IDFranchise ID (may the same as current one) for earlier dataLeagueDivision (blank before divisional play)Location name (e.g. Brooklyn, Texas)NicknameAlternate nicknames (blank if none for the period)Date of the first game with this combinationDate of the last game with combination (blank if current one)CityState"""if DOWNLOADFROMRS: r = requests.get("https://www.retrosheet.org/gamelogs/glfields.txt")withopen(os.path.join(DATADIR, "glfields.txt"),"w") as f: f.write(r.text)if DOWNLOADFROMRS:for y inrange(1980,2015): zip_file_url =f"https://www.retrosheet.org/gamelogs/gl{y}.zip" r = requests.get(zip_file_url) z = zipfile.ZipFile(io.BytesIO(r.content)) z.extractall(path=DATADIR)fields = ['date','visit_team_id','visit_team_league','visit_game_number','home_team_id','home_team_league','home_game_number','visit_final_score','home_final_score']cols = [0,3,4,5,6,7,8,9,10]namefields = ['current_team_id','hist_franchise_id','league','division','location','nickname','alt_nickname','first_used','last_used','city','state']def assign_outcome(row):if row.final_score>row.opponent_final_score:return'W'elif row.final_score<row.opponent_final_score:return'L'def read_in_data(year): dat = pd.read_csv(f"{DATADIR}/GL{year}.txt", header=None,names=fields, usecols=cols, parse_dates=['date']) v = dat.loc[dat.visit_game_number==1,].rename(columns=lambda x: x.replace('visit_','')).rename(columns=lambda x: x.replace('home_','opponent_')) h = dat.loc[dat.home_game_number==1,].rename(columns=lambda x: x.replace('home_','')).rename(columns=lambda x: x.replace('visit_','opponent_')) games = pd.concat([v,h], ignore_index=True) games['outcome'] = games.apply(assign_outcome,axis=1) games['season'] = yearreturn gamesteamnames = pd.read_csv(os.path.join(DATADIR,'CurrentNames.csv'), parse_dates=['first_used','last_used']) df = pd.DataFrame()for year inrange(1980,2015): g = read_in_data(year) df = pd.concat([df,g], ignore_index=True)teamsgrouped = df.groupby('team_id')pct = teamsgrouped.apply(lambda x: x.outcome.value_counts()/x.outcome.size).round(3)*100pct = pct.unstack(1)pct = pct.sort_values('W', ascending=False) ## decent answer here, but let's merge in team namespctteamids =list(pct.index.values)pctteams = teamnames[teamnames.hist_franchise_id.isin(pctteamids)].sort_values('last_used').drop_duplicates(subset='hist_franchise_id',keep="last")finaltable = pct.merge(pctteams, left_index=True, right_on='hist_franchise_id', how="left").reset_index().assign(Team =lambda x: x['location'] +" "+ x['nickname']).rename(columns={'W':'Wins','L':'Losses'})[['Team','Wins','Losses']]HTML(finaltable.to_html(index=False)) # alternative to using the itables package# with pd.option_context("display.float_format", "{:,.1f}%".format):# show(finaltable, dom="lrtp", lengthMenu = [5,10,len(finaltable)]) #paging=False to show full table
Table 1: Opening Day Records, 1980 – 2014
Team
Wins
Losses
New York Mets
71.4
28.6
Baltimore Orioles
65.7
34.3
Chicago White Sox
65.7
34.3
Los Angeles Angels
61.1
38.9
Seattle Mariners
60.0
40.0
Arizona Diamondbacks
58.8
41.2
Atlanta Braves
57.1
42.9
Los Angeles Dodgers
54.3
45.7
San Francisco Giants
54.3
45.7
New York Yankees
54.3
45.7
Cincinnati Reds
54.3
42.9
Milwaukee Brewers
54.3
42.9
Detroit Tigers
54.3
45.7
Tampa Bay Rays
52.9
47.1
Florida Marlins
52.6
47.4
Toronto Blue Jays
51.4
48.6
St. Louis Cardinals
51.4
48.6
Colorado Rockies
50.0
50.0
Texas Rangers
48.6
51.4
Pittsburgh Pirates
48.6
51.4
Montreal Expos
48.0
52.0
California Angels
47.1
52.9
Chicago Cubs
45.7
54.3
Boston Red Sox
45.7
54.3
Houston Astros
42.9
57.1
Washington Nationals
40.0
60.0
San Diego Padres
40.0
60.0
Minnesota Twins
40.0
60.0
Cleveland Indians
40.0
60.0
Philadelphia Phillies
37.1
62.9
Oakland Athletics
37.1
62.9
Miami Marlins
33.3
66.7
Kansas City Royals
28.6
71.4
Here’s how I did it:
I downloaded1 game logs from 1980 – 2014 from RetroSheet and wrote the following script to analyze them. The key was changing the “dyadic” structure of the raw game logs (one row contains information about the two teams playing the game) into one that had one row for each team. I did this in the read_in_data function:
def read_in_data(year): dat = pd.read_csv(f"{DATADIR}/GL{year}.txt", header=None,names=fields, usecols=cols, parse_dates=['date']) v = dat.loc[dat.visit_game_number==1,].rename(columns=lambda x: x.replace('visit_','')).rename(columns=lambda x: x.replace('home_','opponent_')) h = dat.loc[dat.home_game_number==1,].rename(columns=lambda x: x.replace('home_','')).rename(columns=lambda x: x.replace('visit_','opponent_')) games = pd.concat([v,h], ignore_index=True) games['outcome'] = games.apply(assign_outcome,axis=1) games['season'] = yearreturn games
The results table includes two teams that have renamed themselves (the Angels and the Marlins), and one that moved and changed names (Expos). I’d like to update my script to deal with these, but should I really lump the Expos in with the Nats?
Also, I was relieved to see the Dodgers and the Giants were tied, at least.
Footnotes
I did this manually back in 2015, but as I was porting this to the new site, I wrote a script to automatically download and unzip the files.↩︎
Source Code
---title: "Opening Day"date: 2015-04-09date-modified: 2023-02-24 categories: [baseball, code, ]image: "max-scherzer-031215.jpg"jupyter: python3fig-cap-location: bottomtbl-cap-location: topformat: html: code-tools: true---In honor of Opening Day (and the Nats' loss), I wanted to see which teams have had the most success on opening day. It's...the Mets (?)```{python}#| code-fold: true#| code-line-numbers: true#| label: tbl-openingday-winpct#| tbl-cap: "Opening Day Records, 1980 -- 2014"import os, requests, zipfile, iofrom IPython.display import HTML#, Markdown#from itables import showimport pandas as pdDATADIR = os.path.join(os.getenv("MY_PROJECT_DIR"), "data", "gl") # Using Quarto's _environment.local file (not saved to GitHub) to set the directory where this project livesifnot os.path.exists(DATADIR): os.makedirs(DATADIR)DOWNLOADFROMRS =Trueif DOWNLOADFROMRS: r = requests.get("https://www.retrosheet.org/CurrentNames.csv")withopen(os.path.join(DATADIR,"CurrentNames.csv"),"w") as f: f.write("current_team_id,hist_franchise_id,league,division,location,nickname,alt_nickname,first_used,last_used,city,state\n") f.write(r.text)"""From https://www.retrosheet.org/Nickname.htm:The fields are:Current franchise IDFranchise ID (may the same as current one) for earlier dataLeagueDivision (blank before divisional play)Location name (e.g. Brooklyn, Texas)NicknameAlternate nicknames (blank if none for the period)Date of the first game with this combinationDate of the last game with combination (blank if current one)CityState"""if DOWNLOADFROMRS: r = requests.get("https://www.retrosheet.org/gamelogs/glfields.txt")withopen(os.path.join(DATADIR, "glfields.txt"),"w") as f: f.write(r.text)if DOWNLOADFROMRS:for y inrange(1980,2015): zip_file_url =f"https://www.retrosheet.org/gamelogs/gl{y}.zip" r = requests.get(zip_file_url) z = zipfile.ZipFile(io.BytesIO(r.content)) z.extractall(path=DATADIR)fields = ['date','visit_team_id','visit_team_league','visit_game_number','home_team_id','home_team_league','home_game_number','visit_final_score','home_final_score']cols = [0,3,4,5,6,7,8,9,10]namefields = ['current_team_id','hist_franchise_id','league','division','location','nickname','alt_nickname','first_used','last_used','city','state']def assign_outcome(row):if row.final_score>row.opponent_final_score:return'W'elif row.final_score<row.opponent_final_score:return'L'def read_in_data(year): dat = pd.read_csv(f"{DATADIR}/GL{year}.txt", header=None,names=fields, usecols=cols, parse_dates=['date']) v = dat.loc[dat.visit_game_number==1,].rename(columns=lambda x: x.replace('visit_','')).rename(columns=lambda x: x.replace('home_','opponent_')) h = dat.loc[dat.home_game_number==1,].rename(columns=lambda x: x.replace('home_','')).rename(columns=lambda x: x.replace('visit_','opponent_')) games = pd.concat([v,h], ignore_index=True) games['outcome'] = games.apply(assign_outcome,axis=1) games['season'] = yearreturn gamesteamnames = pd.read_csv(os.path.join(DATADIR,'CurrentNames.csv'), parse_dates=['first_used','last_used']) df = pd.DataFrame()for year inrange(1980,2015): g = read_in_data(year) df = pd.concat([df,g], ignore_index=True)teamsgrouped = df.groupby('team_id')pct = teamsgrouped.apply(lambda x: x.outcome.value_counts()/x.outcome.size).round(3)*100pct = pct.unstack(1)pct = pct.sort_values('W', ascending=False) ## decent answer here, but let's merge in team namespctteamids =list(pct.index.values)pctteams = teamnames[teamnames.hist_franchise_id.isin(pctteamids)].sort_values('last_used').drop_duplicates(subset='hist_franchise_id',keep="last")finaltable = pct.merge(pctteams, left_index=True, right_on='hist_franchise_id', how="left").reset_index().assign(Team =lambda x: x['location'] +" "+ x['nickname']).rename(columns={'W':'Wins','L':'Losses'})[['Team','Wins','Losses']]HTML(finaltable.to_html(index=False)) # alternative to using the itables package# with pd.option_context("display.float_format", "{:,.1f}%".format):# show(finaltable, dom="lrtp", lengthMenu = [5,10,len(finaltable)]) #paging=False to show full table```Here's how I did it:I downloaded[^fnmanual] game logs from 1980 -- 2014 from [RetroSheet](http://www.retrosheet.org/gamelogs/index.html) and wrote the following script to analyze them. The key was changing the "dyadic" structure of the raw game logs (one row contains information about the two teams playing the game) into one that had one row for each team. I did this in the `read_in_data` function:```{python}#| eval: false#| echo: true#| code-fold: falsedef read_in_data(year): dat = pd.read_csv(f"{DATADIR}/GL{year}.txt", header=None,names=fields, usecols=cols, parse_dates=['date']) v = dat.loc[dat.visit_game_number==1,].rename(columns=lambda x: x.replace('visit_','')).rename(columns=lambda x: x.replace('home_','opponent_')) h = dat.loc[dat.home_game_number==1,].rename(columns=lambda x: x.replace('home_','')).rename(columns=lambda x: x.replace('visit_','opponent_')) games = pd.concat([v,h], ignore_index=True) games['outcome'] = games.apply(assign_outcome,axis=1) games['season'] = yearreturn games```Here's an example of one year's worth of gamelog data once it's been run through `read_in_data`. 2014 was a bit of an odd season, since the [Dodgers and the Diamondbacks started their seasons on March 22](https://www.mlb.com/schedule/2014-03-22) by [playing three exhibition games in Syndey, Australia](https://sabr.org/gamesproj/game/march-22-2014-opening-day-down-under-as-dodgers-defeat-diamondbacks/). Each team went on to be another team's opening day opponent -- the Dodgers faced the Padres and the Diamondbacks went up against the Giants.```{python}#| column: page#| echo: falseHTML(df[df['season'] ==2014].sort_values("date").drop(columns=["team_league", "opponent_team_league", "season"]).to_html(index=False))```The results table includes two teams that have renamed themselves (the Angels and the Marlins), and one that moved and changed names (Expos). I'd like to update my script to deal with these, but should I really lump the Expos in with the Nats?Also, I was relieved to see the Dodgers and the Giants were tied, at least.[^fnmanual]: I did this manually back in 2015, but as I was porting this to the new site, I wrote a script to automatically download and unzip the files.