Opening Day

baseball

code

Author

Matthew Spence

Published

April 9, 2015

Modified

February 24, 2023

In honor of Opening Day (and the Nats’ loss), I wanted to see which teams have had the most success on opening day.

It’s…the Mets (?)

Code

import os, requests, zipfile, io
from IPython.display import HTML#, Markdown
#from itables import show
import pandas as pd

DATADIR = os.path.join(os.getenv("MY_PROJECT_DIR"), "data", "gl") # Using Quarto's _environment.local file (not saved to GitHub) to set the directory where this project lives

if not os.path.exists(DATADIR):
    os.makedirs(DATADIR)

DOWNLOADFROMRS = True

if DOWNLOADFROMRS:
    r = requests.get("https://www.retrosheet.org/CurrentNames.csv")
    with open(os.path.join(DATADIR,"CurrentNames.csv"),"w") as f:
        f.write("current_team_id,hist_franchise_id,league,division,location,nickname,alt_nickname,first_used,last_used,city,state\n")
        f.write(r.text)

"""
From https://www.retrosheet.org/Nickname.htm:
The fields are:

Current franchise ID
Franchise ID (may the same as current one) for earlier data
League
Division (blank before divisional play)
Location name (e.g. Brooklyn, Texas)
Nickname
Alternate nicknames (blank if none for the period)
Date of the first game with this combination
Date of the last game with combination (blank if current one)
City
State
"""

if DOWNLOADFROMRS:
    r = requests.get("https://www.retrosheet.org/gamelogs/glfields.txt")
    with open(os.path.join(DATADIR, "glfields.txt"),"w") as f:
        f.write(r.text)

if DOWNLOADFROMRS:
    for y in range(1980,2015):
        zip_file_url = f"https://www.retrosheet.org/gamelogs/gl{y}.zip"
        r = requests.get(zip_file_url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(path=DATADIR)

fields = ['date','visit_team_id','visit_team_league','visit_game_number','home_team_id','home_team_league','home_game_number','visit_final_score','home_final_score']
cols = [0,3,4,5,6,7,8,9,10]

namefields = ['current_team_id','hist_franchise_id','league','division','location','nickname','alt_nickname','first_used','last_used','city','state']

def assign_outcome(row):
    if row.final_score>row.opponent_final_score:
        return 'W'
    elif row.final_score<row.opponent_final_score:
        return 'L'

def read_in_data(year):
    dat = pd.read_csv(f"{DATADIR}/GL{year}.txt", header=None,names=fields, usecols=cols, parse_dates=['date'])
    v = dat.loc[dat.visit_game_number==1,].rename(columns=lambda x: x.replace('visit_','')).rename(columns=lambda x: x.replace('home_','opponent_'))
    h = dat.loc[dat.home_game_number==1,].rename(columns=lambda x: x.replace('home_','')).rename(columns=lambda x: x.replace('visit_','opponent_'))
    games = pd.concat([v,h], ignore_index=True)
    games['outcome'] = games.apply(assign_outcome,axis=1)
    games['season'] = year
    return games

teamnames = pd.read_csv(os.path.join(DATADIR,'CurrentNames.csv'), parse_dates=['first_used','last_used']) 

df = pd.DataFrame()
for year in range(1980,2015):
    g = read_in_data(year)
    df = pd.concat([df,g], ignore_index=True)

teamsgrouped = df.groupby('team_id')

pct = teamsgrouped.apply(lambda x: x.outcome.value_counts()/x.outcome.size).round(3)*100
pct = pct.unstack(1)
pct = pct.sort_values('W', ascending=False) ## decent answer here, but let's merge in team names

pctteamids = list(pct.index.values)
pctteams = teamnames[teamnames.hist_franchise_id.isin(pctteamids)].sort_values('last_used').drop_duplicates(subset='hist_franchise_id',keep="last")

finaltable = pct.merge(pctteams, left_index=True, right_on='hist_franchise_id', how="left").reset_index().assign(Team = lambda x: x['location'] + " " + x['nickname']).rename(columns={'W':'Wins','L':'Losses'})[['Team','Wins','Losses']]

HTML(finaltable.to_html(index=False))  # alternative to using the itables package
# with pd.option_context("display.float_format", "{:,.1f}%".format):
#     show(finaltable, dom="lrtp", lengthMenu = [5,10,len(finaltable)]) #paging=False to show full table

Table 1: Opening Day Records, 1980 – 2014

Team	Wins	Losses
New York Mets	71.4	28.6
Baltimore Orioles	65.7	34.3
Chicago White Sox	65.7	34.3
Los Angeles Angels	61.1	38.9
Seattle Mariners	60.0	40.0
Arizona Diamondbacks	58.8	41.2
Atlanta Braves	57.1	42.9
Los Angeles Dodgers	54.3	45.7
San Francisco Giants	54.3	45.7
New York Yankees	54.3	45.7
Cincinnati Reds	54.3	42.9
Milwaukee Brewers	54.3	42.9
Detroit Tigers	54.3	45.7
Tampa Bay Rays	52.9	47.1
Florida Marlins	52.6	47.4
Toronto Blue Jays	51.4	48.6
St. Louis Cardinals	51.4	48.6
Colorado Rockies	50.0	50.0
Texas Rangers	48.6	51.4
Pittsburgh Pirates	48.6	51.4
Montreal Expos	48.0	52.0
California Angels	47.1	52.9
Chicago Cubs	45.7	54.3
Boston Red Sox	45.7	54.3
Houston Astros	42.9	57.1
Washington Nationals	40.0	60.0
San Diego Padres	40.0	60.0
Minnesota Twins	40.0	60.0
Cleveland Indians	40.0	60.0
Philadelphia Phillies	37.1	62.9
Oakland Athletics	37.1	62.9
Miami Marlins	33.3	66.7
Kansas City Royals	28.6	71.4

Here’s how I did it:

I downloaded¹ game logs from 1980 – 2014 from RetroSheet and wrote the following script to analyze them. The key was changing the “dyadic” structure of the raw game logs (one row contains information about the two teams playing the game) into one that had one row for each team. I did this in the read_in_data function:

def read_in_data(year):
    dat = pd.read_csv(f"{DATADIR}/GL{year}.txt", header=None,names=fields, usecols=cols, parse_dates=['date'])
    v = dat.loc[dat.visit_game_number==1,].rename(columns=lambda x: x.replace('visit_','')).rename(columns=lambda x: x.replace('home_','opponent_'))
    h = dat.loc[dat.home_game_number==1,].rename(columns=lambda x: x.replace('home_','')).rename(columns=lambda x: x.replace('visit_','opponent_'))
    games = pd.concat([v,h], ignore_index=True)
    games['outcome'] = games.apply(assign_outcome,axis=1)
    games['season'] = year
    return games

Here’s an example of one year’s worth of gamelog data once it’s been run through read_in_data. 2014 was a bit of an odd season, since the Dodgers and the Diamondbacks started their seasons on March 22 by playing three exhibition games in Syndey, Australia. Each team went on to be another team’s opening day opponent – the Dodgers faced the Padres and the Diamondbacks went up against the Giants.

date	team_id	game_number	opponent_team_id	opponent_game_number	final_score	opponent_final_score	outcome
2014-03-22	LAN	1	ARI	1	3	1	W
2014-03-22	ARI	1	LAN	1	1	3	L
2014-03-30	SDN	1	LAN	3	3	1	W
2014-03-31	MIA	1	COL	1	10	1	W
2014-03-31	CIN	1	SLN	1	0	1	L
2014-03-31	TEX	1	PHI	1	10	14	L
2014-03-31	TBA	1	TOR	1	9	2	W
2014-03-31	OAK	1	CLE	1	0	2	L
2014-03-31	DET	1	KCA	1	4	3	W
2014-03-31	CHA	1	MIN	1	5	3	W
2014-03-31	BAL	1	BOS	1	2	1	W
2014-03-31	ANA	1	SEA	1	3	10	L
2014-03-31	PIT	1	CHN	1	1	0	W
2014-03-31	CHN	1	PIT	1	0	1	L
2014-03-31	WAS	1	NYN	1	9	7	W
2014-03-31	ATL	1	MIL	1	0	2	L
2014-03-31	COL	1	MIA	1	1	10	L
2014-03-31	SLN	1	CIN	1	1	0	W
2014-03-31	SFN	1	ARI	3	9	8	W
2014-03-31	PHI	1	TEX	1	14	10	W
2014-03-31	TOR	1	TBA	1	2	9	L
2014-03-31	CLE	1	OAK	1	2	0	W
2014-03-31	KCA	1	DET	1	3	4	L
2014-03-31	MIN	1	CHA	1	3	5	L
2014-03-31	BOS	1	BAL	1	1	2	L
2014-03-31	SEA	1	ANA	1	10	3	W
2014-03-31	MIL	1	ATL	1	2	0	W
2014-03-31	NYN	1	WAS	1	7	9	L
2014-04-01	NYA	1	HOU	1	2	6	L
2014-04-01	HOU	1	NYA	1	6	2	W

The results table includes two teams that have renamed themselves (the Angels and the Marlins), and one that moved and changed names (Expos). I’d like to update my script to deal with these, but should I really lump the Expos in with the Nats?

Also, I was relieved to see the Dodgers and the Giants were tied, at least.

Footnotes

I did this manually back in 2015, but as I was porting this to the new site, I wrote a script to automatically download and unzip the files.↩︎