# Fixed-Effects Regressions: An Exploration of PGA Tour Panel Data

# Preface

The PGA Tour, golf’s best professional tour, travels around the world to the most renowned golf courses almost weekly, (45 tournaments per year). 120-some golfers attack these courses for two days, however, only the top-performing 65–70 players remain for two more rounds of golf on the weekend. Those remaining golfers have a chance to win the tournament as well as large sums of money.

The rest of us, non-professional golfers, have a chance to predict who is going to win those large sums of money each week. **To do so, I will collect informing statistics and create fixed-effects regression models to predict final scores of each golfer at any given tournament.** Tournaments are generally held at the same courses every year, hence the reason for *fixed-effect regression models*.

# Data Curation

To obtain this data for free I plan to scrape it from a reputable site. The first site I experiment with is PGATOUR.com/stats, I end up scraping about 18,000 rows of data from them. I clean and merge 13 metrics from each 4-round PGA tournament over the last 7 years, however, the compiled dataset has one major drawback. There’s only data available for golfers who made the cut (played all 4 rounds), so it misses about half the pool of data from each tournament, and would also create a *survivorship bias* in my dataset.

I find new data on a different site *(pictured below)* containing statistics on every player in the field in each tournament. The site, however, proves tricky to scrape.

It contains a dropdown menu to select the individual tournaments, multiple years to select for each tournament, and a large embedded chart. I write a bit of a complex script *(portion below)* and learn a few things in the process of collecting the data.

`results_list = []`

for tourney in tourney_list:

sel = Select(dropdown)

sel.select_by_visible_text(f'{tourney}')

time.sleep(5)

years = driver.find_elements_by_css_selector("text.yearoptions")

for year in years:

year.click()

time.sleep(5)

graph = driver.find_element_by_css_selector("div.table")

rows = graph.find_elements_by_class_name("datarow")

i = 0

e = 0

v = 0

scorelist = []

for row in rows:

player_dict = {}

player_dict["tournament"] = tourney

player_dict["year"] = year.text

try:

golfer = row.find_element_by_id("col_text1").text

except:

golfer = ''

player_dict["golfer"] = golfer

The script collects data on about 120 golfers per tournament, for almost every PGA tournament dating back to 2010.

# The Data

Totaling 59,000 rows, the dataset was vast. After pruning tournaments without the four key statistics I would use for my Xs in the models (mostly foreign courses), I am left with 52,000 rows. Each row contains the tournament name, the course for the tournament, the year of the tournament (unfortunately not the exact date), and the golfer’s name as indicators. The panel data is primed for fixed-effect regressions.

The four key statistics in the data set are all-encompassing statistics that measure how well a golfer performs **off-the-tee**, **approaching-the-green**, **around-the-green**, and **putting **in comparison** **to the field. Mark Broadie, a professor from Columbia University’s business school, developed the statistic using data provided to academia by the PGA Tour. Here is *strokes gained: off-the-tee *explained:

The number of strokes a player takes from a specific distance off the tee on Par 4 & par 5’s is measured against a statistical baseline to determine the player’s strokes gained or lost off the tee on a hole. The sum of the values for all holes played in a round minus the field average strokes gained/lost for the round is the player’s Strokes gained/lost for that round.

*https://www.pgatour.com/stats/stat.02567.html** (bottom of the webpage)*

Next I **create dummy binary variables** for each tournament in order to extract the unobserved variables that only change across courses, but stay constant over time. I create the same dummy binary variables for each year to do the same; extract unobserved variables that only change over time, and stay constant across courses.

After making lists of each unique course and a separate list for each year in the dataset, I use for loops to make new columns (regressors) in the dataframe and populate them initially with *zeros*, and then with *ones* for their corresponding courses and years.

# creating columns for each course

for course in courses:

df1[f'{course}'] = 0

# populating each course column with a 1 for its respective course

for x in courses:

df1.loc[df1.course == f'{x}', f'{x}'] = 1# creating columns for each year

for year in years:

df2[year] = 0

# populating each year column with a 1 for its respective year

for x in years:

df2.loc[df2.year == x, x] = 1

The courses totaling 104, the years totaling 12, I now have a **total of 120 regressors** including the strokes gained statistics to regress on the golfer’s final scores.

# Models

I read in the final dataset, and **randomize its order**, to ensure the models are trained using data from all years and courses.

`dta2 = pd.read_csv("panel_data_timeandcourse.csv")`

dta2['ML_group'] = np.random.randint(100,size = dta2.shape[0])

dta2 = dta2.sort_values(by='ML_group')

Employing the TVT split (*train, validate, test*), I create splitting filters for the randomized dataset that **filters the data by its “ML group”** number

*(below)*. Allowing for

**80%**of the data to train the model,

**10%**to test and predict, and

**10%**validate the prediction.

`inx_train2 = dta2.ML_group<80 `

inx_valid2 = (dta2.ML_group<90)&(dta2.ML_group>=80)

inx_test2 = (dta2.ML_group>=90)

I designate the **final score** data as the Y (dependent variable), and the **120 regressors **we discussed as the Xs (independent variables, *abbrev. below*). However, to avoid *perfect* *multicollinearity*, I must drop a binary regressor from both the years and the courses variable sets, making this an* “N-1 & T-1 Binary Regression Method”.*

Y_train2 = dta2.score[inx_train2].to_list()

Y_valid2 = dta2.score[inx_valid2].to_list()

Y_test2 = dta2.score[inx_test2].to_list()X_train2 = dta2.loc[inx_train2, ['sg_putting', 'sg_arg', 'sg_approach', 'sg_tee', 'Muirfield Village GC', 'Muirfield Village Golf Club', 'TPC Louisiana', 'Sherwood Country Club', 'Sedgefield CC', ....... '2016', '2017', '2018', '2019', '2020',

'2021']]

*A side note:*

Before I proceed to run an sklearn (python ML package) Linear Regression, I manipulate the data a little bit. With four non-binary regressors I want to test some interaction terms to see if they have some statistical significance. I use a lambda function to calculate interaction terms between all four regressors, they prove to be counterproductive.

The new regressors overfit the model, while they marginally increase the number of correct or closely correct score predictions, the number of wildly incorrect predictions increases as well.

The first model I utilize is a simple linear regression model. I implement it with sklearn:

from sklearn import linear_model# model declaration

model = linear_model.LinearRegression()

# training

result3 = model.fit(X_train2, Y_train2)result3.predict(X_test2)# prediction

dta2['score_hat'] = np.concatenate(

[result3.predict(X_train2),

result3.predict(X_valid2),

result3.predict(X_test2)]

).round().astype(int)# confusion matrix

dta2['result'] = 0

results3 = dta2.loc[inx_valid2].result = dta2.loc[inx_valid2].apply(lambda x: confusion(x['score'], x['score_hat']), axis=1)

The *Confusion Matrix*, **a technique to measure the spread of the results of a model’s predictions**, is implemented with a lambda function that feeds the actual scores *[‘score’] *and the predicted scores* [‘score_hat’]* into a Confusion Matrix function I made prior *(below)*.

`def confusion(x, y):`

if x == y:

z = 'Exact'

elif (x == y-1) | (x == y+1) == True:

z = '1 off'

elif (x == y-2) | (x == y+2) == True:

z = '2 off'

elif (x == y-3) | (x == y+3) == True:

z = '3 off'

elif (x == y-4) | (x == y+4) == True:

z = '4 off'

elif (x == y-5) | (x == y+5) == True:

z = '5 off'

elif (x == y-6) | (x == y+6) == True:

z = '6 off'

elif (x == y-7) | (x == y+7) == True:

z = '7 off'

elif (x == y-8) | (x == y+8) == True:

z = '8 off'

elif (x == y-9) | (x == y+9) == True:

z = '9 off'

elif (x == y-10) | (x == y+10) == True:

z = '10 off'

elif (x == y-11) | (x == y+11) == True:

z = '11 off'

elif (x == y-12) | (x == y+12) == True:

z = '12 off'

elif (x > y-16) & (x < y+16) == True:

z = '13-15 off'

elif (x > y-20) & (x < y+20) == True:

z = '16-19 off'

elif (x > y-26) & (x < y+26) == True:

z = '20-25 off'

else:

z = '26+ off'

return z

The results, *normalized* (expressed in percentages), of the regular sklearn.linear_model.LinearRegression() are shown below:

`In [45]: results3.value_counts(normalize=True)`

Out[45]:

1 off 0.290673

2 off 0.219722

Exact 0.152012

3 off 0.137517

4 off 0.082205

5 off 0.050734

6 off 0.025367

7 off 0.016975

8 off 0.007820

9 off 0.005722

10 off 0.003052

11 off 0.002670

13-15 off 0.002479

12 off 0.001907

16-19 off 0.000572

20-25 off 0.000572

**80%** of the predicted scores are **within 3 strokes** of the actual final score of the golfer. **1.1%** of predicted scores are 10 or more strokes away from their actual score. These results are incredibly encouraging compared to the regression results **without the fixed-effects dummy variables ***(below)…*

`In [53]: results3.value_counts(normalize=True)`

Out[53]:

1 off 0.197597

2 off 0.175854

3 off 0.139424

Exact 0.108716

4 off 0.104711

5 off 0.082777

6 off 0.054358

7 off 0.046920

8 off 0.026702

9 off 0.021362

10 off 0.015068

13-15 off 0.008964

11 off 0.008201

12 off 0.005913

16-19 off 0.003052

20-25 off 0.000381

…but they could be better, so I think.

After testing a handful of Tree models and receiving sprawling results **I do not go further with any other classification machine learning methods**. However, I run some **ridge** and** lasso** regressions with numerous different alphas.

Every alpha I test from 0.001 - 1 are **unsuccessful** in beating the results of the classic linear regression.

# Next Steps

I will continue to grow the dataset I have now as each tournament plays out this 2021 season and beyond. I also plan to create a forecasting model using AR and ARIMA methods to forecast each non-binary X variable (strokes-gained statistics) and use both models in tandem to forecast results for each golfer in advance of each tournament.