NBA Player Salary Prediction using R

Reasoning behind the project

I am a big NBA fan, so when I had the opportunity to analyze NBA data for a class project, I jumped right on it. The goal of this project was to be the accumulation of the linear regression analysis topics I learned throughout the semester. This was not a traditional data analysis course for me, as I an am Industrial Engineering student, and this was a project in my Econometrics class. The linear models had to be built to be consistent with economic theory, which was a great learning experience. The data set used in this project was provided by the professor, and is from the 1994-95 NBA season.

Defining the model

With the data provided, I decided to investigate for a correlation between the salary of an NBA player and a bunch of explanatory variables. A model that can predict the salary of an NBA player based on explanatory variables including experience, experience squared, points, rebounds, assists, all star status, and position played can be very useful in contract negotiations for players and general managers.

nba_lm1 <- lm(wage ~ exper + expersq + points + rebounds + assists + guard + forward + allstar, data=nbawages)
summary(nba_lm1)

## 
## Call:
## lm(formula = wage ~ exper + expersq + points + rebounds + assists + 
##     guard + forward + allstar, data = nbawages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1730.82  -387.69   -17.89   372.71  2802.40 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  108.368    157.529   0.688  0.49211    
## exper         98.299     40.921   2.402  0.01700 *  
## expersq       -1.582      2.942  -0.538  0.59118    
## points        85.287     12.174   7.006  2.1e-11 ***
## rebounds      51.943     22.596   2.299  0.02231 *  
## assists       39.709     27.798   1.429  0.15434    
## guard       -485.974    154.564  -3.144  0.00186 ** 
## forward     -305.677    119.203  -2.564  0.01090 *  
## allstar      -30.375    163.952  -0.185  0.85316    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 666.6 on 260 degrees of freedom
## Multiple R-squared:  0.5687, Adjusted R-squared:  0.5555 
## F-statistic: 42.86 on 8 and 260 DF,  p-value: < 2.2e-16

Interpreting the model

The output of this model (Figure 2) shows that there are 5 statistically significant explanatory variables at the 5% level. These variables are experience, points, rebounds, guard, and forward. However, experience squared is insignificant so we will check the turning point of the experience variable.

tp <- (coefficients(nba_lm1)["exper"])/(-2*coefficients(nba_lm1)["expersq"])
tp

##    exper 
## 31.06474

Given that the longest NBA career of all time is 22 years, and the turning point of the experience variable is 31 years, it makes sense that experience squared is an insignificant variable. We can remove this insignificant nonlinear variable from the model and run it again.

nba_lm2 <- lm(wage ~ exper + points + rebounds + assists + guard + forward + allstar, data=nbawages)
summary(nba_lm2)

## 
## Call:
## lm(formula = wage ~ exper + points + rebounds + assists + guard + 
##     forward + allstar, data = nbawages)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1712.98  -396.13   -23.78   380.33  2817.54 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   141.77     144.57   0.981  0.32767    
## exper          77.33      12.38   6.248 1.68e-09 ***
## points         85.58      12.15   7.046 1.64e-11 ***
## rebounds       52.66      22.53   2.338  0.02017 *  
## assists        42.53      27.26   1.560  0.11991    
## guard        -485.09     154.35  -3.143  0.00187 ** 
## forward      -301.35     118.77  -2.537  0.01176 *  
## allstar       -38.59     163.02  -0.237  0.81305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 665.7 on 261 degrees of freedom
## Multiple R-squared:  0.5682, Adjusted R-squared:  0.5567 
## F-statistic: 49.07 on 7 and 261 DF,  p-value: < 2.2e-16

In the reduced model the same 5 explanatory variables are significant at the 5% level. To give an example of how you would interpret this output, I will look at the variable points. The points variable is statistically significant, and has a parameter estimate of 85.58. This means that, holding all else constant, for every additional point per game average increase for an NBA player, on average they will earn 85.58 thousands of dollars more.

You might notice this model still has 3 statistically insignificant explanatory variables. It is rare to have every explanatory variable be statistically significant, even if there is logic as to why it might be. For the variables that were not statistically significant, it could be argued that assists were undervalued in the 90’s NBA style, as centers were typically the best players, and did not average a lot of assists. The allstar variable being statistically insignificant was surprising, and this may have been insignificant due to being confounded with many of the other variables in the model. A player is usually determined to be an all star based on their statistics such as points, rebounds, and assists. This leads to an essentially useless variable being included in the model.

Testing a different type of model

When completing Linear Regression Analysis, it is good practice to test a few different types of models. In this case, I tested a model that used the natural log of the nba player salary variable.

nba_lm3 <- lm(lwage ~ exper + expersq + points + rebounds + assists + guard + forward + allstar, data=nbawages)
summary(nba_lm3)

## 
## Call:
## lm(formula = lwage ~ exper + expersq + points + rebounds + assists + 
##     guard + forward + allstar, data = nbawages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0621 -0.3520  0.1134  0.4143  1.6876 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.526227   0.146153  37.811  < 2e-16 ***
## exper        0.142208   0.037966   3.746 0.000222 ***
## expersq     -0.005512   0.002730  -2.019 0.044489 *  
## points       0.074594   0.011295   6.604 2.24e-10 ***
## rebounds     0.036035   0.020964   1.719 0.086819 .  
## assists      0.058565   0.025790   2.271 0.023977 *  
## guard       -0.268800   0.143403  -1.874 0.061991 .  
## forward     -0.028100   0.110595  -0.254 0.799633    
## allstar     -0.343290   0.152112  -2.257 0.024850 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6185 on 260 degrees of freedom
## Multiple R-squared:  0.5224, Adjusted R-squared:  0.5077 
## F-statistic: 35.55 on 8 and 260 DF,  p-value: < 2.2e-16

These parameter estimates are interpreted slightly differently now that the natural log of salary is being used. For the points variable, \(100*(e^{0.074594}-1) = 7.74\) is the predicted percent increase in wages given a 1 point per game increase, holding constant the other explanatory variable values. In other words, for every 1 point per game increase a player has, their salary is expected to go up by 7.74%, assuming nothing else changes.

Comparing the two models

To compare these two models, I used a Box-Cox Test (log-likelihood values). These two models are special cases of the Box-Cox transformation where \(\lambda = 0\) and \(\lambda = 1\). The “better” model will have a larger (or less negative) log-likelihood value.

bc <- boxcox(nba_lm1, lambda = seq(-3,3, 0.05))

lllam0 <- bc$y[which(bc$x==0)] # This prints the log-likelihood value for lambda = 0 (ln(y) model)
lllam0

## [1] -618.4336

lllam1 <- bc$y[which(bc$x==1)] # This prints the log-likelihood value for lambda = 1 (y, level(y) model)
lllam1

## [1] -626.8299

From these values it can be concluded that the model where lambda = 0, or the ln(y) model, is the preferable model. The ln(y) model is preferable because it has the larger log-likelihood value.

What can be drawn from all this analysis?

A few factors were determined to have an effect on the salary of an NBA player. The amount of points scored by a player had the most statistically significant effect, and this effect was positive. This makes sense because scoring a lot of points is a valuable trait to have. Assists also have a positive correlation, which makes sense for the same reason. Both points and assists are considered equally valuable, as both have a statistically similar effect on the salary of a player. Rebounds have a weaker correlation with salary, but there is still a positive correlation present. This correlation checks out for the same reason as points and assists. Going away from players in game statistics, the position of a player has an impact on their predicted salary. There is evidence to suggest that guards have lower salaries than non guards on average. This makes sense as many of the top players in the NBA at this time were non guard players. The amount of experience a player has also has a positive correlation, which makes sense because players who have played longer typically make more money due to being off their low salary rookie scale contracts. The last factor that helps determine the salary of an NBA player is whether or not they were an allstar. This has a negative effect on their salary. This does not logically make sense, as all stars are usually the best players, and the best players should theoretically make the most money. This correlation could be negative because players are not always chosen to be allstars for the same reasons they are paid large amounts. Allstars are voted for by fans and media, which can be more of a popularity contest. Players are paid for the production on the court, and not every highly paid player can get voted to be an allstar. Allstars make up only 31 of the 269 players in this dataset. It would be interesting to see how this varies from year to year, based on which players are voted as allstars.

Ways to improve or change this model

This model could be improved by having more modern data that would include advanced NBA metrics such as Win Shares and Player Efficiency Rating. These two metrics are an attempt to convey the entire value of a player in one statistic. There would likely be a very strong correlation between these metrics and the salary of NBA players, therefore there would likely be a better fit to the data if these metrics could be included. A time-series model of NBA salary data would also be useful, as you could attempt to predict a player’s upcoming salary based on their previous season statistics if their salary is up for renegotiation. This would require multiple years of data, and more parameters such as number of career All-NBA and All-Defense Team appearances, as these have an effect on the size of the contract a player is eligible to sign.

Thanks for reading

I used R Markdown to compile this project for publication on my website, and it was very easy to use. I look forward to learning more about R and data analysis.