In The Bonus: Predicting Fouls in the NBA

The NBA has a foul problem.

In recent years, NBA officials have rewritten the rulebook, hoping to eliminate offensive players lunging into a jumping defender and, more recently, penalize take fouls (fouls on would-be open runs to the basket). These changes come from a surface-level view of the NBA — essentially reducing fouls that look funky — without looking at what combination of factors leads to more fouls. Understanding the combination of factors that lead to fouls can allow the NBA to create rule changes that truly reduce fouls and, in turn, increase pace of play.

Data & Methods:

The data used in this project required quite a bit of cleaning. I used play-by-play data of 500 regular season games from the 2019-2020 NBA season. Given that the original play-by-play data has one row for every event in a basketball game, condensing the data into a usable form was essential. To do this, I first counted the counting stats as they appeared in the data, adding columns for how many steals, blocks, fouls, etc. had occurred so far in the data. I then took the first value of every minute to further reduce my data without losing much information. Furthermore, to account for time-based differences in counting stats (stats are of course higher later in the game), I converted each of the counting stats to frequencies. Although not the simplest pre-processing, this ultimately provided a dataset with a response variable (future foul frequency) and a number of frequencies and shooting percentages as explanatory variables.

The full set of explanatory variables includes counting stats (points, assists, rebounds, steals, blocks, etc.) and shooting percentages (broken into 2- and 3-pt shots), time remaining, score differential, technical fouls, and past foul frequency. My response variable is future foul rate, defined as the number of fouls in the remainder of the game divided by the time remaining in the game.

I used several different methods, starting with a basic linear regression before doing LASSO, random forest, and XGBoost models. I personally like the predictive power of XGBoost, and find that it can be much more interpretable than similarly powerful methods through SHAP scores. In simple terms, XGBoost is a boosted tree-based algorithm that typically performs better than other tree- or non-tree-based algorithms. Using XGBoost was a means to finding a more accurate answer than a simple linear regression might be able to find. Ultimately, my goal was to simply find the most important variables, not to decrease my prediction accuracy, so a linear regression was likely sufficient and would have been easily the most time-efficient option. XGBoost did more accurately predict foul rates, so the graphs and insights you’ll see below come from the XGBoost model.

What factors lead to higher foul rates throughout an NBA game?

The model used the below graph of SHAP scores to predict future foul rate. As a brief explainer: SHAP scores assign an importance value to each feature in the model. In this graph, minute is the most important variable, followed by score_diff (score differential) and foul_freq (foul frequency to that point in the game). Each dot represents a row in the testing dataset, with its values captured by the colors. The colors represent how the value of each feature impacts the model, with purple representing a high value and yellow a lower value. From this, we see games with low score differentials have higher expected foul rates. My distilled understanding of both this graph and the other models can be found below.

SHAP values for predicting the rate of fouls in the remainder of a game

Fouls are predictable — but not removable…

The biggest drivers of higher foul rates are also the two biggest drivers of intensity — a low point differential (more fouls occur in a close game) and middle-to-low time remaining (more fouls are expected to occur during the middle of the game). The start of a game and the last few minutes (when many games are already decided) see the fewest expected fouls.

Expect the most fouls in fast-paced games (those where more shots are being taken) and games with active defenders (higher steal frequency correlates with higher foul frequency).

Further, games which have seen a lot of fouls will continue to have a lot of fouls as the game continues to unfold.

Interestingly, many of the factors I looked at were not predictors of foul rate at all. Technical fouls have no impact on the number of fouls called despite theoretically serving as a tranquilizer for misbehaving players and coaches. Blocks, assists, and rebounds (either offensive or defensive) proved to have no impact on foul frequency. Teams shooting well — or poorly — from anywhere on the court is also not a harbinger of higher foul rates.

What this means for the NBA

The NBA’s foul problem might be here to stay, and unfortunately it will show up the most under the brightest lights, when games are the most intense. Refs could anticipate the increased intensity (in playoff games, rivalry games, and nationally broadcast games) and call fewer fouls throughout the game, but this could be unpopular for fans of players known to create — and benefit from — a lot of contact by driving to the basket.