The NBA has a foul problem.
In recent years, NBA officials have rewritten the rulebook, hoping to eliminate offensive players lunging into a jumping defender and, more recently, penalize take fouls (fouls on would-be open runs to the basket). These changes come from a surface-level view of the NBA — essentially reducing fouls that look funky — without looking at what combination of factors leads to more fouls. Understanding the combination of factors that lead to fouls can allow the NBA to create rule changes that truly reduce fouls and, in turn, increase pace of play.
Data & Methods
The data used in this project required quite a bit of cleaning. I used play-by-play data of 500 regular season games from the 2019–2020 NBA season. Given that the original play-by-play data has one row for every event in a basketball game, condensing the data into a usable form was essential. To do this, I first counted the counting stats as they appeared in the data, adding columns for how many steals, blocks, fouls, etc. had occurred so far. I then took the first value of every minute to further reduce the data without losing much information. To account for time-based differences in counting stats, I converted each of the counting stats to frequencies. This ultimately provided a dataset with a response variable (future foul frequency) and a number of frequencies and shooting percentages as explanatory variables.
The full set of explanatory variables includes counting stats (points, assists, rebounds, steals, blocks, etc.) and shooting percentages (broken into 2- and 3-pt shots), time remaining, score differential, technical fouls, and past foul frequency. My response variable is future foul rate, defined as the number of fouls in the remainder of the game divided by the time remaining.
I used several different methods, starting with a basic linear regression before doing LASSO, random forest, and XGBoost models. I personally like the predictive power of XGBoost, and find that it can be much more interpretable than similarly powerful methods through SHAP scores. XGBoost did more accurately predict foul rates, so the graphs and insights below come from the XGBoost model.
What Factors Lead to Higher Foul Rates?
The model used the below graph of SHAP scores to predict future foul rate. As a brief explainer: SHAP scores assign an importance value to each feature in the model. In this graph, minute is the most important variable, followed by score differential and foul frequency to that point in the game. Each dot represents a row in the testing dataset, with its values captured by the colors — purple representing a high value and yellow a lower value.
Fouls Are Predictable — But Not Removable
The biggest drivers of higher foul rates are also the two biggest drivers of intensity — a low point differential (more fouls occur in a close game) and middle-to-low time remaining (more fouls are expected during the middle of the game). The start of a game and the last few minutes (when many games are already decided) see the fewest expected fouls.
Expect the most fouls in fast-paced games (those where more shots are being taken) and games with active defenders (higher steal frequency correlates with higher foul frequency). Further, games which have seen a lot of fouls will continue to have a lot of fouls as the game continues to unfold.
Interestingly, many of the factors I looked at were not predictors of foul rate at all. Technical fouls have no impact on the number of fouls called despite theoretically serving as a tranquilizer for misbehaving players and coaches. Blocks, assists, and rebounds (either offensive or defensive) proved to have no impact on foul frequency. Teams shooting well — or poorly — from anywhere on the court is also not a harbinger of higher foul rates.
What This Means for the NBA
The NBA's foul problem might be here to stay — and it will show up the most under the brightest lights, when games are the most intense.
Refs could anticipate the increased intensity (in playoff games, rivalry games, and nationally broadcast games) and call fewer fouls throughout the game, but this could be unpopular for fans of players known to create — and benefit from — a lot of contact by driving to the basket.