Let's start with a hypothesis:
Andy Staples, a national CFB writer for the Athletic, states a version of the hypothesis concisely here; teams are playing slower on offense because it saves their defense from getting worn out and giving up points.
Re-framed it goes something like this:
"Teams with up tempo offenses will have worse defenses."
I think it's important that we set up how we got here and then define some terms in the hypothesis so we can test it. We are here because some college offenses (there is a nice look at who was playing fast in 2016 here from Jake Troch) started playing at a much faster pace, reducing the time between snaps, and increasing the number of plays and drives per game.
Drives are like at bats in that they are scoring opportunities. You probably want more of those. Plays are a little more ambiguous. Ideally you might not be running a ton of plays and the point of many tempo offense seemed to be to put the defense in disadvantageous spots to get big plays (reducing the number of plays per drive for a given starting field position). Tempo rested on two prongs: wear the defense out by limiting substitutions (tempo offenses led the world in off-season conditioning advantage chatter), put the defense in disadvantageous spots through limiting substitutions, and don't give the defense time to settle between plays. This includes limiting the defense to more vanilla looks as they hustle to get the play in and get set.
There is a bit of a problem though, the actual points per drive has gone up as the drives per game has gone down. If we are talking about playing better defense with a less worn out defense then at the very macro level the data doesn't fit the hypothesis.
That, of course, is not final or conclusive evidence of anything. Maybe worse offenses slowed it down more recently, but you probably need to have a companion hypothesis for "teams are slowing it down on offense so they can play better defense" if it turns out that they are not, in fact, playing better defense.
I have to admit that I always found "you wear out the defense more than yourself without subbing" argument to be fishy. Unless you are training at altitude and playing at altitude against an opponent who trains at close to sea level it made little sense to me that offensive players should get less tired than defensive players. Shouldn't the conditioning you are putting your offense through also be applied to your defense since you know they will face more drives too? The other team does get the ball after you in football. Drives are fairly symmetrical. How did we end up at a place where defenses get so asymmetrically tired?
A possible answer is that we still are not smart enough to evaluate football independent of tempo. As the number of drives a defense faces rises the number of points that defense is expected to give up rises as well. This was the problem evaluating Oregon's defense in the Chip Kelly era. Commentators were using points/yards per game metrics and missing that the Ducks consistently had a pretty good defense under Kelly and early in Helfrich's tenure.
So given that the metrics I just threw in here are tempo free and adjusted for field position and opponent it's important that we agree on how we are going to define worse in
Prior Work
Bill Connelly, now at ESPN, did some work specifically on College Football data that is cited in this article. It shows the same negative trend we see on success rate, different metric but aiming in the same direction, on the play level data from CFB. It does not suggest that offenses get more effective as the number of plays the defense has faced increases. Ben Baldwin, now at the Athletic, has this very good piece that examines specifically the rest time that defenses have between drives in order to test whether that has an effect on performance; it does not. Ethan Douglas, also of the Athletic, posted this look at GPS data from the NFL on his Twitter feed and it specifically refuted the idea that the defense gets asymmetrically worn out. Humans tend to get the same tired.
Hypothesis
So our hypothesis is:
"Teams with up tempo offenses will have worse defenses."
The first thing we need to define is how we are going to measure worse. Points per drive (PPD) is the cleanest raw efficiency metric that I know and it's a good place for us to start. It also happens to be the dependent variable at the top of the hierarchy of Beta_Rank models. So the more drives a defense faces the more points per drive we can expect that defense to surrender.
Points per drive in terms of a very simple mathematical function might look like something like:
PPD = f(starting field position, offensive unit quality, defensive unit quality)
It seems simple, but it explains an awful lot of the variation in points per drive on its own.The hypothesis would add:
PPD = f(starting field position, offensive unit quality, defensive unit quality, drive number)
So let's start with visualizing the distribution of PPD by number of drives. If you want to skip ahead to the regressions you are more than welcome to scroll down, but I encourage everyone to visualize distributions of data first. In the first pass of this data I have removed end of half/game run out the clock drives and limited drives to those that took place in regulation time. All subsequent cuts will have this same cut applied. I have also limited the number of drives to 17 because the data gets thin and noisy further out.
At first glance the hypothesis doesn't look good. The drive number seems negatively related to points per drive. In all of these cuts we are not controlling for field position or team quality, but relying on the size of the sample to give us reasonably unbiased data for those two factors. It's fair to question whether that is true and we'll move on to explicitly controlling for them before we conclude here.
But! Not all end of games are created equal right? Sometimes when you are ahead you want to kill clock instead of score points. The turtle/David Shaw strategy when ahead late in a game is a real thing. We should attempt to control for this by looking at close games where the you are behind or narrowly ahead need to score points, but not just that you need to score, but that you have enough time to score. So we limit the sample to those drives where you are less than +-8 points ahead/behind your opponent and you have at least two minutes left on the clock when you get the ball.
Here is some evidence that we might say suggests some wearing out effect on the defense. At least at drive 17, where we still have an n>100 the points per drive turns sharply upward and matches the points per drive back from the first drive where there is a huge sample and lots of games that are only still close because the game just started (we'll get to the fact that we still need to control for team/unit quality). It might be more fair to only look at games that were close late in the early goings as well, but we'll deal with this in regression controls.
But it's worth considering that there might be yet another effect confounding what we see late in the game here. I am going to call it the "prevent defense effect." That many teams in late game situations in close games switch to a defense with an overly conservative emphasis on preventing "The Big Play" that the coach and defensive coordinator will have to explain after the game. In testing a roughly designed variable for this it wasn't the very last drive that caught most of this effect within the boundaries of what we had defined as close games, but the second to last drive. So we'll split the data into the drives that were the second to last drive of the game with the same sets of controls we have above. So we'll still be looking at drives in close games at the time of drive start and you'll still need to have enough time to score, but most of the potential prevent defense situations have been removed.
There is still a spike at 17, but the trend is really down. In fact starting all the way back at drive 8, in no way a high tempo game, you can see that removing the drive 8 drives that were second to last drives in close games moves the needle slightly downward. So just as some of the late drive point per drive might be suppressed by turtle behavior some of it might be inflated by prevent defense. In fact the second to last drive in close games has an even higher points per drive total, 2.63 !!!!!, than the first, second, or third drives of the game 2.18, 2.20, and 2.26 points per drive respectively.
Regression
We have been working towards regression in looking at the data in various cuts and I think we might have some functional forms to test out. First lets look at the basics of what a points per drive regression (by this I mean the data is at the drive level) should include:
Points = Starting Field Position*B + Offensive Efficiency*B + Defensive Efficiency*B + Drive Number*B +e
Beta_Rank's model includes some other effects in there as well that link into the other models in the hierarchical form, but that explains most of what you see in it. So as we run this what do we care about in measurement?
1. The sign of the coefficient on Drive Number
This is the most important from the pure testing the hypothesis perspective. More drives should mean more points should mean the coefficient is positive and the more drives you are into the game the more points you should see per drive. Also how should we think of the coefficient switching signs depending on specification?
2. The size of the coefficient on Drive Number
This one is critical too, because how far is the coefficient from zero? A big effect tells you the Drive Number matters a lot. A small coefficient tells you that the Drive Number doesn't.
3. The standard deviation for that coefficienct on Drive Number
Is the variable and coefficient statistically significant? Are we sure the coefficient is statistically different from zero? If you can't tell the coefficient from zero then it's probably not important on it's own.
4. Overall model accuracy in predicting wins and spreads
This is the ultimate guide for changes in Beta_Rank, does including this field make the model more accurate on winners and spreads?
Regression One: Drives Alone
The first regression will mimic our first cut of the data. If we include the Drive Number in the regression then how does the model fit it? Well much like the trend we saw in the raw data the model fits the variable with a negative coefficient.
Drive Number -0.26
Standard Error 3.0219e-06
t-value -8824
Win Fit +0.1%
Spread Fit +0.1%
So the model expects there to be -0.32 points per drive less on Drive 17 than on Drive 1. The negative coefficient does not back up the hypothesis that teams play better defense when they face fewer drives even though now we are controlling explicitly for unit quality and field position, the coefficient may seem small, but nearly a third of a point off per drive is substantial, and the relationship is quite statistically significant. The model predicts game winners and spread slightly better on 9 years worth of game data with these additional controls.
Regression Two: Clock Killers and Garbage Time
Now we will fit a variable that controls for the game situation. We want differentiate it though so as not to mix the playing ahead and playing behind effects so we split it into two fields that take the score difference and calculate the number of possessions required to equalize it (divide by 8) and then ensure that enough possessions remain to catch up (assuming roughly 2 minutes per possession if you are really trying to do it). Now we are going to try to control for garbage time and the clock killing strategy.
Drive Number -0.013
Standard Error 2.9858e-06
t-value -4374
Win Fit +0.1%
Spread Fit +0.3%
We have reduced the size of the negative coefficient on Drive Number, but we failed to make it positive. It still remains somewhat substantial in it's effect where at drive 17 it's taking off -0.22 expected points per drive. It remains very statistically significant, but less so. We also increased the fit in predicting spreads in this model, but had no effect on predicting winners. Both the control for being ahead and the control for being behind were negative. Suggesting a symmetric effect where the further the game is out of hand the less both teams care about scoring. Though this additional garbage time control did boost the defenses from teams with very good offenses in Beta_Rank, so some of the effect was to shift the way the model scores the units themselves and that was not uniformly positive, but was team dependent. Thus far the hypothesis is not holding up, but we are not done yet.
Regression Three: Prevent Defense
Now we have to bring in that last effect we saw when we cut the data; the last, or second to last, drive effect. Now we are just going to dummy for the second to last drive of the game and see if this variable has the effect we expect in a broader set of controls.
Drive Number -0.013
Standard Error 2.9857e-06
t-value -4478
Win Fit +0.1%
Spread Fit +0.4%
This additional control had virtually no effect on our Drive Number field as it remained a substantial in real world effect and statistically significant. The Prevent Defense control was positive, adding .04 points per drive if it was in play, but that is not much considering it is multiplied by 1 when it is in play. It did interact with the game situation controls and made each of them more negative and overall it increased the model fit on spread on the 9 years of data.
Conclusion
While you don't ever shut the door permanently on a hypothesis I think we can safely put the tempo hypothesis to bed for a while. This work falls very much in line with the prior work from Connolly, Baldwin, and Douglas on the effect of tempo on a defense in football. While college football lacks the depth of data of the NFL to get to GPS level tracking the data certainly allows us to test this hypothesis fairly and report usable results. Simply running more drives may lead to scoring, and giving up more points, but its getting more at bats that matters and not that you give up more points per drive the more drives you face. While we can come back and revisit this question in college at another time where the data is at a deeper level, and I would recommend that since college has a greater disparity of scheme than pro, I think the tempo question is pretty fairly settled as well as we can get it in the current data.
Teams still do go tempo these days in CFB and the Pros and we should realize this is more about match-ups, substitutions and limiting opposing playcaller's options than some tiring effect that thus far is not showing up in multiple levels of data with different analytic methods applied.
Note: Not all of the new variables made it in to the latest Beta_Rank production model. In testing the overall fit for parsimony, the model that predicts best with the fewest independent variables is best, I found that the Last Drive variable added no additional predictive power when run with the game situation and last drive variables and it was dropped. I was also able to refactor how I weighted for garbage time using the new fields and this produced a new version of Beta_Rank with a full 0.7% improvement on predicting spreads accurately.