Statistically Speaking

Data Dump: 2007 OPA! numbers

For those of you who followed the development of my OPA! system for measuring fielding, and were curious about the numbers, here are the 2007 numbers in their full glory.  I’ve posted them as Google docs, and sorted them by the total number of runs saved above average.  I couldn’t get the pitcher file to fit into a Google Doc (if someone wants it, e-mail me), and OPA! doesn’t look at catchers.  But the other seven positions are up.  Hopefully, on the spreadsheets, the headings make sense.  Players are labeled with their Retrosheet ID.  As always, feel free to use them as you see fit, just maybe tip the cap over this way.

When the 2008 Retrosheet file comes out, I’ll post the 2008 numbers. 

Enjoy.

What run estimator would Batman use? (Part II)

If you haven’t already, I suggest you read Part I first, but it’s not strictly necessary, so long as you have a feel for how run estimators work. Part I goes into a lot of the background of how run estimators work, but there’s not a lot of technical detail.

Now, let’s go ahead and strap some run estimators down to the table, cut them open and see how they work.

Linear weights

First of all, when I refer to linear weights, I should clarify that I use the term to refer to any linear run estimator, not just Pete Palmer’s Linear Weights System. Onward, then.

Simply looking at a linear weights formula should be pretty straightforward. We’ll look at the reduced version of Extrapolated Runs, Jim Furtado’s version of a linear weights formula*:

(.50 * 1B) + (.72 * 2B) + (1.04 * 3B) + (1.44 * HR) + (.33 * (HP+TBB)) + (.18 * SB) + (-.32 * CS) + ((-.098 * (AB - H))

Essentially, every event is multiplied by its average run value, based on a certain run context. (In the case of XR it’s team seasons from 1995 to 1997, but you could use any context you wanted. You could put together a linear weights formula for, say, Greg Maddux’s career if you wanted to.)

This begs the question of how to determine the run value of an event. Looking simply at Runs Batted In won’t help – a single with the bases empty provides value. So what do we do? Here’s where a concept called run expectancy comes in handy. Every base/out state has a certain run expectancy, which essentially is how many runs on average a team scores from that point of the inning. I’m using values from this table by Tango, because they’re already in a nice arrangement.


0
1
2
___
0.555
0.297
0.117
1__
0.953
0.573
0.251
_2_
1.189
0.725
0.344
__3
1.482
0.983
0.387
12_
1.573
0.971
0.466
1_3
1.904
1.243
0.538
_23
2.052
1.467
0.634
123
2.417
1.65
0.815

There’s one case not strictly defined on the table; three outs means a run expectancy of zero.

The linear weights value of an event is the average change in run expectancy by an event. Let’s say you have runners on first and second, no outs; that’s a RE of 1.573. A player hits a double, scoring the two runners in front of him:

2 + 1.189 = 3.189

The double scored two runs and leaves the game with an RE of 1.189, for a total RE of 3.189. Subtract 1.573, and you get 1.616, the run contribution of that double. Take the average RE change of every double available in your dataset, and there’s your linear weights value of a double.

(There are other ways to estimate linear weight values when you don’t have sufficient data to do the Run Expectancy analysis; an overview of the subject is available.)

Read the rest of this entry »

Holds, Saves and Blown Saves

Francisco Rodriguez of the Angels, with 54 saves in 59 opportunities, is on his way to breaking the all-time single season record of 57, set by Bobby Thigpen of the White Sox in 1990. Percentage wise, the Phillies’ Brad Lidge is perfect, with 33 saves in 33 opportunities. On the opposite end, there are records such as those of Aaron Heilman of the Mets, 3 for 7 this year and 9 for 33 since 2004. It’s obvious Heilman can’t close games, with a record like that. No wonder Willie Randolph got fired. Right? Wrong!

Saves have become a statistic who’s leaders are as well known to the casual fan as the homerun leaders, and save percentage is one of the simplest computations in baseball statistics, but it has always contained an error that grossly distorts the value of middle relievers to the general public. It is easy to understand that the setup man isn’t in a position to get many saves, but save percentage has been held up by many, including the media, as evidence that certain pitchers routinely fail when handed a save situation, proof that they can’t handle the closer role. Read the rest of this entry »

World Famous StatSpeak Roundtable: September 3

Our humble round table welcomes a new guest knight.  Please welcome to this week’s version of the roundtable, Will Carroll of Baseball ProspectusWill has been kind enough to join us here on StatSpeak for a record-setting five-person roundtable.  He joins us in a discussion of the ghosts of trade deadline deals past, injuries and Sabermetrics, C.C.’s sorta no-hitter, instant replay, and who will be looking in from the outside on the AL playoffs in October.

Question #1: When I started doing ”Under the Knife” seven years ago, there were no stats and people didn’t think that injuries and sabermetrics went together. I’m still not sure they do, but to me, it’s about information. You guys are stats guys — how would you go about mixing the two?

Will Carroll: I think it comes down to a bit of luck. Is it someone getting hot and carrying the team? Is it an injury that costs them a premier player for a couple weeks or worse? I know that luck is probably the worst thing to say on a site like this but I think its the best way to say that small things make a huge difference and I’m not sure which ones. I think we get lost in this fog because we’re seeing quantifiable effects but in such small quantities that we don’t notice, things that amount to 0.1 runs or less, but enough of them that they add up.

Brian Cartwright: Well, my day job is in data processing, which include designing methods of data collection. So one of my current projects is designing a comprehensive database that hopefully will include everything we can get our hands on, from season stats and play by play to transactions and injuries, as opposed to narrowly constructed ad hoc databases. I’d like to be able to look at the pre injury data and see if there are any indicators, such as simple to derive stuff like lists of pitchers headed to a Verducci Effect (and then test how true it is). Post injury, be able to see how well players recover from various types of injuries.

I know Will has done much of this on his own, but I’d like to see the injury data married to the stats and projections to enable more of us to do these kind of studies.

Colin Wyers: That’s sort of the unexplored frontier of sabermetrics - introducing traditional sorts of data into our models. What’s lacking right now is a good record of who got injured, where, and how. I don’t know if we’ll ever get to that point, but people like Tom Ruane of Retrosheet are working on that sort of data - and all of us who research baseball owe the folks of Retrosheet a huge debt.

Eric Seidman: A fusion of injuries and sabermetrics is something I have actually discussed with Will on numerous occasions because now, with Pitch F/X data in full bloom, there are certain avenues we can explore.  For instance, one idea of Will’s (that I wholeheartedly support) is that pitchers that are on the verge of injury will have consistent release points with inconsistent results.  Before, this really could not be studied, but now it can.  We can run analyses to see which pitchers fit the bill.  Or, if someone is experiencing a “dead arm” we can look to their movement.  Stats cannot tell us everything about injuries, but just like all other aspects of analysis, the combo of numbers and scouting will ultimately prove to be key in this combination.

Pizza Cutter: I don’t think that the two are opposed at all.  I do agree that injury analysis isn’t really something that fits nicely into any of the Sabermetric models that we have now, but that’s more of an engineering problem.  To really pursue this line of study, one would have to be familiar with bio-mechanics and statistics, plus have a fairly extensive injury database handy.  (So basically, you, Will.)  Even at that point, there’s going to be a lot of statistical noise.  Suppose that Larry has an elbow problem and goes on the 15 day DL.  Even if we assume that we know exactly when he was hurt (and when it started hurting his performance), we’ll never really know how hurt he was.  How can we tell if it’s not just him having a bad string of luck?  Maybe with a big enough sample, we can detect a signal, but it’s going to be hard to find.  Calculating the complete absence of a player is fairly easy.  Calculating what it means to have a player at 80% is a lot harder.

The other side of the Sabermetric-injury nexus is predicting who’s an injury risk.  My guess is that some team (or several) out there hired an actuary to study just that and they’re keeping it close to the vest.  (Can’t blame them.)  Plus, with many teams already insuring contracts, someone out there in the insurance industry must be running some sort of tables.

Read the rest of this entry »

ARCHIVE

September 2008
S M T W T F S
« Aug    
 123456
78910111213
14151617181920
21222324252627
282930  

SPONSORS