How Big Data has Affected Statistics in Baseball
The purpose of this report is to highlight how the inception of big data in baseball has changed the way baseball is played and how it affects the choices managers make before, during, and after a game. It was found that big data analytics can allow baseball teams to make more sound and intelligent decisions when making calls during games and signing contracts with free agent and rookie players. The significance of this project and what was found was that teams that adopt the moneyball mentality would be able to perform at much higher levels than before with a much lower budget than other teams. The main conclusion from the report was that the use of data analytics in baseball is a fairly new idea, but if implemented on a larger scale than only a couple of teams, it could greatly change the way baseball is played from a managerial standpoint.
Keywords: sports, data analysis, baseball, performance
Whenever people talk about sports, they will always talk about some kind of statistic to show that their team is performing well, or certain player(s) are playing incredibly well. This is due to the fact that statistics has become extremely important in sports, especially in rating an entity’s performance. While essentially every sport has adopted statistics to quantify performance, baseball is the most well-known sport to use it, due to the obscene number of stats that are tracked for each player and team, as well as the sport that uses stats the most in how they play the game. The MLB publishes about 85 different statistics for individual players, including the stats tracked of the teams, there is likely to be about double the amount if tracked statistics for the sport of baseball. The way that all the statistics are calculated, is, of course, by analyzing big data found from the players and teams. This report will mainly talk about the history of big data and data analytics in baseball, what the data is tracking, what we can learn from the data, and how the data is used.
The dataset that will be analyzed in this report will be the Lahmen Sabermetrics dataset. This dataset is a large dataset curated by Sean Lahmen, which contains baseball data starting from the year 1871, which is when the Major League Baseball association was founded 1. The dataset contains data for many different types of statistics, including batting stats, fielding stats, pitching stats, awards stats, player salaries, and games they played in. The data for this dataset has statistics from the last season that occurred (2020 season), but the data that could be accessed for this report is from 1871-2015. I plan to use this dataset for discussing later how the data in sets like this is used for statistical analysis in baseball and how teams can use this to their advantage.
The concept of baseball has been a sport that has existed for centuries, but the actual sport called baseball started in early to mid 1800s. Baseball became popularized in the United States in the 1850s, where a baseball craze hit the New York area, and baseball quickly was named a national pastime. The first professional baseball team was the Cincinnati Red Stockings, which was established in 1869, and the first professional league was established in 1871, and was called the National Association of Professional Base Ball Players. This league only lasted a few years and was replaced by a more formally structured league called the National League in 1876, and the American League was established in 1901 from the Western League. A vast majority of the modern rules of baseball were in place by 1893, and the last major change was instituted in 1901 where foul balls are counted as strikes. The World Series was inaugurated in the fall of 1903, where the champion of the National League would play against the champion of the American League. During this time, there were many problems with the league, such as strikes due to poor and unequal pay and the discrimination of African Americans. The era in the time of the early 1900s was the first era of baseball, and the second era of baseball started in the 1920s where a plethora of changes to the game caused the sport to move from more of a pitcher’s game to a hitter’s game, which was emphasized by the success of the first power hitter in professional baseball, Babe Ruth. In the 1960s, baseball was losing revenue due to the rising popularity of football, the league had to make changes to combat this drop in revenue and popularity. The changes lead to the salaries of the players getting increased and also an increase in attendance to games, which means increased revenue 2.
Big data is able to be used in baseball through the use of statistics, which has become a major part in how the sport is played. The use of statistics in baseball has become known as sabermetrics, which can be used by teams to make calls for the game based on numbers. The idea of using sabermetrics started from the book called Moneyball. The book was published in 2003, and was about the Oakland Athletics baseball team and how they were able to use sabermetric analysis to compete on equal grounds with teams that had much more money and good players than the A’s team 3. This book has had a major impact on the way baseball is played today for several teams. For example, teams like the New York Yankees, the St. Louis Cardinals, and the Boston Red Sox have hired full-time sabermetric analysts in attempts to gain an edge over other teams by using these sabermetrics to influence their decisions. Also, the Tampa Bay Rays were able to make the moneyball idea a reality by making it to the 2020 World Series with a much lower budget team and using lots of sabermetrics in their decision-making 3. This moneyball strategy is not the perfect strategy however, because the Rays lost the World Series, and the turning point for their loss could be pointed to a decision they made based on what the analytics said they should do, which ended up being the wrong choice, which lost them a crucial game in the series.
4. Big Data in Baseball
The data that is being analyzed for the statistics in baseball are datasets similar to the one this report is looking into, the Lahmen Sabermetrics dataset. Since this dataset contains a vast amount of data for each player and many different tables containing many different kinds of data, many kinds of statistics are able to be tracked for each player. With this large amount of statistics, teams are able to look to numbers and analytics in order to make the best decision on the actions to make during a game. Teams can also use analytic technology to predict the performance of a player based on their previous accomplishments and comparing that to similar players and situations in the past 4.
This kind of analysis can be used to gauge the potential performance of a free-agent or rookie player, as well as deciding what player should be in the starting lineup for the upcoming game(s). Since these sabermetric analyses are able to predict the performance of a player in the coming years, they are able to tell if a contract made by a team is likely to not be a smart deal since they tend to make long-term deals for lots of money, even though it is likely that the player will not continue to perform at the same level they are currently at for the entirety of their contract. This is normally due to the inability to play at an incredibly high level consistently for a long period of time and regression of performance from age, which is shown to start occuring at around 32 years old 4. This simply means that large contracts have a trend of being a large loss of money in the long run shown from analysis of similar types of contracts in the past. This kind of analysis also allows for a players ranking to be deciding by more areas than before. For example, a player’s offensive capabilities can be shown by looking at more categories than the amount of home runs hit and batting average, they can also look at baserunning skill, slugging percentage, and overall baseball intelligence. This ability of looking at a player’s overall capabilities in a more analytical manner allows teams to not throw all their budget into one or two top prospected players, but can spread their money across several talented players to have a good and balanced team. Another reason why deciding to not spend lots of money on a long contract for a top prospected player is that the analysis shows that players have started to have shorter lengths of time where they are able to perform at their best, even though other sports have seen the opposite in recent years . However, young players have been performing at a much higher level in recent years and they have had younger players moving from the minor to the major league much faster than before 4.
The data that is presented in the Lahman Sabermetrics database and other similar databases is able to allow analysts to compare data and statistics of one team with any/every other team with relative ease and in an easy to understand way. For example, the comparative analysis Figure 1 below shows that the payroll of teams and their winning percentage, analysts are able to learn that the New York Yankees have a much higher payroll than all other teams and they have a very good win rate, but there are other teams that do have very high payrolls and have the same good rate rate. Also, Figure 1 shows that there are other teams that have higher payrolls than average, but have a very bad win rate compared to all other teams, including teams that have a much lower payroll 5. This kind of analysis shows us that spending lots of money does not guarantee a strong season, which can strengthen the idea of the moneyball strategy coined earlier where teams attempt to waste less money by spreading budgets across several players other than spending most of the budget on only one or two players.
Figure 1: Comparative Analysis of Payroll to Win Percentage 5
This is only one way that the Lahman Sabermetrics dataset can be used, but there are many more ways this data can be used to make league wide analyses and compare a certain team to others. This can be used by teams to possibly learn what they might be doing wrong if they feel as though they should be performing better.
This report discusses the history of baseball and how big data analytics came to be prevalent in the sport, as well as how big data is used in baseball and what can be learned from the use of it so far. Big data is able to be used to make decisions that could greatly benefit a team from saving money on a contract with a player to making a choice during a game. Big data analytics use in baseball is a fairly new occurrence, but due to the advantages a team can gain from using analytics, it is likely that use of it will increase soon in the future.
The author would like to thank Dr. Gregor Von Laszewski, Dr. Geoffrey Fox, and the associate instructors in the FA20-BL-ENGR-E534-11530: Big Data Applications course (offered in the Fall 2020 semester at Indiana University, Bloomington) for their continued assistance and suggestions with regard to exploring this idea and also for their aid with preparing the various drafts of this article.
Lahman, Sean. “Lahman’s Baseball Database - Dataset by Bgadoci.” Data.world, 5 Oct. 2016. https://data.world/bgadoci/lahmans-baseball-database ↩︎
Wharton University of Pennsylvania. “Analytics in Baseball: How More Data Is Changing the Game.” Knowledge@Wharton, 21 Feb. 2019. https://knowledge.wharton.upenn.edu/article/analytics-in-baseball/ ↩︎
Tibau, Marcelo. “Exploratory data analysis and baseball.” Exploratory Data Analysis and Baseball, 3 Jan. 2017. https://rstudio-pubs-static.s3.amazonaws.com/239462_de94dc54e71f45718aa3a03fc0bcd432.html ↩︎