Hello, Baseball Fans! - Visualizing Player Movement between Major League Baseball Franchises

Learn Data Science
Teradata Employee

This week marks the beginning of Major League Baseball’s 2017 season. To celebrate, I thought it would be fun to explore common player movement patterns between MLB franchises using both Teradata Aster’s nPath function and a sigma chart visualization (I know…I’m weird like that). In other words, we want to visualize the most common patterns of franchises played for, for example: Franchise A followed by Franchise B followed by Franchise C.

 

The Data:

To investigate, we will look at the 2016 version of the Baseball Databank database, which can be downloaded here: http://seanlahman.com/baseball-archive/statistics/

 

Specifically, we will look at the Batting.csv, Teams.csv, and TeamFranchises.csv files.

 

Note from the accompanying README.txt file:

Baseball Databank is a compilation of historical baseball data in a convenient, tidy format, distributed under Open Data terms.

 

This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.  For details see: http://creativecommons.org/licenses/by-sa/3.0/

 

Person identification and demographics data are provided by Chadwick Baseball Bureau (http://www.chadwick-bureau.com), from its Register of baseball personnel.

 

Player performance data for 1871 through 2014 is based on the Lahman Baseball Database, version 2015-01-24, which is Copyright (C) 1996-2015 by Sean Lahman.

 

The tables Parks.csv and HomeGames.csv are based on the game logs and park code table published by Retrosheet. This information is available free of charge from and is copyrighted by Retrosheet.  Interested parties may contact Retrosheet at http://www.retrosheet.org.

 

Data Loading/Prep Work:

Let’s imagine that we work for a company where the Batting.csv, Teams.csv, and TeamFranchises.csv files are loaded onto a Teradata Aster database and stored in tables named [schema].batting, [schema].teams, and [schema].teamfranchise .

 

First, we’ll isolate the records of interest in our data. To answer our question (i.e., what are the most common player movement patterns between Major League Baseball franchises), we will focus on players who played for franchises currently in existence.

 

Note: You might be wondering why we are considering franchises rather than the names of the teams that each player played for. This is because some franchises have changed their associated team’s name over the years at one time or another. Focusing on the name of the franchise prevents us from detecting false movement patterns where a player “moves” from one team to another when both teams actually represent the same franchise. For example, in 2005, the Montreal Expos relocated to Washington, D.C. and became the Washington Nationals, so a player cannot really have moved from the Expos to the Nationals, since these two teams are actually the same franchise entity.

 

The join below leaves us with just the season records of players who played for franchises that currently exist. For ease of coding later, we’ll create this join as a new table called [schema].players .

 

--Create table isolating players who played for franchises that currently exist
 
DROP TABLE IF EXISTS [schema].players;
 
CREATE TABLE [schema].players
DISTRIBUTE BY HASH (playerid)
COMPRESS LOW AS (
       SELECT a.*, b.franchid
       FROM [schema].batting a
       JOIN (
             SELECT yearid, teamid, franchid
             FROM [schema].teams
             WHERE franchid IN (
                    SELECT franchid
                    FROM [schema].teamfranchise
                    WHERE active = 'Y'
             )
       ) b
       ON a.teamid = b.teamid
       AND a.yearid = b.yearid
       ORDER BY a.yearid, a.stint ASC
);

 

Now, we will create a flag that marks the records from both the year that a player started his career and the years that he began playing for a new franchise. Note that players can move from one franchise to another and then back to the previous franchise; they can also move multiple times within the same season (as indicated by the “stint” variable). The following code sets up the flag that we need (for future coding simplicity, we will create this flag within a new table called [schema].firstyearsflagged :

 

--Set up flag for first year that a player played with a franchise
DROP TABLE IF EXISTS [schema].firstyearsflagged;
 
CREATE TABLE [schema].firstyearsflagged
DISTRIBUTE BY HASH (playerid)
COMPRESS LOW AS (
SELECT playerid, yearid, stint, franchid,
       CASE
             WHEN yearid = MIN(yearid) OVER (PARTITION BY playerid, franchid) OR
                     stint > 1 OR
                     yearid - (LAG(yearid, 1, 1) OVER (PARTITION BY playerid,
                                             franchid ORDER BY yearid)) > 1
             THEN 1
             ELSE 0
       END AS firstyear
FROM [schema].players
ORDER BY yearid, stint ASC
);

 

To see this flag in action, let’s look at the career of Greg Maddux. He began his major league career with the Chicago Cubs in 1986. He then moved to the Atlanta Braves in 1993 after being signed as a free agent, to the Cubs in 2004 after being signed as a free agent, to the Los Angeles Dodgers after being traded by the Cubs midseason 2006, to the San Diego Padres in 2007 after being signed as a free agent, and back to the Dodgers after being traded by the Padres midseason 2008 before retiring at the end of the 2008 season. The code below retrieves his records from the [schema].firstyearsflagged table and arranges them in order by yearid, then stint:

 

--Check firstyear flag 
SELECT *
FROM [schema].firstyearsflagged
WHERE playerid = 'maddugr01'
ORDER BY yearid, stint;

 

Output:

 

playerid

yearid

stint

franchid

firstyear

maddugr01

1986

1

CHC

1

maddugr01

1987

1

CHC

0

maddugr01

1988

1

CHC

0

maddugr01

1989

1

CHC

0

maddugr01

1990

1

CHC

0

maddugr01

1991

1

CHC

0

maddugr01

1992

1

CHC

0

maddugr01

1993

1

ATL

1

maddugr01

1994

1

ATL

0

maddugr01

1995

1

ATL

0

maddugr01

1996

1

ATL

0

maddugr01

1997

1

ATL

0

maddugr01

1998

1

ATL

0

maddugr01

1999

1

ATL

0

maddugr01

2000

1

ATL

0

maddugr01

2001

1

ATL

0

maddugr01

2002

1

ATL

0

maddugr01

2003

1

ATL

0

maddugr01

2004

1

CHC

1

maddugr01

2005

1

CHC

0

maddugr01

2006

1

CHC

0

maddugr01

2006

2

LAD

1

maddugr01

2007

1

SDP

1

maddugr01

2008

1

SDP

0

maddugr01

2008

2

LAD

1

 

As we can see, the records from the years 1986, 1993, 2004, 2006 (specifically, the stint with the Dodgers), 2007, and 2008 (again, the stint with the Dodgers) are all flagged with a 1.

 

Exploration:

Now, let’s find the different franchise patterns that exist. To do this, we will run Teradata Aster’s nPath function on the records from the [schema].firstyearsflagged table where firstyear = 1. Also, we only want to look at player moves that happened during or after 1998, since this was the year that the two newest franchises in Major League Baseball (the Arizona Diamondbacks and the Tampa Bay Rays) were created. This restriction controls for age of franchise; franchises that have been around longer naturally have a longer history of players moving to and from their team than newer franchises, so if we don’t control for age of franchise, it artificially inflates the “popularity” of the older franchises.

 

The code below runs nPath on our records of interest. We’ll also create our nPath results as a new table that we can feed into Teradata Aster AppCenter in order to visualize our results.

 

 

--run nPath on [schema].firstyearsflagged - are there common movement patterns 
--between franchises?
DROP TABLE IF EXISTS [schema].mlbmoves;
 
CREATE TABLE [schema].mlbmoves
DISTRIBUTE BY HASH (path)
COMPRESS LOW AS (
SELECT path, count(*) cnt FROM nPath (
       ON (SELECT * FROM [schema].firstyearsflagged WHERE firstyear = 1
             and yearid >= 1998)
       PARTITION BY playerid
       ORDER BY yearid, stint ASC
       Mode (NONOVERLAPPING)
       Pattern ('A.B+')
       Symbols ('TRUE' AS A,
                    franchid != LAG(franchid, 1) AS B)
       Result (ACCUMULATE(franchid OF ANY(A, B)) AS path)
                   
)
GROUP BY path
);

 

Here is a sample of records from our new table:

 

 

For example, we can see from this output that, between 1998 and 2016, three players began their major league careers with the Chicago Cubs and then moved to the Miami Marlins (denoted here by FLA), finishing their careers with the Marlins.

 

Visualization:

At this point, let’s visualize our results. We can do this by building an app in Teradata Aster AppCenter:

  • From the AppCenter home screen, click “Build an App”.
  • Fill out the following fields:
    • App Name
    • App Version
    • Developer Name
    • Developer ID
    • Description
  • Next, go to the Logic tab and click the “Generate Visualization Code…” button.
  • Fill out the following fields as shown below:
    • Data (Table/View): [schema].mlbmoves (Note: replace [schema] with the actual schema of your table)
    • Data Format: nPath output schema
    • Visualization Type: sigma
    • Visualization Title: Add a title here if you want
  • Click the “Add Visualization Statement” button. In the code that appears, make sure that your input table appears as “[schema]”.”mlbmoves”
  • Click “Save”.
  • Configure the app. On the Database Connection tab, connect to the database where the [schema].mlbmoves table was created. Click “Save”.

 

At this point, the app is complete. On the screen that now appears, fill out the “Title” field and click “Run”. When the app finishes, click on the little sigma chart icon that appears next to the report title in order to see the visualization that was produced. Our sigma chart should look like this:

 

 

The larger a point is in our sigma chart, the more likely it is to appear in a movement pattern. According to our diagram above, from the years 1998-2016, players were more likely to move to or from the San Diego Padres than any other franchise. Also, the darker a path is in this diagram, the more often it occurs. Using the toggles that come with the sigma chart visualization, we can hide the less-frequently occurring paths and get the new sigma chart below:

 

Here, we can see that, between 1998 and 2016, some of the most common player movement patterns involved going from the New York Yankees to the Washington Nationals, from the Baltimore Orioles to the Toronto Blue Jays, from the San Diego Padres to either the Oakland Athletics or the Boston Red Sox, from the Athletics to the Colorado Rockies, or from the Seattle Mariners to the Padres.

 

As a next step, we could take a closer look at the stats of the players who moved between these franchises during their careers. For example, we could see if batting averages tend to peak while players are with some franchises and not others. For pitchers (especially American League pitchers), we might follow their earned run average (ERA) over time instead of batting average to see if there is a correlation between ERA and franchise played for.

 

In this post, we have seen how to use Teradata Aster’s nPath function in conjunction with Teradata Aster AppCenter in order to visualize player movement between Major League Baseball franchises. If you have further questions, feel free to contact me at kate.phillips@teradata.com.