Let's Cross That Bridge When We Come to It

Learn Data Science
Teradata Employee

As a millennial, I spend a lot of time on Twitter. And as a Seattleite, I spend a lot of time waiting for one of the seven draw bridges within a couple miles of my house. And thus, as a millennial Seattleite, I spend a lot time sitting in traffic on a draw bridge, hitting refresh on @SDOTbridges to find out how soon I can get home to my dog.

 

freemont_bridge.jpg

 

The Seattle Department of Transportation’s SDOTbridges is a handy little bot that tweets every time that one of the 7 bridges managed by the city opens or closes to street traffic. Recently, while waiting not very patiently, I was struck by how well formatted the tweets were.

 

sdotbridges_tweet.png

 

Like many in the profession of data science, I love building models and finding new insights but find that I often spend time making data useable. These tweets from SDOT are so well formatted that they could be a gift from the data gods – and it would really be rude not to do a mini project with them.

 

Although fun questions come to mind like: Can we predict which bridges are open at a specific time? or If the Ballard Bridge just closed would it be better to wait for it or take the Freemont Bridge? we’re going to start with some data profiling. It really is important to understand the data before digging in to advanced analytics.

 

In this post we will:

  1. Load the data into the Teradata Analytics Platform using Teradata Studio
  2. Turn the raw tweets into a table we can work with
  3. Use SQL aggregates to profile the data

Previous to this post I created a Twitter Developer license and pulled the last 3,200 tweets from @SDOTbridges which gives us almost exactly one month of data. Using the Teradata Studio Smart Loader (as seen in the gif below) I then created a table and loaded the tweet text and timestamps.

bridges_load_data.gif

 

Now that our data is loaded we'll look at the frequency of tweets for each bridge and the frequency of closures for each bridge.

 

CREATE TABLE BRIDE_TWEETS_PARSED AS (
	SELECT 
		TWEET_ID
		, TWEET_TEXT
		, SUBSTR(TWEET_TEXT, 1 , INDEX(TWEET_TEXT, ' has ')) AS BRIDGE
		, SUBSTR(
			TWEET_TEXT
			, INDEX(TWEET_TEXT, ' has ') + 5
			, INDEX(TWEET_TEXT, ' to traffic ') - (INDEX(TWEET_TEXT, ' has ') + 5)
		)AS EVENT_TYPE
		, TWEET_TIMESTAMP at 'America Pacific' AS TWEET_TIMESTAMP
	FROM BRIDGE_TWEETS_LAST_3200	
) WITH DATA;

SELECT TOP 3 * FROM BRIDE_TWEETS_PARSED ORDER BY 1;

--CHECK THAT ALL DATA LOADED
SELECT COUNT(*) FROM BRIDE_TWEETS_PARSED; --3200
-- tweet_id tweet_text                                                      BRIDGE                       EVENT_TYPE tweet_timestamp     
-- -------- --------------------------------------------------------------- ---------------------------- ---------- ------------------- 
--        1 The Ballard Bridge has reopened to traffic at 2:59:45 PM        The Ballard Bridge           reopened   2018-09-17 21:59:47
--        2 The Lower Spokane St Bridge has closed to traffic at 2:57:06 PM The Lower Spokane St Bridge  closed     2018-09-17 21:57:11
--        3 The Ballard Bridge has closed to traffic at 2:55:43 PM          The Ballard Bridge           closed     2018-09-17 21:55:45


SELECT COUNT(DISTINCT TWEET_ID) FROM BRIDE_TWEETS_PARSED; --3200

SELECT BRIDGE, COUNT(*)
FROM BRIDE_TWEETS_PARSED
GROUP BY 1
ORDER BY 1;
-- BRIDGE                       Count(*) 
-- ---------------------------- -------- 
-- The 1st Ave S Bridge              181
-- The Ballard Bridge                886
-- The Fremont Bridge               1049
-- The Lower Spokane St Bridge       212
-- The South Park Bridge             130
-- The University Bridge             742

SELECT EVENT_TYPE, COUNT(*)
FROM BRIDE_TWEETS_PARSED
GROUP BY 1;

-- EVENT_TYPE Count(*) 
-- ---------- -------- 
-- closed         1561
-- reopened       1639

SELECT BRIDGE, EVENT_TYPE, COUNT(*)
FROM BRIDE_TWEETS_PARSED
GROUP BY 1,2
ORDER BY 1,2;

-- BRIDGE                       EVENT_TYPE Count(*) 
-- ---------------------------- ---------- -------- 
-- The 1st Ave S Bridge         closed           90
-- The 1st Ave S Bridge         reopened         91
-- The Ballard Bridge           closed          410
-- The Ballard Bridge           reopened        476
-- The Fremont Bridge           closed          524
-- The Fremont Bridge           reopened        525
-- The Lower Spokane St Bridge  closed          105
-- The Lower Spokane St Bridge  reopened        107
-- The South Park Bridge        closed           62
-- The South Park Bridge        reopened         68
-- The University Bridge        closed          370
-- The University Bridge        reopened        372 

 Lucky for me, most boat traffic (which causes these closures) happens near the bridges I frequent most. We also notice that we don't have an exact match of opens to closures. 
bridge_tweet_count.png

 It's quite possible that we lost some tweets when extracting from Twitter as the developer licenses can be finicky. However, as a frequent user, I know that the worse offender, The Ballard Bridge, has had some maintenance issues this summer and I would not at all be surprised if the sensor really did see more reopenings than closures. In a future post we will explore this idea and try to understand the anomaly. For now, let's assume that after a closure tweet the next reopen tweet is its pair.

 

Using nPath, we split the data by bridge, order by tweet time, and then look for every instance of a closure followed by a reopening. If there are multiple reopenings we take the first one. 

 

CREATE TABLE BRIDGE_EVENTS AS (
	SELECT 
		DT1.*
		,OPENED_TIMESTAMP - CLOSED_TIMESTAMP HOUR TO SECOND AS CLOSED_DURATION
	FROM nPATH(
		ON BRIDE_TWEETS_PARSED
		PARTITION BY BRIDGE
		ORDER BY TWEET_TIMESTAMP
		USING
			MODE(NONOVERLAPPING)
			SYMBOLS(
				EVENT_TYPE = 'closed' AS C
				,EVENT_TYPE = 'reopened' AS O
			)
			PATTERN('C.O+')
			RESULT(
				FIRST(BRIDGE OF C) AS BRIDGE
				,FIRST(TWEET_TIMESTAMP OF C) AS CLOSED_TIMESTAMP
				,FIRST(TWEET_TIMESTAMP OF O) AS OPENED_TIMESTAMP
			)
	) AS DT1
) WITH DATA NO PRIMARY INDEX;

SELECT TOP 10 * FROM BRIDGE_EVENTS;

-- bridge                       closed_timestamp          opened_timestamp          CLOSED_DURATION  
-- ---------------------------- ------------------------- ------------------------- ---------------- 
-- The Lower Spokane St Bridge  2018-08-17 20:32:25-07:00 2018-08-17 20:55:39-07:00   0:23:14.000000
-- The Lower Spokane St Bridge  2018-08-18 14:31:34-07:00 2018-08-18 14:45:20-07:00   0:13:46.000000
-- The Lower Spokane St Bridge  2018-08-18 16:43:01-07:00 2018-08-18 16:52:00-07:00   0:08:59.000000
-- The Lower Spokane St Bridge  2018-08-18 18:53:01-07:00 2018-08-18 19:07:08-07:00   0:14:07.000000
-- The Lower Spokane St Bridge  2018-08-18 20:44:44-07:00 2018-08-18 20:55:54-07:00   0:11:10.000000
-- The Lower Spokane St Bridge  2018-08-18 21:41:40-07:00 2018-08-18 21:50:23-07:00   0:08:43.000000
-- The Lower Spokane St Bridge  2018-08-18 20:20:34-07:00 2018-08-18 20:29:22-07:00   0:08:48.000000
-- The Lower Spokane St Bridge  2018-08-18 00:40:00-07:00 2018-08-18 00:48:34-07:00   0:08:34.000000
-- The Lower Spokane St Bridge  2018-08-17 19:21:49-07:00 2018-08-17 19:34:49-07:00   0:13:00.000000
-- The Lower Spokane St Bridge  2018-08-17 18:31:52-07:00 2018-08-17 18:47:01-07:00   0:15:09.000000

When we summerize closures by bridge we see that we have some data quality issues: It's unlikley there was a 6 second closure and I'm aware of at least one instance when one of these guys (I'm looking at you, Ballard Bridge) was stuck open, that is closed to traffic, for at least 3 hours. Overall, though, our data seems reasonable. 

 

bridge_stats.png

 

As a final step in our exploring let's visualize the closures over time:

 

brige_time.png

 

We see again that Ballard, Freemont, and The University Bridge are closed most often. Traffic closures, denoted in red, are fairly consistent in length, but we an see some anomalies where the bridges were closed for longer periods of time. We see consistent trends of large white spaces at every bridge at the same time, when the bridges are physically closed but open to traffic, these are mostly likely during the hours of 11PM to 7AM where bridges will only open to boats who have a previously made appointment. 

 

As we've started to play with this data some obvious next steps have stood out:

  • How do we define the closure time if there are multiple open or closed tweets in a row?
  • At a specific date and time, what bridges will be open to traffic?
  • If we see a physically open bridge, how long should we wait before giving up? That is, are we over the average time or might the bridge be broken (really, I've been here 3 times this summer). 

 

We'll explore these next time!