Visualizing View Dependencies with Named Entity Recognition and gTree (Part 1)

Learn Aster

Aggregate views are an excellent way to package business logic into a repeatable and (hopefully) optimized way of producing reports.  A view provides a "single version of the truth", or an accepted and optimized way of joining tables to produce standardized output.  If all of my users only use a single view to produce a monthly sales report, then I can be sure that everyone is seeing the same results, and I can take action to optimize the performance of that view to ensure that I'm meeting SLA's and getting consistent data to users in a timely fashion.

While views can simplify data access, it's still important for DBA's and end-users to have an idea of where the data that they're accessing is physically stored.

Why is it important to be able to track the source of data that is contained in views?

  1. Helps DBA's optimize the performance of the view by tuning the core tables and joins
  2. Useful when troubleshooting data issues that appear downstream in the reporting layer
  3. Helps understand what dependencies exist - if a table is dropped or modified, which views will it affect?

With these advantages in mind, is there a way for us to build a "master list" of all views and the object dependencies that exist within them?  Of course!

For example, let's take a look at some of the data dictionary views from a Teradata database.  The DBC views are built on top of the tables in the DBC database which hold system-level metadata.  The DBC.TableSizeV view in the example below joins three dictionary tables together to show the current size of each table on the Teradata system:

REPLACE VIEW DBC.TableSizeV
AS SELECT
DataBaseSpace.Vproc,
Dbase.DatabaseName,
Dbase.AccountName,
TVM.TVMName,
DataBaseSpace.CurrentPermSpace,
DataBaseSpace.PeakPermSpace
FROM DBC.Dbase, DBC.DataBaseSpace, DBC.TVM
WHERE DataBaseSpace.TableID <> '000000000000'XB
AND DataBaseSpace.TableID = TVM.tvmid
AND TVM.DatabaseId = Dbase.DatabaseId
AND TVM.TableKind NOT IN ('G','M','V');

We can visualize DBC.TableSizeV's dependencies like this:

Since we have DDL for all objects stored in our data dictionary, we can use Aster's text analysis capabilities to examine the DDL text and identify object names.  The Named Entity Recognition (NER) in Aster's SQL-MR library allows us to define a list or "dictionary" of named entities and then extract all instances of those entities from an input document.  For more information on the NER function check out John Thuma's post here.

First we'll need to build our "dictionary" of named entities that we will want to search for.  We will want to search all of our input documents (each view's DDL) for the names of other views or tables.  Since we want to search for names of both views AND tables, our list has to include names of both.  I built a dictionary table like the one below, which holds the object's full name in one column and the object type in another:

CREATE DIMENSION TABLE pdcr.tblvw_dict (
objecttype varchar(1) NULL,  --"V" for views,"T" for tables
dict varchar(100) NULL   --fully qualified objectname (databasename.objectname)
)
DISTRIBUTE BY HASH (dict)

Second, I'll need my input documents.  In this case we're looking at view DDL to identify individual object names.  We need to build a table that holds the DDL for each view.  Aster's SQL-MR function will search the DDL for instances of the object names.  We need to keep in mind that the DDL may need to be slightly modified in order for NER to work.  For example, NER won't recognize the object name if it is immediately preceded by a comma without a space.  It is also a good practice to remove double quotes from the DDL.  I built a table like the one below, using regular expressions (regexp_replace) to make any relevant modifications to the DDL

regexp_replace((regexp_replace(regexp_replace(createtext,'"','','g'),';',' ','g')),',',' , ','g') as createtext

CREATE FACT TABLE pdcr.tblvw_info (
objectfullname varchar NULL,   --fully qualified objectname (databasename.objectname)
objecttype varchar(1) NULL,   --"V" for views,"T" for tables
createtext varchar NULL   -- full text DDL, modified as noted below
)
DISTRIBUTE BY HASH (objectfullname)

NOTE: Be aware that Aster is case-sensitive while other databases (like Teradata) may not always be.  You'll want to keep this in mind when using any of Aster's text analytics functions.  In this case, I have casted the content of all relevant text fields (objectname, objecttype, createtext) to uppercase.

Now that we have both of our inputs for the NER function (our dictionary and our input documents), we call the NER function with the syntax below:

CREATE TABLE pdcr.objectlinks
DISTRIBUTE BY HASH(objectfullname)
AS (
SELECT DISTINCT cast(objectfullname as varchar(100)) as objectfullname
, objecttype as objecttype
, CAST(entity as varchar(100)) entity
, "type"
from NER (
on pdcr.tblvw_info PARTITION BY ANY  -- list of input documents (DDL)
on pdcr.tblvw_dict AS dict DIMENSION   --dictionary of object names
textcolumn('createtext')        --input text column
showContext(4)
accumulate ('objectfullname', 'objecttype')  --retreive the objectname and it's type from the DDL
));

This will create a new table with a list of each view, along with each object found within that view's DDL.  For example, let's look at DBC.TableSizeV from above:

select * from pdcr.objectlinks where objectfullname = 'DBC.TABLESIZEV'

objectfullnameobjecttypeentitytype
DBC.TABLESIZEVVDBC.DataBaseSpaceT
DBC.TABLESIZEVVDBC.DbaseT
DBC.TABLESIZEVVDBC.TableSizeVV
DBC.TABLESIZEVVDBC.TVMT

As you can see, NER returned one row for each object name that it found in the DDL for DBC.TableSizeV.  This includes the object itself "DBC.TABLESIZEV", which is right after the CREATE TABLE statement.

With simple views like the example above, identifying the source tables that hold the data is not very difficult, since both dependencies of the view are tables.  However, in cases where there are many views that are "nested" within a view as dependencies, tracking down the source data tables can get both complicated and time consuming.  What if DBC.TVM was also a view that selected from 2 tables and 3 more views? How can we programattically identify and visualize the dependencies and sub-dependencies of a specific view? 

Part 2 can be found here.