Building an Events Table or Integrated Contact History Table in Aster

Learn Data Science
Teradata Employee

Building an Integrated Contact History (ICH) table:

  A note to the reader:

    This text speaks about some concepts, practices, techniques

    and -most importantly- pitfalls in the process of building

    an Integrated Contact History structure/ table.

    If you are a seasoned ICH builder you may find this text too

    obvious for your skills.  However, if you have never built one

    of these before, I urge you to read through.  I tried to keep

    "cruft" out and focused on discussing the challenges I faced

    when building an ICH for the first time.

    Some of the ideas and methods are quite commonplace, but I

    sure as hell hope someone had pointed them out to me before

    I got started my first time.  It would have cut days off my

    development efforts.


An ICH is an excellent data representation to store, track, analyze

and report on time-series event sequences which offer a glimpse into

user behavior.  However, as simple as the ICH structure may seem in

the number of columns it holds, its construction requires manipulation

and careful crafting of the large scale source data-sets which feed

events into it.  I cannot stress this fact enough: it may look simple,

but the transformations that conform them are not.

  e.g: In a Digital Marketing world, billions of rows representing

  ad impressions page-view hits, click-through actions, shopping

  cart actions, etc. will be analyzed and then transformed into

  the standard form of the ICH.

This writeup pretends to be a guide on concepts, practices and common

pitfalls to avoid when building an ICH.

Given the large volumes of data (and number of rows) to be processed

during construction, being mindful about the process is likely to save

you hours of re-processing time after misinterpreted, erroneous or

just plain failed assembly attempts.


The ICH is a structure that is central to a Digital Marketing project

or any other multi-channel event analysis effort.

An ICH comes in the form of a large FACT table which hosts timestamped

events for a universe of identities, be them customers, on-line web

site visitors or otherwise.  As such, an ICH is distributed by hashing

their identity column, as to keep all of a user's (identity) events

grouped within the same partition.

Events in the ICH can be of many types and are labeled to reflect this

fact.  Then furthermore, each event is accompanied by a group of

properties/ attributes relevant to its type.

The simplest table schema that I can think of for an ICH is:

    CREATE FACT TABLE public.ich (

       user_id        VarChar,

       event_time     Timestamp,

       event_type     VarChar,

       event_data     VarChar


    DISTRIBUTE BY HASH(user_id);

In this example, think of 'event_data' as a large VarChar BLOB that

holds information relevant to the 'event_type' it is associated with,

keeping in mind that in real applications the ICH will use more than

one column to store different attributes for an event.

In the Digital Marketing world event types will typically represent

on-line customer interactions like ad impressions, ad clicks, e-mail

sent to the user, e-mail click-throughs, etc.  However, an ICH is not

conceptually limited to representing only Digital Marketing events.

The ICH structure: Event data and polymorphic columns

If you paid attention and thought about the minimal schema I used as

an example, you will come to the inevitable conclusion: "This column

layout pretends to use the 'event_data' column as a free-for-all

BLOB-of-sorts storage field for attributes associated with the event

itself".  This notion underpins the polymorphic event-attribute

columns in an ICH and, to exemplify, I will talk about "Creative IDs".

  NOTE: As mentioned before, your ICH is [very] likely to have

  more than a single 'event_data' column, so the discussion

  below will introduce other column names.

  Don't be surprised by this.

In Marketing-speak a Creative ID identifies an ad (advertisement) like

"the dog jumping over a red fence" and such.  I mention it because it

took me a while to learn this; i.e: a Creative ID identifies an ad.

Creative IDs are found/ known in different sources by different names

(column names) like 'creative_id', 'ad_id', etc.  While building an

ICH you will be storing the values found in these sources into one

column named 'marketing_id' which [more generally] represents an item

of marketing material that a user was exposed to; e.g: an ad, a

specific type of e-mail, etc.

For example, when building your ICH you will be tasked with:

  1. Gathering all ad-serve actions to INSERT them into the

     ICH, as an "Impression" (IMP) event type while taking

     care of placing the 'ad_id' value into the ICH's

     'marketing_id' column.

  2. Gathering all outgoing e-mail events and INSERT them into

     the ICH as an "Email" (EML) event type and placing the

     'creative_id' value (again) in the ICH's 'marketing_id'.

Needless to say, as you populate the target column in the ICH it is

quite easy to build confusion on what values from what source went

into which of the ICH columns.  This is a both nomenclature and

communications challenge when designing an ICH.

You will be best served by having a clear and well understood piece of

documentation -available to all involved- about these mappings.  It

will avoid mis-communication, the need to reprocess and, eventually,

will serve as a reference to all users of the ICH.  (More later on a

good documentation artifact to keep track of all this)

Event types: Labeling

Event types are things like page views, ad impressions, ad clicks,

sent e-mails, shopping cart actions (e.g: add and remove items) and

purchase check-outs.

These events are typically and best identified by a short character

string like 'IMP', 'CKL', 'EML', 'SCA' (after "Shopping Cart Add"),

'SCR' (after "Shopping Cart Remove") and 'CNV' (after "Conversion").

Sometimes an event type will have a category or sub-type; e.g: An ad

impression might have been served by an external agency on a public

site (like Facebook or Google), or it could have been served by an

internal ad-serving system (like X+1) on your customer's own site.

You might represent this using two columns in the ICH: the Event Type

and a Category, which will reflect (for each event) classification

labels like:

   ev_type | category


   IMP     | Display Advertising   <- External Internet

   IMP     | On-site Advertising   <- Internal site

I personally like to also track the original source of each event

placed in the ICH and offer the following revised ICH schema, enhanced

for better tracking of event sources and types:

    CREATE FACT TABLE public.ich (

    -- The original source for the event

       src              VarChar,

       user_id          VarChar,

       event_time       Timestamp,

    -- Type and category for the event

       event_type       VarChar,

       event_category   VarChar,

       event_data       VarChar


    DISTRIBUTE BY HASH(user_id);

The source ('src') column allows me to reference the original data

source from which the event was harvested; the event type and category

work as described above.

When inserting events into the ICH, you will get used to seeing SQL

forms like the one below:

    INSERT INTO public.ich


    -- X+1 (aka: 'XPO') is the source

       'XPO'::VarChar        AS src,

       user_id, event_time,

    -- The event type and category

       'IMP'::VarChar        AS event_type,

       'On-site Advertising' AS event_category,

    -- Data associated with the event

       some_column::VarChar  AS event_data


    -- The source is the X+1 impressions table



    -- Only records which meet this condition in the

    -- source are considered X+1 impressions to be

    -- reflected in the ICH



The above means you are selecting specific rows from the original X+1

impressions data and inserting well-tagged records into the ICH with

'some_column' as relevant attributes for this type of event.

Special events and double counting:

As you gather event types from their original sources you will find

that some of them are more scarce and important than others, and hence

more "valuable" for their rarity.

Such is the case of, say, conversions, which come in many forms like:

shopping cart additions, purchase check-outs and on-line application


I assume you are familiar with the concepts of shopping cart addition

and removal of items and purchase check-outs, so I will only speak

about the last:

  Application completion refers to the act of completing an

  on-line application for a product or service, normally after

  having gone through what is known as an "application start"

  which means: having visited the application page for said

  product or service.

These types of events (when compared to ad impressions or page visits)

deserve closer attention in their counting, accounting and insertion

into the ICH.

When looking through your data you may find what looks like an item

being removed from a shopping cart without being previously added, or

the completion of an application without evidence of the application

[start] page being visited before it.  You may also detect artifacts

in the data where for some reason (like a user clicking their browser

"Reload" button when it seems to have hung) a purchase check-out is

registered twice (same user, same amount, etc)

These situations deserve special attention to avoid double counting,

or just plain counting of stuff that your customer's business cannot

rightly justify reflected in their accounting.

In an Aster world, special methods to pay attention to these events

and rules include (but is not limited to) the use of 'Sessionize()'

and nPath as tools to validate source records and decide which to

include (and not include) in your ICH.

Event attributes: Values and References

By the discussion above you should know that to populate the ICH you

will be gathering specific rows of data from a multitude of sources

and creating well identified (source and type) event records in the


It should also be clear that you will be mapping source column values

into semantically polymorphic columns in the ICH.  (if you don't get

this last statement, go re-read the section on polymorphic columns)

Now, the attribute values populated into the ICH will sometimes be the

information you are looking for, but sometimes these values will be

mere (numeric or string) references to what you want, like a string

description.  An example of this is (again) the "Creative ID".

The Creative ID is typically a numeric value representing a specific

piece of advertisement (e.g: the tall brunette in a blue bikini).  But

the numeric representation is -of course- not sufficient.  It is only

a reference and people will want to see a description for it in the

ICH itself (without the need to go and LEFT OUTER JOIN some other


Harvesting these descriptions is often harder than would be expected

and the task normally falls into one of three scenarios I have had to

tackle (from easiest to hardest):

  1. The customer has provided you with a nice, clean and

     compact DIMENSION TABLE catalog of IDs to description.

     All you have to do is LEFT OUTER JOIN (not JOIN) your

     selected source rows to it and *presto*, the needed

     descriptions will be in place.

  2. A tricky scenario comes when the customer *believes*

     they have a catalog for you, but what they provided

     is *NOT really* a proper catalog.

     i.e: in a catalog each ID must appear only once

     associated with a description and -hopefully- all

     IDs in the source can be found in said catalog.

     This is a common scenario with so-called Creative ID

     tables, where you will find a table with more than

     two columns: ID -> Description.  Multiple columns is

     a tell-tale sign that what you are seeing is NOT a

     proper catalog, and that you will find more than one

     record for each ID, sometimes associated to the same

     description and sometimes with a different one.

     Make sure what you have in hand is a proper look-up

     catalog before using it, otherwise it will throw off

     your row/ event counts when you LEFT OUTER JOIN.

  3. The trickiest scenario (that I have seen) goes something

     like this:

     "You have an add-click event which does not contain a

      Creative ID.  Its corresponding Creative ID can only

      be found in a preceding impression to the same user"

     Now you need to find a way to associate a point-in-time

     click event with a preceding impression (through some

     other rule), harvest the sought-after Creative ID from

     there and then look-up its description against a proper

     catalog (see #2 and #1 above)

Whichever the case, you are very likely to be confronted with at least

#2 above; sometimes #1 and others #3.  Remember that this treatment

will typically be different for each attribute in each of the sources

that you are transporting into the ICH and, as such, can be easy to

mix-up, mis-communicate across your team and incur in delays reviewing

and re-reviewing definitions with those who misunderstood, followed by

a re-processing once they have been [allegedly] understood (for the

Nth time).  I'm not joking here.

Event and identity counts: Life-saving checks and balances

After presenting the challenge of resolving attribute references to

descriptions in the previous section I want to define the ill effects

of a bad catalog and a technique to spot them when they happen.

The technique is as simple as counting.

By definition, the ICH assembles user events from different sources in

a 1-to-1 relationship.  This means that if you have a source table

that contains, say, 1 million ad-impression records, your ICH will

hold the same number of ad-impression events (from said source).

  MAXIM: Get used to audit your ICH insertion operations by

  counting event types (by source if necessary) before and

  after your INSERT operations.

I know that the maxim above is tedious and requires processing time

when you are dealing with an ICH with several billion rows.  However

(believe you me) it will keep you out of trouble.

The following template has proven to be a life-saver for me:


      src, event_type, event_category,

      a.row_cnt AS event_count,

      b.row_cnt AS unique_uid_count


   -- The number of rows for each event type

      (SELECT src, event_type, event_category,

              Sum(1) AS row_cnt

         FROM public.ich

        GROUP BY 1,2,3

      ) AS a


   -- The number of unique User IDs for each event type

      (SELECT src, event_type, event_category,

              Sum(1) AS row_cnt

         FROM (SELECT src, event_type, event_category, user_id

                FROM public.ich

               GROUP BY 1,2,3,4) AS r

        GROUP BY 1,2,3

      ) AS b

   USING (src, event_type, event_category)

   ORDER BY 1,2,3


You can compare the results from the query above before and after an

INSERT operation into the ICH and validate the difference against the

number of rows in the original source which you expected to generate

new event rows in your ICH.

Counting at the source: Avoid data-skew

As you bring rows over from diverse sources into the ICH you should be

mindful about the possibility of data-skew, both at the source and (by

consequence) in the ICH.  Remember that data-skew will impact your run

times significantly on almost all operations performed on a data-set.

Moreover, you will find that on-line data-sets hold a large number of

interactions generated by robots (bots) and test/ monitoring systems.

You will be best served by detecting these rows and filtering them out

of the ICH.

As an example, consider the following row-counts extracted from an X+1

impressions data-set:



      Trunc(100.0 * a.row_cnt / b.row_cnt, 3) AS pct,


   FROM (SELECT user_id,Sum(1) AS row_cnt

          FROM stage.x1_impressions GROUP BY 1) AS a,

        (SELECT Count(*) AS row_cnt FROM stage.x1_impressions) AS b


   LIMIT 5;

    row_cnt |  pct  |       user_id       


     15,732 | 5.518 | OO-00000000000000000

        380 | 0.138 | LF-00000000199843325

        358 | 0.126 | AD-00000000226293653

        356 | 0.125 | CB-00000000204177401

        345 | 0.121 | LF-00000000190174618

   (5 rows)

Clearly the 'OO-*' User ID is off the charts with regard to the number

of impressions and should be filtered when inserting into the ICH.

Remember to apply this counting technique to the ICH repeatedly too

(and not just the data-sources themselves).  This will help you ensure

you are building a well distributed (not skewed) table.

Complex operations (like nPath and table JOINs) will sometimes run for

hours instead of minutes as the larger partitions (e.g: 'user_id') are

slowly completed after all the smaller ones have already been processed.

Keep an eye out for this data-skew phenomenon.

Identities in the ICH: The basics

Thus far I have described the idea that each and every event within

the ICH has a 'user_id' designation that identifies the user that it

is associated with.  In real life the full story is somewhat more


Events stored in the ICH are harvested from an array of different data

sets.  It is typical for each individual data-set to use its own kind

of unique identifier for its data; i.e: X+1 uses X+1 User IDs, Adobe

Insight uses Profile IDs (and 1st Visit cookies), DoubleClick uses a

User ID which is unrelated to that in X+1.

Under this light, the problem of identity (and tracking it) in the ICH

is more complicated than a simple [overarching] User ID.

In order to present a consistent User ID within the ICH you will be

faced with a couple of different scenarios:

  * The customer will offer you one or more [allegedly] correct

    catalogs to translate each source's identifier into a single

    unified ID space, or

  * You will have to come up with mechanisms to resolve each source's

    identities into a unified space you can use in your ICH.  (more on

    this later)

Whichever path you choose (or have available) it will not be perfect

for two reasons:

  1. There will be source identities which you will fail to

     resolve.  Learn to measure and deal with this effect

     and get used to it. (more on this later)

  2. When resolving source identities into a unified space you

     face the same risk as when using bad (non unique catalogs).

     Use the counting technique described earlier to verify you

     are not producing double (or triple or worse) records in

     the ICH when JOIN(ing) to identity tables.

Of the two imperfections described above the second is dangerous, but

I have described how counting will help you identify when it happens.

There is a technique that is great to deal with the first:

You may or may not have noticed that the 'user_id' in the ICH is

defined to be of type VarChar instead of BigInt as would otherwise be

common with large cardinality types in a FACT table.  There is a

reason for this:

Imagine now that you are dealing with two data sources (but the same

principle applies to more of them): DoubleClick (DFA) and Adobe

Insight (INS).

Furthermore, imagine you have a way to map some DFA User IDs into INS

Profile IDs, with the latter being a preferred identification because

it deals with users who have authenticated on your customer's site.

Under this scenario, the layout of the ICH can be expressed as:

    CREATE FACT TABLE public.ich (

    -- The original source for the event

       src              VarChar,

    -- A whole set of columns for user identification

       user_id          VarChar,

       ins_profile_id   VarChar,

       dfa_user_id      VarChar,

    -- o -

       event_time       Timestamp,

    -- Type and category for the event

       event_type       VarChar,

       event_category   VarChar,

       event_data       VarChar


    DISTRIBUTE BY HASH(user_id);

Notice -of course- how I have extended the ICH's "identity section" of

columns to store not just the 'user_id' value that the table will be

HASH DISTRIBUTEd on, but also two additional columns to store the

original identity from its source channel.

When inserting events from the Adobe Insight (INS) source I will use:

    INSERT INTO public.ich


    -- Insight (aka: 'INS') is the source

       'INS'::VarChar           AS src,

    -- Source-tagged identities

       'ins:' || profile_id     AS user_id,

       NULL::VarChar            AS dfa_user_id,

       profile_id               AS ins_profile_id

    ... yadda, yadda, yadda ...

    FROM src.ins_impressions

More interestingly, assuming that my DFA User ID to Insight Profile ID

catalog is stored in a table named 'dfa__uid_to_ins_pid', to insert

events from the DFA source I will use:

    INSERT INTO public.ich


    -- DFA is the source

       'DFA'::VarChar                 AS src,

    -- Preference identity selection


          'ins:' || b.ins_profile_id,

          'dfa:' || a.user_id)        AS user_id,

    -- Source identities

       b.ins_profile_id               AS ins_profile_id,

       a.user_id                      AS dfa_user_id,



       (SELECT * FROM src.dfa_impressions


           -- Only records which meet this condition in the

           -- source are considered DFA impressions to be

           -- reflected in the ICH

              <condition>) AS a

    LEFT OUTER JOIN dfa_uid_to_ins_pid AS b

      ON a.user_id = b.dfa_user_id


Notice how the ICH 'user_id' column is populated using ins:X' when it

is available in the look-up catalog and as 'dfa:Y' when not, with the

associated meaning:

  "The Insight Profile identity will be used in the ICH over

   the DFA User identity whenever a resolution is possible.


Notice also how the 'ins_profile_id' and 'dfa_user_id' identity

columns are also populated accordingly.

Identities in the ICH: Leveraging cross-channel identities

The previous section discussed the basics of identity designation in

the ICH using a somewhat trivial scenario where there was a DFA User

ID to Adobe Insight Profile ID look-up catalog available.

More often than not such a catalog will not exist, and you can find

ways to construct one but such topic is beyond the scope of this

document which I might discuss in "Evidence-based Identity Tracking"

if I find the time.

What will be discussed in this section is the notion of cross channel

identities.  These are pieces of information found in source records

from different channels and will improve your outcome in identification.

For this discussion I want you to imagine a DFA record containing a

column named 'profile_id' which will [sometimes] contain the value of

the user's Insight Profile ID.  The value will have been populated by

DoubleClick when your browser holds a cookie with such information.

Under this scenario my DFA-originated event insertions into the ICH

may look like:  (I'll discuss below)

    INSERT INTO public.ich


    -- DFA is the source

       'DFA'::VarChar                 AS src,

    -- Preference identity selection


          'ins:' || b.ins_profile_id,


            'ins:' || a.profile_id,

            'dfa:' || a.user_id))     AS user_id,

    -- Source identities



          a.profile_id)               AS ins_profile_id,

       a.user_id                      AS dfa_user_id,



       (SELECT * FROM src.dfa_impressions


           -- Only records which meet this condition in the

           -- source are considered DFA impressions to be

           -- reflected in the ICH

              <condition>) AS a

    LEFT OUTER JOIN dfa_uid_to_ins_pid AS b

      ON a.user_id = b.dfa_user_id


Note how the nested 'Coalesce()' functions in the identity section

express the rule:

  "The catalog-based Insight Profile identity will be used in

   the ICH whenever available.  If missing, the DFA-reported

   value for the Profile ID cookie will be used if it exists

   in the source.  Otherwise the DFA User ID will be used.


This technique improves the tagging of DFA-originated events -when

possible- through the use of either a look-up catalog or a hint to the

Insight Profile ID sometimes present in the source DFA record itself.

The scenario described above is a bit of a myth since DFA and Insight

do not generally play together in the way described.  However, there

is a data element which typically does, it is called a First Visit

cookie (or 1st Visit cookie, or 'v1st' for short).  This cookie value

is implanted in a user's browser when they first visit your customer's

site and is reported by DFA whenever it is detected on a click-through

event in the DFA source data.

   NOTE: v1st is an interesting value because it identifies a

   visitor's browser uniquely a.  Not quite the identity of the

   visitor (she sitting at the browser) but the browser used

   which can be loosely associated to a user.

When inserting DFA clicks into the ICH you are likely to use a similar

incantation to the one presented above, but also considering the v1st

cookie value:

    INSERT INTO public.ich


    -- DFA is the source

       'DFA'::VarChar                  AS src,

    -- Preference identity selection


          'ins:' || b.ins_profile_id,


            'ins:' || a.profile_id,


              'vst:' || a.v1st_cookie,

              'dfa:' || a.user_id)))   AS user_id,

    -- Source identities



          a.profile_id)                AS ins_profile_id,

       a.user_id                       AS dfa_user_id,

       a.v1st_cookie                   AS v1st_cookie,



       (SELECT * FROM src.dfa_clicks


           -- Only records which meet this condition in the

           -- source are considered DFA impressions to be

           -- reflected in the ICH

              <condition>) AS a

    LEFT OUTER JOIN dfa_uid_to_ins_pid AS b

      ON a.user_id = b.dfa_user_id


Note how I have implied that the ICH now has a column to store v1st

cookie values named 'v1st_cookie' and how the 'Coalesce()' group now

favors (in order of preference):

  1. The catalog-based Profile ID if it can be determined

  2. The DFA-record's Profile ID if one is present

  3. The browser of origin's v1st cookie value

  4. The DFA-source User ID

In such a list you might ask yourself: "Why is the v1st cookie more

important than the original DFA User ID?", and the reason is: once

your customer's site has implanted a v1st cookie on a browser, several

on-line (source) systems will report it back.  Using a 'v1st:' tagged

identity helps group together events which would otherwise be tagged

as 'dfa:', thus offering no opportunity for cross-channel analysis.

  Capice?  Is your head spinning yet?

Identities in the ICH: Evidence-based Identity Tracking

   This is an advanced topic in cross-channel identity propagation

   and resolution which I will not be covering here.  If and when

   I write some about it I'll come back and plunk that stuff in.

Artifact: Mapping sources to ICH event types

  src | from            | where  | ev_type | ev_category         | attribute_1  |  ...

-----+-----------------+--------+---------+---------------------+--------------+ ...

  DFA | dfa_impressions | <cond> | IMP     | Display Advertising |  creative_id

  DFA | dfa_impressions | <cond> | IMP     | Display Advertising | *creative_id

  TODO: Enhance and formalize Mike's technique for representing these

  things on a 2-dimensional whiteboard.

Artifact: Catalogs available

  TODO: A way to document "catalogs" available and their use.

        (i.e: when catalogs are not really catalogs and how to

         build sensible look-up tables for them)

In praise of the ICH:

This section was to be entitled "Just why the hell would I want to

build this ICH structure?" but it was both too long and did not quite

reflect the tone of what I am going to say here.

  The ICH is a good thing.

  Actually, the ICH is an *excellent* thing.

Repeat after me: "The ICH is an excellent thing", then keep repeating

this as a mantra while you build yours and work through all of its

intricate construction details.  Believe me, you'll need to.

But why is it so good?  There's no mention of it in Moses' Tablets,

the Qur'an or Talmud, so I'll have to explain.

Firstly and simply, in an Aster SQL-MR world the ICH represents a

single time-series structure upon which pathing (nPath) can be used in

a straightforward way to inspect, detect, measure and report over all

user-related events gathered from a multitude of sources/ channels.

It is also a direct source (and single-stop shop) for event data to be

evaluated for attribution.  But this is not all, so read on.

By reading this writeup and/ or by working in an ICH assembly project

you must have gotten a taste of the variety of data handling decisions

that are pertinent in gathering multi-source (channel) information and

storing it in a single table.  Understanding that this is a serious

exercise defining the treatment of source data is an important fact

and realization.

In an analytics world where "Data Scientists" have not constructed an

ICH, each individual will arrive to different decisions as to how each

case of data-handling must be treated and, when they do so, they will

produce different results on the exact same topics; e.g: number of

conversion events (remember "double counting"?).

What ensues is neither pretty nor productive: people must sit down

with their individual results and trace back through their [complex]

processing methods to the original sources to understand and explain

differences in their results ... typically to Management, which is not


So the ICH not only gathers all of the multi-channel data available in

a single place, but it also does so using a consistent set of rules;

good or bad, but consistent and which can be tweaked in a single

process: the process of ICH construction.

As such, the availability of an ICH wherever multi-source analysis

takes place is a hallmark of maturity.  It offers the consumers of all

this data a pre-built repository from which to bootstrap their work.

It provides them with a normalized starting point that encodes a

wealth of processing decisions they would otherwise have had to make

themselves, individually and inconsistently.