Defensive Programming in the ELT environment...

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Enthusiast

There was a discussion over on the Teradata Masters list about NoPI. Basically, someone was questioning why you would ever use such a feature... and I was a bit astounded about some of the commentary I received.  I'm not going to rewrite my post, it's quoted below, but it made me think about how important it is for us to remember our "boundary basics". I thought I'd illustrate one of the reasons with a real life example...

At eBay, the "Marketplaces Site" is a very large and complex technical ecosystem, in which new releases of code roll to production quite frequently. In one such release, a new feature caused a spike in activity on an existing data feed, but in which the PK was converted to a default value for unknowns - which all the new records were...

This data happened to occur at an early point in the nightly batch for which there were many many downstream dependencies. The load job failed for spool, and the DBAs and development staff had to get up in the middle of the night to develop and implement a work around, we missed our SLA for the day.  This was not the first or last time this happened. 

We had to actually change the default value to a complex key, write a quite difficult transform on a binary data feed with UTF8 characters in the data stream... - not the kind of decision making you want to make a 2am.  There was no way to just run an update. And there was no way to write a check for the validity of the PK values on the database - because the data is not on the database yet. 

Defensive Programming is something few of us learn until we become professional programmers.  Those little tricks which keep you out of trouble. This is certainly one of them. As much as there will be performance improvement to the loads from NoPI - which frees AWTs and loader slots, the benefit to me, is not getting up at 2am... or having our team have to do so...

Bob Hahn also noted the excellent NoPI Orange book "No Primary Index (NoPI) Table User's Guide" by Tam Ly.

My original email to the masters list: 

NoPI is going to be a great feature for sites which have two general categories of use: 1) ELT, and 2) SandBoxes or User/App created Tables. Both of these have been mentioned in previously in this chain. I would expect that in both cases, the default an organization will be to use NoPI.

For me, this is a defensive feature which prevents runtime breakage. It prevents our DBA team from having to deal with exceptions and for our development staff to deal with a class of boundary conditions which have no real solution on TD. Runtime exceptions are the most expensive to deal with from a staff perspective, particularly when your batch runs in the middle of the night...

IN our case, our primary loading methodology is load, transform, apply. This involves at least 2 staging tables, the first being the target of the fastload, and the second being the target of the final transformations (aka target rich rows), prior to application to the base production tables (this table allows for rollback/forward AND for movement to another system). In both tables, the only time a full table scan will not be applied, is to partially enrich an attribute with a join to some other table, which in itself is almost always going to involve a FTS. Put another way - how many transformation processes do you have which warrant an INDEX between your load and your final table? By definition this should nearly always be none...

Since you're doing an FTS, in most cases it will not make much sense to organize the data (which is what a key or distribution does...) just to persist the data. The first time you join the data, it has to be redistributed on the join key - but this has to be done anyway. Further, this lack of organization protects you from unknown spikes in the input data. When your data is NOT highly structured or has absolute and known distribution of values, you can easily get an error which causes great skew, which is very difficult to find and filter - because you cannot test for it on the database until it is landed - which is a derivation of the BI tool case.

Almost ALL BI tools are written for SMP databases first, and as such typically have no concept of data distribution. So, all of us customers and MPP database vendors have to go back to them to get these concepts built in - and this is not a simple concept to implement. Even MSTR has issues, and they're the best at it of the big BI vendors.

This is the same concept with Data Analysts. The less they need to know about platform specific implementation semantics, the more productive it is and the less "churn". How frustrating is it for an analyst who continuously runs out of spool because there are 2000 vprocs on a box, and 100K rows load up on one vproc? It takes very expensive escalation processes to a DBA figure this out - and we all look bad in the process, but worse - the user didn't get the job done.

I consider NoPI the default for any new data which goes on the box. If you are a consultant who wants they're ELT code to never break - you will use this liberally. If you're an analyst with a sandbox, you will want this to be the default until you get to know the data.

This feature is going to be one of your best friends.