Currently faced with an envionment that looks as such:
ETL Tool: Informatica
Requirement: Low Latency / Near Real Time Data Availability
Problem: When pulling data out of the AS/400 Journals (via CDC), populating in to a Teradata Staging Table (via TPT/Stream), the loads do not complete in time (due to the near real time data SLA's and data volumes). Additionally we are using a significant amount of sessions with this this approach which has a long term setback (high cost) when looking at the number of loads that run in the pipeline (sessions etc).
Consideration: Currently considering an Architecture where data is moved in to flat files first (at no cost) then merging like data into a consolidated file (again at no cost). Finally, using a single load as opposed to multiple streams (at a much lower cost and increased performance) to load that data in to a target table.
Suggested approach reduces number of overall sessions and cost on the Teradata box (which can be used elsewhere) and increases performance at the trade off of landing data files temporarily. Are there any other best practices that should be considered here or concerns to be looked at? This seems to be our best approach so far.
Thanks in Advance
How long has the existing CDC process been in place? Is it home grown? Was Teradata engaged to tune or architect it? The reason I'm asking is I know of a client that uses a similiar approach and I don't remember them having problems but then again the data volumns may not be equivalent.
What Id recommend is that you just test your new approach and see if it meets your SLA requirements. It is sound but don't make too many assumptions till you've ran some tests bulk loading the data in the fashion you described above.
You should make sure to engage your Teradata Solution Architect/AE for additional help.
The CDC process has been in place for sometime and we have done a POC which shows the proposed architcture working nicely. The concern really is whether there is another alternative methodology we should consider other than using an interim file system. We have Teradata expertise on the ground here (past employees and PS). This was a sanity check to see if there is something else we should consider as experience in the real time DW enironment is lacking (compared to batch and mini batch).
Then you should be good then. You could possibly improve performance through elimination of the landing of the data. Below is a link you can look at which shows a TPT only example. I personally prefer to have the file landed.
Also here is an excellent Informatica Best Practices Document that you can review.
What version of Teradata are you on?
I am always thinking of solutions :). Please bear with me, if it is out of place :)Teradata has recently acquired Think Big Analytics. Think Big Aanlytics has rich experience of integrating hadoop, HBase, Cassandra and MongoDB, and Storm for real-time complex event processing. I have evaluated quite a number of Big data/hadoop tools. I feel storm is good. I used to work with few ETL specific near real-time features like Ab Initio's continuous flow, SAP DS's real-time service. They are very slow.