Conquering Big Data with Teradata Data Mover parallelism

Tools
Tools covers the tools and utilities you use to work with Teradata and its supporting ecosystem. You'll find information on everything from the Teradata Eclipse plug-in to load/extract tools.
Teradata Employee

Conquering Big Data with Teradata Data Mover parallelism

One of the big advantages that Teradata Data Mover (DM) provides is built-in parallelism. The underlying utilities that Teradata DM uses such as Teradata Parallel Transporter API do have methods available for users to do parallel work but either are limited to a single client machine or require the user to build their own code framework. Teradata DM takes care of all the hard work and puts the world of multiple client machine parallelism at your finger-tips.

But how to make best use of this parallelism for your big jobs? There’s a lot of power under the hood but it might not be obvious how to put that power to work. Here we’ll talk about what parallelism features are available and provide tips for how to use those to get your big data moving faster.

It All Starts with the Agents

Teradata DM is designed for scaling. When you ask Teradata DM to move your data, DM breaks down the work into tasks and passes these out to the DM Agents to perform. Each DM Agent is capable of executing multiple tasks at once but you can also add additional DM Agents which will work alongside each other in performing the work.

For jobs moving large amounts of data, the key feature is Teradata DM’s ability to spread the work for a single task across multiple DM Agents. If, for example, you need to move a very large table, rather than giving a single DM Agent the task of moving that data, Teradata DM can instead use multiple DM Agents together to perform the work. This means you’ll have multiple processes across multiple client machines all pulling together to get that data moving.

So key number one is having multiple DM Agents available to do your work.

Know Thy Parallelism Attributes

When creating your big Teradata DM jobs, you need to be aware of the parallelism attributes and what they can do for you. The key attributes are:

  • Data streams
  • Max agents per task
  • Source/Target sessions

Data streams: In Teradata DM terminology, a data stream is the pairing of the utility job that pulls data from your source database system and the utility job that pushes that data to your target database system. When Teradata ARC is used, a data stream is the pairing of two processes on the same client machine, the ARCHIVE process and the COPY process. When Teradata Parallel Transporter API is used, a data stream is the pairing of the Export thread and the Load/Update/Stream thread, both within the same process on the same client machine.

           

A data stream is used when executing the tasks responsible for moving a database’s or a table’s data. Specifying multiple data streams means multiple data streams will be created and work together when executing the task.

By default, all Teradata DM jobs use a single data stream. For big data jobs, you are going to want to use more than one data stream to make use of Teradata DM’s parallelism. Each data stream is confined to working on a single DM Agent. Having multiple data streams allows the work to be spread across multiple DM Agents by running different data streams on different DM Agents.

Max Agents Per Task: After telling Teradata DM to use multiple data streams, you can spread those data streams across multiple DM Agents by using the max agents per task attribute. The max agents per task attribute tells Teradata DM the maximum number of DM Agents that it can use together when executing the tasks for your job.

It’s easiest to explain by example here so let’s walk through a quick case. Let’s say we are moving a big table and we’ve chosen to use 8 data streams. If we don’t specify a max agents per task value, then the default of 1 will be used. This means when DM gets to the task that does the work for moving the table’s data, DM will assign that task to a single DM Agent and all 8 data streams will run on that same agent.

            

              Max agents per task = 1

This is not necessarily a terrible thing as the managed servers on which the DM Agents run have multiple CPU cores so we would still see a performance improvement over using a single data stream. However, if more DM Agents are available, we can maximize our performance by getting those other DM Agents involved. So let’s instead specify a max agents per task value of 2. If we have at least 2 DM Agents, then DM will assign two of them to work on the task. Data streams are currently distributed evenly so each DM Agent will run 4 data streams for a total of 8. So instead of having 8 data streams all competing for the same resources on 1 DM Agent, we have half as many data streams running on each DM Agent resulting in better performance.

             

                                  Max Agents Per Task = 2

Source/Target Sessions: Sessions are a long established method for parallelizing the work done over the network between the client and the Teradata database. I’ll assume you are already familiar with the concept of a session so I’ll skip to how they apply to Teradata DM’s parallelism.

Each Teradata DM data stream connects two sets of sessions, the source sessions between the client and the source Teradata database and the target sessions between the client and the target Teradata database. The current default number of sessions is 1 which really doesn’t make the best use of the data stream’s ability. It’s the equivalent of hiring a manager only to oversee 1 employee. The manager can obviously handle more employees than that. So setting a higher number of sessions is always recommended. No Teradata DM job should use the default of 1 unless you have no expectations for performance.

When it comes to multiple data streams, the number of sessions you specify are handled differently depending on which underlying utility is used. If Teradara ARC is used, then each data stream connects the number of sessions you specify. So if you specified 2 source sessions and 2 data streams, a total of 4 sessions will be connected to the source Teradata database.

            

If Teradata Parallel Transporter API is used, then the number of sessions you specify is divided amongst the data streams. So if you specified 2 source sessions and 2 data streams, each data stream will connect a single session.

              

Thus it is important to pay attention to which utility is being used when specifying the number of sessions.

Where to Find These Attributes?

Teradata DM Viewpoint Portlet: When creating a Teradata DM job using the DM Viewpoint Portlet, you can find these parallelism attributes by doing the following:

  1. After clicking on the “New Job” button, select the “Job Settings” Tab

         

     2. Click on the “Advanced Settings” button near the bottom of the portlet.

         

     3. Here you can provide values for data streams, source sessions, target sessions, and max agents per task attributes.

         

Teradata DM Commandline Interface: When creating a Teradata DM job using the DM Commandline Interface, you can specify these parallelism attributes by adding their corresponding XML tags to the XML for the job:

Data streams: <data_streams>8</data_streams>
Max agents per task: <max_agents_per_task>4</max_agents_per_task>
Source Sessions: <source_sessions>10</source_sessions>
Target Sessions: <target_sessions>20</target_sessions>                              
1 REPLY
Teradata Employee

Re: Conquering Big Data with Teradata Data Mover parallelism

Thanks Kevin for the article. It's a valuable article on the features which are usually neglected. 

-Smarak