How does SQL, SQL-MR query works internally in Teradata aster?

Aster
Enthusiast

How does SQL, SQL-MR query works internally in Teradata aster?

Hi,

I've gone through documentation but didn't get clear picture on the below questions.

1. How does SQL or SQL-MR queries works interanally in aster database? Will the generate MapReduce jobs when queries are fired? like Hive and pig generate in hadoop.

      For example,

      Customer table has 10 records which are distributed by hash(customer id) and say they're 3 workers, so data is split among them as

       worker1 -  3 records, worker2 - 3 records, worker3- 4 records.

    Since data is distributed among three different machines(nodes), how simple sql query select * from customer or any SQL-MR query ; fecthes data? whether it'll send the same query on all three nodes, or it'll generate map reduce jobs which fetches the rows?

2. Once creating table customer, data distributed  by hash(customer id), can change distribution later by different column, age instead of customer id?

3. In hadoop jobs can be run one at time? simultaneously you cant run multiple others will be queued? the same thing happens in  aster database or can we run multiple jobs on cluster?

Please help me to understand these.

Pradi

5 REPLIES
Enthusiast

Re: How does SQL, SQL-MR query works internally in Teradata aster?

Let me try to share my limited knowledge. I am not an expert in Aster.

You can access data stored in HDFS using Aster SQL-H. SQL-H interacts with HCatalog, a sort of metadata library.The format stored in HDFS is different.But using SQL-H, which generates MR program(s) and developers do not need to write one(s) help a lot.SQL-MR is a framework developed by Teradata Aster.So, SQL-MR functions developed by developers can be called by way of SQL through Aster database, a disovery plaform.

Hadoop and Teradata Aster  can be connected by a two-way communicating Aster-Hadoop Adaptor.Aster Teradata adapter connects Discovery Platform's Aster and Teradata DB

Data, in turn, can be pulled from HDFS, join or operate with that from  Discovery Platform's Aster database.Interoperability of hadoop, Aster and TD IDW I feel is one strong point where everything is in-house.

This picture may help:


The hadoop part will remain there like what it is, having Catalog, Namenode,MR Framework,Jobtracker, Database connector, TaskTracker,datanode,database etc.

So I feel that the sitting of data it is still HDFS way, with 3 replications, yarns etc.....in hadoop.

Also few tools like Impala, hawq have interface that allow SQL to be written. The SQL in turn gets converted to M-R, which is known to be slower than RDBMS :).

As far as I know Oozie scheduler can run many jobs simultaneously. I also think Teradata Unity Ecosystem Manager can do it.

Enthusiast

Re: How does SQL, SQL-MR query works internally in Teradata aster?

Thanks for the reply Raja,

I got to know some these stuff by going through the documentation. But these concepts didn't answer my questions like whether SQL-MR internally generates any MapReduce job or not? what about SQL how it fetched distributed data from different nodes?

Enthusiast

Re: How does SQL, SQL-MR query works internally in Teradata aster?

An SQL-MapReduce job is automatically started when you issue a query that includes an SQLMapReduce function. A GUI is there to monitor the job too.

In Aster Developer Guide 6.00, these steps are given as below(page number too highlighted):

Write a SQL-MapReduce Function in Java --page 26 -----(Where SDK, required interfaces, frameworks details,functions....)

or

Write an SQL-MapReduce Function in C --page 35------(SDK, header files required.....)

----------------------------------------------------------------------------------------------------------------------------------

Build and Package the SQL-MapReduce Function -- page 29-----(Compiling and packaging)

----------------------------------------------------------------------------------------------------------------------------------

Install and Use a Sample Function --page 39

Procedure:

.....

\install counttokens.jar textanalysis/counttokens.jar

\install tokenize.jar textanalysis/tokenize.jar

Run the Function:

...

SELECT token,

sum(count)

FROM textanalysis.tokenize

(

ON documents)

GROUP BY token

ORDER BY sum(count)

;

...

----------------------------------------------------------------------------------------------------------------------------------

Manage SQL-MapReduce Execution --page 42

A GUI is there to monitor the job.

----------------------------------------------------------------------------------------------------------------------------------

Finally,in page 42 "Start an SQL-MapReduce Job"

An SQL-MapReduce job is automatically started when you issue a query that includes an SQLMapReduce function.

----------------------------------------------------------------------------------------------------------------------------------

Now from hadoop point of view.

There are input reader,Map function,partition function,compare function,Reduce function,output writer.

When you run a job you double check whether NameNode, DataNode, JobTracker, TaskTracker, SecondaryNameNode are running or not.

Then you write MR job mostly java, say mapper, reducer...

Next compile and create jar file.

Next run the MR job, making sure any input parameter. Here there are lots of schedulers. 

Let us hear from other experts too.

Teradata Employee

Re: How does SQL, SQL-MR query works internally in Teradata aster?

Let me try...

1. How does SQL or SQL-MR queries works interanally in aster database? Will the generate MapReduce jobs when queries are fired? like Hive and pig generate in hadoop.

      For example,

      Customer table has 10 records which are distributed by hash(customer id) and say they're 3 workers, so data is split among them as

       worker1 -  3 records, worker2 - 3 records, worker3- 4 records.

    Since data is distributed among three different machines(nodes), how simple sql query select * from customer or any SQL-MR query ; fecthes data? whether it'll send the same query on all three nodes, or it'll generate map reduce jobs which fetches the rows?

Map Reduce jobs are only used when you use an MR function (ie anything with an ON clause is an MR function). When you call regular SQL functions such as select * it is not like hadoop where it spins up a JVM to execute the job - this is done in a fashion similar to regular postgres. With regular SQL, the queen will send the commands to each individual worker, the workers will compile their answers based on the data on their nodes and return the resultset to the queen where the queen will compile the results into a single dataset. When it comes to things like joins, this can become complicated because data has to be shuffled between nodes. The queen facilitates this process as the workers are blind to what exists on other workers



2. Once creating table customer, data distributed  by hash(customer id), can change distribution later by different column, age instead of customer id?

I'm 99 pct sure you can't change the distribution key - you would have to perform a CTAS using the new key.



3. In hadoop jobs can be run one at time? simultaneously you cant run multiple others will be queued? the same thing happens in  aster database or can we run multiple jobs on cluster?

 

You can run multiple jobs on Aster and they will execute in parallel. The issue here is that performance will be impacted.

HTH... let me know if you need clarification

Enthusiast

Re: How does SQL, SQL-MR query works internally in Teradata aster?

Thanks a lot ewan,

In case of simple SQL select *  statement queen will send commands to each worker depending on the data it has.

if possible, Can you please explain a bit in detail in case SQL-MR queries, how it works? how many maps are generated etc?

How does queen node knows whether a node is alive or dead? does worker nodes send heartbeats to queen regularly like in hadoop.

Once again thank a lot for reply.

Pradi