Export to Hadoop (hdfs) using TPT

Tools
Enthusiast

Export to Hadoop (hdfs) using TPT

Hi All,

We have TPT 15.10 and we are trying to export rows from Teradata (using Export Operator) and push it to Hadoop (HDFS).

Below is my TPT Control File:

USING CHARACTER SET UTF8
DEFINE JOB EVEREST_COL_BASE
DESCRIPTION 'Export script for EVEREST_COL_BASE from TD Table'
(
DEFINE SCHEMA SCHEMA_EVEREST_COL_BASE FROM TABLE 'XXXXXX.EVEREST_COL_BASE';
/*
DEFINE SCHEMA SCHEMA_EVEREST_COL_BASE
(
"SYS_COL"  INTEGER,
"PRIN_COL"  INTEGER,
"AGNT_COL"  INTEGER,
"COLL_CODE_COL"  INTEGER,
"DELQ_FAMILY_COL"  CHAR(6),
"DELQ_FAMILY_DESCR_COL"  VARCHAR(50),
"DROP_DTE_COL"  ANSIDATE,
"LS_WORK_DTE_COL"  ANSIDATE,
"LS_TRAN_DTE_COL"  ANSIDATE,
"NO_ACTS_COL"  INTEGER,
"NO_MEMOS_COL"  INTEGER,
"REACTIVE_DTE_COL"  ANSIDATE,
"SUB_ACCT_NO_COL"  CHAR(32),
"START_DTE_COL"  ANSIDATE,
"WORK_DTE_COL"  ANSIDATE,
"AUDIT_PROCESS_DT"  ANSIDATE,
"AUDIT_BATCH_ID"  INTEGER,
"AUDIT_JOB_ID"  INTEGER,
"AUDIT_DML_ACTION_CD"  CHAR(2),
"AUDIT_INSERTED_TS"  TIMESTAMP(6),
"AUDIT_UPDATED_TS"  TIMESTAMP(6),
"RECORD_START_TS"  TIMESTAMP(0),
"RECORD_END_TS"  TIMESTAMP(0),
"CURRENT_FLAG"  VARCHAR(4),
"REPORTING_START_TS"  TIMESTAMP(0),
"REPORTING_END_TS"  TIMESTAMP(0)
);
*/
DEFINE OPERATOR o_ExportOper
TYPE EXPORT
SCHEMA SCHEMA_EVEREST_COL_BASE
ATTRIBUTES
(
 VARCHAR UserName               = @UserName
,VARCHAR UserPassword           = @UserPassword
,VARCHAR TdpId                  = @TdpId
,INTEGER MaxSessions            = @MaxSessions
,INTEGER MinSessions            = @MinSessions
,VARCHAR PrivateLogName         = 'Export'
,VARCHAR SpoolMode              = 'NoSpool'
,VARCHAR WorkingDatabase        = @WorkingDatabase
,VARCHAR SourceTable            = @SourceTable
,VARCHAR SelectStmt             = @SelectStmt
,VARCHAR DateForm               = 'ansiDate'
);

DEFINE OPERATOR o_FileWritter
TYPE DATACONNECTOR CONSUMER
SCHEMA SCHEMA_EVEREST_COL_BASE
ATTRIBUTES
(
 VARCHAR PrivateLogName         = 'DataConnector'
,VARCHAR HadoopFileFormat       = 'textfile'
,VARCHAR HadoopHost             = @HadoopHost
,VARCHAR HadoopJobType          = 'hdfs'
,VARCHAR HadoopUser             = @HadoopUser
,VARCHAR HadoopTargetPaths      = @HadoopTargetPaths
,VARCHAR HadoopSeparator        = @TextDelimiter
);

APPLY TO OPERATOR (o_FileWritter[@LoadInst])
SELECT * FROM OPERATOR (o_ExportOper[@ReadInst]);
)
;

Below is the console output I'm getting when the job is failing:

Teradata Parallel Transporter Version 15.10.01.02 64-Bit
Job log: /opt/teradata/client/15.10/tbuild/logs/EVEREST_COL_BASE-130.out
Job id is EVEREST_COL_BASE-130, running on 
Teradata Parallel Transporter DataConnector Operator Version 15.10.01.02
o_FileWritter[1]: Instance 1 directing private log report to 'DataConnector-1'.
o_FileWritter[1]: DataConnector Consumer operator Instances: 1
Teradata Parallel Transporter Export Operator Version 15.10.01.02
o_ExportOper: private log specified: Export
o_FileWritter[1]: ECI operator ID: 'o_FileWritter-1513'
o_FileWritter[1]: Operator instance 1 processing file '/hdfs/user/hive/warehouse/col_base_aroy001c'.
o_ExportOper: connecting sessions
TPT_INFRA: TPT02639: Error: Conflicting data type for column(1) - "SYS_COL". Source column's data type (VARCHAR) Target column's data type (INTEGER).
o_ExportOper: TPT12108: Output Schema does not match data from SELECT statement
o_ExportOper: disconnecting sessions
o_ExportOper: Total processor time used = '0.039994 Second(s)'
o_ExportOper: Start : Thu Dec  8 12:23:06 2016
o_ExportOper: End   : Thu Dec  8 12:23:07 2016
o_FileWritter[1]: TPT19609 After 60 seconds, all clients have not connected; a timeout has occurred.
o_FileWritter[1]: TPT19610 A failure occurred while writing to the Hadoop client. See log TDCH-TPT_log_1513.txt in TPT logs directory for details.
o_FileWritter[1]: Total files processed: 0.
Job step MAIN_STEP terminated (status 8)
Job EVEREST_COL_BASE terminated (status 8)
Job start: Thu Dec  8 12:23:02 2016
Job end:   Thu Dec  8 12:24:06 2016

Below is my tbuild command:

tbuild -f /home/aroy001c/Everest_Extract_to_Hadoop.tpt.ctl -v /home/aroy001c/logon_ndw_extracts_file_tpt -u " WorkingDatabase='XXXXXX' , SourceTable='EVEREST_COL_BASE' , load_op=o_ExportOper , LoadInst=1 , ReadInst=1 , MaxSessions=10 , MinSessions=5 , FileName='HDFS://COL_BASE_AROY001C.txt' , LOAD_DTS='2016-12-08-103021' , Format='DELIMITED' , TextDelimiter='|' , SkipRows=0 , SelectStmt='SELECT "SYS_COL","PRIN_COL","AGNT_COL","COLL_CODE_COL","DELQ_FAMILY_COL","DELQ_FAMILY_DESCR_COL","DROP_DTE_COL" ,"LS_WORK_DTE_COL" ,"LS_TRAN_DTE_COL" ,"NO_ACTS_COL","NO_MEMOS_COL","REACTIVE_DTE_COL" ,"SUB_ACCT_NO_COL","START_DTE_COL" ,"WORK_DTE_COL" ,"AUDIT_PROCESS_DT" ,"AUDIT_BATCH_ID","AUDIT_JOB_ID","AUDIT_DML_ACTION_CD","AUDIT_INSERTED_TS","AUDIT_UPDATED_TS","RECORD_START_TS","RECORD_END_TS","CURRENT_FLAG","REPORTING_START_TS","REPORTING_END_TS" FROM NDW_EXTRACT_VIEWS.EVEREST_COL_BASE; ' HadoopHost= '<Name_Node_Host>:<Name_Node_Port>' ,HadoopUser='aroy001c' ,HadoopTargetPaths='/hdfs/user/hive/warehouse/col_base_aroy001c', HadoopFileFormat='textfile', HadoopJobType='hdfs' " EVEREST_COL_BASE -e UTF8

Below is the TDCH TPT Log:

WARNING: Use "yarn jar" to launch YARN applications.
16/12/08 10:44:18 INFO tool.ConnectorImportTool: ConnectorImportTool starts at 1481193858654
16/12/08 10:45:12 INFO common.ConnectorPlugin: load plugins in file:/tmp/hadoop-unjar9068557967474246190/teradata.connector.plugins.xml
16/12/08 10:45:14 INFO processor.IDataStreamInputProcessor: input preprocessor com.teradata.connector.idatastream.processor.IDataStreamInputProcessor starts at:  1481193914184
16/12/08 10:45:26 INFO processor.IDataStreamInputProcessor: the teradata connector for hadoop version is: 1.4.1
16/12/08 10:45:27 INFO processor.IDataStreamInputProcessor: the number of mappers are 2
16/12/08 10:45:27 INFO processor.IDataStreamInputProcessor: input preprocessor com.teradata.connector.idatastream.processor.IDataStreamInputProcessor ends at:  1481193927423
16/12/08 10:45:27 INFO processor.IDataStreamInputProcessor: the total elapsed time of input preprocessor com.teradata.connector.idatastream.processor.IDataStreamInputProcessor is: 13s
16/12/08 10:45:58 INFO impl.TimelineClientImpl: Timeline service address: http://<SomeHostNotSure>:8188/ws/v1/timeline/
16/12/08 10:47:01 WARN ipc.Client: Failed to connect to server: <SomeHostNotSure>:8032: retries get failed due to exceeded maximum allowed retries number: 0
java.net.ConnectException: Connection refused
        at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
        at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
        at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:649)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
        at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:397)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
        at org.apache.hadoop.ipc.Client.call(Client.java:1431)
        at org.apache.hadoop.ipc.Client.call(Client.java:1392)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy19.getNewApplication(Unknown Source)
        at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getNewApplication(ApplicationClientProtocolPBClientImpl.java:221)
		at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy20.getNewApplication(Unknown Source)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getNewApplication(YarnClientImpl.java:220)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.createApplication(YarnClientImpl.java:228)
        at org.apache.hadoop.mapred.ResourceMgrDelegate.getNewJobID(ResourceMgrDelegate.java:188)
        at org.apache.hadoop.mapred.YARNRunner.getNewJobID(YARNRunner.java:231)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:153)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
        at com.teradata.connector.common.tool.ConnectorJobRunner.runJob(ConnectorJobRunner.java:134)
        at com.teradata.connector.common.tool.ConnectorImportTool.run(ConnectorImportTool.java:67)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at com.teradata.connector.common.tool.ConnectorImportTool.main(ConnectorImportTool.java:745)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
16/12/08 10:47:04 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
16/12/08 10:47:40 INFO impl.TimelineClientImpl: Timeline service address: http://<SomeHostNotSure>:8188/ws/v1/timeline/
g.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:649)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:744)
        at org.apache.hadoop.ipc.Client$Connection.access$3000(Client.java:397)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1521)
        at org.apache.hadoop.ipc.Client.call(Client.java:1431)
        at org.apache.hadoop.ipc.Client.call(Client.java:1392)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy19.getClusterMetrics(Unknown Source)
        at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getClusterMetrics(ApplicationClientProtocolPBClientImpl.java:206)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
        at com.sun.proxy.$Proxy20.getClusterMetrics(Unknown Source)
        at org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getYarnClusterMetrics(YarnClientImpl.java:501)
        at org.apache.hadoop.mapred.ResourceMgrDelegate.getClusterMetrics(ResourceMgrDelegate.java:152)
        at org.apache.hadoop.mapred.YARNRunner.getClusterMetrics(YARNRunner.java:179)
        at org.apache.hadoop.mapreduce.Cluster.getClusterStatus(Cluster.java:247)
        at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:804)
        at org.apache.hadoop.mapred.JobClient$4.run(JobClient.java:802)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.mapred.JobClient.getClusterStatus(JobClient.java:802)
        at com.teradata.connector.common.utils.HadoopConfigurationUtils.getAllActiveHosts(HadoopConfigurationUtils.java:251)
        at com.teradata.connector.idatastream.IDataStreamInputFormat.getSplits(IDataStreamInputFormat.java:347)
at com.teradata.connector.common.ConnectorCombineInputFormat.getNoneFileSplit(ConnectorCombineInputFormat.java:68)
        at com.teradata.connector.common.ConnectorCombineInputFormat.getSplits(ConnectorCombineInputFormat.java:47)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
        at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
        at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
        at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
        at com.teradata.connector.common.tool.ConnectorJobRunner.runJob(ConnectorJobRunner.java:134)
        at com.teradata.connector.common.tool.ConnectorImportTool.run(ConnectorImportTool.java:67)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
        at com.teradata.connector.common.tool.ConnectorImportTool.main(ConnectorImportTool.java:745)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
16/12/08 10:47:41 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
16/12/08 10:47:44 WARN mapred.ResourceMgrDelegate: getBlacklistedTrackers - Not implemented yet
16/12/08 10:47:45 INFO mapreduce.JobSubmitter: number of splits:2
16/12/08 10:47:51 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1480200761664_60132
16/12/08 10:48:06 INFO impl.YarnClientImpl: Submitted application application_1480200761664_60132
16/12/08 10:48:09 INFO mapreduce.Job: The url to track the job: http://<SomeHostNotSure>:8088/proxy/application_1480200761664_60132/
16/12/08 10:48:09 INFO mapreduce.Job: Running job: job_1480200761664_60132
16/12/08 10:48:09 INFO mapred.ClientServiceDelegate: Application state is completed. FinalApplicationStatus=FAILED. Redirecting to job history server
16/12/08 10:48:10 INFO mapreduce.Job: Job job_1480200761664_60132 running in uber mode : false
16/12/08 10:48:10 INFO mapreduce.Job:  map 0% reduce 100%
16/12/08 10:48:11 INFO mapreduce.Job: Job job_1480200761664_60132 failed with state FAILED due to:
16/12/08 10:48:12 INFO tool.ConnectorImportTool: ConnectorImportTool ends at 1481194092071
16/12/08 10:48:12 INFO tool.ConnectorImportTool: ConnectorImportTool time is 233s
16/12/08 10:48:12 INFO tool.ConnectorImportTool: job completed with exit code 1

Now we are facing two problem here:

1) When we are using DEFINE SCHEMA FROM TABLE we are getting Schema mismatch error. We are trying to generate the schema defination from a view, but underlying table is having INTEGER as data type where as while generating schema TPT is trying to generate as VARCHAR. When we are using column list, we are not getting schema mismatch error.

2) Even if we are giving HadoopHost = <Name_Node:Port>, it's trying to connect to some other server which is not Name Node using 8032 port and getting connection refused error. Whey it is trying to connect using 8032 port?

 

Any help regarding the same will be of great help.

 

Thanks & Regards,

Arpan.

10 REPLIES
Enthusiast

Re: Export to Hadoop (hdfs) using TPT

Can anyone please help us here?

 

Thanks & Regards,

Arpan.

Tags (1)
Enthusiast

Re: Export to Hadoop (hdfs) using TPT

Is there any sample script to load data in HDFS using TDCH TPT interface? I checked PTS00031,PTS00030 and PTS00032, but none of them is writting to HDFS using TDCH TPT.

 

Thanks & Regards,

Arpan.

Teradata Employee

Re: Export to Hadoop (hdfs) using TPT

For #1, are there different databases with a table named "EVEREST_COL_BASE"?

The DEFINE SCHEMA references one potential table by that name.

The SourceTable references a table by that name.

And the SelectStmt specifically references the table in the "NDW_EXTRACT_VIEWS" view.

 

 

-- SteveF
Enthusiast

Re: Export to Hadoop (hdfs) using TPT

Hi Steve,

EVEREST_COL_BASE view only extsts in NDW_EXTRACT_VIEWS database. But, whenever I'm running the tbuild command, I'm getting "Output Schema does not match data from SELECT statement" error.

 

Thanks & Regards,

Arpan.

Enthusiast

Re: Export to Hadoop (hdfs) using TPT

Hi Steve,

It's my observation (may not be correct) that whenever I'm adding HadoopHost, DEFINE SCHEMA FROM TABLE is not working, i.e. getting schema mismatch error:

TPT_INFRA: TPT02639: Error: Conflicting data type for column(1) - "SYS_COL". Source column's data type (VARCHAR) Target column's data type (INTEGER).

But DEFINE SCHEMA FROM TABLE is perfectly working when I'm using local file, i.e. not pushing the data to Hadoop.

Is this true?

 

Thanks & Regards,

Arpan.

Teradata Employee

Re: Export to Hadoop (hdfs) using TPT

This is issue has been fixed in 15.10.01.04.

Please upgrade to at least this version.

 

 

-- SteveF
Enthusiast

Re: Export to Hadoop (hdfs) using TPT

Hi Steve,

After removing Format and TextDelimeter variables from tbuild command, auto schema generation from table (DEFINE SCHEMA FROM TABLE) is resolved.

But still I'm getting below error:

o_FileWritter[1]: TPT19609 After 60 seconds, all clients have not connected; a timeout has occurred.
o_FileWritter[1]: TPT19434 pmGetPos failed. General failure (34): 'Unknown Access Module failure'
o_FileWritter[1]: TPT19307 Fatal error checkpointing data.
o_FileWritter[1]: TPT19015 TPT Exit code set to 12.
TPT_INFRA: TPT02258: Error: Operator checkpointing error, status = Multi Phase Error
TPT_INFRA: TPT03720: Error: Checkpoint command failed with 23

I've just installed TCDH 1.4, but haven't done any configuration.

1) Do we need to do any configuration after TDCH install? Is there any document available for the same?

If required I can share the TDCH Log, TPT script and tbuild command and TPT job log.

 

Thanks & Regards,

Arpan.

Teradata Employee

Re: Export to Hadoop (hdfs) using TPT

Are you trying to read from TDCH and write to a flat file?

Please explain the intent of your job.

As to configuration, you would need to look at the TDCH installation instructions.

And then also look at the TPT manuals that should describe all of the operator attributes that are needed to run a TPT job to integrate with TDCH, and any environment variables that might need to be set up.

 

-- SteveF
Enthusiast

Re: Export to Hadoop (hdfs) using TPT

Hi Steve,

Thanks for the reply. 

I'm trying to read data from Teradata and write to HDFS. While using HDFS API, we had a success to write the data in HDFS. Now we  are trying to use TDCH TPT so that we can write to Hive tables as well.

Regarding the installation of TDCH, I'm not getting the instruction manual like we have for TPT.

Do we have any document describing the steps to install and configure TDCH?

 

Thanks & Regards,

Arpan.