tpt script to populate edw from kerberized hadoop instance

Tools & Utilities
Enthusiast

tpt script to populate edw from kerberized hadoop instance

Evening folk,

i am writing a TPT script that will populate an EDW table from an HDFS file.  the script is being run from the edge node, and i have verified the environment variables are properly set per the TPT Reference documentation.  i have also looked at the PTS00029 example, and i seem to be doing things according to that sample.  the problem i am having is that i am getting the following error messages:

TPT19609 After 60 seconds, all clients have not connected; a timeout has occurred.

TPT19434 pmRead failed, General failure (34): 'Unknown Access Module failure'

TPT19305 Fatal eror reading data.

TPT19015 TPT Exit code set to 12.

then the script exits.

 

if i check the TDCH-TPT_log_nnnnn.txt file, i see that a job has successfully been sent over to the correct name node and there are no errors in the log file, in fact the last few lines indicate that the mapreduce job is runing properly.  when i check the yarn-yarn-resourcemanager-NN.log file on the name node master (we have an HA setup by the way), i see the job progressing through the file retrieval (although it completes well after the TPT script has bailed  out, as the file is about 1.3GB in size) with no errors.

 

since the job is being submitted propelry and the name node is processing it, i don't immediately see what the error is, as it seems like it is simply a timing issue, but i am not sure how to proceed either in resolution or further debugging.

 

any help is as always appreciated,

tom

4 REPLIES
Enthusiast

Re: tpt script to populate edw from kerberized hadoop instance

Adding some additional info/context to this issue, and have also come up with a partial resolution of the issue i beleive.  we are running tpt 15.10, on a hadoop hdp 2.3.4 system, and my tpt script is as follows:

 

define job LOAD_hdfsTest
(
define schema schema_hdfsTest
(
in_int_value VARCHAR(16),
in_string_value VARCHAR(16),
in_int_value2 VARCHAR2(16)
);

define operator op_hdfsTest
type dataconnector producer
schema schema_hdfsTest
attirbutes
(
VARCHAR HadoopHost = 'hostnameTPTScriptRunsOn',
VARCHAR HadoopJobType = 'hdfs',
VARCHAR HadoopSourcePaths = '/user/test/hdfsTestData.txt',
VARCHAR HadoopSeparator = '|'
);

define operator od_hdfsTest
TYPE DDL
ATTRIBUTES
(
VARCHAR PrivateLogName = '',
VARCHAR LogonMech = '',
VARCHAR TptID = 'EDWname'
VARCHAR UserName = 'dbName',
VARCHAR UserPassword = 'dbPassword',
VARCHAR ErrorList = ['3807','3803']
);

define operator ol_hdfsTest
TYPE LOAD
SCHEMA *
ATTRIBUTES
(
VARCHAR LogonMech = '',
VARCHAR TptID = 'EDWname'
VARCHAR UserName = 'dbName',
VARCHAR UserPassword = 'dbPassword',
VARCHAR LogTable = 'hdfsTest_LG',
VARCHAR ErrorTable1 = 'hdfsTest_ET',
VARCHAR ErrorTable2 = 'hdfsTest_UV',
VARCHAR TargetTable = 'hdfsTest',
);

step stSetupTables
(
APPLY
('DROP TABLE hdfsTest_LG'),
('DROP TABLE hdfsTest_ET'),
('DROP TABLE hdfsTest_UV'),
('DROP TABLE hdfsTest'),
('CREATE MULTISET TABLE hdfsTest,
NO FALLBACK,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
(
val1 VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC,
val2 VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC,
val3 VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC
) NO PRIMARY INDEX;')
to operator (od_hdfsTest);
);

step stLoadFile
(
APPLY
('INSERT INTO hdfsTest(
val1,
val2,
val3
) VALUES (
:in_int_value,
:in_string_value,
:in_int_value2
)
') to operator (ol_hdfsTest)
select * from operator ( op_hdfsTest );
);

);

 

the file in hadoop is as follows:

1|val 1|1

2|val 2|2

3|val 3|3

4|val 4|4

5|val 5|5

 

this file gets ingested properly from HDP to EDW using the TDCH connector properly.  if i increase the file size from 5 rows to a file that is roughly 5MB in size,  then the transfer fails with the errors mentioned in the first post.  what seems to be happening is that the listener on the TPT scripting side is timing out after 60 seconds; but the TDCH process is taking longer than that for the larger files to get spun up and to start streaming data through the pipe that is set up between the hadoop cluster and the node i am running the TPT script on. 

 

This is most likely a question to the TDCH/TPT development team itself, but someone else may have run into this as well.  is there any way in the TPT scripting code to increase the amount of time that the TPT side will listen when the TDCH producer process side is spun up?  

 

hope this makes sense, and again thanks.

 

tom

 

Enthusiast

Re: tpt script to populate edw from kerberized hadoop instance

and as soon as I post the question; i think to check the TPT user guide again, and find you can specify the

 

INTEGER Timeout = nnnn (I used 300)

 

parameter in either the PRODUCER or CONSUMER configs.  setting it as resolved the issue.

 

again, thanks and hope this helps resolve the issue for someone else also.

 

tom

 

 

 

Enthusiast

Re: tpt script to populate edw from kerberized hadoop instance

afternoon again, the problem i thought was resolved is not resolved.  the scenario is that if the file is large enough (in this case i have the file at 2gb in size) i get the same error sequence as before:

TPT19609 After 60 seconds, all clients have not connected; a timeout has occurred.

TPT19434 pmRead failed, General failure (34): 'Unknown Access Module failure'

TPT19305 Fatal eror reading data.

TPT19015 TPT Exit code set to 12.

 

this worked for files up to about 1.5 gb in size, but they are consistently failing for files that are 2gb and over in size.  i have the HadoopNumMappers = 1 to try to keep things simplified, and to be sure i am not requesting more resources on the hadoop side than may be available; but otherwise the earlier shown tpt script is what is being used.  i have tried adjusting the Timeout setting in both the DATA Consumer and/or Data Producer sections, but to no avail (although i thought it had made a difference originally).  i have also tried specifying the VigilWaitTIme, but too did not make a difference in the behaviour of the TPT side of the script.

 

i also tried to pause the TPT job after i had started it once i saw it had fired off the TDCH job, but even with the job in a paused state, the timeout occurred and the job terminated.  is anyone aware of any additional parameters that can be specifed that will allow the timeout to be increased.  Again specifically, this is the 60 second timeout that is in effect on the TPT side that is posting a listener waiting for the start of the streamed data coming from the TDCH side from a TPT-TDCH script.  the exact script works fine for files under 1 gb in size, but fails for files larger than that.

 

Again thanks,

Tom

 

Enthusiast

Re: tpt script to populate edw from kerberized hadoop instance

Morning again,

Again adding to this post as i have gotten what seems to be a valid and viable solution.  (i know i said that before, but really, it has been reliably working this time, and will be until it doesn't, correct?)  Finally after more than a 3 week of investigation, debugging, TAYS incident discussions, and "i wonder what happens if i push that button" cogitations, i discovered that the solution was to increase the nummappers in the hadoop jobs.  i had been working on the thought that i needed to decrease the number of mappers it, due to how i was interpreting some me error code explanation. 

 

I thought based on the explanation of the error code descriptions (sorry, i can't locate the exact document/link where i got them from) that i needed to decrease the numbers of mappers as there was wording giving an indication that perhaps the number of mappers requested were too high for what the hadoop cluster was able to handle at load.  that was the crux of my issue.  i went the wrong way with that setting; i had tried bumping it from the default of 10; as i had tried values of 1,5,and 7 and all had made no difference on the timeout that was occurring.  on an off chance, i tried a value of 50, and the timeout did not occur.  the job completed successfully in 403 seconds.  

 

here is the adjusted (and working) script section that addresses the number of mappers and buffer size initialization.  i believe the IOBufferSize defaults to 65536 for a named pipe implementation on a Linux for kernel versions > 2.6.11.  i had tried setting it to 512288 to determine the affect, and the performance almost doubled (403 seconds down to 199 seconds) using the larger buffer allocation size.

 

define operator op_hdfsTest
type dataconnector producer
schema schema_hdfsTest
attributes
(
VARCHAR HadoopHost = 'hostnameTPTScriptRunsOn',
VARCHAR HadoopJobType = 'hdfs',
VARCHAR HadoopSourcePaths = '/user/test/hdfsTestData.txt',
VARCHAR HadoopSeparator = '|',

VARCHAR HadoopNumMappers = 50,

INTEGER IOBufferSize = 512288
);

 

section 6 of Teradata Connector for Hadoop Tutorial v1.5.pdf adds a lot of good context around the selection of the number of mappers used in the TDCH portion of the TPT-TDCH jobs.  after reading back through it again, and getting an additional level of understanding of what worked and what didn't; it made a bit more sense to me; but i still have a long way to go in understanding the exact interaction between the systems. 

 

just thought some may like to know the solution, or perhaps you did and i finally caught up.

 

tom