Processing multiple files from S3 using Teradata TPT

Tools

Processing multiple files from S3 using Teradata TPT

Hi All,

We need to process a large set of files from AWS S3 using TPT, can anyone please advise how we can achieve this using "Teradata Access Module for Amazon S3".

In documentation example sample script is provided only for single file, please advise how to process multiple files. 

 

Thanks & Regards,

Shayan

13 REPLIES
Senior Supporter

Re: Processing multiple files from S3 using Teradata TPT

Check the S3SinglePartFile option and it's description (see below)

So it is possible to load multiple files - question to me is of the name schema of the files is fix (F000001, F000002, etc.) or if any files are picked...

 

Ulrich

 

 

From the documentation:

"This optional parameter determines whether the S3 object being
read or written is a single file with the name specified by S3Object
or a set of numbered files in an “apparent” directory named by the
S3Object. As discussed above, directories don’t really exist, but
the GUI and CLI segregate files based on “/” making it appear that
there is a directory.
When an application creates an object on S3, the object can be
created from up to 10,000 “pieces”. The size and SHA hash of the
“pieces” must be presented when the connection is opened. Since it
is not desirable to buffer the file on disk, the size of a “piece” is
limited to the size of the buffer. The minimum piece size/buffer size
is 5MB. By default 8MB is used as a good trade-off between memory
usage and performance. When the pieces are to be turned into an S3
object (a file) the command is limited to 10,000 pieces. That
effectively limits the size of a single S3 object (file) that is created to
80 Gigabytes.
The solution is to provide a mechanism to automatically treat a list
of objects (files) with related pathnames as a single file for the export
or load operation. The connector appends a “/”, followed by an “F”,
followed by a sequence number. So there is an “apparent” directory
created with the name of the requested S3 object. The files in that
“apparent” directory are named F000001 , F000002 , F000003 , etc.
This solves the problem of being able to handle an arbitrarily large
result set from the database. The suffixes of the “parts” are
predefined, and cannot be overridden.

If it is possible that the output will exceed 80 Gigabytes, the
S3SinglePartFile parameter must be set to false so a set of
numbered files can be created. The default value for
S3SinglePartFile is "false".
The S3SinglePartFile parameter may be set to "true" in the
following scenarios:
• For tests using small datasets
• For production where the customer is absolutely certain the
result set to be stored is smaller that 10,000*buffersize
The S3SinglePartFile must be set to true when reading a
single S3 object (file) of arbitrary size that isn’t split up into parts
(F000001, F000002, etc.)."

Enthusiast

Re: Processing multiple files from S3 using Teradata TPT

Ulrich - we have seen the document but having some difficult time to understanding. 

 

Can you/anyone guide me what to give in the S3Object field?

 

My sample s3 path is =>  /bucket_name/retail/usa/daily/test_files/
The above test_files folder has 3 files in it. I want to process all the 3 files via TPT. File names are F000001,F000002 and F000003.

 

Below are the attributes I have given in my TPT:

VARCHAR PrivateLogName = 'private_test_Log',
VARCHAR AccessModuleName = '/usr/lib64/libs3axsmod.so',
VARCHAR AccessModuleInitStr = 'S3Region="us-west-2" S3Bucket="bucket_name" S3Prefix="retail/usa/daily/test_files" S3Object=? S3SinglePartFile="False" S3ConnectionCount=50 S3BufferSize=128M ',
VARCHAR Format = 'Delimited',
VARCHAR OpenMode = 'Read',
VARCHAR TextDelimiter = '|'

 

Similar way I need to process another S3 directory files and it has around 1000+ files. Each file has around 400 MB in size (Uncompressed).

Teradata Employee

Re: Processing multiple files from S3 using Teradata TPT

You can use Wildcard character specifications if the filenames follow a naming convention.

 

This feature is avialable in

 

AccessModuleAmazonS3 15.10.01.003 or later
AccessModuleAmazonS3 16.10.00.000 or later

 

Wildcard expressions in FileName

    Wildcard expressions in a filename can be used in the data import scenario,
    where data is loaded into the Teradata Database.Wildcard characters in a filename
    can bespecified by setting "S3SinglePartFile=True" and simultaneously using
    the wildcard characters * or ? in the S3Prefix name or S3Object name.
    S3 Objects are selected using the same rules as in the Unix shell. Filenames matching
    the wildcard file specifications are added in the job log, and data from each file whose
    name matches the wildcard expressions are loaded into the Teradata Database table.
    Wildcard expressions that match a "/" in the object name with a wildcard character will be skipped.
    Wildcard characters will no longer match the '/' separator in names (e.g. '/ab*' will not match '/abc/xyz')
    You can say  /alph*/*  to match /alphabet/a, /alphabet/b, etc.
    This is the same behavior that you would expect from GLOB during shell filename expansion.
    S3Prefix and S3Object are concatenated before expansion.  Wildcard characters can appear in either.
    Wildcards which do not match any files will show the correct pattern in the error response.

 

Example to read testdata1.raw , testdata2.raw  , testdata3.raw objects in one load job using wildcards.

DEFINE JOB IMPORT_TO_TERADATA
DESCRIPTION 'Import data to Teradata from Amazon S3'
(
        SET TargetTable = 'S3W001';
    STEP IMPORT_THE_DATA
    (
        APPLY $INSERT TO OPERATOR ($LOAD)
        SELECT * FROM OPERATOR ($FILE_READER ()
            ATTR
            (
                AccessModuleName = 'libs3axsmod.so',
                AccessModuleInitStr = 'S3Bucket=<S3BucketName> S3Prefix="<PrefixName>/" S3Object=testdata* S3SinglePartFile=True'
            )
        );
    );
);

 

OR

 

You can try union operator of TPT to read multiple files simultanesously if the filenames doen't follow a naming convention.

 

DEFINE JOB IMPORT_TO_TERADATA
DESCRIPTION 'Import data to Teradata from Amazon S3'
(
    SET TargetTable = 'S3W001';

    STEP IMPORT_THE_DATA
    (
        APPLY $INSERT TO OPERATOR ($LOAD)
        SELECT * FROM OPERATOR ($FILE_READER ()
            ATTR
            (
                PrivateLogName = 'S3AXSMODREADW001P2_1',
                AccessModuleName = 'libs3axsmod.so',
                AccessModuleInitStr = 'S3Bucket=<S3BucketName> S3Prefix="<PrefixName>/" S3Object=testdata1.raw S3SinglePartFile=True'
            )
        )
        UNION ALL
        SELECT * FROM OPERATOR ($FILE_READER ()
            ATTR
            (
               PrivateLogName = 'S3AXSMODREADW001P2_2',
               AccessModuleName = 'libs3axsmod.so',
               AccessModuleInitStr = 'S3Bucket=<S3BucketName> S3Prefix="<PrefixName>/" S3Object=testdata2.raw S3SinglePartFile=True'
            )
        )
        UNION ALL
        SELECT * FROM OPERATOR ($FILE_READER ()
            ATTR
            (
               PrivateLogName = 'S3AXSMODREADW001P2_3',
               AccessModuleName = 'libs3axsmod.so',
               AccessModuleInitStr = 'S3Bucket=<S3BucketName> S3Prefix="<PrefixName>/" S3Object=testdata3.raw S3SinglePartFile=True'
            )
        );
    );
);

 

 

 

 

Teradata Employee

Re: Processing multiple files from S3 using Teradata TPT

In import scenario 'S3SinglePartFile=False' is helpful
if you are importing data to Teradata from S3 which were earlier exported from Teradata to S3
using s3axsmod and 'S3SinglePartFile=False' option was specified during export.

 

The maximum size of a single object which can be created through s3axsmod
while exporting data from Teradata is calculated as follows

 

S3BufferSize * 10000

 

if you specify S3BufferSize=5M then during export the maximum size of a single object can be 50GB.

 

If the Teradata table contains more than 50GB of data then

 

  • Either you increase the S3BufferSize value
  • Or use 'S3SinglePartFile=False' if you cannnot increase S3BufferSize value due to performance reason or if you don't want to keep large volume of data in a single object.

if you use the following initialization string during export 

 

VARCHAR AccessModuleInitStr = 'S3Region="us-west-2" S3Bucket="bucket_name" S3Prefix="retail/usa/daily/test_files" S3Object=Backup S3SinglePartFile="False" S3ConnectionCount=50 S3BufferSize=128M '

 

then the s3axsmod will create multiple small objects to write the data in S3 and name them as follows

 

/bucket_name/retail/usa/daily/test_files/Backup/F000001
/bucket_name/retail/usa/daily/test_files/Backup/F000002
/bucket_name/retail/usa/daily/test_files/Backup/F000003
.....................
....................

 

You need to specify

 

VARCHAR AccessModuleInitStr = 'S3Region="us-west-2" S3Bucket="bucket_name" S3Prefix="retail/usa/daily/test_files" S3Object=Backup S3SinglePartFile="False" S3ConnectionCount=50 S3BufferSize=128M '

 

in the import job while importing the data, earlier exported through s3axsmod using 'S3SinglePartFile=False' option, back to Teradata.

Enthusiast

Re: Processing multiple files from S3 using Teradata TPT

@Mishra_HS @ulrich Thanks for the reply. Sorry I didn't mention that I have already tried those wild card approach. While I was running with wildcard, Its reading 20 files and after that the job got disconnecting from the sessions and terminated. Below are the log for your reference.

 

We need to process 100s of .gz file or 1000s of .csv file. We are planning with go ahead with .gz file. Do you have any recommendation?

 

S3AXM [25509]:  S3 Parameter        Parameter Value                     Parameter Origin
S3AXM [25509]:  -------------       --------------------------------    ---------------------
S3AXM [25509]:  S3ConfigDir       = /home/hadoop/.aws                   $HOME environment variable
S3AXM [25509]:  S3Profile         = default                             Default
S3AXM [25509]:  S3Region          = us-west-2                           Module Parameters
S3AXM [25509]:  S3Bucket          = bucket_name         		Module Parameters
S3AXM [25509]:  S3Prefix          = retail/usa/daily/test_files/        Module Parameters
S3AXM [25509]:  S3Object          = filename_8_21*.csv    		Module Parameters
S3AXM [25509]:  S3ConnectionCount = 50                                  Module Parameters
S3AXM [25509]:  S3BufferCount     = 100                                 Computed Default
S3AXM [25509]:  S3BufferSize      = 134217728                           Module Parameters
S3AXM [25509]:  S3SinglePartFile  = True                                Module Parameters
S3AXM [25509]:  S3AccessID        = ****************                    AWS "credentials" file
S3AXM [25509]:  S3AccessKey       = ****************                    AWS "credentials" file
S3AXM [25509]:  -----------------------------------------------------------------------------
OPR_AccessModule_Read_RtlrLctnUPC_PY[1]: ECI operator ID: 'OPR_AccessModule_Read_RtlrLctnUPC_PY-25509'
Load_Oper_RtlrLctnUPC_PY: connecting sessions
Load_Oper_RtlrLctnUPC_PY: preparing target table(s)
Load_Oper_RtlrLctnUPC_PY: entering DML Phase
Load_Oper_RtlrLctnUPC_PY: entering Acquisition Phase
S3AXM [25509]: For offset            0 opened object of length    167061571: "retail/usa/daily/test_files/filename_8_21_0_0_0.csv"
S3AXM [25509]: For offset    167061571 opened object of length    167363421: "retail/usa/daily/test_files/filename_8_21_0_0_1.csv"
S3AXM [25509]: For offset    334424992 opened object of length    169741546: "retail/usa/daily/test_files/filename_8_21_0_0_10.csv"
S3AXM [25509]: For offset    504166538 opened object of length    169998470: "retail/usa/daily/test_files/filename_8_21_0_0_100.csv"
S3AXM [25509]: For offset    674165008 opened object of length    172088574: "retail/usa/daily/test_files/filename_8_21_0_0_101.csv"
S3AXM [25509]: For offset    846253582 opened object of length    172925467: "retail/usa/daily/test_files/filename_8_21_0_0_102.csv"
S3AXM [25509]: For offset   1019179049 opened object of length    168335355: "retail/usa/daily/test_files/filename_8_21_0_0_103.csv"
S3AXM [25509]: For offset   1187514404 opened object of length    170752191: "retail/usa/daily/test_files/filename_8_21_0_0_104.csv"
S3AXM [25509]: For offset   1358266595 opened object of length    171636895: "retail/usa/daily/test_files/filename_8_21_0_0_105.csv"
S3AXM [25509]: For offset   1529903490 opened object of length    172512607: "retail/usa/daily/test_files/filename_8_21_0_0_106.csv"
S3AXM [25509]: For offset   1702416097 opened object of length    169532035: "retail/usa/daily/test_files/filename_8_21_0_0_107.csv"
S3AXM [25509]: For offset   1871948132 opened object of length    172511710: "retail/usa/daily/test_files/filename_8_21_0_0_108.csv"
S3AXM [25509]: For offset   2044459842 opened object of length    173964168: "retail/usa/daily/test_files/filename_8_21_0_0_109.csv"
S3AXM [25509]: For offset   2218424010 opened object of length    168322153: "retail/usa/daily/test_files/filename_8_21_0_0_11.csv"
S3AXM [25509]: For offset   2386746163 opened object of length    146942344: "retail/usa/daily/test_files/filename_8_21_0_0_110.csv"
S3AXM [25509]: For offset   2533688507 opened object of length    174048390: "retail/usa/daily/test_files/filename_8_21_0_0_12.csv"
S3AXM [25509]: For offset   2707736897 opened object of length    174221309: "retail/usa/daily/test_files/filename_8_21_0_0_13.csv"
S3AXM [25509]: For offset   2881958206 opened object of length    171484850: "retail/usa/daily/test_files/filename_8_21_0_0_14.csv"
S3AXM [25509]: For offset   3053443056 opened object of length    174115611: "retail/usa/daily/test_files/filename_8_21_0_0_15.csv"
Load_Oper_RtlrLctnUPC_PY: disconnecting sessions
S3AXM [25510]: AWS S3 ACCESS Module Shutdown Completed
Load_Oper_RtlrLctnUPC_PY: Total processor time used = '3.876 Second(s)'
Load_Oper_RtlrLctnUPC_PY: Start : Tue Aug 22 00:55:38 2017
Load_Oper_RtlrLctnUPC_PY: End   : Tue Aug 22 01:00:54 2017
S3AXM [25509]: AWS S3 ACCESS Module Shutdown Completed
OPR_AccessModule_Read_RtlrLctnUPC_PY[1]: Total files processed: 0.
Job step STP_Import_RtlrLctnUPC_PY terminated (status 12)

 

Enthusiast

Re: Processing multiple files from S3 using Teradata TPT

@Mishra_HS Thanks again for the quick response. I will try the below option. Hopefully it will solve my issue.

 

VARCHAR AccessModuleInitStr = 'S3Region="us-west-2" S3Bucket="bucket_name" S3Prefix="retail/usa/daily/test_files" S3Object=Backup S3SinglePartFile="False" S3ConnectionCount=50 S3BufferSize=128M '

Enthusiast

Re: Processing multiple files from S3 using Teradata TPT

@Mishra_HS one more doubt. If I want to run multiple job in parallell, how to define different checkpoints? Because if I run 2 jobs parallel, by default the check points are created at /opt/teradata/client/16.10/tbuild/checkpoint/ in the samename for both the jobs. For example, if one job is running, and if I run the second job, the second job is restarting as the checkpoint file (i.e. related to first job) is already present. 

 

Any guidance to avoid this kind of scenerio? or how to set different checkpoint file names for different jobs?

 

Thanks in advance.

Buvan

Teradata Employee

Re: Processing multiple files from S3 using Teradata TPT

Even though we do not enforce it, it is HIGHLY RECOMMENDED that you assign a job name to each TPT job you run.

Use the -j command line option.

This will create separate checkpoint/restart files for you.

TPT actually does not allow you to run 2 jobs simultaneously without a job name (which would default to the logon id of the user running the TPT job).

It is just good practice to provide a unique name to each job you run.

 

-- SteveF
Enthusiast

Re: Processing multiple files from S3 using Teradata TPT

@feinholz Thank you. Can you please let me know why after processing 20 files, the job got terminated. I have attached the log in previous reply. It would be great if you provide your insights.

 

Thanks,

Buvan.