Trying to load some very big JSON files

General
Enthusiast

Trying to load some very big JSON files

Hi I've got some very big JSON files (8 mbs to 16 mbs in size) that I need to load into a DWH table so I can json shred them. How do you load such files into the DWH?. The Teradata manual shows very small examples.  There is a small sample below. Its all on the one text file, there are millions of rows:

 

 

 
{"type":"FeatureCollection", "features": [
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[136.31477686688,-11.96325894658],[136.31553970687,-11.96244894754],[136.31669962686,-11.96232897847],[136.32062182682,-11.96415597674],[136.32182674681,-11.96539896566],[136.3228189068,-11.9652919591],[136.3229405868,-11.9659889166],[136.32207190681,-11.96800491264],[136.32207082681,-11.96859693764],[136.3232908668,-11.97033294356],[136.3231846668,-11.97125094356],[136.32242182681,-11.97184593511],[136.32054478683,-11.97223590672],[136.31866882685,-11.97181893933],[136.31816482685,-11.97143697486],[136.31837758685,-11.96996790285],[136.31760070685,-11.9672689829],[136.31665390686,-11.96678190746],[136.31621182687,-11.96488191949],[136.31576974687,-11.96423895402],[136.31488486688,-11.96374998307],[136.31477686688,-11.96325894658]]]},"properties":{"LG_PLY_PID":"2178","DT_CREATE":"2016-05-25T00:00:00.000Z","DT_RETIRE":"1899-11-30T00:00:00.000Z","LGA_PID":"NT93","NT_LGA_shp":"2016-05-25T00:00:00.000Z","NT_LGA_s_1":"1899-11-30T00:00:00.000Z","NT_LGA_s_2":"EAST ARNHEM REGION","NT_LGA_s_3":"EAST ARNHEM","NT_LGA_s_4":"1899-11-30T00:00:00.000Z","NT_LGA_s_5":"7"}},
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[136.44633670522,-11.8473469464],[136.44654982521,-11.84653398088],[136.44845782519,-11.84500290127],[136.44883870518,-11.84434698174],[136.44873286518,-11.84358692875],[136.44822886519,-11.84272995029],[136.4472377852,-11.84208293886],[136.4467327052,-11.84104695039],[136.44717586519,-11.84002094937],[136.44838186518,-11.84006990929],[136.45318786512,-11.84162097197],[136.45428694511,-11.8422109229],[136.45472974511,-11.84286296636],[136.45489678511,-11.84398896135],[136.45457674511,-11.84469798037],[136.45286674513,-11.84546990798],[136.45117366516,-11.84672595308],[136.45001482517,-11.84689092116],[136.44919078518,-11.84744189971],[136.4476489052,-11.84723193281],[136.44661174521,-11.84767193538],[136.44633670522,-11.8473469464]]]},"properties":{"LG_PLY_PID":"2179","DT_CREATE":"2016-05-25T00:00:00.000Z","DT_RETIRE":"1899-11-30T00:00:00.000Z","LGA_PID":"NT93","NT_LGA_shp":"2016-05-25T00:00:00.000Z","NT_LGA_s_1":"1899-11-30T00:00:00.000Z","NT_LGA_s_2":"EAST ARNHEM REGION","NT_LGA_s_3":"EAST ARNHEM","NT_LGA_s_4":"1899-11-30T00:00:00.000Z","NT_LGA_s_5":"7"}},
{"type":"Feature","geometry":{"type":"Polygon","coordinates":[[[136.05522658898,-11.62505993214],[136.05539362898,-11.62425695413],[136.05737758896,-11.62424399154],[136.05798850896,-11.62500197051],[136.05870454895,-11.62505597098],[136.05870562895,-11.62592699148],[136.05749962896,-11.62857893203],[136.05646246897,-11.62890896157],[136.05557758898,-11.62956892711],[136.05502858899,-11.62946597513],[136.05601966898,-11.62799492259],[136.05595774898,-11.6270179751],[136.05522658898,-11.62505993214]]]}

 

1 REPLY
Teradata Employee

Re: Trying to load some very big JSON files

For JSON files that don't fit in a database row (>64k) you need to use "deferred" loading with teradata parallel transporter (TPT).  Inline (standard) loading is limited to the maximum size of a database row (64k) because it actually sends rows of data across a TPT data stream.  Deferred loading is used when obejcts don't fit into a data row and can't be sent across the TPT data stream.  Instead, a pointer to the actual file location is sent and the file is loaded from that location directly.  More details here: https://downloads.teradata.com/tools/articles/large-object-loading-with-teradata-parallel-transporte...

 

Deferred loading uses a very similar script to inline loading, with a few differences.  You must specify "AS DEFERRED BY NAME" in the TPT input schema like below:

DEFINE SCHEMA JSON_SCHEMA
DESCRIPTION 'JSON RECORD INPUT SCHEMA'
(
JSON_TEXT BLOB(20000000) AS DEFERRED BY NAME
);

 

Then, in your data connector producer operator, you specify the location of a file which contains the locations of the objects you're loading.  That's confusing, but to reiterate - you need to supply your TPT script with a file that contains a list of the file locations for the JSON files you're trying to load.

 

DEFINE OPERATOR FLIGHT_READER()
DESCRIPTION 'TERADATA PARALLEL TRANSPORTER DATA CONNECTOR OPERATOR'
TYPE DATACONNECTOR PRODUCER
SCHEMA JSON_SCHEMA
ATTRIBUTES
(
VARCHAR DirectoryPath = 'C:\input',
VARCHAR FileName = 'json_file_locations.txt',
VARCHAR OpenMode = 'Read',
VARCHAR Format = 'Delimited',
INTEGER IOBufferSize = 500000
);

 

The contents of json_file_locations.txt could be just this:

C:\Input\JSON_Files\*.json

 

..which would load all .json files in C:\Input\JSON_Files

 

below is a TPT script that should work for you...you just need to update the directory paths, connection information, and your json_file_locations.txt file:

 

DEFINE JOB LOAD_JSON_FROM_FILE
DESCRIPTION 'Load file using inserter operator'
(
/*****************************/
DEFINE SCHEMA JSON_SCHEMA
DESCRIPTION 'JSON RECORD INPUT SCHEMA'
(
JSON_TEXT BLOB(20000000) AS DEFERRED BY NAME
);
/*****************************/

DEFINE OPERATOR FLIGHT_READER()
DESCRIPTION 'TERADATA PARALLEL TRANSPORTER DATA CONNECTOR OPERATOR'
TYPE DATACONNECTOR PRODUCER
SCHEMA JSON_SCHEMA
ATTRIBUTES
(
VARCHAR DirectoryPath = 'C:\input',
VARCHAR FileName = 'json_file_locations.txt',
VARCHAR OpenMode = 'Read',
VARCHAR Format = 'Delimited',
INTEGER IOBufferSize = 500000
);

/*****************************/
DEFINE OPERATOR FLIGHT_INSERTER
DESCRIPTION 'Teradata PT INSERTER OPERATOR'
TYPE INSERTER
SCHEMA *
ATTRIBUTES
(
VARCHAR TdpId = '*********',
VARCHAR UserName = '****',
VARCHAR UserPassword = '****',
VARCHAR PrivateLogName = 'SQL_inserter_log',
INTEGER IOBufferSize = 1000000
);

/*****************************/

STEP export_to_file
(
APPLY ('INSERT INTO mt.json_files_stg VALUES (1,:JSON_TEXT);')
TO OPERATOR (FLIGHT_INSERTER() )
SELECT JSON_TEXT FROM OPERATOR (FLIGHT_READER());
);
);