Teradata Studio vs teradata Python libray performance: python lib is slow

General
General covers Articles, Reference documentation, FAQs, Downloads and Blogs that do not belong to a specific subject area. General-purpose Articles about everything and anything
Fan

Teradata Studio vs teradata Python libray performance: python lib is slow

Hi,

 

We have an application that needs to offload data on a daily basis from Teradata. The application is written in Python and uses the Python Teradata library (version 15.10.0.21). Functionally it works, but downloading of data is really slow. Execution of the query is OK-ish, about 2 minutes (table has >400M rows), however the actual offloading of the records is slow.

 

For example:

  • offloading 1.1M rows of data takes ~1 hour using the python application (data volume 660MB)
  • whereas, offloading the exact same data using Teradata Studio only takes 15 minutes

So offloading using Teradata Studio is 4 times faster (both situations were run from the same machine, using the same connection to Teradata).

 

Can someone explain why Teradata Studio is so much faster? Is there some (undocumented) setting that we need to do in the Python library to make it faster? Are there alternatives?

 

I also tried sqlalchemy i.c.m. with tdodbc1620__linux_indep.16.20.00.50-1 driver and that performed 10% slower than the Python Teradata library. 

Any help is appreciated. 

 

Regards,

Gero

5 REPLIES 5
Teradata Employee

Re: Teradata Studio vs teradata Python libray performance: python lib is slow

Hi Gero,

 

Teradata Studio uses the FastExport proporties tied to JDBC driver.

From the link you provided, it's not supported within the teradata module.

 

Just do a TPT Export if you want optimal response, that is the best we've got.

You can execute from your Python script or any batch.

Teradata Employee

Re: Teradata Studio vs teradata Python libray performance: python lib is slow

Just in principle, you may want to change from the older teradata Python module (which uses the ODBC driver under the covers) to the current "native" Teradata SQL Driver for Python (teradatasql) module for more efficient access from Python. But teradatasql does not yet support FastExport protocol either. So either TPT or a JDBC client that knows how to exploit FastExport (such as Teradata Studio / Studio Express) will be faster.

Fan

Re: Teradata Studio vs teradata Python libray performance: python lib is slow

Thanks, we're gonna give the JDBC driver a try.

 

Gero

Fan

Re: Teradata Studio vs teradata Python libray performance: python lib is slow

Hi,

 

We have implemented use of the JDBC library and see it is now ~25% faster, which is a nice first step. Our earlier comparison with Teradata Studio showed reduced time with 75%, so there is more to gain. Below is our code. We did experiment a bit with the size passed to the fetchmany() function, but that did not show significant differences. 

 

Any suggestions on how we can further optimize?

#!/usr/bin/env python3
# pylint: disable=invalid-name,no-else-return

import logging
import jaydebeapi
import time

start_time = time.time()

logging.basicConfig()
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

logger.info("Start time {}".format(start_time))

try:


logger.info("Start connection")
connection = jaydebeapi.connect(jclassname="com.teradata.jdbc.TeraDriver",
url="jdbc:teradata://000.000.000.1/TYPE=FASTEXPORT",
driver_args=['USERNAME', 'PASSWORD'],
jars=['/opt/teradata/tdgssconfig.jar', '/opt/teradata/terajdbc4.jar'])
logger.info("Connected")

logger.info("Get cursor")
cursor = connection.cursor()

logger.info("Run query")
query = "SELECT * FROM TABLE"
cursor.execute(query)

logger.info("Get column information")
columns = []
for column in cursor.description:
columns.append(column[0])
logger.info(columns)

logger.info("Fetch all rows")
logger.info(cursor.fetchmany(size=1000))

except:
print("Error")
finally:
if cursor:
logger.info("Close cursor")
cursor.close()
if connection:
logger.info("Close connection")
connection.close()

end_time = time.time()
logger.info("End time: {}".format(end_time))
logger.info("Difftime: {}".format(end_time - start_time))

Thanks,

Gero

Highlighted
Teradata Employee

Re: Teradata Studio vs teradata Python libray performance: python lib is slow

You should really give TPT a try, it's way much faster.

Run it to produce a ziped file, use pandas to read this ziped file and you're mostly done.

I think this may be a less than a minute execution.