Python with Teradata ODBC

Blog
The best minds from Teradata, our partners, and customers blog about whatever takes their fancy.
Teradata Employee

1 Introduction

This document is a high-level tutorial that describes how to connect to Teradata Database using ODBC Driver for Teradata from scripting language such as Python.

Scripting languages are quickly becoming a common language for the implementation in many areas, especially, where the development time is more critical than the execution time. Moreover, in the areas where the execution time is important, languages and environment are adjusted and optimized to boost performance. Scripting languages make possible a variety of different scenarios and configurations. Increasingly, the languages themselves are used as a full basic instrumental platform. For example, many large commercial Internet applications are now developed with Perl, Python or PHP.

This document is designed to demonstrate the ease with which applications written in these languages can interact with the Teradata database. 

1.1 Driver Manager


When an application tries to connect to a data source using the connection methods of the ODBC API, the Driver Manager (DM) determines which driver is required and loads it into memory. The Driver Manager then simply takes any incoming function call from the application and calls the function of the same name in the driver. It also performs other functions such as error checking to ensure that function are called in the right order, arguments contain valid values and unloading the driver from memory.

The ODBC driver manager acts as the mediator between the application and the database thereby creating a layer of abstraction. It manages the interactions between application programs and drivers. It has the capability to manage multiple applications and multiple drivers simultaneously. 


There are different Driver Managers which can be used. Some of the most commonly used Driver Managers are:

unixODBC DM: an open source Driver Manager for various platforms

iODBC DM: an open source Driver Manager shipped with Mac OS X

Microsoft ODBC DM: Driver Manager shipped with Microsoft Windows OS 

DataDirect DM: Driver Manager built by DataDirect 

The ODBC Driver for Teradata works with iODBC on Mac OS X and with the Microsoft ODBC Driver Manager on Windows OS. On all other supported platforms, it is shipped with DataDirect’s Driver Manager. 

1.2 References

[DD] – DataDirect - www.datadirect.com

[iODBC] - Independent Open Database Connectivity - www.iodbc.org

[ODBC] Microsoft ODBC Specification - http://msdn.microsoft.com/en-us/library/ms710252%28VS.85%29.aspx

[PYTHON] – Python Programming Language - www.python.org

[PYTHON DAPI] – Python Database API Specification - www.python.org/dev/peps/pep-0249

[PYODBC] – Python ODBC Library - code.google.com/p/pyodbc

[TD ODBC] – ODBC Driver for Teradata User Guide – www.info.teradata.com

[TD DOWNLOAD] - Teradata Download Center - http://www.teradata.com/downloadcenter

1.3 Glossary

The following terms are used in this document and may not be familiar to all readers:

Term Definition
CGI Common Gateway Interface
DBMS Database Management System
DSN Database Source Name
FastCGI CGI extension
HTML HyperText Markup Language
IIS Internet Information Services
iODBC Independent Open Database Connectivity
ODBC Open Database Connectivity


2 Configuring Python

2.1 Prerequisites 

The components used throughout this tutorial:

Component Version URL
Teradata ODBC Driver 15.10.00.01 www.teradata.com/DownloadCenter
Python 2.7.11 http://www.python.org/download
Pyodbc 2.1.6 / 3.0.7 http://code.google.com/p/pyodbc/downloads/list

Several modules that enable connectivity to an ODBC data source exist. This tutorial demonstrates ODBC connection to the Teradata database using one of such modules - Pyodbc ([PYODBC]). Pyodbc is a Python database module for ODBC that implements the Python DB API 2.0 specification. Pyodbc is an open-source Python module. One of the restrictions of pyodbc is the version of Python. Pyodbc requires Python 2.4 or greater.

On Windows platform, Pyodbc module is available to download as an installation package from [PYODBC]. 

There is no binary distribution of Pyodbc for UNIX like platforms. Hence, on these platforms, it is required to download the pyodbc source distribution and build it against an ODBC driver manager installed on the machine. 

The rest of this section describes the steps required to build Pyodbc module against DataDirect driver manager, which is the driver manager that is distributed with Teradata ODBC driver on UNIX platforms.

Unpack:

gzip -d pyodbc-2.1.6.zip
cd pyodbc-2.1.6

Open setup.py file in text editor and find these lines:

extra_compile_args = None
extra_link_args = None

Update the lines as follows:

extra_compile_args = None
extra_link_args = ['-L/opt/teradata/client/15.10/lib64']

Note that the installation path of the Teradata ODBC driver and its components depends on the operating system and the version of driver. The snippet above assumes default installation of TDODBC 15.10 on Linux OS. 

For pyodbc-2.1.6 find this line:

extra_compile_args = ['-Wno-write-strings']

Edit the line so that it looks like this:

extra_compile_args = ['-Wno-write-strings', '-I/opt/teradata/client/15.10/include', '-DSQL_CP_ONE_PER_HENV=2']

For pyodbc-3.0.7 find this line:

settings['extra_compile_args'] = ['-Wno-write-strings']

Edit the line so that it looks like this:

settings['extra_compile_args'] = ['-Wno-write-strings', '-I/opt/teradata/client/15.10/include', '-DSQL_CP_ONE_PER_HENV=2']

NOTE the definition of -DSQL_CP_ONE_PER_HENV=2. Include file sqlext.h distributed with DataDirect driver manager misses some definitions for 3.0 ODBC specification. For instance, the following values are not defined:

/* values for SQL_ATTR_CONNECTION_POOLING */
#define SQL_CP_OFF 0UL
#define SQL_CP_ONE_PER_DRIVER 1UL
#define SQL_CP_ONE_PER_HENV 2UL
#define SQL_CP_DEFAULT SQL_CP_OFF

The snippet below illustrates the code from a connection routine in pyodbc. Note the use of undefined attribute value - SQL_CP_ONE_PER_HENV, which is missing in the include files.

If (bPooling)
{
if (!SQL_SUCCEEDED(SQLSetEnvAttr(SQL_NULL_HANDLE, SQL_ATTR_CONNECTION_POOLING, (SQLPOINTER)SQL_CP_ONE_PER_HENV, sizeof(int)))) {
Py_FatalError("Unable to set SQL_ATTR_CONNECTION_POOLING attribute.");
}

In order to successfully build pyodbc we inject this value into a building script. Moreover, connection pooling is not supported by the Teradata ODBC driver on UNIX platforms and thus should be disabled in the scripts in order to avoid connection failures. Next section will show how to turn off connection pulling. 

Build:

python setup.py build

Install as root:

python setup.py install
running install
running build
running build_ext
running install_lib
copying build/lib.linux-x86_64-2.7/pyodbc.so -> /usr/lib64/python2.7/site-packages
running install_egg_info
Removing /usr/lib64/python2.7/site-packages/pyodbc-2.1.6-py2.7.egg-info
Writing /usr/lib64/python2.7/site-packages/pyodbc-2.1.6-py2.7.egg-info

To verify successful installation of pyodbc execute the following script:

import pyodbc
pyodbc.apilevel

It should print out the string constant '2.0' indicating this module supports DB API level 2.0:

'2.0'

Before connecting to the database we need to make sure the ODBCINI variable is set.

We will be using the 64-bit version of the ODBC driver as the 64-bin pyodbc library was installed. The path depends upon the version of the driver installed.

Execute the following:

ODBCINI=/opt/teradata/client/15.10/odbc_64/odbc.ini
export ODBCINI

2.2 Create Sample Data

This section examines the script that generates the sample data in Payroll DSN that was introduced in Sample Data section 2.3. 

Pyodbc workflow of the interaction with the database is very similar to a generic ODBC application workflow, and can be described with the following steps:
  1. Open a database connection using connect() function;
  2. Create a cursor object using the cursor() function of the connection object;
  3. Execute a SQL statement using execute() function of the cursor object;
  4. Disconnect with close() function.

2.2.1 Connection

To begin, the script first imports the appropriate Python module for connecting to the database product being used – pyodbc in case of this tutorial.

import pyodbc

Then, the connection pooling is disabled. As mention in the previous section, the reason is the limitations of the ODBC driver on UNIX platform.

pyodbc.pooling = False

The next step is to create an object that represents a database connection. 

connection = pyodbc.connect('DSN=Payroll')

Note that the argument to the connect() function is an ODBC connection string. If required, it is possible to define additional attributes of the connection, such as user name and password using the connection string:

connection = pyodbc.connect('DSN=Payroll;UID=user;PWD=password')

2.2.2 Cursor

Next, the script creates a cursor object. Cursors act as handles for a given SQL statement. They allow retrieval of one or more rows of the result set. 

cursor = connection.cursor()

Cursor’s execute() method prepares and executes the SQL statement:

cursor.execute("drop table employee ");

Note that DDL statements should be committed:

connection.commit()

Or alternatively, it is possible to enable autocommit mode after connection is established:

connection.autocommit = True;

2.2.3 Cleanup

The last step is closing the connection:

connection.close()

2.2.4 Complete Example

# import pyodbc module
import pyodbc

# disable connection pooling
pyodbc.pooling = False

# create connection
connection = pyodbc.connect('DSN=Payroll;UID=;PWD=')

# enable auto commit
connection.autocommit = True;

# print driver and database info
print '-ODBC version =',connection.getinfo(10)
print '-DBMS name =',connection.getinfo(17)
print '-DBMS version =',connection.getinfo(18)
print '-Driver name =',connection.getinfo(6)
print '-Driver version =',connection.getinfo(7)
print '-Driver ODBC version =',connection.getinfo(77)

# create cursor
cursor = connection.cursor()

# Does table 'employee' exist?
if cursor.tables(table='employee').fetchone():

# drop employee table
cursor.execute("drop table employee");

# create employee table
cursor.execute("CREATE SET TABLE employee (employee_number INTEGER NOT NULL PRIMARY KEY, last_name VARCHAR(50) NOT NULL, first_name VARCHAR(50) NOT NULL)");

# populate employee table with sample data
cursor.execute("INSERT INTO employee VALUES (2, 'Olson', 'Chuck')");
cursor.execute("INSERT INTO employee VALUES (3, 'Lee', 'Bill')");
cursor.execute("INSERT INTO employee VALUES (4, 'Chapman', 'Lisa')");
cursor.execute("INSERT INTO employee VALUES (1, 'Miller', 'Susan');");

#disconnect
connection.close()

2.3 Retrieve Data

To retrieve data, using pyodbc, Python application executes the usual steps to establish connection as described in the section above. Then, the actual data fetch is accomplished with one of the fetch() methods of the cursor object. 

2.3.1 Cursor

Create cursor: 

cursor = connection.cursor()

Cursor’s execute method prepares and executes the statement:

cursor.execute("select first_name, last_name from employee")

Row objects are returned from Cursor’s fetch functions. As specified in the DB API, they are tuple-like and can be accessed with indices:

for row in cursor:
print row[0], row[1]

Also, pyodbc allows a more convenient access to the values by the column names:

for row in cursor:
print row.first_name, row.last_name

Note that it is possible to retrieve results using different fetch methods provided by the cursor object, such as fetchone, fetchall etc:

row = cursor.fetchone()
print row.first_name, row.last_name

Once the work with cursor is done, it should be freed:

cursor.close()

For more details refer to [PYODBC] and [PYTHON DAPI].

2.3.2 Complete example

# import pyodbc module
import pyodbc

# disable connection pooling
pyodbc.pooling = False

# create connection
connection = pyodbc.connect('DSN=Payroll')

# create cursor
cursor = connection.cursor()

# execute SQL statement
cursor.execute("select first_name, last_name from employee")

# fetch result set rows
for row in cursor:
print row.first_name, row.last_name

# close cursor
cursor.close()

# disconnect
connection.close()

5 Comments
Teradata Employee

Is it correct to assume that there cannot be more than ODBC Driver Managers installed at the same time? Since Teradata doesn't support unixODBC on Linux, does this mean unixODBC will need to be uninstalled (and hence any other ODBC database drivers that rely on it) before installing DataDirect ODBC DM?

Also, I recommend tdodbc module from https://github.com/Teradata/PyTd which is supported by Teradata and is pure python wrapper to ODBC driver. I found pyodbc to be buggy and not very well supported.

Teradata Employee

I have used the Teradata ODBC drivers with unixODBC multiple times before, never had an issue. Are you reffering to a specific problem?

Teradata Employee

The instructions say:

Update the lines as follows:

extra_compile_args = None
extra_link_args = ['-L/opt/teradata/client/15.10/lib64']

Is that for 2.1.6 only?  "extra_link_args" does not appear in the 3.0.7 setup.py.  Does it need to be added?

Teradata Employee

It's good to know that Teradata ODBC drivers work with unixODBC DM. The article suggests that unixODBC DM on Linux is not supported. I wasn't sure if it was possible to have both DataDirect and unixODBC DM installed simultaneously. If so, will the appropriate DM be used automatically, or some configuration changes will be needed.

Also, there is "works" and then there is "supported". If you run into any issues, or if some future changes cause the diver+DM combination to stop working, you are out of luck if not officially supported.

Please, could someone point out where I can read/study more to learn how to optimize the batch-size when using Python to insert data to a Teradata table, using ODBC?

 

For example, when I issue the "session.executemany( query2, row_array, batch = True )" command, the array named "row_array" is an n-dimensional array.  I'd like to learn the factors that affect performance, and how many rows I should be configure per batch.  Where, the variable "query2" contains the paramaterized insert statement, for example: "query2='INSERT INTO UI_Results_DB.tableName (Unique_Id,field1, field2...) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)'".

 

I've experiemented and it seems that somewhere between 2000 and 4000 rows, there is a problem.  At 2000 rows, the import is reasonably speedy.

 

Let me share a sample program below, for reference.  Before I removed specific info for my company, it worked fine.  I need to implement best-practice functional and error handling techniques.  I think when I pasted the program, all the tabs/leading spaces are removed, disasterous for Python.  Any constructive criticism would be welcome:

 

#------------------------------------------
#add try: and except: error handling
#https://github.com/Teradata/PyTd
#https://developer.teradata.com/tools/reference/teradata-python-module?
#ericscheie, escheie
#The module is hosted on PyPi: https://pypi.python.org/pypi/teradata
#Read one line of data from datafile. Parce the pipe-delimited data into a 32 item list. Append to an array. After the array is X rows deep, send to Teradata.
#Continue cycling through the read-parce-append-upload process until the source datafile is completely read.
#------------------------------------------
import teradata
import fileinput
udaExec = teradata.UdaExec (appName="Test_Import", version="1.0", logConsole=False)
session = udaExec.connect(method="odbc", system="serverName.com",username="username", password="password");
#create table in teradata db. Or, create table outside of Python program.
source_file='C:\\path\\datafile.txt'
#f=open(source_file,'r')
query2='INSERT INTO UI_Results_DB.table_name (field1, field2, field3... fieldN) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)'
row_array =[]
outer_loop=0
inner_loop=0
batch_size=2048
with open(source_file,'r') as f:
#--------------------------------------------------
for line in f: 
numbers=line.split( sep='|' ) #problem with empty values in record that is numeric in destination db table.
row_array.append(numbers)
inner_loop = inner_loop +1
#print('Inner Loop' + str(inner_loop))
if inner_loop >= batch_size: #Find the near-optimum batch-size for ODBC connection.
#Upload Batch to Teradata.
try:
session.executemany( query2, row_array, batch = True )
session.commit() #Teradata libray may auto-commit.
except:
print('Error Uploading Data. Outer_Loop = ' + str(outer_loop) + '. Inner Loop = ' + str(inner_loop))
finally:
print('Executing Finally Clause1')
row_array =[] #Empty array to prepare for next batch.
inner_loop = 0
outer_loop = outer_loop +1
print('Outer Loop = ' + str(outer_loop))
#Break when source file is completly read. Final batch will probably not match the batch size, therefor the second upload block.
#--------------------------------------------------
#Upload Final Batch to Teradata.
try:
session.executemany( query2, row_array, batch = True )
session.commit() #Teradata libray may auto-commit.
except:
print('Error Uploading Data. Outer_Loop = ' + str(outer_loop) + '. Inner Loop = ' + str(inner_loop))
finally:
print('Executing Finally Clause1')
row_array =[] #Empty array to prepare for next batch.
inner_loop=0
outer_loop = outer_loop +1
print('Outer Loop = ' + str(outer_loop))
del row_array, outer_loop, inner_loop, batch_size
print('Collecting Statistics')
session.execute( "COLLECT STATISTICS ON ui_results_db.table_name INDEX( Unique_Id );" )
cur = session.cursor()
cur.execute("SELECT COUNT(*) FROM UI_Results_DB.table_name;")
for row in cur:
print('Row Count = ' + str(row))
session.close()
#print('Teradata Session Closed =' + str(session.closed))
f.close()
print('File Closed =' + str(f.closed))
#------------------------------------------