Large Objects - Part 1 Loading

Applications
Applications covers the world of Teradata apps, including apps offered by Teradata (such as TRM or DCM), as well as best practices and methodologies for building your own Teradata-connected apps.
Teradata Employee

Large Objects - Part 1 Loading

In part 1, we will look at getting those Large Objects of yours into the Teradata. One way or another we will get them into Teradata, kicking and screaming if need be (however, it will be relatively painless, I promise) and we will do this despite any "Objections" "Large" may have.

Later in part 2, we will drag those Large Objects back out and use them in a web application.

What is a Large Object?

As the name implies Large Objects allow you to store large things in a database. They come in two flavours (sadly, neither Chocolate nor Vanilla):

  • Character Large Object or CLOB – used to store character data (e.g. text documents including plain text, XML, HTML, Rich Text Format etc) 
  • Binary Large Object or BLOB – used to store binary data such as images, sounds and structured binary documents (e.g. pdf, Word, Excel etc).

These sounds very similar to existing data types (char/varchar and byte/varbyte), so why do we need them? Well there are a couple of differences between large Objects and the existing data types.

Large Objects and the “equivalent” data types

The first and most obvious difference is that the maximum size of Large Objects is significantly larger than their “equivalent” data types.

The maximum size of char, varchar, byte and varbyte (on Teradata) is around 64K bytes for character fields this translates to around 64,000 characters for single byte character encoding schemes (e.g. ASCII) or around 30,000 characters for multi-byte character encoding schemes (e.g. Unicode). Large Objects can be up to 2 Giga Bytes (GB) in size.

Storing Large Objects

A second difference is how Large Objects are stored within the database.

Astute readers, will note that the “maximum” row size (i.e. the maximum size of all columns in a single table) is around 64Kbytes; the actual value also depends on the database version. This presents us with a bit of a conundrum. If the maximum row size is 64KB, then how can we store (multiple) objects up to 2GB in it?

The database internals folks tell us that Large Objects are not actually stored in the “row”. Instead they are stored in a sub-table.

Table and sub table containing extended data

In simple terms, a sub-table is a set of data that augments one or more tables in some way. Although we might not think of it as such, the most common example of a sub-table is an index (other than the primary index). Thus common examples of sub-tables include secondary indexes and join indexes.

Management of the sub-table is handled automatically by Teradata. In most, if not all, circumstances, this “separation of data” within the row is completely transparent and doesn’t affect anything that you would be doing.

While splitting the row up into “basic attributes” (e.g. name, id, etc) and Large Objects might sound like a bad idea it actually is a good idea. For example most queries are based upon the basic data stored in the numeric and character (i.e. varchar) attributes and not the content of the large objects. If you are performing an “all rows scan” for example. Then splitting the table (rows) like this will make the query run faster if it contains Large Objects.

I do not wish to get into an internals discussion (that may be the subject of a future article), but consider this query:

select count(*)
from customer
where gender = 'F';

When running the query Teradata will need to scan the table to satisfy the "gender = 'F'" condition. This means reading the entire row (or block of data) from disk into memory and checking the gender column for the value 'F' and counting it if the condition is satisfied.

Getting down to tin tacks, the performance of the query is directly related to the size of the row being read from disk. It is not possible for any database to just read the “gender” column – the entire row must be read. The gender column is extracted from the row read. In fact databases don’t even read rows – they read blocks which can contain multiple rows.

If the row is large, the query will take a little longer than if the row was smaller. Why? Because more rows fit into a block of data if they are smaller and hence more rows are read per block if they are smaller. Hint: there is a performance tip hidden here for those of you building online “tactical” applications that need to run all rows scan queries (e.g. a search by name or partial name).

If the above customer table had several large objects (e.g. a photo, digital signature etc) and these were stored in the row (as opposed to being stored in the sub-table) then this would make the row larger and queries such as the one above will take longer to run because more I/O is required. By splitting this “unnecessary” data out of the primary table and placing it in the sub table, performance can be improved.

Having said all that, let me add that common sense must prevail. I am absolutely not advocating that anyone should go out and remove all the less frequently used attributes from your entities, encode them into a BLOB or CLOB and store them in a sub table on the basis that you are unlikely to filter your results based upon those less commonly used values. All you will achieve is to make life difficult – but if masochism is one of your guiding principles …

Loading Large Objects

Resistance is futile; whatever “Large’s” “Objection” is, he (or she) will be loaded into the database whether he/she likes it or not, kicking and screaming if need be (although it will be relatively painless I promise).

So what are our options?

Options

There are three main options to get large objects into the database. These are:

  • Use a Teradata utility such as bteq or SQL Assistant (a.k.a. Queryman),
  • Use a third party utility that has good Large Object support,
  • Write your own program using CLI, JDBC or ODBC.

I will briefly cover the Teradata Utility option and present a Java (i.e. JDBC) application to load data. The code for the application is attached.

The table

In all of the examples, the following table will be used:

create table devX_lob (
id varchar (80),
mime_type varchar (80),
char_lob character large object,
binary_lob binary large object
)
unique primary index (id);

Both types of Large Object are present in the table, but the examples predominantly use the Binary Large Object. This is because Binary Large Objects are typically cooler than Character Large Objects and that’s what this article is all about – being cool and impressing your friends (or not)!

The Primary Key for this table is the id which is varchar. The reason for this will become apparent in part 2 of this article, but for now let’s just say it contains the name of the file representing the large object. Finally, there is the mime_type this also becomes important in part 2 where we output the Large Objects in a web application.

Load LOBs with a Teradata Utility

bteq

Until recently bteq did not really support Large Objects. However the more recent versions of bteq (12.00.00.02 or later) provide support for large objects.

The following example shows how a control file (lobjobbteq.txt) can be used to control the loading of large objects.

delete from devX_lob all;
.import vartext ',' LOBCOLS=1 file='/temp/lob/lobjobbteq.txt';

.repeat *
using (lobFile blob as deferred, id varchar(40), lobType varchar(1), mimeType varchar(40))
insert into devX_lob (id, mime_type, binary_lob)
values (:id, :mimeType, :lobFile);

The content of the control file is this:

/temp/lob/thumbs down.gif,thumbsdown.gif,B,image/gif
/temp/lob/thumbs up.gif,thumbsup.gif,B,image/gif
/temp/lob/crying.gif,crying.gif,B,image/gif
/temp/lob/small.pdf,small.pdf,B,application/pdf



The Examples.zip attachment contains both of the above files together with the images and pdf.

Once the bteq script has been run, there are four records in the table as can be seen in the result set window.

Since this example does not use CLOB’s the char_lob column is always null.

The LOB data is identified by the file names. These must be located in the first columns of the control file.

TPT

Support for large objects can also be found in TPT. Information about this can be found here.

SQL Assistant (QueryMan)

The tool formerly known as QueryMan has featured basic Large Object support for some time. You can import Large Objects using SQL Assistant by setting up an import job. An import job consists of the same elements as the bteq import, but is a little simpler.

First you need a parameterised query:

insert into devx_lob (id, mime_type, binary_lob)
values (?, ?, ?B);

The question mark fields in the “values” clause correspond to the columns in your control file. The “question mark B” column tells SQL Assistant that the third column is a BLOB and that the control file lists the file name of the blob.

The control file looks like this:

thumbsdown.gif,image/gif,thumbs down.gif
thumbsup.gif,image/gif,thumbs up.gif
crying.gif,image/gif,crying.gif
small.pdf,application/pdf,small.pdf

Finally you set SQL Assistant into “Import mode” and run the query. When you do, SQL Assistant will prompt you for the location of the control file. Notice that the file names (the 3rd column) do not have paths. SQL Assistant assumes that the files representing the LOB data are in the same directory as the control file. If you attempt to use path names to locate files in other directories SQL Assistant will abort the import and display an error in the status bar.

The Examples.zip attachment contains the control script together with the images and pdf. You will need to copy and paste the query from the article.

When I tried this, SQL Assistant successfully imported the 3 graphics files, but for an unknown reason, it could not import the PDF. The error reported was “A data file specified on record 4 is missing or too large”. The PDF was only 51.5K which is not too large and it definitely existed in the same directory as the 3 GIF files. The other two methods presented here all successfully imported the 4 files. My suspicion is that this is an issue in my environment, not an issue with SQL Assistant, ODBC or the import function.

Update: Refer to the comments below for the solution to the above problem.

Write your own

At first thought writing your own utility to import Large Objects may seem like a bit of overkill. However, if you are pre-bteq 12.00.00.02 and encounter a similar problem to the one I did with SQL Assistant then you may have to resort to this approach.

Fortunately, writing a program to import Large Objects into the database isn’t that hard. And once you have the basic program, you can modify it to work anyway you want it to.

Your choices are to write the program using Java (and JDBC) or one of the many languages that use ODBC based database connectivity (e.g. C, C++, Visual Basic etcetera). In this article we will use Java and JDBC.

A JDBC Lob Loader

In the attachment (LOBLoad.zip), you will find a complete NetBeans project that contains the source code for LobLoad. The program is relatively simple, so simple that the entire project consists of just one class (LobLoad.java).

LobLoad is a Java program that, curiously enough, reads a control file and loads Large Objects into the database using JDBC. The format of the control file is:

  • The id of the large Object (id column)
  • The type of large Object (C or B meaning character or binary). This determines which column in the above table is being loaded.
  • The MIME type which is used in part 2
  • The name of the file containing the Large Object.

The sample control file looks like this:

thumbsdown.gif,B,image/gif,/temp/lob/thumbs down.gif
thumbsup.gif,B,image/gif,/temp/lob/thumbs up.gif
crying.gif,B,image/gif,/temp/lob/crying.gif
lobjob.btq,C,text/plain,/temp/lob/lobjob.btq
small.pdf,B,application/pdf,/temp/lob/small.pdf

Most of the program relates to establishing the database session and processing the control file.

The part relating to the query that actually loads the Large Objects is as follows:

...
sqlText = "insert into devX_lob (id, mime_type, char_lob, binary_lob)\n" +
"values (?, ?, ?, ?);";
PreparedStatement lobStmt = connection.prepareStatement(sqlText);
...

The above establishes the query we will be using to load the Large Objects and prepares it ready for use. Preparing a query means sending the query text (without values) to Teradata for parsing. The data values will be supplied at a later time, which is what the rest of the program does.

The control file is read within the loop. The loop is executed once for each input line. Each time through the control file is parsed (split into its component parts). For the sake of brevity I’ve omitted the loop and the sections that extract and set up the other values from the control file.

The following shows how the file representing the large object is opened. The InputStream is then passed to our prepared Statement for transmission to Teradata.

File lobFile = new File (lobFileName);
FileInputStream lobInputStream = new FileInputStream (lobFile);
long lobSize = lobFile.length();

lobStmt.setString(1, id);
lobStmt.setString(2, mimeType);
if (isBlob) {
lobStmt.setNull (3, Types.CLOB);
lobStmt.setBinaryStream(4, lobInputStream, (int) lobSize);
}
else {
lobStmt.setAsciiStream(3, lobInputStream, (int) lobSize);
lobStmt.setNull (4, Types.BLOB);
}
lobStmt.executeUpdate();
lobInputStream.close ();

In the above code, we establish a FileInputStream, but this could be any of the other types of InputStream. For example if you wish to store (persist) a Java Object in your table, you could convert it to a byte array (byte []) and use a ByteArrayInputStream to store it in a BLOB column.

After setting the id and mime type, we decide if we are working with a Binary Large Object or a Character Large Object. If we are working with a BLOB then the 3rd parameter (the CLOB) is set to null and the InputStream is supplied to parameter 4 (the BLOB). Otherwise we have a CLOB and these assignments are reversed (the CLOB gets the InputStream and the BLOB is null).

Finally we execute the statement which causes the row to be created and our file to be uploaded into either the BLOB or CLOB column. With a bit of luck our Large Objects will appear in the database. 

A note for Java 6.0 users

The astute Java 6.0 reader may wonder why I went to all the bother to use the particular setXxxStream method that I did.

In reference to the java.sql.PreparedStatement javadocs for release 6.0 you will find there are two additional setXxxStream methods which would have been more convenient to use. Indeed there are more than just two, but the point is the same. In the case of setBinaryStream these additional methods are:

void setBinaryStream(int parameterIndex, InputStream x, long length);
void setBinaryStream(int parameterIndex, InputStream x);

Equivalent signatures exist for setAsciiStream. Both of the above methods would be more convenient to use:

  • If I used the first method, the cast of the file length to int would be unnecessary (saving a whopping 6 characters per call, not to mention the corresponding productivity boost! ).
  • If I used the second method all the code relating to figuring out how big the file is (and the cast to int) would not be required. This includes the construction of the File object whose only purpose in life is to obtain the length of the file (FileInputStream has a constructor that takes the file name as a String and thus does not need a File).

So why did I use such a “convoluted” approach? The answer is two fold:

  • Again in reference to the javadoc, you will note that the methods listed above are new in 1.6 (a.k.a. 6.0) and thus users of earlier JDK’s would not have access to these methods. This means the attached code wouldn’t work for them.
  • The JDBC driver is compiled to be compliant with the JDBC 3.0 specification (which is part of Java 5.0 and 1.4). Thus the driver does not include these methods.

So if you are using JDK 6.0 (like me) you could try to call these methods. If you do, your program will compile, but you will get an “AbstractMethodError” thrown because the methods simply do not exist in the JDBC driver. This seems to apply up to and including version 12.00.00.106 of the driver. 

Conclusion

Hopefully this article gives you some ideas as to how you might go about loading Large Objects into Teradata.

In part 2 of this article, we will look at retrieving our large objects and use them in a web application. Stay tuned!

17 REPLIES

Re: Large Objects - Part 1 Loading

Waiting anxiously to retrieved the stored objects...
Teradata Employee

Re: Large Objects - Part 1 Loading

The problem you got when loading the PDF file using SQL Assistant was probably the fact that SQL Assistant has an option to specify the maximum size of a LOB to be imported.
My guess is that you had this value set smaller than 51K, which would result in the message that the LOB was too large.

This option is used to limit the amount of memory it consumes but it must always be set greater than the largest file you are planning to import.

Note that SQL Assistant 12 is limited to a maximum LOB size of 10MB;
version 13 has no such limit.
(BTEQ was limited to 64K - not sure about 12.0 or 13.0)
Teradata Employee

Re: Large Objects - Part 1 Loading

Please add a "bteq" tag to the article. Thanks. Also, BTEQ supports deferred mode transfer of LOB files starting with the 12.00.00.02 version. It is its support for inline mode transfer that is limited to 64K.
Teradata Employee

Re: Large Objects - Part 1 Loading

Thanks Mike for the tip relating to the option in SQL Assistant that specifies the maximum LOB size.
We had a bit of discussion on the topic and discovered that in addition to setting this parameter there is a setting in the ODBC Data Source definition that must also be checked.

1) Select your Data source and choose configure.
2) Click the "Options >>>" button
3) Ensure that "Use Native Large Object support" is checked.

Re: Large Objects - Part 1 Loading

attachment seems to be missed
Teradata Employee

Re: Large Objects - Part 1 Loading

Hmmm, it was there, but now it is missing.
I've readded it - all the best.
Junior Supporter

Re: Large Objects - Part 1 Loading

Windows Scripting Host & OLEDB/ADO is an alternative for uploading & downloading BLOBs/CLOBs from/to OS files as explained here:

http://carlosal.wordpress.com/2008/11/23/insertar-ficheros-de-imagen-en-teradata-usando-windows-scripting-host-y-ado/

http://carlosal.wordpress.com/2010/01/26/guardar-datos-de-columnas-blob-en-teradata-como-ficheros-usando-windows-scripting-host-y-oledbado/

HTH.

Cheers.

Carlos.

Fan

Re: Large Objects - Part 1 Loading

Attachment Examples.zip seems to be missing please. Thx
Teradata Employee

Re: Large Objects - Part 1 Loading

I've re-added it. Hopefully it sticks around this time :-)