Picture This - Teradata and Digital Image Processing

Extensibility
Extensibility covers the mechanisms by which you, as the user or developer, can extend the functionality of the Teradata Database, for example with the use of User Defined Functions, or UDFs.
Teradata Employee

Picture This - Teradata and Digital Image Processing

As most of you might agree, managing our collections of digitial pictures is becoming quite a challenge.  The number of photos continues to increase and now includes pictures from cameras as well as multiple mobile devices.  And to add to my troubles, I find that I have duplicate copies in different folders and on different computers.  Getting this organized is becoming a high priority.  Sure there are management solutions already available, but hey, we're tech people and it's more fun to try to build our own!  With the free Teradata Express database and some java coding, we have the right tools to get started.

Most databases, including Teradata, support storing large binary objects using the SQL BLOB data type.  This is exactly how we'll store our digital photos.  But a key message that I really want to demonstrate is that the Teradata platform is much more than storage.  Sure, Teradata is the leader in large scale database capacity.  With our MPP platform, we can size from gigabytes to petabytes by adding more and more hardware nodes.  But just as importantly, we are also an extremly powerful processing platform.  Data analytics on these large volumes of data is the true value that Teradata brings to its customers.  So even with these simple digital photo examples, what I really hope to demonstrate is that you can apply compute processing on these binary data objects using SQL and custom Java that can be as functional as you need it to be.

ELT vs ETL

When it comes to preparing data for loading into a database there are two basic schools of thought on where that processing should take place.  The first is that the preparation should be done before loading.  This follows the standard ETL processes; ie (Extract), Transform and then Load.  The other camp leverages the processing capabilities of the database platform: ELT - (Extract), Load and then Transform.  The general idea here is that the source data should be inserted into temporary load tables first where any transformation processing can be done followed by then moving the records to their proper target tables.  These "load first and transform second" jobs will benefit from Teradata's parallel processing prowess.

Loading LOBs

OK, let's get some data into Teradata.  Here on Developer Exchange we have some good articles on how to deal with these BLOB data types.

Since we're flexing our Java muscles with this exercise, I've included a very simple java file loading example/tool to get us started.  This tool will load our files to a temporary loading table named "image_load_table".  This table is created by our ImageLoader tool (below).

CREATE TABLE image_load_table
(
id INTEGER,
source VARCHAR(255),
image BLOB
)

The id column gives us a very simple record identifier while the source field is where we can store the path/filename.  And of course the BLOB field is where we'll put our image bytes.

Quick and Simple Java Load Tool

The java load tool has two simple requirements.  First is to create a target database that we will use to load our images with a 'CREATE TABLE' permissions granted to our user.  And while we are doing this, there are a few other GRANTs that we'll need in order to deploy and execute the Java UDFs that we will be creating later on, so I'll add that syntax here too.  This is easily done with the following SQL statements executed by the dbc user:

CREATE DATABASE {imagedb} AS PERM = 500000000
GRANT CREATE TABLE to {user} on {imagedb}

GRANT EXECUTE PROCEDURE on SQLJ to {user}

GRANT CREATE EXTERNAL PROCEDURE ON {imagedb} TO {user}
GRANT CREATE FUNCTION ON {imagedb} TO {user}

where {imagedb} is the name of the database that you are creating for this project and {user} is the database login for your connection.  The 'PERM' parameter is the size the new database.  In this example I'm using 500 MB, which should give us just enough space (the units for PERM is bytes).  I've kept this simple on my Teradata Express instance and am using the dbc user and a database named imagedb.

The second requirement is that our images are in a sub-directory named "images" relative to where we execute our Java tool.  All files in this directory will then be loaded by this tool.  For the examples in this article, I've focused strictly on JPEG images.

The key code snippets here show connecting to Teradata with JDBC and looping through the directory and loading BLOB data:      

public class ImageLoader
{
public static void main(String[] args)
{
if ( args.length != 4 )
{
System.out.println("Required Parameters: [Teradata host address] [user] [password] [database]");
return;
}

conn = getConnection(args);
.......

System.out.println("Scanning " + dir.toString());

File[] children = dir.listFiles();

// loop through all files and insert into database

// get the starting index for the row id in case the table already contains rows

int startIndex = getStartIndex();

for (int i = 0; i < children.length; i++)
{
if ( children[i].isFile() )
{
loadImage(i + startIndex, children[i]);
....

private static Connection getConnection(String[] args)
{
Connection conn = null;

String host = args[0];
String user = args[1];
String pwd = args[2];
db = args[3];

String url = "jdbc:teradata://" + host + "/" + db + "/";

System.out.println("Connecting to Teradata database " + db + " at address " + host);

try
{
String driverClassName = "com.teradata.jdbc.TeraDriver";
Class.forName(driverClassName);
conn = DriverManager.getConnection(url, user, pwd);
.......
return conn;
}

private static void loadImage(int id, File file)
{
String sql = "insert into " + db + ".image_load_table(id,source,image) values" +
" ( ?, ?, ?)";

try
{
java.sql.PreparedStatement stmt = conn.prepareStatement(sql);
System.out.println("Loading image file: " + file.getName());

stmt.setInt(1,id);
stmt.setString(2,file.getName());

FileInputStream imageInputStream = new FileInputStream (file);
long size = file.length();
stmt.setBinaryStream(3, imageInputStream, (int) size);

stmt.executeUpdate();
imageInputStream.close();

stmt.close();
conn.commit();

..........
}

On my system I have created a directory "c:\projects\ImageLoader" where I've put the ImageLoader.jar file, along with the images subdirectory, "c:\projects\ImageLoader\images".

The source code and jar file is attached.  To run this tool using the command line (at the ImageLoader directory) use this syntax with the Teradata connection parameters:

java -jar ImageLoader.jar [Teradata host address] [user] [password] [database]

To load images to my local VMware image of Teradata Express, my command syntax looks like this:

C:\projects\ImageLoader>java -jar ImageLoader.jar 192.168.100.150 dbc dbc imagedb
Connecting to Teradata database imagedb at address 192.168.100.150
Scanning images
Loading image file: P1020983.JPG
Loading image file: P1020986.JPG
Loading image file: P1020987.JPG
Loading image file: P1020988.JPG
Loading image file: P1020989.JPG
Loading image file: P1020990.JPG
Loading image file: P1020992.JPG
Loading image file: P1020993.JPG
Loading image file: P1020996.JPG
Loading image file: P1020997.JPG
Loading image file: P1030008.JPG
Loading image file: P1030011.JPG
Loading image file: P1030014.JPG
Loading image file: P1030022.JPG
Loading image file: P1030027.JPG
Loading image file: P1030028.JPG
Loading image file: P1030030.JPG
Loading image file: P1030039.JPG
Loading image file: P1030041.JPG
Loading image file: P1030042.JPG
Loading image file: P1030043.JPG
Loading image file: P1030047.JPG
Loading image file: P1030049.JPG
Loading image file: P1030050.JPG
Loading image file: P1030052.JPG
Loading image file: P1030053.JPG
Loading image file: P1030055.JPG
Loading image file: P1030058.JPG
Loading image file: P1030061.JPG

You should now have some images loaded into your database.  A quick check with the Teradata SQL Assistant tool will even show us the images.

Teradata can now serve up these images as part of database application, for example here's a Developer Exchange article showing how Teradata can support a web application with images.  But remember, Teradata is much more than a simple storage platform.  Let's see what more we can do with this data.

Flexing our Processing Power

Now that we have our binary data loading onto Teradata, let's put Teradata's processing power to use.  Teradata supports custom functions in what is called User Defined Functions (UDFs).  These can be written in C or Java.  UDFs open the doors for almost unlimited functionality on our data.  As you can image, there are many real-world applications that already exist for processing and analysing digital photos.  A great example is facial recognition where individuals can be recognized in a digital image by a computer algorithm.  Image a Teradata platform holding thousands or millions of images and using SQL we can search through these binary data objects looking for a particular person.  Now that's digital photo management!

OK, I'm not about to embark on writing a facial recognition program.  But I do have a simple example that we can use to demonstrate the binary analytic capabilities of Teradata using a Java UDF.  Remember in the introduction when I said that one of my challenges is that I have duplicate images.  Many of these also have different filenames, so its a very manual process to find them.  Basically what I do is view them side by side.  What I need is a way to uniquely fingerprint each image so that I can easily search for duplicates.  Hello MD5 checksums!

MD5 UDF

MD5 is a hash function that is widely used to verify file intergrity, such as when downloading large files.  While it has been proven that collisions are possible, it is still a very easy way to uniquely identify files, such as our binary images.

All we need is a very simple Java class that we can deploy to Teradata as a UDF.  Here is an excellent Developer Exchange article that will walk you through the steps to create and deploy a Java function using Teradata's Eclipse development tool.

For our MD5 calculation we'll use an open-source package from Apache - the commons codec components:

use this download link: http://commons.apache.org/codec/download_codec.cgi

In our code we'll import the DigestUtils class and use the md5Hex() method to calculate the 32 character checksum:

import org.apache.commons.codec.digest.DigestUtils;

String md5 = DigestUtils.md5Hex(byteData);

We'll then create a class with this method that we can deploy to Teradata so that we can call our calcMD5 function from SQL

Using the Java UDF tutorial on Developer Exchange, I created a Java project named 'TD-MD5'.  I downloaded the apache commons codec jar and added it to my project library as an external jar.  I then added the following java code (and the import statement for DigestUtils) to a class named 'CalcMD5' with a method named 'calcMD5'.  Using the 'create Teradata UDF' wizard (as explained in the tutorial), I created a scaler UDF that takes a large binary object as an input parameter, which I named 'byteData' and returns a varchar(32) value containing the checksum.

package judf;

import java.sql.Blob;
import java.io.InputStream;

//Apache DigestUtils for MD5 calc
import org.apache.commons.codec.digest.DigestUtils;

/**
* @author Teradata Mike
*
*/
public class CalcMD5
{
/**
* @param binaryData
* @return MD5 checksum String
*/
public static String calcMD5(Blob binaryData)
{
String md5 = null;
try
{
InputStream inputStream = binaryData.getBinaryStream();
md5 = DigestUtils.md5Hex(inputStream);
inputStream.close();
}
catch (Exception e)
{
// printing doesn't really make sense, you may want to throw a SQLException
// or simply return NULL

System.out.println(e);
}

return md5;
}
}

When deploying these UDFs, make sure you add the external library jars (apache commons codec)  to the deployment package, otherwise you'll get an exception at SQL runtime.  Follow the steps in the tutorial.  For clarity, the deployment is a 2 step process.  First a jar file is created and sent to Teradata.  The 2nd step is to run the install SQL.  These are the 2 tabs at the lower right of the screen shot above.  First use the 'JAR Files' tab to DEPLOY, then the SQL tab to INSTALL.  If you encounter any SQL permission errors, refer back to the necessary GRANT privileges that were highlighted earlier in this article.

Our First Image Analytic Query

Let's analyze some binary data!  

SELECT id, source, calcMD5(image)
FROM image_load_table
ORDER BY id
id    source          calcMD5(image)

1 tree.jpg 22147a4beb52a7599746ed901aebf3bb
2 P1020983.JPG 47a0b819fc5195e760bf7702b70fe0e3
3 P1020986.JPG affb7bf239969c47b2c6a8e2e46af8fa
4 P1020987.JPG f8b629bff260bf258cb7681688375123
5 P1020988.JPG 4a5683140caa61c55c53697927063eef
6 P1020989.JPG f175125ba53da1cdcc9ffb4e99aed19e
7 P1020990.JPG 3a2c6c8d4bd10cefc5434f127642a6af
8 P1020992.JPG 8a08ee270142f7d7cc44f05e2ed38b5e
9 P1020993.JPG 116668900b1a45306d30ced1d4261350
10 P1020996.JPG 52ef343fab22cbe5ee71cbf8c1364f92
11 P1020997.JPG 9b0d4a7bea9fa82141f84ac750fd6216
12 P1030008.JPG c292826f5530f5b81044cc156f0e8da7
13 P1030011.JPG f02a9805f847ad96d3f8c3b3ddcbc25b
14 P1030014.JPG b67e2be4409dc45c6dde196f1f42cdc9
15 P1030022.JPG 11a06a1b6a86948ea75d910d9c4cfd5b
16 P1030027.JPG 950d2eb1f6043022a608cbd932995569
17 P1030028.JPG 203539900e59fadd8b80d27dd09de041
18 P1030030.JPG 12a61b3922bf74152fdc629589bd5630
19 P1030039.JPG eee3eea0b2bb902995515af6c31551e1
20 P1030041.JPG 284de1b0daa4d9c084192b0d4468d388

Our results now show the MD5 checksum for each image....pretty cool, huh?

To look for duplicates, the SQL would be:

SELECT calcMD5(image), COUNT(*)
FROM image_load_table
GROUP BY 1
HAVING COUNT(*) > 1
ORDER BY 1

If you didn't have any duplicates already in your set, just rerun the load utility to double up all your previous images (or maybe just a select few to keep things simple).

EXIF Extraction

As you know, digitial photos also include embedded metadata tags within the binary file.  From your computer you can view these properties from your file explorer tools.  Usually these tags tell you about the camera model number, image width and length, original creation date, shutter and aperture settings and some even include GPS coordinates to tell you where the picture was taken.  All great stuff that belongs in our database tables!  Here is an example of some of the EXIF information in one of my photos:

Exif metadata:
Root:
Make: 'Panasonic'
Model: 'DMC-ZS1'

Exif:
Exposure Time: 10/8000 (0.001)
FNumber: 56/10 (5.6)
Exposure Program: 8
ISO: 80
Exif Version: 48, 50, 50, 49
Date Time Original: '2010:06:04 10:36:12'
Create Date: '2010:06:04 10:36:12'

Let's create another Java project in Eclipse for these new image functions.  I've named my new project 'TD-ImageEXIF', again following the steps outline in the Developer Exchange tutorial.

For this project will use another Apache library - Sanselan.  This gives us the toolss to extract EXIF metadata from our digital photos.  Here is the link to download the jar file: http://commons.apache.org/sanselan/download_sanselan.cgi.  Remember, for these examples I'm only focused on JPEG images.  The examples to follow are coded only for JPEG.  For other image types I'll leave that to you to add that support to the code.

I've extracted the Sanselan jar file and copied to a resources folder within my Eclipse project.  This jar file will need to be added to the project library as an external jar.

The first UDF that we'll create will be to extract the full EXIF metadata from the file.  This is a semi-structured text message.  I've defined the UDF to take our image BLOB as input and return a VARCHAR(5000).  That should be large enough to hold the full EXIF text (easily increased if needed). I noticed that some of my own images had text just over 3000 characters. I named my Java class 'ImageEXIF' and the method 'imageEXIF' as shown in this screen shot.

The source code is a little more complex than the MD5 example, but still fairly straight forward:

package judf;

import java.sql.Blob;
import java.io.InputStream;

//Sanselan is the Apache image EXIF library

import org.apache.sanselan.Sanselan;
import org.apache.sanselan.common.IImageMetadata;

/**
* @author Teradata Mike
*
*/
public class ImageEXIF
{

/**
* @param blobData
* @return EXIF String
*/
public static String imageEXIF(Blob blobData)
{
String returnValue = null;

try
{
InputStream inputStream = blobData.getBinaryStream();
IImageMetadata metadata = getImageMetadata(inputStream);
inputStream.close();

if ( metadata != null )
returnValue = metadata.toString();
}
catch (Exception e)
{
// printing doesn't really make sense, you may want to throw a SQLException
// or simply return NULL
System.out.println(e);
}

return returnValue;
}

/** getImageMetaData()
*
* @return EXIF metadata
*
* retrieve image EXIF metadata in EXIF format
*/
protected static IImageMetadata getImageMetadata(InputStream inputStream)
{
try
{
// null for optional filename
IImageMetadata metadata = Sanselan.getMetadata(inputStream,null);
return metadata;
}
catch (Exception e)
{
// printing doesn't really make sense, you may want to throw a SQLException
// or simply return NULL
System.out.println(e);
return null;
}
}

}

Again, run through the deployment steps outlined in the tutorial.  Remember to include the Sanselan jar file in the deployment step and then run the SQL install script.  You should now be able to execute this SQL statement to see the EXIF data:

SELECT id, imageEXIF(image)
FROM image_load_table
ORDER BY id

There should be all types of goodies in there, from camera model to shutter speed, aperture, and image size.

Individual Metadata Tag Extract

But let's not stop there.  This EXIF data is good, but its still very much unstructured and not very useful in an analyic query.  What we need is a way to extract specific metadata tags so that we can use them in our SQL. So let's create 1 more UDF to be used to get specific values.  To do this we'll add another input parameter for the EXIF tag that we want.  We'll use the same Eclipse Java project.  Again, following the tutorial steps for creating a UDF, I have a new class named 'ImageTag' with a new method named 'imageTag'.  The inputs are our BLOB binary data and a VARCHAR tag name and returns a VARCHAR(2000) of the specified tag (I think 2000 is a bit overkill, but just to be safe...).

Now we can run more specific SQL queries based on individual image tags.  For example:

SELECT id, source, imageTag(image,'create_date'),
imageTag(image,'width'), imageTag(image,'height'),
imageTag(image,'longitude'), imageTag(image,'latitude')
FROM image_load_table
ORDER BY id

Here is the code source for our new class.  In the spirit of reuse, I'm using the method from our first ImageEXIF class; ImageEXIF.getImageMetadata(inputStream).  One edit that I did need to do is change the scope of this method from private to protected.

Also, you'll see that I've added a mapping layer between the TiffConstants and our input Tag name.  This should make it easier to use.  You can modify this code and add more EXIF tags to our UDF, as you see I haven't included any of the camera settings such as ISO or shutter speed.

package judf;

import java.sql.Blob;
import java.io.InputStream;

// Sanselan is the Apache image EXIF library

import org.apache.sanselan.formats.jpeg.JpegImageMetadata;
import org.apache.sanselan.formats.tiff.TiffField;
import org.apache.sanselan.formats.tiff.constants.TagInfo;
import org.apache.sanselan.formats.tiff.constants.TiffConstants;

/**
* @author Teradata Mike
*
*/
public class ImageTag
{
/**
these are some of the useful EXIF tag constants:

TiffConstants.EXIF_TAG_CREATE_DATE
TiffConstants.EXIF_TAG_EXIF_IMAGE_LENGTH
TiffConstants.EXIF_TAG_EXIF_IMAGE_WIDTH
TiffConstants.EXIF_TAG_ISO
TiffConstants.EXIF_TAG_EXPOSURE_TIME
TiffConstants.EXIF_TAG_APERTURE_VALUE
TiffConstants.EXIF_TAG_FNUMBER
TiffConstants.GPS_TAG_GPS_LATITUDE_REF
TiffConstants.GPS_TAG_GPS_LATITUDE
TiffConstants.GPS_TAG_GPS_LONGITUDE_REF
TiffConstants.GPS_TAG_GPS_LONGITUDE
*/

public static String imageTag(Blob blobData, String tag)
{
String returnValue = null;

TagInfo tagInfo = null;

// now map tag parameter to TiffConstant

tag = tag.toUpperCase(); // makes matching easier

if ( tag.equals("CREATE_DATE") )
tagInfo = TiffConstants.EXIF_TAG_CREATE_DATE;
else if ( tag.equals("LONGITUDE") )
tagInfo = TiffConstants.GPS_TAG_GPS_LONGITUDE;
else if ( tag.equals("LATITUDE") )
tagInfo = TiffConstants.GPS_TAG_GPS_LATITUDE;
else if ( tag.equals("WIDTH") )
tagInfo = TiffConstants.EXIF_TAG_EXIF_IMAGE_WIDTH;
else if ( tag.equals("HEIGHT") || tag.equals("LENGTH") ) // added HEIGHT for ease of use
tagInfo = TiffConstants.EXIF_TAG_EXIF_IMAGE_LENGTH;

try
{
InputStream inputStream = blobData.getBinaryStream();
JpegImageMetadata metadata = (JpegImageMetadata) ImageEXIF.getImageMetadata(inputStream);
inputStream.close();
returnValue = getEXIFValue(metadata, tagInfo);
}
catch (Exception e)
{
// printing doesn't really make sense, you may want to throw a SQLException
// or simply return NULL
System.out.println(e);
}

return returnValue;
}

/**
* getEXIFValue()
* @param jpegMetadata
* @param tagInfo
*
* @return String tag value
*
* extract EXIF value from image metadata
* returns NULL if tag value not found
*/

private static String getEXIFValue(JpegImageMetadata jpegMetadata, TagInfo tagInfo)
{
TiffField field = jpegMetadata.findEXIFValue(tagInfo);

if ( field == null )
return null;

return field.getValueDescription();
}

}

Deployment - be sure to include the Sanselan jar file in the deployment screen as well a check the box "include all classes in project" for the ImageEXIF method that we're using to get the metadata.

Transform and Load

And finally to finish our ELT process, let's transform the raw data in our temporary load table and move it over to our 'production' target table.

This can be as simple as an "INSERT ... SELECT ...." statement.

Let's create a 'production' table to hold these images and their calculated metadata.

CREATE TABLE images
(
id INTEGER,
source VARCHAR(255),
create_date VARCHAR(100),
width INTEGER,
height INTEGER,
longitude VARCHAR(100),
latitude VARCHAR(100),
md5 CHAR(32),
image BLOB
)

And here's our SELECT INTO example using our calcMD5 and imageTag UDFs. Nice, right?

INSERT images (id,source, create_date, width, height,longitude, latitude, md5, image)
SELECT id + 0, source, imageTag(image,'create_date'),
imageTag(image,'width'),imageTag(image,'height'),
imageTag(image,'longitude'), imageTag(image,'latitude'),
calcMD5(image), image
FROM image_load_table

Notice that I've added a placeholder, 'id + 0', to show where you'll need to insert a starting index so that you can load multiple batches without id collisions on the primary key.  Something like this should work in a pinch:

SELECT MAX(id) + 1
FROM images

Now, I'll leave it to you to modify the functions to be more friendly to create_date and the lat/lon coordinates.  In the UDF code, we could do some transformations on the create_date format to make it more SQL friendly so that we can cast it to a true DATETIME object.  And for extra credit we could do the same for the GPS coordinates and store them as Teradata's ST_GEOMETRY data type for our geospatial library functions (yet another Developer Exchange article!).

Conclusion

I hope that I've given a good demonstration of the power of Teradata as a binary data processing platform.  While these examples were a bit simple, they should help open your eyes to the potential of the Teradata platforms capabilities with new binary data types, such as transformations like resizing,  conversion to new binary formats, and more advanced processing potential such as facial recognition, image comparisons, audio and video analysis, etc.