Large Objects - Part 2 retrieval web application

Applications
Applications covers the world of Teradata apps, including apps offered by Teradata (such as TRM or DCM), as well as best practices and methodologies for building your own Teradata-connected apps.
Teradata Employee

Large Objects - Part 2 retrieval web application

In Part 1, we looked at what Large Objects are and some tactics that we could employee to load them into Teradata. As I stated back in part 1, loading large objects is great, but by itself is pretty useless. In this article, we will look at extracting Binary Large Objects from Teradata for inclusion in a web application.

But before we get to that, we need to know something about the large objects we will be dealing with.

What is that Large Object?

In part 1, we looked at Large Objects as data types that can store large things. Examples included text documents, Images, PDF documents, sounds, structured data and pretty much anything else that you could imagine.

But once we’ve loaded a large object into the database, how do we know what it is later on?

The table that we used we didn’t give any clue as to what the Large Object was. The large objects simply were present in the table. Indeed we loaded both Images and PDF documents into the table. To recap, here is the table definition:

create table devX_lob (
id varchar (80),
mime_type varchar (80),
char_lob character large object,
binary_lob binary large object
)
unique primary index (id);

The binary_lob column could theoretically contain anything up to 2GB in size. We could have named the column with a more descriptive name. For example:

create table customer (
id integer,
...
photo binary large object,
signature binary large object,
...
)
unique primary index (id);

In the above table we could reasonably assume that the first Large Object is a photo of the customer and the second a signature. But what is the format? Are they GIF, PNG, BMP or JPEG images? Perhaps they are a vector format such as WMF, SVG, EPS? Maybe even something else?

MIME Types

Most web browsers (and email clients) can display a variety of images types, play many sound formats and display other documents thanks to plugins.

All of these things are received by the browser as a mass of data which is not terribly meaningful to the browser. To correctly display it, the browser needs a tip as to what the type of data is that it is receiving.

Consider the following HTML snippet:

    <img src="photo.jpg">

When the browser sees this, it makes a request back to the server to retrieve "photo.jpg". It is reasonable to assume that image being returned will be a JPEG Image. But what about this tag:

    <img src="/servlet/images/id?2">

In this case there is no clue as to what the type of image we will be getting back. All we do know is that it is an image (or at least we hope it is). For what it is worth, the above would return an image, presumably from a database where the id = 2. However the browser will (or should) still display the image correctly. How does it know?

Enter the MIME type. MIME stands for Multimedia Internet Mail Extensions. Simplistically, a MIME type is a text string in two parts viz. a type and sub-type. Common types are “application”, “image”, “audio”. Subtypes tend to represent recognised standards and include things like “PDF”, “JPG”, “MP3” etc.

A MIME type tells the browser (and email client) what type of data is being returned. So in the case of the second HTML example above, the browser will receive a MIME Type telling it what type of image that ID = 2 is.

An example of a MIME type is “application/pdf” which means that the data to follow should be treated as a PDF. Whereas “image/jpg” means that the following data is a JPEG image. Character data also can have MIME types such as “text/html”, “text/plain” and “text/richtext” A lengthy list of MIME types can be found here: www.webmaster-toolkit.com/mime-types.shtml

In the first example, where we retrieved photo.jpg, the browser probably does not determine the file type from the extension in the file name. Rather, when the server returns the “photo.jpg”, it will precede it with a MIME Type of “image/jpg”.

In our little test table, this is the purpose of the mime_type column. In addition to the file name (key), we are storing the MIME type along with the large object data. By doing this, the process of returning data to the browser for display suddenly becomes much easier.

The Large Objects Web application

Retrieve LOB Servlet

In our web application we will use a Servlet to process requests from the browser for our large objects. The two attachments contain the project in its complete glory. One attachment contains a NetBeans 6.5 project, the other an Eclipse 3.4.2 version of the project. Apart from the IDE used to create them, both versions are the same. I will leave it to you to figure out which is the NetBeans project and which is the Eclipse version.

Our Servlet will basically retrieve a Large Object from our table and return it to the browser. The browser will know what type of data it is being sent because the Servlet will also send the MIME Type back to the browser.The retrieve LOB web app

The key will be the file name which is stored in the id column of the table.

If every thing goes to plan, we should see this less than aesthetically pleasing result in our web browser.

Most of the time, web requests that involve parameters consist of the page (or Servlet name) followed be a question mark and then the parameters. So at first thought our request might look something like this:

    <img src="/servlet/RetrieveLob?id=thumbsup.gif">

The above will run the RetrieveLob Servlet and supply it with the parameters following the question mark (i.e. ID = “thumbsup.gif”). The Servlet would retrieve the appropriate record from Teradata and return the Large Object data to the browser along with the associated MIME type.

While that is perfectly workable, it has a drawback which I will discuss below. So, our requests will look more like this:

    <img src="/servlet/RetrieveLob/thumbsup.gif">

In addition to displaying images, we can link to the PDF (small.pdf) or any of the other large objects we have loaded using there names:

    <a href="/servlet/RetrieveLob/small.pdf">Click to view</a>

You could also link to the images in similar fashion if you so desired:

    <a href="/servlet/RetrieveLob/small.pdf">
        <img src="/servlet/RetrieveLob/thumbsup.gif" width=”100” height=”100”>
    </a>
 

The above would represent a sort of thumbnail view which when clicked should display the image full size in a page of its own. I say it is a “sort of thumbnail view” as my little images are smaller than 100 x 100. So the “thumbnail view” consists of images that are larger than the “full size” image, but hopefully you get the idea.

The Servlet descriptor

The obvious question is how does this work? Do we need one Servlet for every Large Object in our database? For example do we need a Servlet called “small.pdf” that retrieves the row identified by id = ‘small.pdf’ and another called thumbsup.gif to retrieve the row identified by id = ‘thumbsup.gif’?

The short answer is no we don’t. Our Servlet is called RetrieveLob. It gets invoked when the web server sees the pattern /servlet/RetrieveLob in an incoming request. The Servlet then extracts the last part of the Servlet path which it uses as the key into the table. Thus the path “/servlet/RetrieveLob/small.pdf” causes our RetrieveLob Servlet to be invoked.

The Servlet will extract the file name (i.e. small.pdf) from the incoming request’s path, retrieve the row from the table identified by id = ‘small.pdf’ and send the data back to the browser.

The key to this little piece of magic is in the Servlets descriptor. The descriptor looks like this:

<servlet-mapping>
<Servlet-name>RetrieveLob</servlet-name>
<url-pattern>/servlet/RetrieveLob/*</url-pattern>
</servlet-mapping>

NetBeans servlet URL PatternThe biggest advantage to doing this is that if your end user elects to save your image, pdf or whatever, the browser will use the path to identify the file name. So for example if the user alternate clicks the “thumbsup.gif” image and choose “save as…” the default file name will be “thumbsup.gif”. Similarly if they alternate click on the “click to view” link and choose save as the default file name will be “small.pdf”.

If we did not code our Servlet this way, the default name would be “RetrieveLob” which is rather useless.

The add Servlet wizard in NetBeans consists of 2 steps. The first is to define the class that implements the Servlet. The second is to define the Servlet mappings.

Note the “URL patterns” entry. The value “/servlet/RetrieveLob/*” means any request to this web application that starts with “/servlet/RetrieveLob” will be handle by the com.teradata.lob.RetrieveLob Servlet class. This means that we could handle requests that include path names – for example: “/servlet/RetrieveLob/images/photos/GlennMc.jpg”. For simplicity, I haven’t bothered with supporting “path names”. However, adding support for “path names” is extremely easy, which I will leave as an exercise for you.

The process to define the Servlet in Eclipse consists of three screens. The second screen is where you enter the URL pattern to invoke the Servlet. To specify the path, you must select the default URL mapping, click “Edit…” then adjust the path to the value shown (/servlet/RetrieveLob/*).

The RetrieveLob Servlet

The RetrieveLob Servlet consists of three main parts. These are:

  • Extraction of the Large Object id from the path in the incoming request.
  • Retrieval of the Large Object from the database.
  • Transmission of the data back to the browser.

Since it isn’t that long, here is the entire “business logic” part of the Servlet.

protected void processRequest(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {

HttpSession session = request.getSession ();

System.out.println ("RetrieveLob: request.getPathInfo () = \"" + request.getPathInfo() + "\"");
// Get the ID as the request path info minus the leading "/" character.
String id = request.getPathInfo().substring(1);
String sqlText = "select mime_type, binary_lob from devX_lob where id = ?;";
Connection connection = (Connection) session.getAttribute("connection");
if (connection == null) {
return;
}
try {
PreparedStatement stmt = connection.prepareStatement(sqlText);
stmt.setString (1, id);
ResultSet rs = stmt.executeQuery ();
if (rs.next ()) {
String mimeType = rs.getString (1);
System.out.println ("mime type = " + mimeType);
response.setContentType(mimeType);
ServletOutputStream out = response.getOutputStream();
Blob blob = rs.getBlob(2);
InputStream blobStream = blob.getBinaryStream();

int b;
while ((b = blobStream.read ()) != -1) {
out.write (b);
}
blobStream.close ();
out.close ();
}
else {
System.out.println ("No such record: " + id);
}

rs.close ();
stmt.close ();
}
catch (SQLException e) {
e.printStackTrace ();
}
}

The ID of the Large Object to be retrieved is obtained at Line 8. According to the JavaDocs the HttpServletRequest.getPathInfo () method “Returns any extra path information associated with the URL the client sent when making the request.”. It waffles on for a bit then adds “… and will start with a ‘/’ character”.

Apparently the interpretation of this varies between the various application servers, so you will need to check what yours does. In tomcat 6.0 (what I am using) this method returns the part of the request URL corresponding to the asterisk (“*”) in the URL pattern. That is, it returns the “filename” starting with a “/”; for example if the request URL is “/servlet/RetrieveLob/thumbsup.gif”, request.getPathInfo () returns “/thumbsup.gif”. This is confirmed in the server logs by reviewing the output of the println immediately before the call to getPathInfo.

In part 1, we didn’t bother with putting the leading “/” character into the names of our Large Objects. Consequently, we need to strip the leading slash out of the “path info” for any retrieval requests. Obviously we could have included the leading “/” and this would probably be a good idea if you plan to extend the example to work with full path names.

Getting back to the example, the next step is to obtain a database connection and run our query. This can be seen at lines 9 through 17. In this particular example, I use a per user connection which is established via the logon page (index.jsp). The connection is stored in the users’ sessions; thus at line 10, I retrieve the connection from the session object. For a production version I would strongly suggest using a connection pool, but obviously this will depend upon your particular needs.

Once we have our connection, the query is prepared and the ID plugged in. For the purposes of the exercise, we are only going to work with Binary Large Objects, so we only retrieve the BLOB and the MIME_type values.

Since the id is the Unique Primary Index (UPI) there should only ever be one row returned (or no rows if the id doesn’t exist). The “if statement” (as opposed to a while) at line XXX checks to see if the row was found. If it was, we return the data to the requestor (i.e. the browser) otherwise we simply exit the Servlet. What happens if no data is returned to the requestor (i.e. the ID was not found)? This will depend upon the browser and the context, but in Internet Explorer the “broken image” icon will be displayed in the case of the <IMG> tag.

The final step is to return the Large Object and MIME Type to the browser. This occurs at lines 21 through 29.

The first part of returning the data is to set the content type, by supplying it with the MIME type. This is done at line 18 via the response.setContentType (String) call. Recall from above, that this is how the browser (or other requestor) will know what type of data it is being sent. Finally we send the actual BLOB data. This is achieved as follows (lines 21 through 29):

  • Obtain the OutputStream representing the connection back to the requestor (the browser). This is achieved by the call to response.getOutputStream () which returns a ServletOutputStream.

    Note that this is different to the more commonly used response.getWriter () method that returns a PrintWriter. The main difference is that the PrintWriter is for text data whereas ServletOutputStream is for binary (e.g. image) data.
  • Next we extract the BLOB from the ResultSet and obtain an InputStream to read it’s data.
  • Finally, we read the data one byte at a time and send it to the requestor (the browser).

The other bits

The rest of the application is relatively straightforward. The diagram shows how this incredibly (not) complex application hangs together in all of its glory.Web app organisation

The only significant thing to point out here is that the SignOn Servlet establishes a database session for each user using their personal Teradata credentials. This session is then stored in the web users’ session objects. An improvement could be to use a connection pool, but for simplicity I didn’t do this. This could be another enhancement that I leave to you, the reader.

The index.jsp prompts for everything required for establishing a database session. For my convenience, and laziness, I’ve hard coded all the values that work for me (except my super secret password – which astute readers might observe in part 1). I recommend you adopt the laziness principle and change my values to once that work for you.

If you load more large objects into the database, you can access them simply by updating the display.jsp page. Copying and pasting what is already in the display.jsp page is probably the easiest way to go. Feel free to get adventurous and try some other things. You could, for example, make the page really irritating through the addition of a sound file. For maximum irritation, I suggest a sound file that loops forever – don’t forget to turn your volume control right up!

As mentioned, the individual database connections are maintained within the web session objects. The users’ web session (HttpSession object) is easily obtained via the request in the opening lines of the Servlet.

The main draw backs to using sessions to store “stuff” such as the database connection is that the web session will eventually expire and there is no sharing of the critical connection resource. So if the user is inactive for a while, the connection may be lost (because the session expired) and if there are lots of people trying this example at the same time (or you forget to signoff) you may start to run out of sessions.

Connection Pools would not experience this “problem”. However, since the establishment of a connection pool is often application server dependent and I wanted to keep this simple, I’ve decided to track the database session this way.

If you wanted to work with a connection pool, the only change would be at line 10 in the Servlet. Instead of extracting the connection from the session object, a request would be made of the server to get a connection. In NetBeans, you simply alternate click in the code locate the “Use Database” Option and fill in the dialogue, it then inserts the necessary code to get a connection. Or you could use a method like this:

public static Connection getConnection ()
throws NamingException, SQLException {

InitialContext initialContext = new InitialContext ();
Context context = (Context) initialContext.lookup ("java:comp/env");
DataSource ds = (DataSource) context.lookup ("jdbc/TestLobPool");
Connection conn = ds.getConnection ();
return conn;
}

If you do use a connection pool, don’t forget to close the connections before returning from the Servlet. If you do forget you will be sorry – I can promise you that. Worse still, if you have some paths through your Servlet that don’t close the session (i.e. sometimes the connection isn’t closed); you will be several orders of magnitude sorrier.

What next?

The example I’ve presented works with Binary Large Objects – because in my opinion they are cooler. However, for those mono-spaced fixed pitch courier fans out there and those that have to do real work, the example could be easily modified to work with Character large objects as well as Binary Large Objects.

To add this enhancement is really easy:

  • Obviously, the Char_lob must be retrieved from the table in the select query.
  • There needs to be some method to decide which value (i.e. BLOB or CLOB) is to be returned to the requestor. This is because only one value can be returned per request. Perhaps you could look at one of the values (e.g. the BLOB) if it is null, return the other one (i.e. the CLOB). You could try to decipher the MIME Type to determine what to return. However, there is virtually an unlimited list of MIME Types with new ones being introduced all the time (yes, you can even make your own up), interpreting and keeping up with the MIME types would quickly become a daunting task.
  • If you are going to return a CLOB, then you will likely need to use a PrintWriter which can be obtained using response.getWriter (). I haven’t tried that, but if someone does, please leave a comment to the article.

    I expect that the ServletOutputStream will work just as effectively for transmitting a CLOB. After all, characters are a subset of all the possible binary values – aren’t they?

The other extension mentioned above is to allow path names to be handled by the server. In that way you could group your BLOBs into “directories”. At the end of the day, it will still just be a row in the table, but if you like directories feel free to do so. To do this is also easy:

  • Increase the width of the id column in the table. Eighty characters isn’t going to cut it for long.
  • Optionally, don’t strip off the leading “/” via the substring on line 8 in the servlet. If you do this, remember to modify the load process to include leading "/" characters are included in the id's.

As a final suggestion, add a few more images, maybe some sound effects a search function a few more colours and fonts; who knows you might have your own version of eBay up and running before morning tea!

6 REPLIES
Teradata Employee

Re: Large Objects - Part 2 retrieval web application

I Saved my first BLOB's based on Article 1 and have started to retrieve them using the algorithm in Article 2, however, Save Performance seems to be far superior to retrieve performance (some what counter-intuitive as DB Writes are typically slower than reads). My guess is that it is due to the int based blobStream.read below:

int b;
while ((b = blobStream.read ()) != -1)
{
out.write (b);
}

This presumably forces the Java Program (and the underlying JDBC Driver / Spool) into a byte at a time mode. Do you have any code that does this in a buffer based manner (say an int array[1024])?
Teradata Employee

Re: Large Objects - Part 2 retrieval web application

Hi Mike,
Have you tried wrapping the Input Stream into a Buffered Input Stream?
For example:
....
InputStream blobStream = blob.getBinaryStream();
BufferedInputStream bufferedBlobStream = new BufferedInputStream(blobStream)
....

Unfortunately I'm not currently in a position to try this, but would be interested in your feedback if you give it a try.

Both the InputStream and the BufferedInputStream define a method with this signature:
int read(byte[] b, int off, int len)
This might be what you are after. The only "problem" with this method is that it repeatedly calls the read method len times - so that probably won't affect what you are seeing.

I expect that a BufferedInputStream would be your best bet.
Teradata Employee

Re: Large Objects - Part 2 retrieval web application

Asking the question made me start thinking and I came up with a similar approach to yourself as so:

Blob rpm = rs.getBlob(1);
InputStream blobStream = rpm.getBinaryStream();
byte[] b = new byte[1024];
int bytesRead = blobStream.read(b);

while (bytesRead != -1)
{
rpmOutputStream.write(b);
bytesRead = blobStream.read(b);
}

I got the following numbers reading a single BLOB column (between two VM's).

26113953 bytes read in 1363ms
552672 bytes read in 598ms
172953870 read in 8600ms
2659349 bytes read in 2499ms
35004447 bytes read in 2444ms
127637040 bytes read in 6085ms
99740019 bytes read in 4195ms
15963897 bytes read in 981ms
13776278 bytes read in 880ms
32194971 bytes read in 2360ms
28939266 bytes read in 2697ms

Wrapping a BufferedInputStream around this as so:

Blob rpm = rs.getBlob(1);
InputStream blobStream = rpm.getBinaryStream();
BufferedInputStream blobBufferedStream = new BufferedInputStream(blobStream);

byte[] b = new byte[1024];
int bytesRead = blobBufferedStream.read(b);

while (bytesRead != -1)
{
rpmOutputStream.write(b);
rpmSize += bytesRead;
bytesRead = blobBufferedStream.read(b);
}

Did not seem to make an appreciable difference.

As these are large files (100MB rpm's) I'll try increasing the size of the byte[] b buffer and run some more tests.

MikeC....
Teradata Employee

Re: Large Objects - Part 2 retrieval web application

Quick Follow up (having talked to the JDBC Experts). The underlying JDBC driver will read up to 64K when you make the Blob blob = rs.getBlob() and blob.getBinaryStream() calls.

In the single byte blobStream.read () mode you are essentially spinning back and forward between your code and the JDBC driver with all the overhead of Java method call invocation causing the delay as the JDBC driver will only go get the next 64K when you exhaust it's buffer.

With the byte[1024] method you are reducing the number of Java Method invocations you make so if you can spare the memory you can go up to 64K to match the driver, although you may want to consider the size of your blobs as there is no point in going above the max size of your binary files (i.e. 10K gif's wouldn't need the full 64K).

MikeC....
Teradata Employee

Re: Large Objects - Part 2 retrieval web application

Interesting feedback. I would have assumed that the Driver would cache the C/BLOB either and access via the Stream obtained from the resultset and the Blob object would be equivalent.

As a matter of interest, how much improvement was there when you used the Blob object? Quite a good improvement I expect.
:-)
Teradata Employee

Re: Large Objects - Part 2 retrieval web application

First off an admission - There is a bug in my code!! Like such a thing could ever occur..

rpmOutputStream.write(b); // Wrong, Flawed....

Should of course read

rpmOutputStream.write(b, 0, bytesRead);

The problem is that the last blobBufferedStream.read(b) will almost never return a value equivalent to b.length, which is what the first line of code actually writes, hence the 2nd line of code is correct for both full buffers and the final (typically) not full buffer. This means the correct BLOB Reader code should be:

Blob rpm = rs.getBlob(1);
InputStream blobStream = rpm.getBinaryStream();
byte[] b = new byte[65536]; // Go 64K to match the JDBC LOB Buffer
int bytesRead = blobStream.read(b);

while (bytesRead != -1)
{
rpmOutputStream.write(b, 0, bytesRead); // bytes read should be 65536 except for the last buffer
bytesRead = blobStream.read(b);
}

Now to the numbers:

Byte at a time

26113953 bytes read in 39.176s
552672 bytes read in 1.387s
172953870 bytes read in 261.245s
2659349 bytes read in 4.877s
35004447 bytes read in 51.513s

64K Buffer

26113953 bytes read in 1.307s
552672 bytes read in 0.587s
172953870 bytes read in 8.171s
2659349 bytes read in 2.548s
35004447 bytes read in 2.504s

The bigger the file the more the overhead of repeated Java Method invocations is compounded (172953870 invocations of blobStream.read(b) in "byte at a time" mode versus just 2640 with a 64K buffer for the largest file which equates to over 4 minutes versus 8 seconds)!

MikeC.....