A Quick Tour of the XML Type

Extensibility
Extensibility covers the mechanisms by which you, as the user or developer, can extend the functionality of the Teradata Database, for example with the use of User Defined Functions, or UDFs.
Teradata Employee

A Quick Tour of the XML Type

XML is a markup language, used to format data in a wide variety of applications. It is commonly used as a message format for application integration (e.g. XML messages exchanged between applications, with those applications implementing an XML based API). Somewhat less commonly, it is used as a document format, to tag information in a platform independent manner. Starting in Teradata Database version 14.10, XML is supported as a native SQL data type. This means, XML documents can be stored in the database, and common XML processing operations like query, validation and transformation of the XML documents can be done in the database. Also supported are the abilities to shred XML documents (extract data from XML documents into database tables) and to publish SQL query results in XML format. This article gives a quick overview of the XML processing capabilities in the Teradata Database along with code samples that demonstrate this functionality. Subsequent articles will cover some of these topics in greater detail.

Sample code

The attachment to this article (XMLBASIC.zip) contains sample code pertinent to this article. The sample code is self contained in the sense that it includes sample data (XML Documents, XML Schemas), BTEQ scripts and a Readme.txt file that describes the sample code and provides instructions on how to execute the sample code.

The Basics

The XML data type can be used to represent XML documents in the relational database. Instances of the XML data type can be constructed using the new XML() constructor, or by using the CREATEXML() function. Each of these will take the string representation of the XML as its lone parameter, and parses it to produce the XML data type instance. Let’s try these out by writing the traditional first “Hello World” example:

SELECT new XML(‘<Greeting>Hello World</Greeting>’);

SELECT CREATEXML(‘<Greeting>Hello World</Greeting>’);

Both of these take a string representation of the XML as parameter and return an XML type instance. If the string represents a “not well-formed” XML, these functions will return an error.

Creating Tables with XML columns

Merely creating transient instances of XML type is not of interest for most users. So let’s look at how we can store XML values in the database. First, let’s create a table with an XML column.

CREATE TABLE T_XMLDOCS(

ID VARCHAR(30),

XMLDOC XML

);

This creates a table with 2 columns, the second of which can store XML values. XML is a lob type, so the size of the XML document can be up to 2GB, the LOB size limit in Teradata. Internally, XML is stored in a binary format that is much more compact than the original string representation. Further, if this internal representation is smaller than 4KB, the XML is stored in-row rather than as a LOB file. This allows for better performance for smaller documents. A maximum of 15 XML type columns can occur in a row.

Loading XML data into XML typed columns

XML is a lob type, and a couple of Teradata utilities can be used for loading XML data. This article shows how BTEQ can be used to load XML values. TPT can also be used to load LOBs. This article covers BTEQ and the sample code attached also contains an example of using TPT for this purpose.

To load XML values into the database using BTEQ, the following steps should be followed.

    1. Place the XML files on some designated staging directory on the file system.
    1. Compose a text file with a listing of XML documents to be loaded
    1. Invoke a BTEQ script that uses the IMPORT instruction to load the files into the table.

The code to load the data looks something like below:

.REPEAT *

.IMPORT vartext file=xml.txt

USING (a XML AS DEFERRED BY NAME, b VARCHAR(50))

INSERT INTO T_XMLDOCS VALUES(:b, :a);

The xml.txt file referenced in the code above contains the listing of the files to be loaded. The contents of that file look something like below:

file1.xml|1|

file2.xml|2|

file3.xml|3|



The file is essentially a ‘|’ delimited data file. The USING clause identifies the first of the fields as the XML document (the path to the file containing the XML document) and the second field as an identifier. The INSERT statement then inserts these two values into the T_XMLDOCS table.

Retrieving XML values

When retrieving XML values, we are often interested in one of the following:

    1. Retrieving documents as a whole
    1. Retrieving parts of the XML document
    1. Retrieving documents built by query and transform of existing XML values
    1. Retrieving data extracted from XML documents

We will look at examples of each of these, with the documents we loaded in the previous example.

Retrieving document as a whole

The simplest way to retrieve XML document as a whole is the most intuitive:

SELECT XMLDOC FROM T_XMLDOCS

When retrieving the documents as XML typed values, the data will be encoded in UTF-8. When client applications need to parse the resulting XML documents, they should direct the parser to ignore the XML encoding declaration and expect UTF-8 encoded documents.

It is possible to return documents in a specific encoding, using the XMLSERIALIZE function. This function allows you to specify that the document be serialized to VARCHAR, CLOB, VARBYTE or BLOB data types. When serialized to VARBYTE or BLOB, it will be possible to specify the encoding that should be used.

SELECT XMLSERIALIZE(DOCUMENT xmldoc as CLOB INCLUDING XMLDECLARATION)

FROM T_XMLDOCS;

Retrieving Part of the XML documents

It is possible to extract portions of the XML document using the XML Query language. Details of XML Query are covered in the next part of this article series, but a couple of simple examples are shown here. When XQuery is used to query an XML document, the results might not always be well-formed XML documents. One example is a query that returns a sequence of two or more elements (a well-formed XML document can only have one root element). If the user wants to retrieve a well-formed document, they should either modify the query so that it returns a well-formed document, or add additional processing on the client side to format the query results into a well-formed document.

SELECT id, xmldoc.XMLEXTRACT('/Customers/Customer[@C_CUSTKEY < 31003]', NULL)

FROM T_XMLDOCS;

This query would run against documents that look like the sample below:

<Customers>
 <Customer C_CUSTKEY="11001" C_NAME="LCDs R Us">
  <C_ADDRESS>10001 Cary Parkway</C_ADDRESS>
  <C_NATIONKEY>21</C_NATIONKEY>
  <C_PHONE>31-389-986-4741</C_PHONE>
  <C_ACCTBAL>7140.81</C_ACCTBAL>
  <C_MKTSEGMENT>Electronics</C_MKTSEGMENT>
  <C_COMMENT>Manufacturer of Liquid Crystal Displays</C_COMMENT>
 </Customer>
 <Customer C_CUSTKEY="21001" C_NAME="Chips Electronics">
  <C_ADDRESS>2101 Stevens Creek Pkwy</C_ADDRESS>
  <C_NATIONKEY>21</C_NATIONKEY>
  <C_PHONE>31-389-986-4741</C_PHONE>
  <C_ACCTBAL>7140.81</C_ACCTBAL>
  <C_MKTSEGMENT>Electronics</C_MKTSEGMENT>
  <C_COMMENT>E-Commerce website owner</C_COMMENT>
 </Customer>
</Customers>

The query above will return a sequence of customer elements, which does not make a well-formed XML document. The example in the next section will tell you how to modify the query to ensure that we return a well-formed XML document.

Retrieving Documents built/transformed from existing documents

XQuery is both an XML Query language, and an XML transformation language (i.e., it has the capabilities for transforming an XML document into any other text based format – including an XML document with a different structure). In practice, this means XQuery can do all the things that XSLT can do, and more. This capability allows XQuery to extract parts of an XML document and re-form it into a different desired document structure. The query below extracts Customer elements from the XML documents just as before, but now adds a root element named “EarlyCustomers” so that the results are well-formed.

SELECT id, xmldoc.XMLEXTRACT('<EarlyCustomers>{/Customers/Customer[@C_CUSTKEY < 31003]}</EarlyCustomers>', NULL)

FROM T_XMLDOCS;

Retrieving data extracted from XML documents

Sometimes, we are interested only in extracting data from an XML document, and don’t really care about getting it back in XML format. In these cases, users can leverage functions like XMLEXTRACT along with the casts from XML data type to other SQL data types. Additionally the XMLTABLE table function provides a way to convert an XML document tree in rows with columns of different SQL data types. Examples of XMLTABLE are included in the XML Shredding article, the 3rd part of this series of articles.

XML Shredding

XML Shredding is the process of extracting data from XML documents to populate other tables in the database. In TD14.10, shredding is implemented primarily through annotated schemas. The AS_SHRED_BATCH stored procedure (for annotated schema, batch shredding) is used to extract data from XML documents in a staging table to populate a set of target tables. The XMLTABLE table function is another option for shredding on a limited basis. XML Shredding is covered in more detail in part 3 of this series of articles.

XML Publishing

XML Publishing is the process of converting the results of a SQL query into an XML format. The XML Publishing stored procedures will take the query and a mapping as parameters and gives you the results of the query in XML format as defined by the mapping. XML Publishing is covered in part 4 of this series of articles.

Other XML Processing capabilities

Other than the capabilities mentioned above, TD14.10 offers XML processing functions for common operations like XML parsing, validation, transformation and construction through a number of functions defined in the SQL/XML specification. These functions, including XMLPARSE, XMLVALIDATE, XMLTRANSFORM, XMLELEMENT, XMLCONCAT etc., are meant to provide means of working with XML formatted data within a relational database.

A note regarding memory usage

Some of the XML processing operations are memory intensive since the XML document needs to be loaded into memory. To prevent XML operations from consuming all the system memory, we impose a limit on the amount of memory these operations can use. This limit is set using the dbscontrol setting XMLMemoryLimit. The default value for this parameter is 4MB, and can be increased up to 32MB (if you need to increase it further, please contact Teradata Support). When the amount of memory required for XML processing exceeds the value set of XMLMemoryLimit, an Out Of Memory error is returned. There is one known DR which affects memory usage by the XMLQUERY and XMLTABLE functions (DR167091). This DR was shipped to TD14.10.4 maintenance release. If you are having memory issues with these functions, please upgrade to this release.

Summary

This article provided a quick overview of the XML data type introduced in TD14.10. Sample code is provided to give developers a starting point for common XML processing operations including storage, query and retrieval. Upcoming article will take a more detailed look at some of these areas with further examples.

1 REPLY
Teradata Employee

Re: A Quick Tour of the XML Type

I tried using the SHRED demo and the bteq import isn't working, perhaps you can help?

The xml.txt and stream_shred_demo.xml files are both in c:\.  The error I get is LOB filename data is incomplete.  I'm simply using what was supplied.

BTEQ -- Enter your SQL request or BTEQ command:

.IMPORT vartext '|' lobcols = 1 file=c:\xml.txt

.REPEAT *

BTEQ -- Enter your SQL request or BTEQ command:

USING (a XML AS DEFERRED BY NAME, b VARCHAR(30))

INSERT INTO i303030.T_XMLDOCS VALUES(:b, :a, :a);

*** Starting Row 0 at Fri Apr 10 13:32:10 2015

 *** Error:  not found.

 *** Warning: Unconditionally aborting the last request.

 *** Failure 3110 The transaction was aborted by the user.            Statement# 1, Info =0

 *** Total elapsed time was 1 second.

 *** Error: LOB filename data is incomplete.

 *** Warning: Out of data.

 *** Finished at input row 2 at Fri Apr 10 13:32:10 2015

 *** Total number of statements: 1,  Accepted : 0,  Rejected : 1

 *** Total elapsed time was 1 second.