Workload Toolkit Part I – Excessive use of String Manipulation Verbs

Database
The Teradata Database channel includes discussions around advanced Teradata features such as high-performance parallel database technology, the optimizer, mixed workload management solutions, and other related technologies.
Teradata Employee

Workload Toolkit Part I – Excessive use of String Manipulation Verbs

We all know the benefits of identifying and fixing badly performing workloads. Many sites use the common method of identifying the top n highest consuming or skewed queries (often the same queries) and fixing those.

In addition to this common practice, one of the first things I like to do when arriving at a client site for performance analysis is to run a few queries I have in a set of queries I call my “toolkit”. These are queries I have developed over the years that can give me insight into workload performance at a little bit higher level. Most of these look at trends in either workloads or modeling practices, more of a “holistic” tuning methodology rather than a one query at a time correction.

The specific query I will discuss today is run against the PDCR SQL History table (or against your own DBQL historical data if you have a homegrown capture process). It looks at all of the queries for a given time period that contain one of these five SQL verbs in the where clause:

SUBSTR (ING)

POSITION

TRIM

LIKE

|| (double pipe - concatenation characters)

What do these particular verbs have in common? They all perform some sort of string manipulation within column data. Now string manipulation by itself is not necessarily bad when used infrequently in say, an ETL process.

However, I limited the query to look at non-ETL workloads, and it is written to look for these verbs in the where clause of the queries. Excessive use of these verbs against the where clause columns causes inefficiency in the queries by precluding the use of statistics or indexes on those columns, sometimes causing full table scans to identify all the qualifying rows. Definitely, something we would like to correct if possible.

Let’s take a look at a sample run of this query. Not all the results are displayed, just a few to show there are good and bad results and to give you a feel for what is returned. I grouped the queries by using the standard 4 positions of the account string that we normally use to identify workloads in TASM. In this case I have highlighted any percentage that is 10% or greater of the total number of queries.

Looking at the highlighted columns, we have to ask why some of these workloads are doing what they do. In the case of workload BBBB about 40% of all queries use SUBSTR and TRIM. For workload LLLL the LIKE verb and Concatenation is used in approximately 80% of the queries. I’m not sure I even want to see workload OOOO. As you can see from the percentages, these tell me something is seriously amiss in these workloads and they need to be looked into.

I usually find one of two reasons for excessive use of these verbs:

1) The need for more developer/end user training in the correct use of the data model or Business Intelligence tool

2) The physical model does not correctly represent hierarchy data (dimensions or lookups) in the database, forcing users to attempt to “find” the hierarchy data by searching the column contents

For the first issue, many times this is an education issue, in that the BI developers simply do not know that they could use another table that already decodes the values they are looking for, so they chose to “decode” the fact table value by using SUBSTR, LIKE, etc.

For the second problem, the data model does not have the proper decoded hierarchy data in dimension or lookup tables to be used by the queries, forcing the BI developers to use alternative, less performing methods to identify the data, or join to another table by creating values on the fly, i.e. concatenating.

The fix for both is to look at the identified workload, determine which of the scenarios is in play and take the appropriate action through training or changing the model to provide appropriate hierarchy data.

Here is the query I use to do the analysis:

SELECT ACCOUNT_STRING,
QUALIFIED_QUERIES (FORMAT 'ZZZ,ZZZ,ZZ9'),
(USE_SUBSTR / (QUALIFIED_QUERIES (FLOAT))) * 100 (FORMAT 'ZZ9.99') PER_SUBSTR,
(USE_POS / (QUALIFIED_QUERIES (FLOAT))) * 100 (FORMAT 'ZZ9.99') PER_POSITION,
(USE_LIKE / (QUALIFIED_QUERIES (FLOAT))) * 100 (FORMAT 'ZZ9.99') PER_LIKE,
(USE_TRIM / (QUALIFIED_QUERIES (FLOAT))) * 100 (FORMAT 'ZZ9.99') PER_TRIM,
(USE_CONCAT / (QUALIFIED_QUERIES (FLOAT))) * 100 (FORMAT 'ZZ9.99') PER_CONCAT

FROM
(
SELECT ACCOUNT_STRING,
COUNT(*) QUALIFIED_QUERIES,
SUM(CASE WHEN WHERE_CLAUSE LIKE '%SUBSTR%' THEN 1 ELSE 0 END) USE_SUBSTR,
SUM(CASE WHEN WHERE_CLAUSE LIKE '%POSITION%' THEN 1 ELSE 0 END) USE_POS,
SUM(CASE WHEN WHERE_CLAUSE LIKE '%TRIM%' THEN 1 ELSE 0 END) USE_TRIM,
SUM(CASE WHEN WHERE_CLAUSE LIKE '%LIKE%' THEN 1 ELSE 0 END) USE_LIKE,
SUM(CASE WHEN WHERE_CLAUSE LIKE '%||%' THEN 1 ELSE 0 END) USE_CONCAT
FROM
(
SELECT l.LOGDATE,
/* USING TERADATA PERFORMANCE COE RECOMMENDED ACCOUNT STRING FORMAT */
SUBSTR(AcctString,5,4) ACCOUNT_STRING,
POSITION('WHERE' IN s.sqltextinfo) WHERE_POS,
SUBSTR(s.sqltextinfo, WHERE_POS, 32000 - WHERE_POS) WHERE_CLAUSE

FROM PDCRINFO.DBQLogTbl l,
PDCRINFO.DBQLSqlTbl s

WHERE l.ProcID = s.ProcID
AND l.QueryID = s.QueryID
AND s.SqlRowNo = 1

AND l.LOGDATE >= CURRENT_DATE - 30
AND WHERE_POS > 0
AND ACCOUNT_STRING NOT IN ('XXXX','YYYY')
) QQ

HAVING QUALIFIED_QUERIES > 100
AND ( USE_SUBSTR > 0 OR
USE_POS > 0 OR
USE_TRIM > 0 OR
USE_LIKE > 0 OR
USE_CONCAT > 0
)

GROUP BY 1
) FF

ORDER BY 2 DESC,1
;

I will be curious to see some of the results people get, the reasons you find these issues exist, and your fixes and results. Please feel free to share your successes back with the DevX community.

Happy hunting!

Dave

2 REPLIES
Enthusiast

Re: Workload Toolkit Part I – Excessive use of String Manipulation Verbs

Very good article indeed....it hits at the right areas. Talking of performance analysis can I request you to pen down important steps to observe performance issues. At our site, we had a performance meltdown on a particular day with loads gatting delayed, lots many locks and then business users queries in turn taking lot many resources . To avoid a repeatation ,I would like to know the observations to carry out to deduce and to solve/reduce the impact.
kindly suggest.
Enthusiast

Re: Workload Toolkit Part I – Excessive use of String Manipulation Verbs

The power of DBQL, thanks for the tip, it is definitely useful to understand the workload on the system with respect to string manipulations. The applied instrumentation through the use of DBQL is simply intriguing.