QUERY OPTIMIZATION

Database
Enthusiast

QUERY OPTIMIZATION

I faced this scenario in an interview. Suppose there are 2 large tables say CUSTOMERS AND CUSTOMER_SESSIONS. The CUSTOMERS table holds the CUST_ID and CUST_NAME say for a website such as facebook/amazon. The CUSTOMER_SESSIONS table contains the CUST_ID referencing the CUSTOMERS tables and the SESSION_TIME every time the user logs in. Both of these tables will be huge. Now the question is how do I optimize the query to find all those customers who have logged in for the first time this year.

I have the sql something like this:

SELECT A.CUST_NAME FROM CUSTOMERS A WHERE NOT EXISTS(SELECT 1 FROM CUSTOMER_SESSIONS B WHERE A.CUST_ID=B.CUST_ID AND B.SESSION_DATETIME>='2014-01-01');

Now it is obvious that when looking up on the CUSTOMER_SESSIONS table it will be highly time consuming.

I have the following suggestions:

1) Create an intermediate table say CUST_SESSIONS_INTERMEDIATE on CUSTOMER_SESSIONS table to hold 1 year of data only. Everyday the 366th day's data will be loaded into the CUSTOMER_SESSIONS table from CUST_SESSIONS_INTERMEDIATE and the 366th day's data will be removed from the CUST_SESSIONS_INTERMEDIATE table.

2) Create a join index for the query above. 

3) Also suggesting that the CUST_ID for CUSTOMERS be the unique_primary_index, the CUST_ID for CUSTOMER_SESSIONS be the non unique primary index and that the CUST_ID for CUST_SESSIONS_INTERMEDIATE is also the non unique primary index.

Is it the right approach?

4 REPLIES
Enthusiast

Re: QUERY OPTIMIZATION

The intermediate table is a waste of IO.  Bullet three is the key here, PI to PI joins, making the joins amp local are going to be key to this query.  You can also partition the sessions table by logon date.  

You could also create a join index as you suggest, but you might find that the JI is more work than it's worth, depending upon your ETL processes.  

Junior Contributor

Re: QUERY OPTIMIZATION

Your query returns the customers who did not login in the year 2014.

"Logged in for the first time this year" is something different for me:

​SELECT * 
FROM CUSTOMERS A
WHERE CUST_ID IN
(
SELECT CUST_ID
FROM CUSTOMER_SESSIONS
WHERE SESSION_DATETIME>= DATE '2014-01-01'
GROUP BY 1
HAVING COUNT(*) = 1 -- or MIN(SESSION_DATETIME) = MAX(SESSION_DATETIME)
)

For optimizing both tables should have the same PI on CUST_ID plus CUSTOMER_SESSIONS should be partitioned based on SESSION_DATETIME, probably daily partitions.

A JI (or a table or denormalization) with MIN/MAX/COUNT per CUST_ID would be quite small but with high maintenance cost...

Supporter

Re: QUERY OPTIMIZATION

not sure that Dieters SQL works out.

The where condition will exclude all older sessions -> all cust_ids with one session will be returned...And I don't read the question that only one session is allowed.

AJ with MIN (first) per cust_id would be good but as Dieter said would have some costs.

In case of an AJ

SELECT *
FROM CUSTOMERS A
WHERE CUST_ID IN
(
SELECT CUST_ID
FROM CUSTOMER_SESSIONS
GROUP BY 1
HAVING MIN(SESSION_DATETIME) >= '2014-01-01'
)

would give you the right answer.

Alternative - Partition by 2014 and the rest (as a optimization for this specific query, often this is not what you want)

In this case

SELECT *
FROM CUSTOMERS A
WHERE CUST_ID exists
(
SELECT si.CUST_ID
FROM CUSTOMER_SESSIONS si
where si.SESSION_DATETIME >= '2014-01-01'
and si.cust_id = a.cust_id
)
and not exits
(
SELECT so.CUST_ID
FROM CUSTOMER_SESSIONS so
where so.SESSION_DATETIME < '2014-01-01'
and so.cust_id = a.cust_id
)
Enthusiast

Re: QUERY OPTIMIZATION

Thank you so much Dieter, Ulrich and VandeBerg.