Q

Customers who bought X at least once and Y at least twice

I have a table CUSTOMERS (customer_id) and a table PURCHASES (customer_id, purchase_date, product_id). I am trying to find the distinct customers that have bought at least once product_id=X and, in the 12 previous months, bought at least twice product_id=Y. Any idea that could do the job in a minimum time?

Solutions that do the job "in a minimum time" are always challenging. Success depends on the existence of proper indexes. Assuming these are in place, we can still sometimes see markedly different results for solutions written with different query constructions.

The first step towards a solution, the most important step, is to make sure we understand the exact requirements. In this case, we don't care about purchase data, just that it exists. We aren't actually retrieving anything from the PURCHASES table! If you had said "give the date of the latest Product X purchase, and the total number of Product Y purchases" then we'd need to write a totally different query.

Here's one solution:

select customer_id
  from CUSTOMERS as C
 where exists
       ( select *
           from PURCHASES
          where customer_id 
              = C.customer_id
            and product_id = 'X' )

   and 2 <=
       ( select count(*)
           from PURCHASES
          where customer_id 
              = C.customer_id
            and product_id = 'Y' 
            and purchase_date
                between date1  
                    and date2 )

Each of the two subqueries above is a correlated subquery. This means that it considers only those purchases which match the customer_id of the correlated row in the main query.

One advantage of using correlated subqueries is that it's fairly easy to understand what they're doing. They "read" well. In this case, though, there are two of them, which leaves open the possibility that the database optimizer will generate two separate joins in order to execute them. (Correlated subqueries are usually executed as joins.)

Here's a different solution:

select C.customer_id
  from CUSTOMERS as C
inner
  join PURCHASES as P
    on C.customer_id 
     = P.customer_id  
group
    by C.customer_id
having 0 <
       sum(
        case when P.product_id = 'X' 
             then 1 else 0 end
          ) 
   and 2 <=
       sum(
        case when P.product_id = 'Y' 
              and P.purchase_date
                between date1  
                    and date2 
             then 1 else 0 end
          )

Here you can see we've taken matters into our own hands and performed one join explicitly. Note that we're still just selecting from the CUSTOMERS table. The WHERE EXISTS construction is replaced by taking a count and making sure it's not zero. The counts are achieved by obtaining the SUM of a column of 1's and 0's.

Which of the solutions is faster? Try them both, and see.


This was first published in March 2005
This Content Component encountered an error

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

Have a question for an expert?

Please add a title for your question

Get answers from a TechTarget expert on whatever's puzzling you.

You will be able to add details on the next page.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchDataManagement

SearchBusinessAnalytics

SearchSAP

SearchSQLServer

TheServerSide

SearchDataCenter

SearchContentManagement

SearchFinancialApplications

Close