You are here: Re: SQL query - duplicate records - different dates - how to get only latest information? « MsSQL Server « IT news, forums, messages
Re: SQL query - duplicate records - different dates - how to get only latest information?

Posted by Hugo Kornelis on 03/24/06 23:41

On Mon, 20 Mar 2006 10:53:29 -0500, clare at snyder.on.ca wrote:

(snip)
>From Hugo's final version, which, by the way, appears to be the only
>one that actually works the way I had intended it to. I wish I could
>completely understand the query, and it might be helpfull if Hugo
>could put a documented version of the code up for the edification of
>all of us.
>
>How 'bout it Hugo??

Hi Clare,

Flattery always works for me. And you have included enough flattery to
last me a year or two! <g>

I'm not sure which one of my two queries you'd like to see commented, so
I'll just do both.

Here's #1. Basically a copy of Erland's original suggestion, with an
extra subquery in the WHERE clause to handle the additional problem of
cancelled policies.

SELECT DISTINCT c.lastname, c.email, p.polno, p.polexpdate
FROM afw_customer AS c
INNER JOIN (SELECT custid, MAX(polexpdate) AS polexpdate
FROM afw_basicpolinfo
WHERE polno LIKE '1932700%'
GROUP BY custid) AS p1

The four lines above are a so-called "derived table". A derived table is
a subquery in the FROM clause. It can only be a non-correlated query,
i.e. you must be able to run it on it's own. If you select just the part
from "SELECT custind" up to "GROUP BY custid" and execute it, you'll get
a normal result. Using a derived table is basically just a shortcut for
making, using, and then dropping a temp table or a view. The "AS p1"
gives the result of this derived table a name (alias).

In this derived table, rows are selected that match the LIKE; they are
then arranged into groups for each distinct custid and the most recent
expiration date ("MAX(polexpdate)") in each group is selected. The
result of this derived table is a table named p1, with two columns:
custid (holding the customerid) and polexpdate (holding the most recent
expiration date for policy 1932700 for this customer).

ON p1.custid = c.custid
INNER JOIN afw_basicpolinfo AS p
ON p.custid = p1.custid
AND p.polexpdate = p1.polexpdate

The results of the derived table are now joined back to the complete
afw_basicpolinfo table. Since both custid and polexpdate have to match,
we'll only join to rows that match a "most recent" policy expiration
date. Through this technique, we can now address other columns in the
same row, that we couldn't include in the derived table without breaking
it's logic. In this query, this extra join is required because you want
to display p.polno in the results.

WHERE p.polexpdate BETWEEN '20060201' AND '20060228'
AND c.email IS NOT NULL
AND NOT EXISTS (SELECT *
FROM afw_basicpolinfo
WHERE (polno LIKE '1932700%CANC'
OR polno LIKE '1932700%NOTRENEWED')
AND custid = c.custid)

Here's the subquery I added. Erland's first suggestion when you added
the requirement to filter out cancelled policies was based on the
assumption that the encoding ("ending in CANC or NOTRENEWED") was always
in the row with the most recent expiration date. You said that the CANC
or NOTRENEWED code might also be in a different row for the same policy
and the same customer. So I set up this subquery. This is a correlated
subquery: it can't be executed on it's own, since it references columns
from tables in the outer query. It has to be re-evaluated for each row
in the outer table. (In theory - the query optimizer in the DB engine
might transform this to different code that produces the same results
faster).

Given a specific customer (c.custid), this subquery attempts to find any
row for that customer with a policy number that starts with 1932700 and
ends in either CANC or NOTRENEWED. If it finds one (or more), then the
EXISTS evaluates to true and hence the NOT EXISTS evaluates to false and
the row in the outer query is rejected from the result set.

ORDER BY p.polexpdate, c.lastname


That was my first query. A quite straightforward addition of your extra
requirement to the existing query suggested by Erland.

My second suggestion was more adventurous and needs lots more testing
but has great potential of being faster, since it eliminates the need to
evaluate a subquery for each row that is produced by the joins in the
outer query.

SELECT DISTINCT c.lastname, c.email, p.polno, p.polexpdate
FROM afw_customer AS c
INNER JOIN (SELECT custid, MAX(polexpdate) AS polexpdate
FROM afw_basicpolinfo
WHERE polno LIKE '1932700%'
GROUP BY custid
HAVING MAX(CASE WHEN polno LIKE '%CANC'
OR polno LIKE '%NOTRENEWED'
THEN 1
ELSE 0
END) = 0) AS p1

What we have here is the same derived table technique, but the subquery
in the derived table is extended with a HAVING clause. A HAVING clause
is much like a WHERE clause - except a WHERE clause is used to reject
individual rows before grouping them (if a GROUP BY is present); a
HAVING clause eliminates complete groups of rows after grouping them
according to the GROUP BY clause.

This particular HAVING clause is complex. Let's start at the innermost
level:

CASE WHEN polno LIKE '%CANC'
OR polno LIKE '%NOTRENEWED'
THEN 1
ELSE 0
END

This CASE expression will be evaluated for each row in the group. If the
polno of that row ends in CANC of NOTRENEWED, the result is 1; if not,
it is 0. So 1 means that the row is flagged as cancelled; 0 means it's
not.

MAX(CASE WHEN polno LIKE '%CANC'
OR polno LIKE '%NOTRENEWED'
THEN 1
ELSE 0
END) = 0) AS p1

After determining 1 (flagged as cancelled) or 0 (not thusly flagged) for
each row in the group, the maximum value of the group is identified. If
at least one row in the group is flagged as cancelled, the MAX value of
the CASE expression has to be 1. The MAX can only be 0 if not a single
row in the group is marked as cancelled.

HAVING MAX(CASE WHEN polno LIKE '%CANC'
OR polno LIKE '%NOTRENEWED'
THEN 1
ELSE 0
END) = 0

And therefor, this HAVING clause will completely reject all rows from a
customer that has at least one row with a cancellation marker.

In the end, the derived table will hold custid and their most recent
expiration date (as in the first query), but only for customers for whom
there is no cancellation marker set on any row.

ON p1.custid = c.custid
INNER JOIN afw_basicpolinfo AS p
ON p.custid = p1.custid
AND p.polexpdate = p1.polexpdate

And as a result, this inner join will only provide the full information
from rows for customers without cancellation and their most recent
expiration date.

WHERE p.polexpdate BETWEEN '20060201' AND '20060228'
AND c.email IS NOT NULL
ORDER BY p.polexpdate, c.lastname

Since the customers with cancellation were already filtered out in the
derived table, the rest of the query is in this case unchanged from the
original query by Erland.

I already mentioned that I expect this query to be faster than the first
suggestion. In case you're wondering why: the subquery in the first
suggestion asks SQL Server to go back to the table, and do another
search over the data. This will probably result in more logical I/O
requests. And depending on the size of your table, that might even
result in more physical I/O as well.

The second suggestion "moves" the extra logic to the derived table. This
means that SQL Server has to do more processing there, but it's done in
a place where we already ask SQL Server to read all rows. It might take
a bit more CPU power, but not more I/O. And since I/O is almost always
the bottleneck for performance in databases, I expect this query to run
faster.

I hope this helps!!

--
Hugo Kornelis, SQL Server MVP

 

Navigation:

[Reply to this message]


Удаленная работа для программистов  •  Как заработать на Google AdSense  •  England, UK  •  статьи на английском  •  PHP MySQL CMS Apache Oscommerce  •  Online Business Knowledge Base  •  DVD MP3 AVI MP4 players codecs conversion help
Home  •  Search  •  Site Map  •  Set as Homepage  •  Add to Favourites

Copyright © 2005-2006 Powered by Custom PHP Programming

Сайт изготовлен в Студии Валентина Петручека
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация