Re: SQL query - duplicate records - different dates - how to get only latest information? — MsSQL Server

You are here: Re: SQL query - duplicate records - different dates - how to get only latest information? « MsSQL Server « IT news, forums, messages

Posted by clare at snyder.on.ca on 03/25/06 02:38

On Fri, 24 Mar 2006 22:41:58 +0100, Hugo Kornelis
<hugo@perFact.REMOVETHIS.info.INVALID> wrote:

>On Mon, 20 Mar 2006 10:53:29 -0500, clare at snyder.on.ca wrote:
>
>(snip)
>>From Hugo's final version, which, by the way, appears to be the only
>>one that actually works the way I had intended it to. I wish I could
>>completely understand the query, and it might be helpfull if Hugo
>>could put a documented version of the code up for the edification of
>>all of us.
>>
>>How 'bout it Hugo??
>
>Hi Clare,
>
>Flattery always works for me. And you have included enough flattery to
>last me a year or two! <g>
>
>I'm not sure which one of my two queries you'd like to see commented, so
>I'll just do both.
>
>Here's #1. Basically a copy of Erland's original suggestion, with an
>extra subquery in the WHERE clause to handle the additional problem of
>cancelled policies.
>
>SELECT DISTINCT c.lastname, c.email, p.polno, p.polexpdate
>FROM afw_customer AS c
>INNER JOIN (SELECT custid, MAX(polexpdate) AS polexpdate
> FROM afw_basicpolinfo
> WHERE polno LIKE '1932700%'
> GROUP BY custid) AS p1
>
>The four lines above are a so-called "derived table". A derived table is
>a subquery in the FROM clause. It can only be a non-correlated query,
>i.e. you must be able to run it on it's own. If you select just the part
>from "SELECT custind" up to "GROUP BY custid" and execute it, you'll get
>a normal result. Using a derived table is basically just a shortcut for
>making, using, and then dropping a temp table or a view. The "AS p1"
>gives the result of this derived table a name (alias).
>
>In this derived table, rows are selected that match the LIKE; they are
>then arranged into groups for each distinct custid and the most recent
>expiration date ("MAX(polexpdate)") in each group is selected. The
>result of this derived table is a table named p1, with two columns:
>custid (holding the customerid) and polexpdate (holding the most recent
>expiration date for policy 1932700 for this customer).
>
> ON p1.custid = c.custid
>INNER JOIN afw_basicpolinfo AS p
> ON p.custid = p1.custid
> AND p.polexpdate = p1.polexpdate
>
>The results of the derived table are now joined back to the complete
>afw_basicpolinfo table. Since both custid and polexpdate have to match,
>we'll only join to rows that match a "most recent" policy expiration
>date. Through this technique, we can now address other columns in the
>same row, that we couldn't include in the derived table without breaking
>it's logic. In this query, this extra join is required because you want
>to display p.polno in the results.
>
>WHERE p.polexpdate BETWEEN '20060201' AND '20060228'
>AND c.email IS NOT NULL
>AND NOT EXISTS (SELECT *
> FROM afw_basicpolinfo
> WHERE (polno LIKE '1932700%CANC'
> OR polno LIKE '1932700%NOTRENEWED')
> AND custid = c.custid)
>
>Here's the subquery I added. Erland's first suggestion when you added
>the requirement to filter out cancelled policies was based on the
>assumption that the encoding ("ending in CANC or NOTRENEWED") was always
>in the row with the most recent expiration date. You said that the CANC
>or NOTRENEWED code might also be in a different row for the same policy
>and the same customer. So I set up this subquery. This is a correlated
>subquery: it can't be executed on it's own, since it references columns
>from tables in the outer query. It has to be re-evaluated for each row
>in the outer table. (In theory - the query optimizer in the DB engine
>might transform this to different code that produces the same results
>faster).
>
>Given a specific customer (c.custid), this subquery attempts to find any
>row for that customer with a policy number that starts with 1932700 and
>ends in either CANC or NOTRENEWED. If it finds one (or more), then the
>EXISTS evaluates to true and hence the NOT EXISTS evaluates to false and
>the row in the outer query is rejected from the result set.
>
>ORDER BY p.polexpdate, c.lastname
>
>
>That was my first query. A quite straightforward addition of your extra
>requirement to the existing query suggested by Erland.
>
>My second suggestion was more adventurous and needs lots more testing
>but has great potential of being faster, since it eliminates the need to
>evaluate a subquery for each row that is produced by the joins in the
>outer query.
>
>SELECT DISTINCT c.lastname, c.email, p.polno, p.polexpdate
>FROM afw_customer AS c
>INNER JOIN (SELECT custid, MAX(polexpdate) AS polexpdate
> FROM afw_basicpolinfo
> WHERE polno LIKE '1932700%'
> GROUP BY custid
> HAVING MAX(CASE WHEN polno LIKE '%CANC'
> OR polno LIKE '%NOTRENEWED'
> THEN 1
> ELSE 0
> END) = 0) AS p1
>
>What we have here is the same derived table technique, but the subquery
>in the derived table is extended with a HAVING clause. A HAVING clause
>is much like a WHERE clause - except a WHERE clause is used to reject
>individual rows before grouping them (if a GROUP BY is present); a
>HAVING clause eliminates complete groups of rows after grouping them
>according to the GROUP BY clause.
>
>This particular HAVING clause is complex. Let's start at the innermost
>level:
>
> CASE WHEN polno LIKE '%CANC'
> OR polno LIKE '%NOTRENEWED'
> THEN 1
> ELSE 0
> END
>
>This CASE expression will be evaluated for each row in the group. If the
>polno of that row ends in CANC of NOTRENEWED, the result is 1; if not,
>it is 0. So 1 means that the row is flagged as cancelled; 0 means it's
>not.
>
> MAX(CASE WHEN polno LIKE '%CANC'
> OR polno LIKE '%NOTRENEWED'
> THEN 1
> ELSE 0
> END) = 0) AS p1
>
>After determining 1 (flagged as cancelled) or 0 (not thusly flagged) for
>each row in the group, the maximum value of the group is identified. If
>at least one row in the group is flagged as cancelled, the MAX value of
>the CASE expression has to be 1. The MAX can only be 0 if not a single
>row in the group is marked as cancelled.
>
> HAVING MAX(CASE WHEN polno LIKE '%CANC'
> OR polno LIKE '%NOTRENEWED'
> THEN 1
> ELSE 0
> END) = 0
>
>And therefor, this HAVING clause will completely reject all rows from a
>customer that has at least one row with a cancellation marker.
>
>In the end, the derived table will hold custid and their most recent
>expiration date (as in the first query), but only for customers for whom
>there is no cancellation marker set on any row.
>
> ON p1.custid = c.custid
>INNER JOIN afw_basicpolinfo AS p
> ON p.custid = p1.custid
> AND p.polexpdate = p1.polexpdate
>
>And as a result, this inner join will only provide the full information
>from rows for customers without cancellation and their most recent
>expiration date.
>
>WHERE p.polexpdate BETWEEN '20060201' AND '20060228'
>AND c.email IS NOT NULL
>ORDER BY p.polexpdate, c.lastname
>
>Since the customers with cancellation were already filtered out in the
>derived table, the rest of the query is in this case unchanged from the
>original query by Erland.
>
>I already mentioned that I expect this query to be faster than the first
>suggestion. In case you're wondering why: the subquery in the first
>suggestion asks SQL Server to go back to the table, and do another
>search over the data. This will probably result in more logical I/O
>requests. And depending on the size of your table, that might even
>result in more physical I/O as well.
>
>The second suggestion "moves" the extra logic to the derived table. This
>means that SQL Server has to do more processing there, but it's done in
>a place where we already ask SQL Server to read all rows. It might take
>a bit more CPU power, but not more I/O. And since I/O is almost always
>the bottleneck for performance in databases, I expect this query to run
>faster.
>
>I hope this helps!!

Thanks Hugo!! Clarifies a lot for me, and answers the question the
other guy asked.
*** Free account sponsored by SecureIX.com ***
*** Encrypt your Internet usage with a free VPN account from http://www.SecureIX.com ***

Navigation:

Next in forum: Re: Dell Raid 0+1
Prev in forum: Re: ORDER BY in VIEW not working
Thread view: Re: SQL query - duplicate records - different dates - how to get only latest information?

[Reply to this message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация