|
Posted by Thomas R. Hummel on 10/22/19 11:33
Hello,
I am currently working on a monthly load process with a datamart. I
originally designed the tables in a normalized fashion with the idea
that I would denormalize as needed once I got an idea of what the
performance was like. There were some performance problems, so the
decision was made to denormalize. Now the users are happy with the
query performance, but loading the tables is much more difficult.
Specifically...
There were two main tables, a header table and a line item table. These
have been combined into one table. For my loads I still receive them as
separate files though. The problem is that I might receive a line item
for a header that began two months ago. When this happens I don't get a
header record in the current month's file - I just get the record in
the line items file. So now I have to join the header and line item
tables in my staging database to get the denormalized rows, but I also
may have to get header information from my main denormalized table
(~150 million rows). For simplicity I will only include the primary
keys and one other column to represent the rest of the row below. The
tables are actually very wide.
Staging database:
CREATE TABLE dbo.Claims (
CLM_ID BIGINT NOT NULL,
CLM_DATA VARCHAR(100) NULL )
CREATE TABLE dbo.Claim_Lines (
CLM_ID BIGINT NOT NULL,
LN_NO SMALLINT NOT NULL,
CLM_LN_DATA VARCHAR(100) NULL )
Target database:
CREATE TABLE dbo.Target (
CLM_ID BIGINT NOT NULL,
LN_NO SMALLINT NOT NULL,
CLM_DATA VARCHAR(100) NULL,
CLM_LN_DATA VARCHAR(100) NULL )
I can either pull back all of the necessary header rows from the target
table to the claims table and then do one insert using a join between
claims and claim lines into the target table OR I can do one insert
with a join between claims and claim lines and then a second insert
with a join between claim lines and target for those lines that weren't
already added.
Some things that I've tried:
INSERT INTO Staging.dbo.Claims (CLM_ID, CLM_DATA)
SELECT DISTINCT T.CLM_ID, T.CLM_DATA
FROM Staging.dbo.Claim_Lines CL
LEFT OUTER JOIN Staging.dbo.Claims C ON C.CLM_ID = CL.CLM_ID
INNER JOIN Target.dbo.Target T ON T.CLM_ID = CL.CLM_ID
WHERE C.CLM_ID IS NULL
INSERT INTO Staging.dbo.Claims (CLM_ID, CLM_DATA)
SELECT T.CLM_ID, T.CLM_DATA
FROM Staging.dbo.Claim_Lines CL
LEFT OUTER JOIN Staging.dbo.Claims C ON C.CLM_ID = CL.CLM_ID
INNER JOIN Target.dbo.Target T ON T.CLM_ID = CL.CLM_ID
WHERE C.CLM_ID IS NULL
GROUP BY T.CLM_ID, T.CLM_DATA
INSERT INTO Staging.dbo.Claims (CLM_ID, CLM_DATA)
SELECT DISTINCT T.CLM_ID, T.CLM_DATA
FROM Target.dbo.Target T
INNER JOIN (SELECT CL.CLM_ID
FROM Staging.dbo.Claim_Lines CL
LEFT OUTER JOIN Staging.dbo.Claims C ON C.CLM_ID =
CL.CLM_ID
WHERE C.CLM_ID IS NULL) SQ ON SQ.CLM_ID = T.CLM_ID
I've also used EXISTS and IN in various queries. No matter which method
I use, the query plans tend to want to do a clustered index scan on the
target table (actually a partitioned view partitioned by year). The
number of headers that were in the target but not the header file this
month was about 42K out of 1M.
So.... any other ideas on how I can set up a query to get the distinct
headers from the denormalized table? Right now I'm considering using
worktables if I can't figure anything else out, but I don't know if
that will really help me much either.
I'm not looking for a definitive answer here, just some ideas that I
can try.
Thanks,
-Tom.
Navigation:
[Reply to this message]
|