|  | Posted by Thomas R. Hummel on 06/19/19 11:33 
Hello,
 I am currently working on a monthly load process with a datamart. I
 originally designed the tables in a normalized fashion with the idea
 that I would denormalize as needed once I got an idea of what the
 performance was like. There were some performance problems, so the
 decision was made to denormalize. Now the users are happy with the
 query performance, but loading the tables is much more difficult.
 Specifically...
 
 There were two main tables, a header table and a line item table. These
 have been combined into one table. For my loads I still receive them as
 separate files though. The problem is that I might receive a line item
 for a header that began two months ago. When this happens I don't get a
 header record in the current month's file - I just get the record in
 the line items file. So now I have to join the header and line item
 tables in my staging database to get the denormalized rows, but I also
 may have to get header information from my main denormalized table
 (~150 million rows). For simplicity I will only include the primary
 keys and one other column to represent the rest of the row below. The
 tables are actually very wide.
 
 Staging database:
 
 CREATE TABLE dbo.Claims (
 CLM_ID BIGINT NOT NULL,
 CLM_DATA VARCHAR(100) NULL )
 
 CREATE TABLE dbo.Claim_Lines (
 CLM_ID BIGINT NOT NULL,
 LN_NO SMALLINT NOT NULL,
 CLM_LN_DATA VARCHAR(100) NULL )
 
 Target database:
 
 CREATE TABLE dbo.Target (
 CLM_ID BIGINT NOT NULL,
 LN_NO SMALLINT NOT NULL,
 CLM_DATA VARCHAR(100) NULL,
 CLM_LN_DATA VARCHAR(100) NULL )
 
 I can either pull back all of the necessary header rows from the target
 table to the claims table and then do one insert using a join between
 claims and claim lines into the target table OR I can do one insert
 with a join between claims and claim lines and then a second insert
 with a join between claim lines and target for those lines that weren't
 already added.
 
 Some things that I've tried:
 
 INSERT INTO Staging.dbo.Claims (CLM_ID, CLM_DATA)
 SELECT DISTINCT T.CLM_ID, T.CLM_DATA
 FROM Staging.dbo.Claim_Lines CL
 LEFT OUTER JOIN Staging.dbo.Claims C ON C.CLM_ID = CL.CLM_ID
 INNER JOIN Target.dbo.Target T ON T.CLM_ID = CL.CLM_ID
 WHERE C.CLM_ID IS NULL
 
 INSERT INTO Staging.dbo.Claims (CLM_ID, CLM_DATA)
 SELECT T.CLM_ID, T.CLM_DATA
 FROM Staging.dbo.Claim_Lines CL
 LEFT OUTER JOIN Staging.dbo.Claims C ON C.CLM_ID = CL.CLM_ID
 INNER JOIN Target.dbo.Target T ON T.CLM_ID = CL.CLM_ID
 WHERE C.CLM_ID IS NULL
 GROUP BY T.CLM_ID, T.CLM_DATA
 
 INSERT INTO Staging.dbo.Claims (CLM_ID, CLM_DATA)
 SELECT DISTINCT T.CLM_ID, T.CLM_DATA
 FROM Target.dbo.Target T
 INNER JOIN (SELECT CL.CLM_ID
 FROM Staging.dbo.Claim_Lines CL
 LEFT OUTER JOIN Staging.dbo.Claims C ON C.CLM_ID =
 CL.CLM_ID
 WHERE C.CLM_ID IS NULL) SQ ON SQ.CLM_ID = T.CLM_ID
 
 I've also used EXISTS and IN in various queries. No matter which method
 I use, the query plans tend to want to do a clustered index scan on the
 target table (actually a partitioned view partitioned by year). The
 number of headers that were in the target but not the header file this
 month was about 42K out of 1M.
 
 So.... any other ideas on how I can set up a query to get the distinct
 headers from the denormalized table? Right now I'm considering using
 worktables if I can't figure anything else out, but I don't know if
 that will really help me much either.
 
 I'm not looking for a definitive answer here, just some ideas that I
 can try.
 
 Thanks,
 -Tom.
  Navigation: [Reply to this message] |