| 
	
 | 
 Posted by JayCallas on 07/12/45 11:43 
Thanks for replying. As you probably guessed, there are specific 
reasons why I am asking these questions. Maybe a bit more info would be 
helpful. 
 
* I am not talking about the same data being written to multiple 
tables. I referring to situations where different columns from the same 
row are written to different tables. 
 
A concrete example would be security (stocks) data. There are different 
types of securities. (Equities and Options are just two of them). Both 
types of securities share common information like symbol, name, and 
security type. But options have additional data (strike price, class, 
and expiration). 
 
My current schema has a table named SecurityMaster where common data 
goes with the primary key being Symbol. I also have a 
SecurityMasterDerivative table where I put Option specific data which 
has a foreign key reference to the SecurityMaster table. 
 
So when I get a new security master file, the set-based approach says 
to insert all the new common security data into the SecurityMaster 
table and then go back and and reprocess the new data to insert the 
option specific data. My dilemma is that I feel it is more "natural" to 
treat each row as a single entity. 
 
* The stored procedure approach I mentioned in the first point is 
related to the example I just gave. When inserting a new security into 
the "Security Master database" (I do not mean the single table), there 
is a at least one table involved but possibly up to 4 tables depending 
on the type of security that we are dealing with. 
 
* As far as exception handling goes, why is it considered "normal" to 
want all the data or just some of it? What about the idea of getting 
"most" of the data in and then dealing with those "exceptions"? Is it 
considered "business logic" to say "I rather have the majority of the 
data as opposed to NONE of the data"? 
 
I understand that I can make one large INSERT query which also performs 
ALL the required checks in order to filter out "bad" rows. But I would 
think that the query could get very complex very quickly. I regularly 
deal with situations where the data can not only NOT be trusted to be 
complete but where column names and formats are VERY different even for 
"similar" information. 
 
I find that I need to "backfill" a lot of data or perform a good number 
of value conversions. 
 
One example that comes to mind of using a single query resulted in code 
that took 25+ minutes ro run. In this situation I was dealing with 
position data for securities. 
 
For Equities, I trusted the CUSIP value they gave me but not the 
symbol. So I decided to "verify" the symbol against what I knew it as 
(ie. backfill the column). But it was not enough to just do a left 
join. Because it was possible to have the CUSIP repeated in the lookup 
table (the PK was CUSIP and Symbol). So my Symbol lookup logic 
(remember this was a single query), was to see if there was only one 
row with the given cusip, then check to see if there was only one 
active security row with the given cusip, THEN check to see if there 
was only one north american active security with the given cusip, 
otherwise use what the source gave us. 
 
And that was only for equities... For options the logic is 100% 
opposite... Talk about complex queries that SUCK in performance... 
 
The reason why I suggested the multiple pass validation approach was to 
be able to record or mark not only those rows that failed but WHY they 
failed. (Assume that there is a [ValidationCode] and [ValidationReason] 
column in the "sandbox" table.) I need to be able to create an 
exception report so somebody can go back and "fix" the data. 
 
----- 
 
I guess the bottom line is that I am probably trying to apply my many 
years of application development approaches and opinions to database 
development and the two do not mesh.
 
  
Navigation:
[Reply to this message] 
 |