|
Posted by JayCallas on 10/01/45 11:43
Thanks for replying. As you probably guessed, there are specific
reasons why I am asking these questions. Maybe a bit more info would be
helpful.
* I am not talking about the same data being written to multiple
tables. I referring to situations where different columns from the same
row are written to different tables.
A concrete example would be security (stocks) data. There are different
types of securities. (Equities and Options are just two of them). Both
types of securities share common information like symbol, name, and
security type. But options have additional data (strike price, class,
and expiration).
My current schema has a table named SecurityMaster where common data
goes with the primary key being Symbol. I also have a
SecurityMasterDerivative table where I put Option specific data which
has a foreign key reference to the SecurityMaster table.
So when I get a new security master file, the set-based approach says
to insert all the new common security data into the SecurityMaster
table and then go back and and reprocess the new data to insert the
option specific data. My dilemma is that I feel it is more "natural" to
treat each row as a single entity.
* The stored procedure approach I mentioned in the first point is
related to the example I just gave. When inserting a new security into
the "Security Master database" (I do not mean the single table), there
is a at least one table involved but possibly up to 4 tables depending
on the type of security that we are dealing with.
* As far as exception handling goes, why is it considered "normal" to
want all the data or just some of it? What about the idea of getting
"most" of the data in and then dealing with those "exceptions"? Is it
considered "business logic" to say "I rather have the majority of the
data as opposed to NONE of the data"?
I understand that I can make one large INSERT query which also performs
ALL the required checks in order to filter out "bad" rows. But I would
think that the query could get very complex very quickly. I regularly
deal with situations where the data can not only NOT be trusted to be
complete but where column names and formats are VERY different even for
"similar" information.
I find that I need to "backfill" a lot of data or perform a good number
of value conversions.
One example that comes to mind of using a single query resulted in code
that took 25+ minutes ro run. In this situation I was dealing with
position data for securities.
For Equities, I trusted the CUSIP value they gave me but not the
symbol. So I decided to "verify" the symbol against what I knew it as
(ie. backfill the column). But it was not enough to just do a left
join. Because it was possible to have the CUSIP repeated in the lookup
table (the PK was CUSIP and Symbol). So my Symbol lookup logic
(remember this was a single query), was to see if there was only one
row with the given cusip, then check to see if there was only one
active security row with the given cusip, THEN check to see if there
was only one north american active security with the given cusip,
otherwise use what the source gave us.
And that was only for equities... For options the logic is 100%
opposite... Talk about complex queries that SUCK in performance...
The reason why I suggested the multiple pass validation approach was to
be able to record or mark not only those rows that failed but WHY they
failed. (Assume that there is a [ValidationCode] and [ValidationReason]
column in the "sandbox" table.) I need to be able to create an
exception report so somebody can go back and "fix" the data.
-----
I guess the bottom line is that I am probably trying to apply my many
years of application development approaches and opinions to database
development and the two do not mesh.
Navigation:
[Reply to this message]
|