|
Posted by --CELKO-- on 03/31/06 01:21
>> * Different data from the same row needs to be inserted into multiple tables. "Common sense" (maybe "gut instinct" is better) says to handle
each row as a "unit". Seems weird to process the entire set for one
table, then to process the entire set AGAIN for another table, and then
YET AGAIN for a third table, and so on. <<
You still think sequentially and procedurally. The SQL engine is free
to hold the data internally and parallelize the process within a single
transaction. It is not like you must open and close a file over and
over. In Standard SQL/PSM, the keywords are BEGIN ATOMIC.. END; What a
series of inserts must do is is see that the order of insertion is
EFFECTIVELY preserved. That is a technical thing in "SQL-Speak" that
says if I can put data item X in tables T1, T2, ..Tn all at once, I can
do it inparallel. I might have to defer constraints, etc., but if the
effect is the same, sequential order is not required.
>> * Exception handling. Set based processing means that if one row fails the entire set fails. Looping through allows you to fail a row but allow everything else to be processed properly. It also allows you to gather statistics. (How many failed, how many worked, how many were skipped, etc.)<<
Look up the concept of SAVEPOINTs and transaction grandularity. A
SAVEPOINT is a "mini-COMMIT"; when you hit an error, you can roll back
to the last save point, do something and try again.
The other method is to break the batch into the smallest possible
transaction grandularity and run each one. The gimmick is that they
have to be separable. Example: making bank deposits to single
accounts.
Univac experimented with a concurency control method they called
logical concurency control. The idea was to look at the job queue, see
which statements were done on disjoint sets (posting deposits and
bowling league scores) and let them run at the same time without any
locks or further checking. I do not know what happened to the project.
>> The alternative is to create a temporary table (sandbox or workspace type thing), copy the data to there along with "status" or "valdation" columns, run through the set many times over looking for any rows that may fail, marking them as such, and then at the end only dealing with those rows which "passed" the testing. Of course, in order for this to work you must know (and duplicate) all constraints so you
know what to look for in your testing. <<
This is the approach used for data scrubbing in Data Warehouses. It is
also the reason that we are seeing Business Rules engines that could
used to generate code for the RDBMS, the DW, the front end code, etc.
with the certainity that they will all match.
Navigation:
[Reply to this message]
|