|
Posted by Erland Sommarskog on 07/15/07 16:17
raylopez99 (raylopez99@yahoo.com) writes:
> No, just dump your data into a flat file, use a hash table to index
> every word in said file, and use hardware (CAM memory) to do a fast
> search. Simple really. Probably what Google does right now. You
> think they are using SQL "queries" to do fast searches? No way Jose.
It's not implausible that Google does something like that. After all,
they have a simple data model, but a lot of data. Use the right tool
for right thing.
But not everything is suitable for flat files. The system I work with
is system for stock brokers and asset managers. We keep track of what
their customers buy and sell, their positions, their cash holdings.
We keep a lot of data about the various financial instruments and what
happens to them. Say that you have 6000 HP stocks in your depot. What is
the acquisition value of that position? To be able to calculate that
we need to know that that this positions is the result of buy HP,
Compaq and Digital along the way. We need report things to the tax
authorities, and levy source tax for dividends etc. And then there are
special accounts for pension-savings with their own set of rules. All
and all, there are over 1000 tables, and what is the most effecient
way to extract a desired set of data is far from always obvious.
And RDBMS is not only about retrieving data, but also adding data.
How do you ensure in your flat file that there no transaction is
entered about an insrument that does not exist in the instrument
file? How do you ensure that if you add a withdrawal transaction is
to the transaction file, that the cash-holdings file is also updated?
> You wrote your dB is 300 GB--but how much of this is raw data?
> Perhaps 10%? The other 90% is junk to link the data (I'm guessing).
> If so, you can buy 30 GB of RAM and when x86/Windows supports 64 bit
> better (though I think they already do--Itanium?) you can access this
> 30 GB with no problem. 30 GB RAM costs about $1000. Not expensive.
The size of the database at our biggest customer is around 350 MB. Is
all that raw data? No. Is there some "junk" that no one ever looks at?
Probably. But we cannot really throw it away. All transactional data
must be kept for ten years, to comply with Swedish legislation. But
the really interesting part is that one reason the database is inflated,
is because we have huge historical tables with the positions and cash
holds for every day going back a couple of years. They are there to
provide faster access to the data.
And there is the problem with your 30 GB of memory. To be useful,
you need more than 30 GB of memory. Your hash tables will easily take
up a couple of GB more. Keep in mind that you will need more than one
hash table for each file, because searches may be per different attributes.
And hash tables alone won't make it. A range query, like "give me the
sales for the first three months this year in Finland" is better served
with a sorted index. There you have more of the "junk" in a relational
database.
And in a database that is for a data warehouse, you will find even more
junk, as in such a database you preaggregate both this and that, and
the database easily ends up several terabytes. (And in this case, the
language of choice is rather MDX than SQL.)
A lot of that "junk" serves to make the data access faster. And after,
while memory is cheap and processors get better, searching 30 GB even
in memory by brute force takes a long time.
> Correct. You are "right as rain" on this point. I do think the C#
> feature of VS2005 vis-a-vis the SQL dB aspect of it is broken--maybe
> because I'm using the SQL Server 2005 Express (free) edition (just
> upgraded for a mere $50 to the Developer's edition--I'll repost here
> if it solves this small problem).
If you have problems with Visual Studio, you may be better of asking
in a Visual Studio group. Although, I will have to admit that I find
it difficult to believe that it matters which language you are using.
> Yes, but it seems you are advocating a THIRD table "instead there is a
> table AuthorBooks". Why, just to link the two tables Authors and
> Books? Why not simply add a foreign key in Books called "AuthorID",
> that is the primary key in Authors? (This is exactly what my Frasier
> book proposes).
Because, as I said, a book can have many authors. This is a classical
database-design problem. What is the cardinality of the relations?
One-to-one? Many-to-one? Many-to-many? Authors and books is a classical
example of many-to-many. And one reason it is commonly used is that
since most books only have one author, it is a good example where you
may initially overlook that it is a matter of a many-to-many relationship.
And the normal way of resolving a many-to-many relationship is to
add a third table. In its simplest form, it has only the foreign keys,
but you could add attributes to the relation, for instance each author's
share of the royalties.
> Also remember right now with RDBMS you expend a tremendous amount of
> work putting data into "orthogonal" databases, via data entry forms.
> You have to pay people to enter the data correctly (even if your program
> rejects bad data entry, you still have to pay people to enter the data
> correctly). You can avoid all of this with a flat file.
Huh? You suggest that people instead of using entry forms, would edit
the flat files directly? Au contraire my friend, with a less intelligent
data store, it's even more essential that the data entry is done right.
--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se
Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downloads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
Navigation:
[Reply to this message]
|