Reply to Re: Updating the SQL key value — PHP Programming Language

Posted by Jerry Stuckle on 03/31/07 22:46

Toby A Inkster wrote:
> Jerry Stuckle wrote:
>> Toby A Inkster wrote:
>>> Erwin Moller wrote:
>>>
>>> Example data in table licences:
>>>
>>> GPL GNU General Public Licence http://www.gnu...
>>> FDL GNU Free Documentation Licence http://www.gnu...
>>> CC Creative Commons Licence http://www.cre...
>>>
>>> Usage is fairly similar, apart from the fact that now, when you look at
>>> the table 'articles' without doing any joins, you can still infer a bit of
>>> information about which licence each article is under, without having to
>>> inner join onto the licences table.
>> True, but you also have to look at performance issues. When searching
>> an index, comparing an int is always faster than comparing a varchar.
>> And comparing a single column is always faster than comparing multiple
>> columns. And the index file itself is smaller.
>
> Firstly, I said char, not varchar. char columns are fixed width storage
> and consequently a lot faster to search than varchar columns. On a 32-bit
> processor, char(4) is particularly fast as it corresponds to the
> processor's word length.
>

True, but unless it's a binary column the search must be case-insensitive.

Also, even if you specify char(4) the field may not end up that way.
For instance, in MySQL if you have a mixture of both char and varchar
columns, the result is generally all char columns are converted to varchar.

Finally, even if none of the above are true, an int column will almost
always be word aligned in memory. A char field - even if it's char(4),
is not always word aligned.

>> Additionally, the PK should not be dependent on data which may change -
>> i.e. if part of your key was the license (code), what would happen if
>> they changed the licensing terms?
>
> Firstly, in this example, the licence code is an arbitrary string chosen by
> the creator of the data, so it needs to change no more often than an
> integer key would.
>

So why not just use an auto-number field. It seems you're unnecessarily
creating an unnecessary bottleneck.

> Secondly, it is perfectly acceptable for a primary key to change. A decent
> database can even cascade changes to dependent tables for you.
>

Yes, it's acceptable. But it's not recommended, and good database
design would argue against it.

>>> And before you say that this is a waste of space as char(8) takes up
>>> eight bytes rather than 4 bytes for an integer, you're second-guessing
>>> the database engine there. Database engines really are dead clever.
>> Not generally, they don't, because of potential problems. For instance,
>> if the data is stored as a pointer to the license table, the system
>> has to do an additional file lookup to fetch the license data. And what
>> happens if the entry from the license table is altered - or worse yet,
>> deleted? There is no referential integrity built in here. Deleting an
>> item from the license table would require all other tables which point
>> to that entry be updated - that is, the varchar data would have to be
>> reinserted into each row in every table which pointed to the license
>> table.
>
> Of course there's referential integrity built-in, via foreign key
> constraints. If foreign keys are correctly set up, then depending on
> how you configured it, one of the following actions would happen when
> you tried to delete a licence from the licences table which was being
> used in the articles table:
>
> 1. the RDBMS would not allow you to delete it;
> 2. the articles which used that licence would also be deleted; or
> 3. the articles which used that licence would have their licence
> set to the default licence.
>

Sure. But you don't have to use referential integrity. And no,
databases do not just point to another table, as I indicated. Look at
the data in your tables. You will find the char fields.

> It's all about cascading.
>
>> And in your case, the data would be stored in 4 bytes ("GPL" + 1 byte
>> length). So it would take the same 4 bytes - but comparisons would
>> still be slower.
>
> Greater than / less than comparisons would be slower. But (case sensitive)
> equal to / not equal to comparisons should be of equivalent speed, as it's
> a simple matter of choosing if one 32-bit value is the same as another
> 32-bit value -- exactly the same process used if they were integers.
> Probably the same CPU instructions used.
>

But char values are not necessarily word aligned, and they aren't
necessarily handled as a 32 bit int.

In fact, even if they were - the database would have to make the
conscious decision that it is valid, and load the fields on a word
boundary. Both take additional time over a simple int field.

>> That's true. But you're also taking up more storage space and slowing
>> down searches.
>
> More storage space on, say, the articles table, for storing the user's
> login in a column instead of an integer; but less space in the user table,
> as there's no need to store an extra integer column.
>

And the result is more space required. And every additional table
requires that much extra space.

> Slower to perform a join between the article and user tables, but fewer
> occasions when you'll need to perform a join, as you have some useful user
> information (the login) in the article table already.
>
>

Even searching for the row identified by the PK will be slower.

>>> On one of my current projects, I've got twelve tables, only one of
>>> which has a surrogate integer primary key (technically it doesn't
>>> autonumber, but I use MAX() to simulate autonumbering when creating a
>>> new record). Guess which table is causing me the most problems?
>> I can imagine. Using MAX() like this can cause concurrency problems.
>> That's why RDB designers came out with the auto-numbering columns.
>
> Not heard of transactions?
>

Sure. And in this case transactions cause their own problems. You need
at least 4 calls to the database:

START TRANSACTION
SELECT MAX(colId)+1 FROM mytable
INSERT INTO mytable ...
COMMIT

Significantly slower than a single call which returns an auto-numbered
column. And worse yet, it will tie up the table for additional changes
until the COMMIT. Very bad for concurrency.

>>> Better would have been to design my table so that it had a primary key
>>> like (url, revision).
>> But then your comment table wouldn't point to it anyway if you use (url,
>> revision). Or, if you just use url, your comment table would be
>> pointing at two different entries - which is not good foreign key
>> design.
>
> It would be pointing to a partial key. Which I think is against 4NF or
> 5NF, but I mostly ignore those two as I've found them of very little
> practical use.
>

It's against foreign key constraints, also. The referenced column(s)
must be either the primary key or have a unique index. Non-unique
entries are not allowed.

>>> Anyhow, it's mostly a matter of taste. I was being somewhat tongue in
>>> cheek when describing surrogate keys as "the root of all evil", but I
>>> can't stand to see a good candidate key go to waste.
>> Here, I agree. But I tend to lean more towards the performance side.
>> Design the database with efficiency in mind. That's why all databases
>> aren't 5NF form.
>
> I tend to err on the side of my own sanity. As far as performance is
> concerned, I'm of the opinion that it's premature optimisation, and that
> the RDBMS programmers will have already done a much better job than I
> would have done.
>

It depends on the database and your needs. For something running 10K
hits a on a database with 50K rows, this is fine.

But I've worked on some databases with hundreds of millions of rows (not
all one table, of course) with size well into the terabyte region.
Number of requests range upwards of 10's of thousands per second.

Performance here is important. And I carry these same techniques down
to smaller databases. I don't consider it 'premature optimization' to
use good rdb design techniques - including 3NF.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

[Back to original message]

Удаленная работа для программистов • Как заработать на Google AdSense • England, UK • статьи на английском • PHP MySQL CMS Apache Oscommerce • Online Business Knowledge Base • DVD MP3 AVI MP4 players codecs conversion help

Home • Search • Site Map • Set as Homepage • Add to Favourites

Сайт изготовлен в Студии Валентина Петручека —
изготовление и поддержка веб-сайтов, разработка программного обеспечения, поисковая оптимизация