|
Posted by Hugo Kornelis on 06/15/07 22:23
On Fri, 15 Jun 2007 18:14:07 +0200, Gert-Jan Strik wrote:
>> >> Well, that definitely rules out newid() as a "good" pseudo random number
>> >> generator, then. A sequence of random numbers should have a chance to
>> >> hold duplicates.
>> >
>> >Good observation. And so you correctly concluded that RAND() also does
>> >not do this.
>>
>> Am I reading you incorrectly, or are you saying that the sequence of
>> numbers generated by RAND() never produces the same value twice?
>
>No, I am not saying that. It might, I haven't analyzed the algorithm
>thoroughly. But that doesn't matter. A good pseudo random number
>generator should incorporate the idea that in a range of 2 billion
>values, there is a one in 2 billion chance that the same value is
>selected next. And after that, then again there is a one in 2 billion
>chance it will appear again. And that is something the algorithm doesn't
>do. The algorithm is totally deterministic.
Hi Gert-Jan,
Any pseudoRNG will always have a deterministic algorithm; the only
alternative would be some device that measures some physical magnitude
that is deemed to be random enough. And both philosophers and physicists
would probably argue whether even that is truly random.
Anyway, a deterministic algorithm can still satisfy the 1 in 2 billion
chance of producing the occasional duplicate. I'll try to illustrate
with a simplified example, using lower numbers (to save me the hassle of
translating Dutch words for extremely high numbers to English, and you
the hassle of translating them back :-)
Let's say that we have an algorith to produce random numbers between 1
and 64. We do of course not want to limit the seed to that range of 64
numbers - instead we use an integer to store the seed, giving us a range
of over 4 billion different seed values. We use an algorithm to
calculate next seed from the previous seed in such a way that there
won't be any obvious pattern to the series and that all 4-and-a-bit
billion possible values are calculated once before the series starts
over. We then use an other algorithm to hash each of the possible
integer values into a number between 1 and 64, such that there will be
an equal distribution but (again) no obvious pattern.
Even though the algorithm is entirely deterministic, if you write down
the full sequence of 4,294,967,296 numbers this algorithm generates
before starting over, you will see a 1 in 64 chance of getting the same
number twice in a row, a 1 in 64^2 chance of getting the same number
thrice and a 1 in 64^3 chance of getting four equal numbers in a row.
The chance for getting five equals in a row might be somewhat more or
less than 1 in 64^4, and the chance of six in a row will definitely
differ significantly from 1 in 64^5. These issues can be fixed by
increasing the ratio of seed numbers vs generated values (e.g. by using
biging instead of int for the seed).
Note that nothing of the above necessarily applies to the random number
generation by SQL Server. This applies to random number generation in
general - as far as I know, the random number generator in SQL Server is
not described in detail, so only MS employees are able to tell if it's
implemented as described here, or in another way.
--
Hugo Kornelis, SQL Server MVP
My SQL Server blog: http://sqlblog.com/blogs/hugo_kornelis
[Back to original message]
|