'Duplicate Records with RAND() function

I have a python program to extract data from Microsoft SQL Server and load them to another table in the same database. Using this extraction program, I am trying to do the following.

Get the input arguments from an excel file
Fetch one random record from the database for each row present in the excel
Load the output to another table

Using the RAND() function, I'm seeing duplicate records being retrieved most of the time even though the combination has a sufficient number of entries in the database. I tried a couple of other approaches like NEWID() and calculating the total number of rows and then retrieving a random row using numpy. But these queries take hours to execute even for a single combination and does not seem feasible.

Note: The table is huge (~7 million records) and left joins are used to retrieve the required data.

Are there any other methods to fix this issue?

Solution 1:^[1]

RANDOM() give a randomized value betwwen 0 and 1, but this value is always the same when used in a SELECT steement because without any argument, this will be a "constant".

As a proof :

SELECT RAND() AS RANDOM_VALUE, *
FROM   sys.objects

If you want to pick one row randomly you need to use an UUID or GUID this way :

WITH
T AS (SELECT RANK() OVER(ORDER BY NEWID()) AS RANDOM_VALUE, *
      FROM   sys.objects)
SELECT *
FROM   T
WHERE  RANDOM_VALUE = (SELECT MAX(RANDOM_VALUE) FROM T)

Of course you can choose MAX or MIN

Another soluton is to use TABLESAMPLE combined to this if the table is veru huge...

WITH
T AS (SELECT RANK() OVER(ORDER BY NEWID()) AS RANDOM_VALUE, *
      FROM   a_table TABLESAMPLE (1 PERCENT))
SELECT *
FROM   T
WHERE  RANDOM_VALUE = (SELECT MAX(RANDOM_VALUE) FROM T)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	SQLpro

'Duplicate Records with RAND() function

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]