'Duplicate Records with RAND() function
I have a python program to extract data from Microsoft SQL Server and load them to another table in the same database. Using this extraction program, I am trying to do the following.
- Get the input arguments from an excel file
- Fetch one random record from the database for each row present in the excel
- Load the output to another table
Using the RAND() function, I'm seeing duplicate records being retrieved most of the time even though the combination has a sufficient number of entries in the database. I tried a couple of other approaches like NEWID() and calculating the total number of rows and then retrieving a random row using numpy. But these queries take hours to execute even for a single combination and does not seem feasible.
Note: The table is huge (~7 million records) and left joins are used to retrieve the required data.
Are there any other methods to fix this issue?
Solution 1:[1]
RANDOM() give a randomized value betwwen 0 and 1, but this value is always the same when used in a SELECT steement because without any argument, this will be a "constant".
As a proof :
SELECT RAND() AS RANDOM_VALUE, *
FROM sys.objects
If you want to pick one row randomly you need to use an UUID or GUID this way :
WITH
T AS (SELECT RANK() OVER(ORDER BY NEWID()) AS RANDOM_VALUE, *
FROM sys.objects)
SELECT *
FROM T
WHERE RANDOM_VALUE = (SELECT MAX(RANDOM_VALUE) FROM T)
Of course you can choose MAX or MIN
Another soluton is to use TABLESAMPLE combined to this if the table is veru huge...
WITH
T AS (SELECT RANK() OVER(ORDER BY NEWID()) AS RANDOM_VALUE, *
FROM a_table TABLESAMPLE (1 PERCENT))
SELECT *
FROM T
WHERE RANDOM_VALUE = (SELECT MAX(RANDOM_VALUE) FROM T)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SQLpro |
