'Aggregate multiple tables considering indexing with values derived from one table
Consider I have three tables (bet,win,cancel) that will contain the data for rounds of bets. Each round can have 1 or more bets and 1 or more wins. A round may be cancelled before any wins are recorded.
I would like to aggregate the data so that each round is displayed as one row, however my queries are very slow after testing with millions of dummy records each day.
Below are example data
Bet
bookmaker_id provider_id round_id transaction_id game_id user_id bet_amount balance bet_timestamp
1 1 Round1 bet_tx1 game1 123 10.00 90.00 2022-03-17 01:40:57.400
1 1 Round2 bet_tx2 game1 123 10.00 70.00 2022-03-17 02:40:57.400
1 1 Round2 bet_tx3 game1 123 10.00 80.00 2022-03-17 03:40:57.400
1 1 Round3 bet_tx4 game1 123 10.00 70.00 2022-03-17 04:40:57.400
Win
bookmaker_id provider_id round_id transaction_id user_id win_amount balance win_timestamp
1 1 Round1 win_tx1 123 0.00 80.00 2022-03-17 01:40:57.400
1 1 Round2 win_tx2 123 10.00 80.00 2022-03-17 02:40:57.400
Cancel
bookmaker_id provider_id round_id transaction_id user_id cancel_timestamp
1 1 Round3 can_tx1 123 2022-03-17 01:40:57.400
Each table has two indexes:
Primary Key: (bookmaker_id , provider_id , transaction_id)
Index: (bookmaker_id, provider_id, round_id)
The following query attempts to aggregate the round information for the first bet transaction. It totals the amount bet, totals the amount won, links the cancellation record if it exists and appends the transaction_id's together if there are multiple. It also returns the opening balance of the first bet and the closing balance of the last win.
Attempt at query
SELECT
bet.*,
win.*,
cancel.transaction_id as cancel_transaction_id,
cancel.cancel_timestamp
FROM
(
SELECT
round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
STRING_AGG(bet.transaction_id, ', ') as bet_transaction_ids,
SUM(bet_amount) as total_bet,
MIN(bet_timestamp) as bet_timestamp,
opening_balance
FROM
(
SELECT
FIRST_VALUE(balance) OVER (
PARTITION BY round_id
order by
bet_timestamp asc
) as opening_balance,
*
FROM
bet
WHERE
bookmaker_id = 1
AND provider_id = 1
AND transaction_id = 'bet_tx1'
) AS bet
GROUP BY
round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
opening_balance
) AS bet
LEFT JOIN (
SELECT
round_id,
STRING_AGG(win.transaction_id, ', ') as win_transaction_ids,
SUM(win_amount) as total_won,
MAX(win_timestamp) as win_timestamp,
closing_balance
FROM
(
SELECT
FIRST_VALUE(balance) OVER (
PARTITION BY round_id
ORDER BY
win_timestamp desc
) AS closing_balance,
*
FROM
win
) AS win
GROUP BY
win.round_id,
closing_balance
) AS win ON bet.round_id = win.round_id
LEFT JOIN (
SELECT
[round_id],
[transaction_id],
[cancel_timestamp]
FROM
cancel
) AS cancel ON bet.round_id = cancel.round_id
ORDER BY
total_won;
Result
round_id game_id provider_id bookmaker_id user_id bet_transaction_ids total_bet bet_timestamp opening_balance round_id win_transaction_ids total_won win_timestamp closing_balance cancel_transaction_id cancel_timestamp
Round1 game1 1 1 123 bet_tx1 10.00 2022-03-17 01:40:57.400 90.00 Round1 win_tx1 0.00 2022-03-17 01:40:57.400 80.00 null null
I assume this is not the most efficient way to link these tables if all I have is the bet transaction id considering the indexes and millions of potential records. I would also like to select multiple rounds filtered by bet_timestamp but I hope I can apply anything learned here to that query too.
How should I query these three tables more efficiently?
Any assistance would be most appreciated.
Solution 1:[1]
I do not know if it works, but could you try this one:
First add an index to your bet table like this(keep the current one too):
Index: (bet_timestamp, round_id, bookmaker_id, provider_id)
Also you have primary key like below:
Primary Key: (bookmaker_id , provider_id , transaction_id)
I do not know histogram of your data in columns "bookmaker_id , provider_id" and in this example you are using "transaction_id" as one of key column of the primary key, but it does not provide any relationship between other tables. After checking historgram of the data refreshing architecture could be good.
Please try this(before executing please create index):
WITH openingtab
AS (SELECT round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
String_agg(bet.transaction_id, ', ') AS bet_transaction_ids,
Sum(bet_amount) AS total_bet,
Min(bet_timestamp) openingbet_time
FROM bet
WHERE bookmaker_id = 1
AND provider_id = 1
GROUP BY round_id,
game_id,
provider_id,
bookmaker_id,
user_id),
bettab
AS (SELECT b.round_id,
b.user_id,
b.game_id,
b.balance opening_bet,
b.provider_id,
b. bookmaker_id,
o.bet_transaction_ids,
o.total_bet,
o.openingbet_time
FROM openingtab o
JOIN bet b
ON o.openingbet_time = b.bet_timestamp
AND o.round_id = b.round_id
AND o.provider_id = b.provider_id
AND o.bookmaker_id = b.bookmaker_id),
betwin
AS (SELECT b.*,
w.transaction_id wtid,
w.win_amount,
w.balance,
w.win_timestamp,
c.transaction_id ctid,
c.cancel_timestamp
FROM bettab b
LEFT OUTER JOIN win w
ON b.round_id = w.round_id
AND b.provider_id = w.provider_id
AND b.bookmaker_id = w.bookmaker_id
LEFT OUTER JOIN cancel c
ON b.round_id = c.round_id
AND b.provider_id = c.provider_id
AND b.bookmaker_id = c.bookmaker_id)
SELECT *
FROM betwin;
Solution 2:[2]
First, the subquery of bet might rewrite as below query, because there is a Primary Key: (bookmaker_id , provider_id , transaction_id) which all hit your condition, so that might be no sense to use aggregate & window function because Primary Key will not be duplicated.
SELECT
round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
transaction_id as bet_transaction_ids,
bet_amount as total_bet,
bet_timestamp as bet_timestamp,
balance as opening_balance
FROM bet
WHERE
bookmaker_id = 1
AND provider_id = 1
AND transaction_id = 'bet_tx1'
The second thing, try to use indexes & filter your result set as possible.
In the below query, I had added conditions bookmaker_id & provider_id on the cancel table when doing OUTER JOIN which might hit your indexes as possible.
because the first column of index is very important that will let QO determine whether use indexes, so try to use first column as well
SELECT
bet.*,
win.*,
cancel.transaction_id as cancel_transaction_id,
cancel.cancel_timestamp
FROM
(
SELECT
round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
transaction_id as bet_transaction_ids,
bet_amount as total_bet,
bet_timestamp as bet_timestamp,
balance as opening_balance
FROM bet
WHERE
bookmaker_id = 1
AND provider_id = 1
AND transaction_id = 'bet_tx1'
) AS bet
LEFT JOIN (
SELECT
round_id,
STRING_AGG(win.transaction_id, ', ') as win_transaction_ids,
SUM(win_amount) as total_won,
MAX(win_timestamp) as win_timestamp,
closing_balance
FROM
(
SELECT
FIRST_VALUE(balance) OVER (
PARTITION BY round_id
ORDER BY
win_timestamp desc
) AS closing_balance,
*
FROM
win
) AS win
GROUP BY
win.round_id,
closing_balance
) AS win ON bet.round_id = win.round_id
LEFT JOIN cancel ON bet.round_id = cancel.round_id
-- add filter condition
AND bet.bookmaker_id = cancel.bookmaker_id
AND bet.provider_id = cancel.provider_id
ORDER BY
total_won;
Final need to review all of the columns whether you indeed need to try to avoid select * (unless you are using cluster index and really need all of them)
The rewritten query of the execution plan might do cluster seek on bet & cancel table and reduce more cost the original one.
Solution 3:[3]
Try with the following index :
CREATE INDEX X001 ON win (balance, win_timestamp, round_id) INCLUDE (transaction_id, win_amount)
I have rewrite your query with CTE :
WITH
WIN0 AS
(
select FIRST_VALUE(balance) OVER (PARTITION BY round_id order by win_timestamp desc) as closing_balance,
round_id, transaction_id, win_amount, win_timestamp
from win
),
WIN1 AS
(
SELECT round_id,
STRING_AGG(transaction_id, ', ') as win_transaction_ids,
SUM(win_amount) as total_won,
MAX(win_timestamp) as win_timestamp,
closing_balance
FROM WIN0
group by round_id,
closing_balance
),
BET0 AS
(
select FIRST_VALUE(balance) OVER (PARTITION BY round_id order by bet_timestamp asc) as opening_balance,
*
from bet
where bookmaker_id = 1
and provider_id = 1
and transaction_id = 'bet_tx1'
),
BET1 AS
(
SELECT round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
STRING_AGG(bet.transaction_id, ', ') as bet_transaction_ids,
SUM(bet_amount) as total_bet,
MIN(bet_timestamp) as bet_timestamp,
opening_balance
FROM BET0 as bet
group by round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
opening_balance
),
CCL0 AS
(
SELECT round_id,
transaction_id,
cancel_timestamp
FROM cancel
)
select betx.*,
winx.*,
cancelx.transaction_id as cancel_transaction_id,
cancelx.cancel_timestamp
from BET1 as betx
LEFT OUTER JOIN WIN1 as winx
ON betx.round_id = winx.round_id
LEFT OUTER JOIN CCL0 as cancelx
ON betx.round_id = cancelx.round_id
order by total_won;
Solution 4:[4]
IMO, when you are asking about performance of query then following info is missing
i) Explain table structure.Like how transaction_id,round_id,bookmaker_id,provider_id are form and combination of what make it unique in Table.
ii)DataType :round_id,transaction_id seem to be very important column and they cannot be nvarchar or varchar
iii) What are your existing index on table ?I cannot suggest index now because table structure is not clear to me.
There is two very visible fault in your query
i) Using subquery and nested subquery.
ii) In inner "Win query" there is no filter,it will select all unecessary data
iii) Partition function can "Under perform" specially if it select million of record.I have no idea how many record it will select and requirement is not 100% clear,so I have not corrected it.I may use Group By over Partition function for this even if i hv to use multiple Select.
iv) if bookmaker_id, provider_id,transaction_id select one record then why aggregate it or put Partition function.
Try my sample query
declare @Bet table(bookmaker_id int, provider_id int, round_id varchar(50), transaction_id varchar(50), game_id varchar(50), user_id int, bet_amount decimal(5,2), balance decimal(5,2), bet_timestamp datetime)
insert into @Bet (bookmaker_id, provider_id, round_id,transaction_id,game_id, user_id ,bet_amount, balance, bet_timestamp)
values
(1 ,1,'Round1','bet_tx1','game1',123, 10.00,90.00,'2022-03-17 01:40:57.400')
,(1 ,1,'Round2','bet_tx2','game1',123, 10.00,70.00,'2022-03-17 02:40:57.400')
,(1 ,1,'Round2','bet_tx3','game1',123, 10.00,80.00,'2022-03-17 03:40:57.400')
,(1 ,1,'Round3','bet_tx4','game1',123, 10.00,70.00,'2022-03-17 04:40:57.400')
declare @Win table(bookmaker_id int, provider_id int, round_id varchar(50), transaction_id varchar(50), user_id int, win_amount decimal(5,2), balance decimal(5,2), win_timestamp datetime)
insert into @Win (bookmaker_id, provider_id, round_id,transaction_id, user_id ,win_amount, balance, win_timestamp)
values
(1,1, 'Round1','win_tx1',123, 0.00 ,80.00,'2022-03-17 01:40:57.400')
,(1,1, 'Round2','win_tx2',123, 10.00 ,80.00,'2022-03-17 02:40:57.400')
declare @Cancel table(bookmaker_id int, provider_id int, round_id varchar(50), transaction_id varchar(50), user_id int, cancel_timestamp datetime)
insert into @Cancel (bookmaker_id,provider_id, round_id, transaction_id, user_id, cancel_timestamp)
values
(1 ,1,'Round3','can_tx1',123,'2022-03-17 01:40:57.400')
--select * from @Cancel
create table #Bet (bookmaker_id int,provider_id int, round_id varchar(50), bet_transaction_ids varchar(50), game_id varchar(50), user_id int, total_bet decimal(5,2), opening_balance decimal(5,2), bet_timestamp datetime)
-- you can bet the dynamic filter here
insert into #Bet (round_id,game_id,provider_id,bookmaker_id,user_id,bet_transaction_ids,total_bet,bet_timestamp,opening_balance)
SELECT
round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
STRING_AGG(bet.transaction_id, ', ') as bet_transaction_ids,
SUM(bet_amount) as total_bet,
MIN(bet_timestamp) as bet_timestamp,
opening_balance
FROM
(
SELECT
FIRST_VALUE(balance) OVER (
PARTITION BY round_id
order by
bet_timestamp asc
) as opening_balance,
*
FROM
@Bet
WHERE
bookmaker_id = 1
AND provider_id = 1
AND transaction_id = 'bet_tx1'
) AS bet
GROUP BY
round_id,
game_id,
provider_id,
bookmaker_id,
user_id,
opening_balance
create table #Win (bookmaker_id int, provider_id int, round_id varchar(50), win_transaction_ids varchar(50), user_id int, total_won decimal(5,2), closing_balance decimal(5,2), win_timestamp datetime)
insert into #Win (round_id,win_transaction_ids,total_won,win_timestamp,closing_balance)
SELECT
round_id,
STRING_AGG(transaction_id, ', ') as win_transaction_ids,
SUM(win_amount) as total_won,
MAX(win_timestamp) as win_timestamp,
closing_balance
FROM
(
SELECT
FIRST_VALUE(balance) OVER (
PARTITION BY round_id
ORDER BY
win_timestamp desc
) AS closing_balance,
*
FROM @Win win
where exists( select * from #Bet bet where bet.round_id = win.round_id )
)t
select bet.round_id,
bet.game_id,
bet.provider_id,
bet.bookmaker_id,
bet.user_id,
bet_transaction_ids,
total_bet,
bet_timestamp,
opening_balance ,
win_transaction_ids,
total_won,
win_timestamp,
closing_balance ,
cancel.[transaction_id] as cancel_transaction_id,
cancel.[cancel_timestamp]
from #Bet bet
left join #Win win on bet.round_id=win.round_id
left join @Cancel cancel on bet.round_id = cancel.round_id
drop table #Bet,#Win
If you provide more detail then I can further optimize my query
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Umut TEK?N |
| Solution 2 | |
| Solution 3 | SQLpro |
| Solution 4 | KumarHarsh |
