'What is the best way to enhance query performance for a large database?
I'm working on an RDS MariaDB database which contains more than 270 highly-connected tables with more than 10 million records. The main goal is to enhance queries performance, the most important queries are very complex and I need a solution that handles these queries more effectively. The following is a query that generates the result in 6 minutes!!
SELECT table1.col1 AS col1,
table1.col2 AS col2,
table2.col1 AS country,
table3.col1 AS col1,
table2.col1 AS col1,
table3.col2 AS time,
Concat_ws('; ', Collect_set(DISTINCT table4.col1)) AS Route,
CASE
WHEN table5.col1 IS NULL THEN 'N/A'
ELSE table5.col1
end AS MAH,
CASE
WHEN table6.col1 IS NULL THEN 'N/A'
ELSE table6.col1
end AS MFG,
table7.col1 AS PackSize,
table1.col3 AS col3
FROM table3
INNER JOIN table7
ON table3.col3 = table7.col1
INNER JOIN table8
ON table3.col4 =
table8.col4
INNER JOIN table2
ON table8.col4 = table2.col1
LEFT JOIN table9 AS table5
ON table3.col5 = table5.col1
LEFT JOIN table9 AS table6
ON table3.col6 = table6.col1
INNER JOIN table10
ON table3.col3 =
table10.col1
INNER JOIN table4
ON table10.col2 = table4.col1
INNER JOIN (SELECT DISTINCT table11.col1 AS
col3,
table12.col1 AS
DosageUnit_col1,
table13.col1 AS
col1,
table13.col2 AS
col2
FROM table7
INNER JOIN table11
ON table7.col2 =
table11.col1
INNER JOIN table14
ON table7.col1 =
table14.col1
INNER JOIN table12
ON table14.col1 =
table12.col1
INNER JOIN (SELECT table15.col1
AS
col1,
Concat_ws(' / ', Collect_set(
table16.col1)) AS
col1,
Concat_ws('/', Collect_set(
table17.col2,
table18.col1))
AS
col2
FROM table15
INNER JOIN table17
ON table15.col1 =
table17.col3
INNER JOIN table18
ON
table17.col1 =
table18.col1
INNER JOIN table16
ON table17.col4 =
table16.col1
GROUP BY table15.col1) AS
table13
ON table12.col2 =
table13.col1
GROUP BY dosageunit_col1) AS table1
ON table7.col1 = table1.dosageunit_col1
WHERE table3.col3 IN ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) and `market_authorizations`.`authority_id` in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
AND table3.col4 IN ( ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )
AND table3.mapped_col1 IS NULL
AND col2 >= ''
GROUP BY table3.col1
ORDER BY table3.col2 DESC;
The first idea I thought about is to migrate the data to Amazon Redshift, so I created a migration task using AWS DMS to export all data from RDS to S3 in Apache Parquet format to take advantage of the columnar storage and compression. The migration task created a parquet file for each exported table. To test the performance of the various options to use in the solution (RDS, Redshift, Redshift Spectrum, Athena), I used the following query which returns the first million records of a table:
select * from table limit 1000000;
and the execution time in seconds was as follows:
- RDS MariaDB (8 CPU, 32 RAM, 600 IOPS) => 23.462 sec
- Redshift Spectrum (Two nodes, each node is with 2 CPUs, 15 RAM) => 23 sec
- Athena => 4.01 sec
Although Athena and Redshift spectrum read from the same file stored in S3, the performance between both is significantly different.
Why the performance didn't get better at all when we used Redshift Spectrum?
Why do Redshift and RDS have the same performance?
Why I have got the same performance exactly (23 sec) when I resized the Redshift cluster from one node into two nodes?
Is that because I use unpartitioned data (one parquet file for each table)?
Why does Athena have the best performance even though it reads from the same source as the Redshift Spectrum?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
