'Traffic outbound when querying S3 from Apache Impala
Despite I've no problem to make my on premises Impala daemons work with S3, I'm wondering how does it work.
I mean, if I consider a table with 100 columns, stored as parquet, and partitioned by (year,month), what happens when I launch this query
SELECT col1, col2, col3, avg(col4)
FROM table_with_100columns
WHERE year in (2021, 2022) and month in (1, 12);
GROUP BY 1,2,3;
Does impala download the entire partitions from S3 (with all the 100 columns) to process them locally, is it able to download only the data matching col1, col2, col3 and col4? Or does it perform the query on S3 and download only the result? (It would be great... but can't see how it would be possible)
The goal of this question is to understand the impact of using a remote storage layer with impala, namely network requirement and billing.
Thanks!
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
