'Read a table only once in Spark SQL
I have a Spark SQL query that uses a table more than once in the same query for which the UI shows the table is read for that many times (3). Is there any way to make Spark access the table just once ?
create temp view my_tab_v as
select prsn_id, cust_id, orig_dt, eff_dt
from db.my_tab_t
where state = 'MA';
with sub1 as
(
select prsn_id as indv, eff_dt
from my_tab_v
union
select cust_id as indv, eff_dt
from my_tab_v
),
sub2 as
(
select indv, eff_dt, row_number() over (partition by indv order by eff_dt desc) as rn
from sub1
)
select indv, '1900-01-01' as orig_dt, eff_dt
from sub2
where rn = 1
union
select prsn_id, orig_dt, eff_dt
from my_tab_v;
Spark UI SQL tab shows that the db.my_tab_t is being read 3 times. Is there any trick to make Spark read the table only once instead of 3 times ?
Note: This is just an experimental query and mainly for educational purpose.
I am using Spark SQL v 2.4
Thanks
Solution 1:[1]
You can try to read the table into dataFrame, then call df.cache() if data fit in memmory or df.persist(). In this way, original table only be read once.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Warren Zhu |
