'Issue while writing sql query results in Spark SQL
I use Spark SQL v2.4. with the SQL API. I have a sql query, which fails when I run the job in Spark, it fails with the error :-
WARN SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints
(spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes).
This may impact query planning performance.
ERROR TransportClient: Failed to send RPC RPC 8371705265602543276 to xx.xxx.xxx.xx:52790:java.nio.channels.ClosedChannelException
The issue occurs when I am triggering write command to save the output of the query to parquet file on S3:
The query is:-
create temp view last_run_dt
as
select dt,
to_date(last_day(add_months(to_date('[${last_run_date}]','yyyy-MM-dd'), -1)), 'yyyy-MM-dd') as dt_lst_day_prv_mth
from date_dim
where dt = add_months(to_date('[${last_run_date}]','yyyy-MM-dd'), -1);
create temp view get_plcy
as
select plcy_no, cust_id
from (select
plcy_no,
cust_id,
eff_date,
row_number() over (partition by plcy_no order by eff_date desc) AS row_num
from plcy_mstr pm
cross join last_run_dt lrd
on pm.curr_pur_dt <= lrd.dt_lst_day_prv_mth
and pm.fund_type NOT IN (27, 36, 52)
and pm.fifo_time <= '2022-02-12 01:25:00'
and pm.plcy_no is not null
)
where row_num = 1;
I am writing the output as :
df.coalesce(10).write.parquet('s3:/some/dir/data', mode="overwrite", compression="snappy")
The "plcy_mstr" table in the above query is a big table of 500 GB size and is partitioned on eff_dt column. Partitioned by every date.
I have tried to increase the executor memory by applying the following configurations, but the job still fails.
set spark.driver.memory=20g;
set spark.executor.memory=20g;
set spark.executor.cores=3;
set spark.executor.instances=30;
set spark.memory.fraction=0.75;
set spark.driver.maxResultSize=0;
The cluster contains 20 nodes with 8 cores each and 64GB of memory.
Can anyone please help me identify the issue and fix the job ? Any help is appreciated.
Happy to provide more information if required.
Thanks
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
