'Getting number format exception in Hive when changing mapred.max.split.size

I am running many commands like this to export data from hive as CSVs:

INSERT OVERWRITE DIRECTORY '/output/database/table/' 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
SELECT * FROM database.table;

That command does exactly what I want it to, but I am getting totally random file sizes on the output ranging from 100mb to 500mb. So in an effort to set the max file size I am setting these hive settings:

set mapred.min.split.size=1024;
set mapred.max.split.size=128000000‬;
set tez.grouping.min-size=1024;
set tez.grouping.max-size=128000000;

But when I run the above query I am getting this error:

ERROR : Status: Failed
ERROR : Vertex failed, vertexName=Map 1, vertexId=vertex_1642026592077_2420_5_00, diagnostics=[Vertex vertex_1642026592077_2420_5_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: table initializer failed, vertex=vertex_1642026592077_2420_5_00 [Map 1], java.lang.NumberFormatException: For input string: "128000000‬"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Long.parseLong(Long.java:589)
        at java.lang.Long.parseLong(Long.java:631)
        at org.apache.hadoop.conf.Configuration.getLong(Configuration.java:1540)
        at org.apache.hadoop.hive.conf.HiveConf.getLongVar(HiveConf.java:5134)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat$Context.<init>(OrcInputFormat.java:642)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1960)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:532)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:789)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
]
ERROR : DAG did not succeed due to VERTEX_FAILURE. failedVertices:1 killedVertices:0

As far as I know 128000000‬ is a valid number, so I am at a loss with this error



Solution 1:[1]

Settings you are using are for controlling mappers parallelism.

Try to trigger merge step and specify the file size required. See comments:

set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.size.per.task=128000000?;  --Size of merged files at the end of the job
set hive.merge.smallfiles.avgsize=120000000?; --When the average output file size of a job is less than this number, 
--Hive will start an additional map-reduce job to merge the output files into bigger files
--For Tez
set hive.merge.tezfiles=true; 
set hive.merge.size.per.task=128000000?;
set hive.merge.smallfiles.avgsize=120000000?;

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 leftjoin