'Disable Multipart Upload to S3 On Spark
I'm trying to write on a bucket that access is granted anonymously (policy allows our VPC). For a small workload, it works fine, but for a big one, I get the following exception:
22/02/08 19:25:40 WARN TaskSetManager: Lost task 7.0 in stage 2.0 (TID 227) (172.20.64.7 executor 1): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:396)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:284)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:75)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:150)
at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:119)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.scheduler.Task.run(Task.scala:91)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$13(Executor.scala:813)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1620)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:816)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:672)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.file.AccessDeniedException: cdc_streaming/sitemercado/data/groceries-sm.dbo.produto_integracao/part-00007-43d2b092-ea37-43c3-908a-a987df7a9a88-c000.snappy.parquet: initiate MultiPartUpload on cdc_streaming/sitemercado/data/groceries-sm.dbo.produto_integracao/part-00007-43d2b092-ea37-43c3-908a-a987df7a9a88-c000.snappy.parquet: com.amazonaws.services.s3.model.AmazonS3Exception: Anonymous users cannot initiate multipart uploads. Please authenticate.; request: POST https://prd-ifood-data-lake-transient-groceries.bucket.vpce-08b663a29475cd9f4-wcex1383.s3.us-east-1.vpce.amazonaws.com cdc_streaming/sitemercado/data/groceries-sm.dbo.produto_integracao/part-00007-43d2b092-ea37-43c3-908a-a987df7a9a88-c000.snappy.parquet {key=[null]} Hadoop 2.7.4, aws-sdk-java/1.11.655 Linux/5.4.0-1063-azure OpenJDK_64-Bit_Server_VM/25.302-b08 java/1.8.0_302 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.InitiateMultipartUploadRequest; Request ID: WFX4E5H30RYT9SXJ, Extended Request ID: JMxLdOn+T0y1tysF63mg31uvPRttI3wTC7xQAlTxxfpiSY6myfzKYWdmL4G8Jvr1vMNcWgKAos0=, Cloud Provider: Azure, Instance ID: da1fea5b08ef43f1b09adb89a162772c (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: WFX4E5H30RYT9SXJ; S3 Extended Request ID: JMxLdOn+T0y1tysF63mg31uvPRttI3wTC7xQAlTxxfpiSY6myfzKYWdmL4G8Jvr1vMNcWgKAos0=), S3 Extended Request ID: JMxLdOn+T0y1tysF63mg31uvPRttI3wTC7xQAlTxxfpiSY6myfzKYWdmL4G8Jvr1vMNcWgKAos0=:AccessDenied
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:248)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:111)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.lambda$retry$3(Invoker.java:265)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:322)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:261)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.retry(Invoker.java:236)
at shaded.databricks.org.apache.hadoop.fs.s3a.WriteOperationHelper.retry(WriteOperationHelper.java:132)
at shaded.databricks.org.apache.hadoop.fs.s3a.WriteOperationHelper.initiateMultiPartUpload(WriteOperationHelper.java:215)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3ABlockOutputStream$MultiPartUpload.<init>(S3ABlockOutputStream.java:579)
at shaded.databricks.org.apache.hadoop.fs.s3a.PageCRCVerifyingS3ABlockOutputStream$PageCRCVerifyingMultiPartUpload.<init>(PageCRCVerifyingS3ABlockOutputStream.java:153)
at shaded.databricks.org.apache.hadoop.fs.s3a.PageCRCVerifyingS3ABlockOutputStream.initMultipartUpload(PageCRCVerifyingS3ABlockOutputStream.java:113)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3ABlockOutputStream.uploadCurrentBlock(S3ABlockOutputStream.java:326)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3ABlockOutputStream.write(S3ABlockOutputStream.java:306)
at shaded.databricks.org.apache.hadoop.fs.s3a.PageCRCVerifyingS3ABlockOutputStream.write(PageCRCVerifyingS3ABlockOutputStream.java:18)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.parquet.hadoop.util.HadoopPositionOutputStream.write(HadoopPositionOutputStream.java:45)
at org.apache.parquet.bytes.ConcatenatingByteArrayCollector.writeAllTo(ConcatenatingByteArrayCollector.java:46)
at org.apache.parquet.hadoop.ParquetFileWriter.writeDataPages(ParquetFileWriter.java:536)
at org.apache.parquet.hadoop.ColumnChunkPageWriteStore$ColumnChunkPageWriter.writeToFileWriter(ColumnChunkPageWriteStore.java:246)
at org.apache.parquet.hadoop.ColumnChunkPageWriteStore.flushToFileWriter(ColumnChunkPageWriteStore.java:316)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:202)
at org.apache.parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:127)
at org.apache.parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.close(ParquetOutputWriter.scala:41)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.releaseResources(FileFormatDataWriter.scala:58)
at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.commit(FileFormatDataWriter.scala:75)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$2(FileFormatWriter.scala:377)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1654)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:383)
... 19 more
Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: Anonymous users cannot initiate multipart uploads. Please authenticate.; request: POST https://prd-ifood-data-lake-transient-groceries.bucket.vpce-08b663a29475cd9f4-wcex1383.s3.us-east-1.vpce.amazonaws.com cdc_streaming/sitemercado/data/groceries-sm.dbo.produto_integracao/part-00007-43d2b092-ea37-43c3-908a-a987df7a9a88-c000.snappy.parquet {key=[null]} Hadoop 2.7.4, aws-sdk-java/1.11.655 Linux/5.4.0-1063-azure OpenJDK_64-Bit_Server_VM/25.302-b08 java/1.8.0_302 scala/2.12.10 vendor/Azul_Systems,_Inc. com.amazonaws.services.s3.model.InitiateMultipartUploadRequest; Request ID: WFX4E5H30RYT9SXJ, Extended Request ID: JMxLdOn+T0y1tysF63mg31uvPRttI3wTC7xQAlTxxfpiSY6myfzKYWdmL4G8Jvr1vMNcWgKAos0=, Cloud Provider: Azure, Instance ID: da1fea5b08ef43f1b09adb89a162772c (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: WFX4E5H30RYT9SXJ; S3 Extended Request ID: JMxLdOn+T0y1tysF63mg31uvPRttI3wTC7xQAlTxxfpiSY6myfzKYWdmL4G8Jvr1vMNcWgKAos0=), S3 Extended Request ID: JMxLdOn+T0y1tysF63mg31uvPRttI3wTC7xQAlTxxfpiSY6myfzKYWdmL4G8Jvr1vMNcWgKAos0=
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4926)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4872)
at com.amazonaws.services.s3.AmazonS3Client.initiateMultipartUpload(AmazonS3Client.java:3560)
at shaded.databricks.org.apache.hadoop.fs.s3a.S3AFileSystem.initiateMultipartUpload(S3AFileSystem.java:3641)
at shaded.databricks.org.apache.hadoop.fs.s3a.WriteOperationHelper.lambda$initiateMultiPartUpload$0(WriteOperationHelper.java:216)
at shaded.databricks.org.apache.hadoop.fs.s3a.Invoker.once(Invoker.java:109)
... 48 more
Since the problem is with multipart upload, I've tried to disable it. I've already tried:
- set spark.hadoop.fs.s3.multipart.uploads.enabled to false
- set spark.hadoop.fs.s3a.multipart.uploads.enabled to false
- set spark.hadoop.fs.s3n.multipart.uploads.enabled to false
- set spark.hadoop.fs.s3.multipart.threshold to a very very big value
- set spark.hadoop.fs.s3a.multipart.threshold to a very very big value
- set spark.hadoop.fs.s3n.multipart.threshold to a very very big value
Everything on cluster startup and nothing seems to work, resulting in the same error. It is worthwhile to say that:
- It is an Azure Databricks instance.
- I'm using Pyspark.
- There's a Security restriction on creating users (access through accessKey/secretKey) thus, the access through anonymous user.
Does anyone had a similar issue and disabled with success the multipart upload? Cheers!
Solution 1:[1]
As a performance optimization for very small updates, by default the commit service sometimes pushes small updates directly from the control plane to S3. To disable this optimization, set the Spark parameter spark.hadoop.fs.s3a.databricks.s3commit.directPutFileSizeThreshold to 0. You can apply this setting in the cluster’s Spark config or set it in a global init script.
Link: docs.databricks.com
Solution 2:[2]
it says "Anonymous users cannot initiate multipart uploads."
if you can grant this permission-do so. otherwise, play with the S3a options to set the size at which uploads go from single post to multipart. I think it is 'fs.s3a.multipart.threshold' ...set it something big like "1G" and make sure you've enough local storage to buffer all active uploads
note that the stack trace included 'PageCRCVerifyingS3ABlockOutputStream'. that's not in the apache s3a connector, so you will be getting whatever they have implemented-which may be different
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mike Beck |
| Solution 2 |
