'The format of the output files when using Sink Table
when I use TableAPI to create the sink table and submit the jobs. The files in S3 have the format like this
part-2db289e0-e70a-48d4-ac11-3e75372f621d-1-179
Therefore, I wonder what is the meaning of this format. To my knowledge, this format was followed this and I wonder if it is correct.
part-<job_id>-<partition_id>-[numOfcommit]
If it is correct, there is some questions that I would like to ask
I have set the commit time using this variable sink.rolling-policy.check-interval = 1min. Therefore, does the numberOfCommit part of the output files means that every time that reach the commit time the file will closed and have that number? If so, what if the data is quite huge and needs more than the commit time, will they generate to another file? If so, what is the format of the files ?
One more question is that, how can we set the file size of the output since what the doc recommend is we adjust the commit time.
Thanks all
Solution 1:[1]
The details of how the underlying file system connector works is described in the documentation for the DataStream FileSink connector.
The default naming scheme is
In-progress / Pending:
part-<uid>-<partFileIndex>.inprogress.uid
Finished:part-<uid>-<partFileIndex>Uid is a random id assigned to a subtask of the sink when the subtask is instantiated. This uid is not fault-tolerant so it is regenerated when the subtask recovers from a failure.
If you use the DataStream API you can customize the bucket assigner and rolling policy, but with the SQL/Table API you are limited to the options described in its documentation.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | David Anderson |
