'Python: how to know if file is finished uploading into hdfs
So I have 2 scripts: script1 for upload file to hdfs script2 will access the folder and read the files every n seconds
my upload script is like this
from hdfs import InsecureClient
from requests import Session
from requests.auth import HTTPBasicAuth
session = Session()
session.auth = HTTPBasicAuth('hadoop', 'password')
client_hdfs = InsecureClient('http://hadoop.domain.com:50070', user='hadoop', session=session)
client_hdfs.upload(hdfsPath,filePath,overwrite=True)
when I read https://martin.atlassian.net/wiki/spaces/lestermartin/blog/2019/03/21/1172373509/are+partially-written+hdfs+files+accessible+not+exactly+but+much+more+yes+than+I+previously+thought or in stackoverflow Accessing a file that is being written.
It seems when I upload using hadoop dfs -put command (or -copyFromLocal or -cp) it will create [filename].COPYING if the file is not finished yet. But in the python script it seems it will create the file with the same name but the size will increasing overtime until it completed (and we can download it before it complete and get corrupted file).
I want to ask if there is a way to upload the file using python so that we knows that the file is finished uploading or not.
Actually I has another work-around to upload them into temporary folder and move them to the correct folder after all is finished (I am still trying to do this), but if there is another idea for this will be appreciated
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
