'How to grab all hive files after a certain date for s3 upload (python)
I'm writing a program for a daily upload to s3 of all our hive tables from a particular db. This database contains records from many years ago, however, and is way too large for a full copy/distcp.
I want to search the entire directory in HDFS that contains the db, and only grab the files with a last_modified_date that's after a specified (input) date.
I will then do the full distcp of these matching files to s3. (If I need to just copy down the paths/names of the matching files in a separate file, and then distcp from this extra file, that's fine too.)
Looking online, I've found that I can sort the files by their last modified date using the -t flag, so I started out with something like this: hdfs dfs -ls -R -t <path_to_db>, but this isn't enough. It's printing like 500000 files and I still need to figure out how to trim the ones that are from before this input date...
EDIT: I'm writing a Python script, sorry for not clarifying initially!
EDIT pt2: I should note that I need to traverse several thousand, or even several hundred thousand files. I've written a basic script in an attempt to solve my problem, but it takes an incredibly long time to run. Need a way to speed up the process....
Solution 1:[1]
I'm not sure if you use Java but here is an example of what can do:. I mad some small modifications to use lastModified.
import java.io.*; import java.util.*; import java.net.*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; // For Date Conversion from long to human readable. import java.text.DateFormat; import java.text.SimpleDateFormat; import java.util.Calendar; import java.util.Date; public class FileStatusChecker { public static void main (String [] args) throws Exception { try{ FileSystem fs = FileSystem.get(new Configuration()); String hdfsFilePath = "hdfs://My-NN-HA/Demos/SparkDemos/inputFile.txt"; FileStatus[] status = fs.listStatus(new Path(hdfsFilePath)); // you need to pass in your hdfs path for (int i=0;i<status.length;i++){ long lastModifiedTimeLong = status[i].lastModified(); Date lastModifiedTimeDate = new Date(lastModifiedTimeLong); DateFormat df = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z"); System.out.println("The file '"+ hdfsFilePath + "' was accessed last at: "+ df.format(lastModifiedTimeDate)); } }catch(Exception e){ System.out.println("File not found"); e.printStackTrace(); } } }
It would enable you to create a list of files and do "things" with them.
Solution 2:[2]
You can use WebHDFS to pull the exact same information: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html
This might be more friendly to use with Python.
Status of a File/Directory Submit a HTTP GET request.
curl -i "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILESTATUS" The client receives a response with a FileStatus JSON object: HTTP/1.1 200 OK Content-Type: application/json Transfer-Encoding: chunked { "FileStatus": { "accessTime" : 0, "blockSize" : 0, "group" : "supergroup", "length" : 0, //in bytes, zero for directories "modificationTime": 1320173277227, "owner" : "webuser", "pathSuffix" : "", "permission" : "777", "replication" : 0, "type" : "DIRECTORY" //enum {FILE, DIRECTORY} } }List a Directory Submit a HTTP GET request.
curl -i "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS" The client receives a response with a FileStatuses JSON object: HTTP/1.1 200 OK Content-Type: application/json Content-Length: 427 { "FileStatuses": { "FileStatus": [ { "accessTime" : 1320171722771, "blockSize" : 33554432, "group" : "supergroup", "length" : 24930, "modificationTime": 1320171722771, "owner" : "webuser", "pathSuffix" : "a.patch", "permission" : "644", "replication" : 1, "type" : "FILE" }, { "accessTime" : 0, "blockSize" : 0, "group" : "supergroup", "length" : 0, "modificationTime": 1320895981256, "owner" : "szetszwo", "pathSuffix" : "bar", "permission" : "711", "replication" : 0, "type" : "DIRECTORY" }, ... ] } }
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Matt Andruff |
| Solution 2 | Matt Andruff |
