'Is there a way to list S3 objects by last modified using airflow?

Code below so far

t1 = S3ListOperator(
    task_id='list_s3_files',
    bucket='mybucket',
    prefix='v01/{{ds}}/',
    delimiter='/'
)

will then copy the latest file across using S3CopyObjectOperator



Solution 1:[1]

Not a particular "Airflow way", but you could do this with a PythonOperator:

all_objects = boto3.resource('s3').bucket(your_bucket_name).objects.iterator()
sorted_objs = sorted(all_objects, key=lambda o: o.last_modified)
latest_file = sorted_objs[-1]

Though it's not an "industrial solution", as it requires pulling all the files just to sort them. S3 doesn't support "querying" like that.

If you have a predictable way to segment the files (e.g per-day, per-hour), it wouldn't be that bad though.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Kache