'How can i join these dataframes on closes timestamp?
I have two dataframes:
a = spark.createDataFrame(
data=[
(1, datetime.strptime('2022-05-16', '%Y-%m-%d')),
(1, datetime.strptime('2022-05-15', '%Y-%m-%d')),
(1, datetime.strptime('2022-05-14', '%Y-%m-%d')),
(1, datetime.strptime('2022-05-14', '%Y-%m-%d')),
(1, datetime.strptime('2022-05-05', '%Y-%m-%d')),
],
schema=StructType(
[
StructField('seller_id', StringType()),
StructField('completed_at', DateType())
],
))
b = spark.createDataFrame(
data=[
(1, datetime.strptime('2022-05-16', '%Y-%m-%d'), 70),
(1, datetime.strptime('2022-05-15', '%Y-%m-%d'), 71),
(1, datetime.strptime('2022-05-14', '%Y-%m-%d'), 70),
(1, datetime.strptime('2022-05-03', '%Y-%m-%d'), 65),
],
schema=StructType(
[
StructField('user_id', StringType()),
StructField('event_timestamp', DateType()),
StructField('lat', IntegerType())
],
))
I want to join these on user_id == seller_id and (completed_at == event_timestamp | closest event_timestamp).
What is the best way to do this?
Desired output should look like this:
| seller_id | completed_at | lat |
|---|---|---|
| 1 | 2022-05-05 | 65 |
| 1 | 2022-05-14 | 70 |
| 1 | 2022-05-14 | 70 |
| 1 | 2022-05-15 | 71 |
| 1 | 2022-05-16 | 70 |
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
