'How to get the group of different intervals in pyspark?
I have one pyspark Dataframe with different intervals and his equivalent groups. And I need to eval the column of other dataframe and get the group of the interval that follow the data
This are the intervals
``` # +---+-----------------+
# | start | end | grupo|
# +---+------------------+
# | 0 | 10 | 1 |
# | 11 | 27 | 2 |
# | 28 | 33 | 3 |
# | 34 | 41 | 4 |
# | 42 | 46 | 5 |
# +---+--------------------+```
And I have this:
# +---+
# | result|
# +---+----
# | 5 |
# | 7 |
# | 33 |
# | 22 |
# | 41 |
# +---+----
And I need this
``` # +---+-------
# | result| grupo|
# +---+-----------
# | 5 | 1|
# | 7 | 1|
# | 33 | 3|
# | 22 | 2|
# | 41 | 4|
# +---+----------
Solution 1:[1]
You can join the df containing intervals and result based such that
df["result"].between(df_intervals["start"], df_intervals["end"])
Working Example
df_intervals = spark.createDataFrame([(0, 10, 1, ),
(11, 27, 2, ),
(28, 33, 3, ),
(34, 41, 4, ),
(42, 46, 5, ),], ("start", "end", "group", ),)
df = spark.createDataFrame([(5, ),
(7, ),
(33, ),
(22, ),
(41, ), ], ("result", ))
(df_intervals.join(df, df["result"].between(df_intervals["start"], df_intervals["end"]))
.select("result", "group")).show()
"""
+------+-----+
|result|group|
+------+-----+
| 5| 1|
| 7| 1|
| 22| 2|
| 33| 3|
| 41| 4|
+------+-----+
"""
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Nithish |
