'Read lines from file in order, parallelization based on file structure
I have a text file formatted as such:
itemID_1:
(observation 1 for itemID_1)
(observation 2 for itemID_1)
...
(observation k_1 for itemID_1)
itemID_2:
(observation 1 for itemID_2)
(observation 2 for itemID_2)
...
(observation k_1 for itemID_2)
...
I want to create a dataframe where each row is (itemID, observation) (there can be multiple rows for the same itemID).
I would go about doing this in python like so:
rows = []
file = open('my-file.txt')
cur_itemID = None
for line in file:
if re.match(r'\d*:', line):
cur_itemID = re.search(r'(\d*):', line)[1]
else:
rows.append([cur_itemID, line])
So I need to read the file in order, but only so that the correct itemID is associated with the rows below. It would be possible to parallelize this if we could process the rows for each item simultaneously (i.e. starting at row "itemID_i" until "itemID_{i+1}"). I'm not sure how to do something like this in Spark and would appreciate any advice.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
