'Loop again through loop in Python
I have a table of 10000+ records, and want to process the data into another source. I don't want to send the records at one go, rather break them into batches i.e. batch_size=1000. So that loop will send 1000 records, and then the next 1000, so on till the last record.
How can I put this in 2 loops, So that the outer loop is for selecting 1000 records, and the inner loop is to process those records. Once the processing is done it should go back to the outer loop get incremented for the next 1000 records, and then run the inner loop, and so on till its reached the total_rows. Please help.
Solution 1:[1]
You can use pyspark.sql.functions.slice to slice up your DataFrame,
But you can also do it "on foot", with regular list slices.
rows = dfpatch.rdd.collect()
batch_size = 1000
batches = [rows[r:r + batch_size] for r in range(0, len(rows), batch_size)]
print(len(batches)) # -> 11
for batch in batches:
for row in batch:
# ...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Tomalak |
