'Merge 2 or more csv files with time overlap data
How do I merge 2 or more csv files with time overlap data? For e.g.,
data1 is
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0
data2 is
Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
What I want is:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
So the last few rows of data from the 1st csv file is deleted and the 2nd csv file is merged so that the time sequence is increasing.
How can that be done? Thanks.
Solution 1:[1]
Python has an excellent built in library function to help with this called heapq.merge()
.
Assuming your data is space delimited, you could use this as follows:
from heapq import merge
import csv
filenames = ['data1.csv', 'data2.csv']
merge_list = []
for filename in filenames:
f_input = open(filename)
csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)
header = next(csv_input)
merge_list.append(csv_input)
with open('output.csv', 'w', newline='') as f_output:
csv_output = csv.writer(f_output, delimiter=' ')
csv_output.writerow(header)
csv_output.writerows(merge(*merge_list, key=lambda x: float(x[0])))
This would produce a CSV output format as:
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84005458 0 0.0093211568 0
0.84043291 0 0.0095214359 0
0.94034343 0 0.0094739951 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
This will work for any number of input CSV files.
Solution 2:[2]
If both files are individually ordered by time already. Using for loop is enough:
# csv cell should be separated by comma, change if required
dilimeter = ','
# open files and read lines
f1 = open('data1.csv', 'r')
f1_lines = f1.readlines()
f1.close()
f2 = open('data2.csv', 'r')
f2_lines = f2.readlines()
f2.close()
# extract header
output_lines = [f1_lines[0]]
# start scanning frome line 2 of both files (line 1 is header)
f1_index = 1
f2_index = 1
while True:
# all data1 are processed, append remaining lines from data2
if f1_index >= len(f1_lines):
output_lines += f2_lines[f2_index:]
break
# all data2 are processed, append remaining lines from data1
if f2_index >= len(f2_lines):
output_lines += f1_lines[f1_index:]
break
f1_line_time = float(f1_lines[f1_index].split(dilimeter)[0]) # get the time cell of data1
f2_line_time = float(f2_lines[f2_index].split(dilimeter)[0]) # get the time cell of data2
if f1_line_time < f2_line_time:
output_lines.append(f1_lines[f1_index])
f1_index += 1
elif f1_lines == f2_line_time:
# if they are equal in time, pick one
output_lines.append(f1_lines[f1_index])
f1_index += 1
f2_index += 1
else:
output_lines.append(f2_lines[f2_index])
f2_index += 1
f_output = open('out.csv', 'w')
f_output.write(''.join(output_lines))
f_output.close()
Solution 3:[3]
Another option:
import csv
delimiter = " "
with open("data1.csv", "r") as fin1,\
open("data2.csv", "r") as fin2,\
open("data.csv", "w") as fout:
reader1 = csv.reader(fin1, delimiter=delimiter)
reader2 = csv.reader(fin2, delimiter=delimiter)
writer = csv.writer(fout, delimiter=delimiter)
next(reader2)
first_row = next(reader2)
start2 = float(first_row[0])
writer.writerow(next(reader1))
for row in reader1:
if start2 <= float(row[0]):
break
writer.writerow(row)
writer.writerow(first_row)
writer.writerows(reader2)
Assumption is that the files are already ordered individually:
- First take the first data row of
data2.csv
and convert its first entry into a floatstart2
. - With that in mind write all rows from
data1.csv
with a time less thanstart2
into the new filedata.csv
, and break out of the loop once the condition isn't met anymore. - Then write the already extracted first data row from
data2.csv
todata.csv
, and afterwards write the rest ofdata2.csv
todata.csv
.
Result for
data1.csv
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0
data2.csv
Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
is
Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0
A more general solution (multiple files) could look like:
import csv
delimiter = " "
files = ["data1.csv", "data2.csv", "data3.csv"]
stops = []
for file in files[1:]:
with open(file, "r") as file:
reader = csv.reader(file, delimiter=delimiter)
header = next(reader)
stops.append(float(next(reader)[0]))
stops.append(float("inf"))
with open("data.csv", "w") as fout:
writer = csv.writer(fout, delimiter=delimiter)
writer.writerow(header)
for stop, file in zip(stops, files):
with open(file, "r") as fin:
next(fin)
reader = csv.reader(fin, delimiter=delimiter)
for row in reader:
if stop <= float(row[0]):
break
writer.writerow(row)
This would work for overlaps looking like
1. file: |------|
2. file: |--------|
3. file: |------|
but not
1. file: |--------|
2. file: |-------|
3. file: |--------------|
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | AlexanderK1987 |
Solution 3 |