'Merge 2 or more csv files with time overlap data

How do I merge 2 or more csv files with time overlap data? For e.g.,

data1 is

Time u v w
0.24001821                          0        0.009301949        0
0.6400364                           0        0.009311552        0
0.84005458                          0        0.0093211568        0
0.94034343                          0        0.0094739951        0

data2 is

Time u v w
0.74041502                          0        0.0095119512        0
0.84043291                          0        0.0095214359        0
0.94045075                          0        0.0095309047        0
1.2404686                           0        0.0095403752        0

What I want is:

Time u v w
0.24001821                          0        0.009301949        0
0.6400364                           0        0.009311552        0
0.74041502                          0        0.0095119512        0
0.84043291                          0        0.0095214359        0
0.94045075                          0        0.0095309047        0
1.2404686                           0        0.0095403752        0

So the last few rows of data from the 1st csv file is deleted and the 2nd csv file is merged so that the time sequence is increasing.

How can that be done? Thanks.



Solution 1:[1]

Python has an excellent built in library function to help with this called heapq.merge().

Assuming your data is space delimited, you could use this as follows:

from heapq import merge
import csv

filenames = ['data1.csv', 'data2.csv']
merge_list = []

for filename in filenames:
    f_input = open(filename)
    csv_input = csv.reader(f_input, delimiter=' ', skipinitialspace=True)
    header = next(csv_input)
    merge_list.append(csv_input)
    
with open('output.csv', 'w', newline='') as f_output:
    csv_output = csv.writer(f_output, delimiter=' ')
    csv_output.writerow(header)
    csv_output.writerows(merge(*merge_list, key=lambda x: float(x[0])))

This would produce a CSV output format as:

Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84005458 0 0.0093211568 0
0.84043291 0 0.0095214359 0
0.94034343 0 0.0094739951 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0

This will work for any number of input CSV files.

Solution 2:[2]

If both files are individually ordered by time already. Using for loop is enough:

# csv cell should be separated by comma, change if required
dilimeter = ','
# open files and read lines
f1 = open('data1.csv', 'r')
f1_lines = f1.readlines()
f1.close()
f2 = open('data2.csv', 'r')
f2_lines = f2.readlines()
f2.close()

# extract header
output_lines = [f1_lines[0]]

# start scanning frome line 2 of both files (line 1 is header)
f1_index = 1
f2_index = 1
while True:
    # all data1 are processed, append remaining lines from data2
    if f1_index >= len(f1_lines):
        output_lines += f2_lines[f2_index:]
        break
    # all data2 are processed, append remaining lines from data1
    if f2_index >= len(f2_lines):
        output_lines += f1_lines[f1_index:]
        break
    f1_line_time = float(f1_lines[f1_index].split(dilimeter)[0]) # get the time cell of data1
    f2_line_time = float(f2_lines[f2_index].split(dilimeter)[0]) # get the time cell of data2
    if f1_line_time < f2_line_time:
        output_lines.append(f1_lines[f1_index])
        f1_index += 1
    elif f1_lines == f2_line_time:
        # if they are equal in time, pick one
        output_lines.append(f1_lines[f1_index])
        f1_index += 1
        f2_index += 1
    else:
        output_lines.append(f2_lines[f2_index])
        f2_index += 1
f_output = open('out.csv', 'w')
f_output.write(''.join(output_lines))
f_output.close()

Solution 3:[3]

Another option:

import csv

delimiter = " "
with open("data1.csv", "r") as fin1,\
     open("data2.csv", "r") as fin2,\
     open("data.csv", "w") as fout:
    reader1 = csv.reader(fin1, delimiter=delimiter)
    reader2 = csv.reader(fin2, delimiter=delimiter)
    writer = csv.writer(fout, delimiter=delimiter)
    next(reader2)
    first_row = next(reader2)
    start2 = float(first_row[0])
    writer.writerow(next(reader1))
    for row in reader1:
        if  start2 <= float(row[0]):
            break
        writer.writerow(row)
    writer.writerow(first_row)
    writer.writerows(reader2)

Assumption is that the files are already ordered individually:

  • First take the first data row of data2.csv and convert its first entry into a float start2.
  • With that in mind write all rows from data1.csv with a time less than start2 into the new file data.csv, and break out of the loop once the condition isn't met anymore.
  • Then write the already extracted first data row from data2.csv to data.csv, and afterwards write the rest of data2.csv to data.csv.

Result for data1.csv

Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.84005458 0 0.0093211568 0
0.94034343 0 0.0094739951 0

data2.csv

Time u v w
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0

is

Time u v w
0.24001821 0 0.009301949 0
0.6400364 0 0.009311552 0
0.74041502 0 0.0095119512 0
0.84043291 0 0.0095214359 0
0.94045075 0 0.0095309047 0
1.2404686 0 0.0095403752 0

A more general solution (multiple files) could look like:

import csv

delimiter = " "
files = ["data1.csv", "data2.csv", "data3.csv"]
stops = []
for file in files[1:]:
    with open(file, "r") as file:
        reader = csv.reader(file, delimiter=delimiter)
        header = next(reader)
        stops.append(float(next(reader)[0]))
stops.append(float("inf"))

with open("data.csv", "w") as fout:
    writer = csv.writer(fout, delimiter=delimiter)
    writer.writerow(header)
    for stop, file in zip(stops, files):
        with open(file, "r") as fin:
            next(fin)
            reader = csv.reader(fin, delimiter=delimiter)
            for row in reader:
                if stop <= float(row[0]):
                    break
                writer.writerow(row)

This would work for overlaps looking like

1. file:   |------|
2. file:       |--------|
3. file:               |------|

but not

1. file:  |--------|
2. file:         |-------|
3. file:     |--------------|

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 AlexanderK1987
Solution 3