'How to read tab separated values that contain strings which are tab separated?

I have the following tab separated data, where the strings within some data are also tab separated. The data is in this text format:

Sl_No   Name    Designation      OrderID        ProductID   Address_Details     Market_Price
1   Rahul   KS  Systems     Engineer    557   201673    Bangalore   Karnataka   India   256
2   Prem    Bhaskar     Data    Analyst     267   275826        Mysore      Karnataka   India   671
.   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   .   
207 Pranav      Rajashekar  Data    Engineer    103   106431    Mangalore   Karnataka   India   159

There are a total of 120 columns, I have taken only 7 columns to illustrate this example.

I'm trying to get the data in this format enter image description here

But if I use pandas to convert the text file into a dataframe using tab as the separator, the data is coming out like this : enter image description here

It is considering the tab separated strings in the data as data belonging to the next column.

If I know a field is going to have n strings, which are tab seperated, how do I automate the process for making sure all n string come under the same column ?

Is there any way to fix this issue without manually changing the text file's data ?



Solution 1:[1]

Updated answer after discussion with OP

OP provided more sample data.

The input data is not consistent in any way regarding tabs/spaces. Easiest way to fix it is to normalize the data and write a custom line parser.
The code is heavily commented to explain what is being done and why...

With the provided data this generated no hits on the sanity check for each row.

Note, I know there's more columns in the real source data, but you should be able expand on the logic below to extend it to your actual source data.

The best option would of course be to get a properly formatted input file, if that is feasible.

import csv


with open("output.txt") as infile:
    data = infile.readlines()


def parse_data(data):
    header = [element.strip() for element in data.pop(0).split("\t") if element != ""]
    data_output = [header]
    for row in data:
        # some cleaning of the lines. 
        # replace tabs with spaces. 
        # only add to row_current if the element is not empty
        row_current = [element.strip() for element in row.replace("\n", "").replace("\t", " ").split(" ") if element != ""]
        # There's some constants in the data. 
        # First element is always Sl_No
        # Second element is always last name
        # Third element is always first name (or initials)
        # Let's put all of that into the first 2 elements of the output row, joining last and first name into 1 element
        row_output = [row_current[0], " ".join(row_current[1:3])]
        
        # The next two elements appear to be the Designation of the person in question. It's always followed by a numeric element (OrderID) though
        # So it's safer to add everything from element 3 till the first numeric element
        # use enumerate to get the index
        for indexOrderID, element in enumerate(row_current[3:]):
            if element.isnumeric():
                # break if isnumeric so we have the correct indexOrderID
                # we need to add 3 to offset the starting enumeration
                indexOrderID += 3
                break
        # join the elements between index 3 and indexOrderID into the 
        row_output.append(" ".join(row_current[3:indexOrderID]))
        
        # next two elements are always the OrderID and ProductID, so we can add them as is
        row_output.append(row_current[indexOrderID])
        row_output.append(row_current[indexOrderID+1])
        
        # What's left now is the Address_Details and the Market_Price
        # We could do the same as with the OrderID but we can just join the rest excluding the last element into Address_Details
        # and then add the last element as Market_Price
        row_output.append(" ".join(row_current[indexOrderID+2:-1]))
        row_output.append(row_current[-1])
        
        # do sanity check to see if the length of row_output matches header
        if len(row_output) != len(header):
            print(f"Length mismatch on this row:\n  {row_output}")
        else:
            data_output.append(row_output)
    # finally, return the resulting data_output
    return data_output


data_output = parse_data(data)


with open("outputdata.csv", "w", newline="") as outfile:
    writer = csv.writer(outfile)
    writer.writerows(data_output)

Old Answer

Below is a partial answer based on the provided input data. The trick in this case is that the elements in a row that are actually part of a single column, end with a space.

There is one false positive in the provided data though, which is the last line. The second to last element "India " contains a space. This is the only inconsitency in the provided sample data though.

Try running the code below on the actual real data and see how far that gets you.

Some notes:

First, note that I copied/pasted the data from your question into a new filed called inputdata.tsv. Replace that part with your actual input data of course...

Second, note that if after processing the row the length of said row doesn't match the length of the header, it will print out the offending line.

Third, there's some data cleaning involved in both the header generation and the rows themselves.

Lastly, it will save the resulting data into a proper CSV file using the csv standard module...

import csv


with open("inputdata.tsv") as infile:
    data = infile.readlines()


header = [element.strip() for element in data.pop(0).split("\t") if element != ""]
data_output = [header]


for row in data:
    row_current = row.replace("\n", "").split("\t")
    row_output = []
    column_buffer = ""
    for column in row_current:
        if len(column) > 0 and column[-1] == " ":
            column_buffer = f'{column_buffer}{column}'
        elif len(column) > 0:
            row_output.append(f'{column_buffer}{column}'.strip())
            column_buffer = ""
    data_output.append(row_output)
    if len(row_output) != len(header):
        print(f"Length mismatch on this row:\n  {row_output}")


with open("outputdata.csv", "w", newline="") as outfile:
    writer = csv.writer(outfile)
    writer.writerows(data_output)

The resulting output file is as below. Please note the last line though, as that is the one with an incorrect result...

Sl_No,Name,Designation,OrderID,ProductID,Address_Details,Market_Price
1,Rahul KS,Systems Engineer,557,201673,Bangalore Karnataka India,256
2,Prem Bhaskar,Data Analyst,267,275826,Mysore Karnataka India,671
207,Pranav Rajashekar,Data Engineer,103,106431,Mangalore Karnataka India 159

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1