'How to read tab separated values that contain strings which are tab separated?
I have the following tab separated data, where the strings within some data are also tab separated. The data is in this text format:
Sl_No Name Designation OrderID ProductID Address_Details Market_Price
1 Rahul KS Systems Engineer 557 201673 Bangalore Karnataka India 256
2 Prem Bhaskar Data Analyst 267 275826 Mysore Karnataka India 671
. . . . . . . . . . . . . . . . . . . . . . . . .
207 Pranav Rajashekar Data Engineer 103 106431 Mangalore Karnataka India 159
There are a total of 120 columns, I have taken only 7 columns to illustrate this example.
I'm trying to get the data in this format

But if I use pandas to convert the text file into a dataframe using tab as the separator, the data is coming out like this :

It is considering the tab separated strings in the data as data belonging to the next column.
If I know a field is going to have n strings, which are tab seperated, how do I automate the process for making sure all n string come under the same column ?
Is there any way to fix this issue without manually changing the text file's data ?
Solution 1:[1]
Updated answer after discussion with OP
OP provided more sample data.
The input data is not consistent in any way regarding tabs/spaces. Easiest way to fix it is to normalize the data and write a custom line parser.
The code is heavily commented to explain what is being done and why...
With the provided data this generated no hits on the sanity check for each row.
Note, I know there's more columns in the real source data, but you should be able expand on the logic below to extend it to your actual source data.
The best option would of course be to get a properly formatted input file, if that is feasible.
import csv
with open("output.txt") as infile:
data = infile.readlines()
def parse_data(data):
header = [element.strip() for element in data.pop(0).split("\t") if element != ""]
data_output = [header]
for row in data:
# some cleaning of the lines.
# replace tabs with spaces.
# only add to row_current if the element is not empty
row_current = [element.strip() for element in row.replace("\n", "").replace("\t", " ").split(" ") if element != ""]
# There's some constants in the data.
# First element is always Sl_No
# Second element is always last name
# Third element is always first name (or initials)
# Let's put all of that into the first 2 elements of the output row, joining last and first name into 1 element
row_output = [row_current[0], " ".join(row_current[1:3])]
# The next two elements appear to be the Designation of the person in question. It's always followed by a numeric element (OrderID) though
# So it's safer to add everything from element 3 till the first numeric element
# use enumerate to get the index
for indexOrderID, element in enumerate(row_current[3:]):
if element.isnumeric():
# break if isnumeric so we have the correct indexOrderID
# we need to add 3 to offset the starting enumeration
indexOrderID += 3
break
# join the elements between index 3 and indexOrderID into the
row_output.append(" ".join(row_current[3:indexOrderID]))
# next two elements are always the OrderID and ProductID, so we can add them as is
row_output.append(row_current[indexOrderID])
row_output.append(row_current[indexOrderID+1])
# What's left now is the Address_Details and the Market_Price
# We could do the same as with the OrderID but we can just join the rest excluding the last element into Address_Details
# and then add the last element as Market_Price
row_output.append(" ".join(row_current[indexOrderID+2:-1]))
row_output.append(row_current[-1])
# do sanity check to see if the length of row_output matches header
if len(row_output) != len(header):
print(f"Length mismatch on this row:\n {row_output}")
else:
data_output.append(row_output)
# finally, return the resulting data_output
return data_output
data_output = parse_data(data)
with open("outputdata.csv", "w", newline="") as outfile:
writer = csv.writer(outfile)
writer.writerows(data_output)
Old Answer
Below is a partial answer based on the provided input data. The trick in this case is that the elements in a row that are actually part of a single column, end with a space.
There is one false positive in the provided data though, which is the last line. The second to last element "India " contains a space. This is the only inconsitency in the provided sample data though.
Try running the code below on the actual real data and see how far that gets you.
Some notes:
First, note that I copied/pasted the data from your question into a new filed called inputdata.tsv. Replace that part with your actual input data of course...
Second, note that if after processing the row the length of said row doesn't match the length of the header, it will print out the offending line.
Third, there's some data cleaning involved in both the header generation and the rows themselves.
Lastly, it will save the resulting data into a proper CSV file using the csv standard module...
import csv
with open("inputdata.tsv") as infile:
data = infile.readlines()
header = [element.strip() for element in data.pop(0).split("\t") if element != ""]
data_output = [header]
for row in data:
row_current = row.replace("\n", "").split("\t")
row_output = []
column_buffer = ""
for column in row_current:
if len(column) > 0 and column[-1] == " ":
column_buffer = f'{column_buffer}{column}'
elif len(column) > 0:
row_output.append(f'{column_buffer}{column}'.strip())
column_buffer = ""
data_output.append(row_output)
if len(row_output) != len(header):
print(f"Length mismatch on this row:\n {row_output}")
with open("outputdata.csv", "w", newline="") as outfile:
writer = csv.writer(outfile)
writer.writerows(data_output)
The resulting output file is as below. Please note the last line though, as that is the one with an incorrect result...
Sl_No,Name,Designation,OrderID,ProductID,Address_Details,Market_Price
1,Rahul KS,Systems Engineer,557,201673,Bangalore Karnataka India,256
2,Prem Bhaskar,Data Analyst,267,275826,Mysore Karnataka India,671
207,Pranav Rajashekar,Data Engineer,103,106431,Mangalore Karnataka India 159
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
