'Convert txt file, with variable categories, to dictionary and pandas df
I've converted a txt file that has a fixed number of variables, for every entry, to a dict and df. For example, if every entry in the txt file has a Date entry followed by Category A, Category B, and Category C, it has a set number of variables. In my current question, there are many categories but most are empty (NaN). I don't want to carry the null value category entries in the txt file. In this case, the txt file may have one Date entry with only Category A and Category C, and another Date entry with only Category B. In the final df, each date is a row and some columns of that row are NaN. My problem is reading that txt file, with unlisted categories in some entries, produces a "All arrays must be of the same length" error.
For example, I've read this txt file:
Date: Jan 1
Cat A: 5
Cat B: 8
Cat C: 2
Date: Feb 6
Cat A: 7
Cat B: 2
Cat C: 3
etc...for multiple dates.
with open ('text.txt', "r") as file:
df = file.read()
date = re.findall('Date:(.*?)\n', df)
col_a = re.findall('Cat A:(.*?)\n', df)
col_b = re.findall('Cat B:(.*?)\n', df)
col_c = re.findall('Cat C:(.*?)\n', df)
dic = {'Date':date, 'Col A': col_a, 'Col B': col_b, 'Col C': col_c}
This works fine when the txt file always contains Cat A, B, & C. In another case, I have many categories, and the text file would be unnecessarily messy to carry them all when there is no entry. I want to read this txt:
Date: Jan 1
Cat A: 1
Cat F: 5
Date: Jan 10
Cat F: 6
Date: Dec 5
Cat A: 8
Cat B: 5
Cat E: 7
Cat F: 1
In the final df, each row can have Nan values where the cat wasn't in the entry.
Solution 1:[1]
With the text file you provide, here is one way to to it:
import pandas as pd
# Import and order data
df = pd.read_table("file.txt", sep=":", header=None).pivot(columns=0, values=1)
# Deal with NA values
df["Date"] = df["Date"].fillna(method="ffill")
for date in df["Date"].unique():
df.loc[df["Date"] == date, :] = (
df.loc[df["Date"] == date, :].fillna(method="ffill").fillna(method="bfill")
)
# Cleanup
df = (
df.drop_duplicates()
.pipe(
lambda df_: df_.reindex(
columns=["Date"] + [col for col in df_.columns if col != "Date"]
)
)
.reset_index(drop=True)
)
df.columns.name = ""
print(df)
# Output
Date Cat A Cat B Cat E Cat F
0 Jan 1 1 NaN NaN 5
1 Jan 10 NaN NaN NaN 6
2 Dec 5 8 5 7 1
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Laurent |
