'Convert txt file, with variable categories, to dictionary and pandas df

I've converted a txt file that has a fixed number of variables, for every entry, to a dict and df. For example, if every entry in the txt file has a Date entry followed by Category A, Category B, and Category C, it has a set number of variables. In my current question, there are many categories but most are empty (NaN). I don't want to carry the null value category entries in the txt file. In this case, the txt file may have one Date entry with only Category A and Category C, and another Date entry with only Category B. In the final df, each date is a row and some columns of that row are NaN. My problem is reading that txt file, with unlisted categories in some entries, produces a "All arrays must be of the same length" error.

For example, I've read this txt file:

Date: Jan 1
Cat A: 5
Cat B: 8
Cat C:  2

Date: Feb 6
Cat A: 7
Cat B: 2
Cat C: 3

etc...for multiple dates.

with open ('text.txt', "r") as file:
    df = file.read()

date = re.findall('Date:(.*?)\n', df)
col_a = re.findall('Cat A:(.*?)\n', df)
col_b = re.findall('Cat B:(.*?)\n', df)
col_c = re.findall('Cat C:(.*?)\n', df)

dic = {'Date':date, 'Col A': col_a, 'Col B': col_b, 'Col C': col_c}

This works fine when the txt file always contains Cat A, B, & C. In another case, I have many categories, and the text file would be unnecessarily messy to carry them all when there is no entry. I want to read this txt:

Date: Jan 1
Cat A: 1
Cat F: 5

Date: Jan 10
Cat F: 6

Date: Dec 5
Cat A: 8
Cat B: 5
Cat E: 7
Cat F: 1

In the final df, each row can have Nan values where the cat wasn't in the entry.

python pandas

Solution 1:^[1]

With the text file you provide, here is one way to to it:

import pandas as pd

# Import and order data
df = pd.read_table("file.txt", sep=":", header=None).pivot(columns=0, values=1)

# Deal with NA values
df["Date"] = df["Date"].fillna(method="ffill")
for date in df["Date"].unique():
    df.loc[df["Date"] == date, :] = (
        df.loc[df["Date"] == date, :].fillna(method="ffill").fillna(method="bfill")
    )

# Cleanup
df = (
    df.drop_duplicates()
    .pipe(
        lambda df_: df_.reindex(
            columns=["Date"] + [col for col in df_.columns if col != "Date"]
        )
    )
    .reset_index(drop=True)
)
df.columns.name = ""

print(df)
# Output
      Date Cat A Cat B Cat E Cat F
0    Jan 1     1   NaN   NaN     5
1   Jan 10   NaN   NaN   NaN     6
2    Dec 5     8     5     7     1

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Laurent

'Convert txt file, with variable categories, to dictionary and pandas df

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]