'How to reference elements in a multi-dict within a list comprehension statement

I could not find an answer to this when googling through the archives (feel like this question should have been asked before).

I am running a hugging face pipeline on top of a pandas dataframe. The structure of my dataframe is simply two columns:

index text
0 Here is some text
1 Here is another text

For each value in text, the pipeline (we'll call it model_func) runs on that value and returns a multidict for each value e.g. model_func(df.text.values[0]) returns...

{var1:[1,2,3], var2:[4,5,6], var3:[7,8,9]}

I want to run this function for all values in df.text and assign the outputted var1,var2 key/values in the dictionary as columns in the original dataframe e.g.

index text var1 var2
0 Here is some text [1,2,3] [4,5,6]
1 Here is another text [7,8,9] [10,11,12]

My current (non-working) list comprehension statement attempting to do the above looks like this:

df[['var1','var2']] = [model_func(x)['var1','var2'] for x in (df['text'])]

Essentially, I want to:

  1. Access the first two keys in the returned multidict (and their respective values) returned from model_func.
  2. Assign these values to each obs(row) as columns in the original dataframe.

(I'll then use the explode function to expand the respective lists into a long data format so as not to have lists within a pandas series).

I realize this is a bit messy. I would think there has to be a more efficient method for this so am all ears on that end. For now though, the main question I have is quite basic:

How do I reference those two multi-dict keys within a list comprehension (assigning a single key / values pair works e.g. - df['var1'] = [data_analysis_func(x)['var1'] for x in (df['text'])] - but not multiple key / values pairs as attempted in the first list comprehension code block above).



Solution 1:[1]

You can use the DataFrame.apply method for this, which simply maps a function to a column (or row) of a DataFrame. In your case, this could be a custom lambda function, for instance:

df[["var1","var2"]] = df.apply(lambda row: 
    (
        [row["text"].count("a"), row["text"].count("e"), row["text"].count("i")],
        [row["text"].count("h"), row["text"].count("i"), row["text"].count("t")]
    )
, axis=1)

which results in

                   text       var1       var2
0     Here is some text  [0, 4, 1]  [1, 4, 1]
1  Here is another text  [0, 1, 2]  [1, 1, 3]

Some clarification:

  • The axis=1 specifies that you want to apply it to columns
  • You can assign multiple columns like this, but make sure that you return data in the appropriate dimension. Here: The lamba function returns a tuple that holds two lists - the first list for var1, the second for var2.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 rammelmueller