'How to reference elements in a multi-dict within a list comprehension statement
I could not find an answer to this when googling through the archives (feel like this question should have been asked before).
I am running a hugging face pipeline on top of a pandas dataframe. The structure of my dataframe is simply two columns:
| index | text |
|---|---|
| 0 | Here is some text |
| 1 | Here is another text |
For each value in text, the pipeline (we'll call it model_func) runs on that value and returns a multidict for each value e.g. model_func(df.text.values[0]) returns...
{var1:[1,2,3], var2:[4,5,6], var3:[7,8,9]}
I want to run this function for all values in df.text and assign the outputted var1,var2 key/values in the dictionary as columns in the original dataframe e.g.
| index | text | var1 | var2 |
|---|---|---|---|
| 0 | Here is some text | [1,2,3] | [4,5,6] |
| 1 | Here is another text | [7,8,9] | [10,11,12] |
My current (non-working) list comprehension statement attempting to do the above looks like this:
df[['var1','var2']] = [model_func(x)['var1','var2'] for x in (df['text'])]
Essentially, I want to:
- Access the first two keys in the returned multidict (and their respective values) returned from model_func.
- Assign these values to each obs(row) as columns in the original dataframe.
(I'll then use the explode function to expand the respective lists into a long data format so as not to have lists within a pandas series).
I realize this is a bit messy. I would think there has to be a more efficient method for this so am all ears on that end. For now though, the main question I have is quite basic:
How do I reference those two multi-dict keys within a list comprehension (assigning a single key / values pair works e.g. - df['var1'] = [data_analysis_func(x)['var1'] for x in (df['text'])] - but not multiple key / values pairs as attempted in the first list comprehension code block above).
Solution 1:[1]
You can use the DataFrame.apply method for this, which simply maps a function to a column (or row) of a DataFrame. In your case, this could be a custom lambda function, for instance:
df[["var1","var2"]] = df.apply(lambda row:
(
[row["text"].count("a"), row["text"].count("e"), row["text"].count("i")],
[row["text"].count("h"), row["text"].count("i"), row["text"].count("t")]
)
, axis=1)
which results in
text var1 var2
0 Here is some text [0, 4, 1] [1, 4, 1]
1 Here is another text [0, 1, 2] [1, 1, 3]
Some clarification:
- The
axis=1specifies that you want to apply it to columns - You can assign multiple columns like this, but make sure that you return data in the appropriate dimension. Here: The lamba function returns a tuple that holds two lists - the first list for
var1, the second forvar2.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | rammelmueller |
