'Extract specific text from a long complex text string in pandas dataframe using Python
Here is how my dataframe looks:
ID|string_column|column3
A101|"[{Lorem Ipsum is simply {dummy} text, of the printing {} and typesetting industry.}, "london", 0.5808755159378052, "london", 0.2203546166419983, "uk", 0.6141567826271057, "europe", 0.9081151485443115, "europe", 0.9140098094940186]"|"3022"
A102|"[{Lorem Ipsum is simply {dummy} }, "delhi", 0.59378052, "UP", 0.9983, "india", 0.14, "sub-continent", 0.9081151485443115, "asia", 0.82]"|"3028"
I would like to extract "string1", confidence, "string2", confidence, "string3", confidence, "string4", confidence, "string5", confidence from the string_column and put them each in a separate new dataframe column.
The new dataframe would look like this:
ID|string_column|column3|string1|confidence1|string2|confidence2|string3|confidence3|string4|confidence4|string5|confidence5
A101|"[{Lorem Ipsum is simply {dummy} text, of the printing {} and typesetting industry.}, "london", 0.5808755159378052, "london", 0.2203546166419983, "uk", 0.6141567826271057, "europe", 0.9081151485443115, "europe", 0.9140098094940186]"|"3022"|"london"|0.5808755159378052|"london"|0.2203546166419983,|"uk"|0.6141567826271057|"europe"|0.9081151485443115|"europe"|0.9140098094940186
How would I do that?
I tried converting it into a list by splitting on delimiter "," but since the text is messy it can have several commas in the text within the {} so that fails.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
