'efficient way of removing leading and trailing whitepaces around dash in text strings and then splitting the text into multiple columns of pandas

Suppose the pandas dataframe contains the following:

 import pandas as pd
 df = pd.DataFrame({'text': ['ABC - XYZ- Some Text', 'DEF- XYZ -sometext', 'GHI -XYZ - sometext', 'JKL-XYZ- sometext', 'MNO1- XYZ- some text', 'MNO2 - XYZ - some text', 'MNO3 - XYZ-some text', 'MNO4-XYZ -some text', 'MNO5- XYZ-sometext -someother text', 'MNO6 -XYZ -sometext-someother text']})

All I want to do is to remove leading and training white spaces around 'only' the dashes and then split the data into new (multiple) columns of a new dataframe. So that the new dataframe should look like this:

 Col1    Col2    Col3           Col4                Col5    Col6 ....
 ABC     XYZ     Some Text     none                 none    none
 DEF     XYZ     sometext     none                  none    none
 GHI     XYZ     sometext     none                  none    none
 JKL     XYZ     sometext     none                  none    none
 .
 .
 MNO6    XYZ     sometext       someother text      none    none

Basically depending upon the highest number of dashes, there will be the columns in the new dataframe (e.g. if it is 6 dashes then there will be 6 columns) and where ever there are no values for a column after split, there will be none values.

Now, what I am trying to do something like this:

df1 = df['text'].str.split(' - ', n=2, expand=True)
df1.columns = ['Col_1_{}'.format(x+1) for x in df1.columns]

and then 

df2 = df1['Col_1_1'].str.split('- ', n=1, expand=True)
df2.columns = ['Col_1_1_{}'.format(x+1) for x in df2.columns]

and so on so that later I can merge all these columns and do renaming of these.

But this seems not to be efficient, sorry as I am not a pro python :'(

Is there an efficient way of achieving the result the way I want? any suggestions would be appreciated.



Solution 1:[1]

You can use a generator function to create the number of records that you need, then use a ROW_NUMBER() to make sure that it's producing a sequential list of numbers. With that you can easily add that sequence of numbers to your start date.

In this example $numdays needs to be variable because the generator function only takes constants. That makes it a little trickier to use as in a view (you can use session variables in a view, but then the view will only work in a session where you've created that variable), but you could definitely use this output to create a table, and run that regularly in a stored procedure.

set startdate = (select min(date(SALE_TIMESTAMP)) from fact_sales);
set numdays = (select datediff(DAY, min(date(SALE_TIMESTAMP)), max(date(SALE_TIMESTAMP))) from fact_sales);

SELECT DATEADD(DAY, c.n, $startdate) AS MY_DATE
FROM(SELECT ROW_NUMBER() OVER (ORDER BY 1) - 1 FROM TABLE(generator(rowcount=>$numdays))) c(n);

Here's a workaround that would allow you to create this in a view. The main concession here is using an arbitrary value for the number of values that the generator can create (note the generator(rowcount=>10000)). You can set this value high enough that it's very unlikely that you'll run out of dates, and it'll still be performant.

SELECT 
    DATEADD(DAY, c.n, (SELECT date(min(SALE_TIMESTAMP)) FROM fact_sales)) AS MY_DATE
FROM(SELECT ROW_NUMBER() OVER (ORDER BY 1) - 1 FROM TABLE(generator(rowcount=>10000))) c(n)
WHERE MY_DATE <= (SELECT date(max(SALE_TIMESTAMP)) FROM fact_sales);

Solution 2:[2]

you can use the following query to generate a date table, with some extra columns which you may want for data manipulation.

select dateadd(day,seq,dt::date) dat ,  year(dat) as "YEAR", quarter(dat) as "QUARTER OF YEAR",
       month(dat) as "MONTH", day(dat) as "DAY", dayofmonth(dat) as "DAY OF MONTH",
       dayofweek(dat) as "DAY OF WEEK",dayname(dat) as dayName,
       dayofyear(dat) as "DAY OF YEAR"
from (
select seq4() as seq,  dateadd(month, 1, '1980-01-01'::date) dt from table(generator(rowcount => 16000))
);

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 Himanshu Kandpal