'efficient way of removing leading and trailing whitepaces around dash in text strings and then splitting the text into multiple columns of pandas
Suppose the pandas dataframe contains the following:
import pandas as pd
df = pd.DataFrame({'text': ['ABC - XYZ- Some Text', 'DEF- XYZ -sometext', 'GHI -XYZ - sometext', 'JKL-XYZ- sometext', 'MNO1- XYZ- some text', 'MNO2 - XYZ - some text', 'MNO3 - XYZ-some text', 'MNO4-XYZ -some text', 'MNO5- XYZ-sometext -someother text', 'MNO6 -XYZ -sometext-someother text']})
All I want to do is to remove leading and training white spaces around 'only' the dashes and then split the data into new (multiple) columns of a new dataframe. So that the new dataframe should look like this:
Col1 Col2 Col3 Col4 Col5 Col6 ....
ABC XYZ Some Text none none none
DEF XYZ sometext none none none
GHI XYZ sometext none none none
JKL XYZ sometext none none none
.
.
MNO6 XYZ sometext someother text none none
Basically depending upon the highest number of dashes, there will be the columns in the new dataframe (e.g. if it is 6 dashes then there will be 6 columns) and where ever there are no values for a column after split, there will be none values.
Now, what I am trying to do something like this:
df1 = df['text'].str.split(' - ', n=2, expand=True)
df1.columns = ['Col_1_{}'.format(x+1) for x in df1.columns]
and then
df2 = df1['Col_1_1'].str.split('- ', n=1, expand=True)
df2.columns = ['Col_1_1_{}'.format(x+1) for x in df2.columns]
and so on so that later I can merge all these columns and do renaming of these.
But this seems not to be efficient, sorry as I am not a pro python :'(
Is there an efficient way of achieving the result the way I want? any suggestions would be appreciated.
Solution 1:[1]
You can use a generator function to create the number of records that you need, then use a ROW_NUMBER() to make sure that it's producing a sequential list of numbers. With that you can easily add that sequence of numbers to your start date.
In this example $numdays needs to be variable because the generator function only takes constants. That makes it a little trickier to use as in a view (you can use session variables in a view, but then the view will only work in a session where you've created that variable), but you could definitely use this output to create a table, and run that regularly in a stored procedure.
set startdate = (select min(date(SALE_TIMESTAMP)) from fact_sales);
set numdays = (select datediff(DAY, min(date(SALE_TIMESTAMP)), max(date(SALE_TIMESTAMP))) from fact_sales);
SELECT DATEADD(DAY, c.n, $startdate) AS MY_DATE
FROM(SELECT ROW_NUMBER() OVER (ORDER BY 1) - 1 FROM TABLE(generator(rowcount=>$numdays))) c(n);
Here's a workaround that would allow you to create this in a view.
The main concession here is using an arbitrary value for the number of values that the generator can create (note the generator(rowcount=>10000)). You can set this value high enough that it's very unlikely that you'll run out of dates, and it'll still be performant.
SELECT
DATEADD(DAY, c.n, (SELECT date(min(SALE_TIMESTAMP)) FROM fact_sales)) AS MY_DATE
FROM(SELECT ROW_NUMBER() OVER (ORDER BY 1) - 1 FROM TABLE(generator(rowcount=>10000))) c(n)
WHERE MY_DATE <= (SELECT date(max(SALE_TIMESTAMP)) FROM fact_sales);
Solution 2:[2]
you can use the following query to generate a date table, with some extra columns which you may want for data manipulation.
select dateadd(day,seq,dt::date) dat , year(dat) as "YEAR", quarter(dat) as "QUARTER OF YEAR",
month(dat) as "MONTH", day(dat) as "DAY", dayofmonth(dat) as "DAY OF MONTH",
dayofweek(dat) as "DAY OF WEEK",dayname(dat) as dayName,
dayofyear(dat) as "DAY OF YEAR"
from (
select seq4() as seq, dateadd(month, 1, '1980-01-01'::date) dt from table(generator(rowcount => 16000))
);
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Himanshu Kandpal |
