'pandas or python equivalent of tidyr complete
I have data that looks like this:
library("tidyverse")
df <- tibble(user = c(1, 1, 2, 3, 3, 3), x = c("a", "b", "a", "a", "c", "d"), y = 1)
df
# user x y
# 1 1 a 1
# 2 1 b 1
# 3 2 a 1
# 4 3 a 1
# 5 3 c 1
# 6 3 d 1
Python format:
import pandas as pd
df = pd.DataFrame({'user':[1, 1, 2, 3, 3, 3], 'x':['a', 'b', 'a', 'a', 'c', 'd'], 'y':1})
I'd like to "complete" the data frame so that every user has a record for every possible x with the default y fill set to 0.
This is somewhat trivial in R (tidyverse/tidyr):
df %>%
complete(nesting(user), x = c("a", "b", "c", "d"), fill = list(y = 0))
# user x y
# 1 1 a 1
# 2 1 b 1
# 3 1 c 0
# 4 1 d 0
# 5 2 a 1
# 6 2 b 0
# 7 2 c 0
# 8 2 d 0
# 9 3 a 1
# 10 3 b 0
# 11 3 c 1
# 12 3 d 1
Is there a complete equivalent in pandas / python that will yield the same result?
Solution 1:[1]
You can use reindex by MultiIndex.from_product:
df = df.set_index(['user','x'])
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]],names=['user','x'])
df = df.reindex(mux, fill_value=0).reset_index()
print (df)
user x y
0 1 a 1
1 1 b 1
2 1 c 0
3 1 d 0
4 2 a 1
5 2 b 0
6 2 c 0
7 2 d 0
8 3 a 1
9 3 b 0
10 3 c 1
11 3 d 1
Or set_index + stack + unstack:
df = df.set_index(['user','x'])['y'].unstack(fill_value=0).stack().reset_index(name='y')
print (df)
user x y
0 1 a 1
1 1 b 1
2 1 c 0
3 1 d 0
4 2 a 1
5 2 b 0
6 2 c 0
7 2 d 0
8 3 a 1
9 3 b 0
10 3 c 1
11 3 d 1
Solution 2:[2]
It's very easy now to use those dplyr/tidyr APIs in python with datar:
>>> from datar.all import f, c, tibble, complete, nesting
>>> df = tibble(user=c(1, 1, 2, 3, 3, 3), x=c("a", "b", "a", "a", "c", "d"), y=1)
>>> df >> complete(nesting(f.user), x=c("a", "b", "c", "d"), fill={'y': 0})
user x y
<int64> <object> <float64>
0 1 a 1.0
1 1 b 1.0
2 1 c 0.0
3 1 d 0.0
4 2 a 1.0
5 2 b 0.0
6 2 c 0.0
7 2 d 0.0
8 3 a 1.0
9 3 b 0.0
10 3 c 1.0
11 3 d 1.0
I am the author of the package. Feel free to submit issues if you have any questions.
Solution 3:[3]
We could use the complete function from pyjanitor, which provides a convenient abstraction to generate the missing rows :
# pip install pyjanitor
import pandas as pd
import janitor
df.complete('user', 'x', fill_value = 0)
user x y
0 1 a 1
1 1 b 1
2 1 c 0
3 1 d 0
4 2 a 1
5 2 b 0
6 2 c 0
7 2 d 0
8 3 a 1
9 3 b 0
10 3 c 1
11 3 d 1
More examples can be found here
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | jezrael |
| Solution 2 | |
| Solution 3 |
