'Filtering Pandas DataFrame based on number of non-null values
I am analyzing a consumer survey and doing the data cleaning with Pandas. I have a question where participants could answer with what frequency they are confronted with ads ('Daily': 1, 'Multiple times a week': 2, 'Once a week':3, 'Once a year': 4, 'Never': 5).
If the participant would answer that they would see ads at least weekly (1,2,3) they would get a whole new line of questions based on the product categories whether they were confronted with them and how often. The survey system would not ask all of the categories but would randomize the survey so 4 categories were asked about. The DataFrame with the answers looks like this:
| Respondent ID | Question Frequency Ads | Question Product 1 | Question Product 2 | Question Product 3 | ... | Question Product 19 | Question Product 20 |
|---|---|---|---|---|---|---|---|
| 1 | 5 | ||||||
| 2 | 4 | ||||||
| 3 | 2 | 1 | 2 | 3 | 5 | ||
| 4 | 1 | 1 | 3 | 5 | 2 | ||
| 5 | 1 | 5 | 3 | 5 | 2 | ||
| 5 | 1 | 4 | 5 | 5 | 2 |
I therefore want to filter the respondents that filled in the questionnaire as it should. The respondents that answered that they are confronted at least weekly with ads, should have at least an answer for 4 product questions. I already tried the following code:
data = data[((data['Question Frequency Ads'].isin([1,2,3])) & (data['Question Product 1'].isnull() + data['Question Product 2'].isnull() + data['Question Product 3'].isnull() + ... + data['Question Product 20'].isnull())) == (20-4)]
I realize this code will not work because of the fact that when adding these boolean expressions, you just get a True/False value and no integer which shows how many True/False values there are. Can someone help me with getting the right expression for this problem?
Solution 1:[1]
Update
IIUC:
m1 = df['Question Frequency Ads'].le(3)
m2 = df.iloc[:, 2:].notna().sum(axis=1).ge(4)
out = df[~m1 | (m1 & m2)]
~m1: ifQuestion Frequency Ads > 3, we keep the respondent because he don't have to answer to questions.m1 & m2: ifQuestion Frequency Ads <= 3, we keep the respondent only if he answered at least to 4 questions.
Old answer You can use:
out = df[df['Question Frequency Ads'].le(3) & df.iloc[:, 2:].notna().sum(axis=1).ge(4)]
Output:
| Respondent ID | Question Frequency Ads | Question Product 1 | Question Product 2 | Question Product 3 | Question Product 19 | Question Product 20 |
|---|---|---|---|---|---|---|
| 3 | 2 | 1 | 2 | 3 | 5 | |
| 4 | 1 | 1 | 3 | 5 | 2 | |
| 5 | 1 | 5 | 3 | 5 | 2 | |
| 5 | 1 | 4 | 5 | 5 | 2 |
Solution 2:[2]
You could use:
m1 = df['Question Frequency Ads'].between(1,3)
m2 = df.filter(like='Question Product').notnull().sum(1).ge(4)
df.loc[m1&m2, 'valid'] = 'valid'
output:
Respondent ID Question Frequency Ads Question Product 1 \
0 1 5 NaN
1 2 4 NaN
2 3 2 1.0
3 4 1 1.0
4 5 1 NaN
5 5 1 4.0
Question Product 2 Question Product 3 ... Question Product 19 \
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 2.0 3.0 NaN 5.0
3 NaN 3.0 NaN 5.0
4 5.0 3.0 NaN 5.0
5 5.0 NaN NaN 5.0
Question Product 20 valid
0 NaN NaN
1 NaN NaN
2 NaN valid
3 2.0 valid
4 2.0 valid
5 2.0 valid
Solution 3:[3]
IIUC, try:
- Keep all rows where Question Frequency Ads is in
[1, 2, 3]and - Where the number of questions that have a response is exactly 4.
>>> >>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.filter(like="Question Product").count(1).eq(4)]
Or:
>>> data[data["Question Frequency Ads"].isin([1,2,3]) & data.drop(["Respondent ID", "Question Frequency Ads"], axis=1).count(1).eq(4)]
Solution 4:[4]
here is a different approach:
def is_int(x):
try:
int(x)
except:
return False
return True
def is_valid(line):
count = 0
for el in line[2:]:
if is_int(el):
count+=1
return count == 4
df[df[df['Question Frequency Ads'].between(1,3)].apply(is_valid, axis=1)]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | |
| Solution 4 | nicofrlo |
