'Select columns using pandas dataframe.query()
The documentation on dataframe.query() is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.
So I tried simply providing the column names: that gave a syntax error. Likewise for typing select and then the column names. So .. how to do this?
Solution 1:[1]
After playing around with this for a while and reading through the source code for DataFrame.query, I can't figure out a way to do it.
If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval() for selecting columns and df.query() for filtering on rows.
UPDATE:
javadba points out that the return value of eval is not a dataframe. For example, to flesh out jreback's example a bit more...
df.eval('A')
returns a Pandas Series, but
df.eval(['A', 'B'])
does not return at DataFrame, it returns a list (of Pandas Series).
So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.
df.loc[0:4, ['A', 'C']]
output
A C
0 -0.497163 -0.046484
1 1.331614 0.741711
2 1.046903 -2.511548
3 0.314644 -0.526187
4 -0.061883 -0.615978
Solution 2:[2]
Dataframe.query is more like the where clause in a SQL statement than the select part.
import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
To select a column or columns you can use the following:
df['A'] or df.loc[:,'A']
or
df[['A','B']] or df.loc[:,['A','B']]
To use the .query method you do something like
df.query('A > B') which would return all the rows where the value in column A is greater than the value in column b.
A B C D
2000-01-03 1.265936 -0.866740 -0.678886 -0.094709
2000-01-04 1.491390 -0.638902 -0.443982 -0.434351
2000-01-05 2.205930 2.186786 1.004054 0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589
Which is more readable in my opinion that boolean index selection with
df[df['A'] > df['B']]
Solution 3:[3]
How about
df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]
Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.
but you would need to filter for rows otherwise it doesn't work.
for filtering columns only better use .loc or .iloc
Solution 4:[4]
pandasql
https://pypi.python.org/pypi/pandasql/0.1.0
Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html . The inputs are two DataFrames meat and births : and this approach gives the projections, filtering, aggregation and sorting expected from sql.
@maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.
pysqldf = lambda q: sqldf(q, globals())
q = """
SELECT
m.date
, m.beef
, b.births
FROM
meat m
LEFT JOIN
births b
ON m.date = b.date
WHERE
m.date > '1974-12-31';
"""
meat = load_meat()
births = load_births()
df = pysqldf(q)
The output is a pandas DataFrame as desired.
It is working great for my particular use case (evaluating us crimes)
odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)
odf
: SMURDER SRAPE SROBBERY SAGASSLT SOTHASLT SVANDLSM SWEAPONS
0 0 0 0 1 1 10 54
1 0 0 0 0 1 0 52
2 0 0 0 0 1 0 46
3 0 0 0 0 1 0 43
4 0 0 0 0 1 0 33
5 1 0 2 16 28 4 32
6 0 0 0 7 17 4 30
7 0 0 0 0 1 0 29
8 0 0 0 7 16 3 29
9 0 0 0 1 0 5 28
Update I have done a bunch of stuff with pandasql now: calculated fields, limits, aliases, cascaded dataframes.. it is just so productive.
Another update (3 yrs later) This works but warning it is very slow (seconds vs milliseconds) –
Solution 5:[5]
Just a simpler example solution (using get):
My goal:
I want the lat and lon columns out of the result of the query.
My table details:
df_city.columns
Index(['name', 'city_id', 'lat', 'lon', 'CountryName', 'ContinentName'], dtype='object')
# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')
# Only lat and lon
city_continent[['lat', 'lon']]
lat lon 113883 -19.12753 -169.84623 113884 -19.11667 -169.90000 113885 -19.10000 -169.91667 113886 -46.33333 168.85000 113887 -46.36667 168.55000 ... ... ... 347956 -23.14083 113.77630 347957 -31.48023 131.84242 347958 -28.29967 153.30142 347959 -35.60358 138.10548 347960 -35.02852 117.83416 3712 rows × 2 columns
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | |
| Solution 3 | Fabich |
| Solution 4 | |
| Solution 5 | Arindam Roychowdhury |
