Category "pandas"

Pandas dataframe in pyspark to hive

How to send a pandas dataframe to a hive table? I know if I have a spark dataframe, I can register it to a temporary table using df.registerTempTable("table_

pandas data mining from Eurostat

I'm starting a work to analyse data from Stats Institutions like Eurostat using python, and so pandas. I found out there are two methods to get data from Eurost

Pandas get topmost n records within each group

Suppose I have pandas DataFrame like this: df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]}) which looks like: id value 0 1

How to determine whether a column/variable is numeric or not in Pandas/NumPy?

Is there a better way to determine whether a variable in Pandas and/or NumPy is numeric or not ? I have a self defined dictionary with dtypes as keys and nume

Get DataFrame with the number of rows for each time interval

Given the following DataFrame of pandas in Python: | ID | date | |--------------|------------------------------------

Splitting dataframe into multiple dataframes

I have a very large dataframe (around 1 million rows) with data from an experiment (60 respondents). I would like to split the dataframe into 60 dataframes (a d

How to use a df column in a vertica_python SQL query?

I have a dataframe with names that I set to a dictionary, like this: {1: "Bob", 41: "John", 126: "Jim", 167: "Pete"} I am using Vertica. I want to be able to p

How to use a df column in a vertica_python SQL query?

I have a dataframe with names that I set to a dictionary, like this: {1: "Bob", 41: "John", 126: "Jim", 167: "Pete"} I am using Vertica. I want to be able to p

df.isna().sum() is not working on titanic dataset

I tried titanic model on kaggle. And it is weird that isna().sum() outputs wrong information. import os import pandas as pd import numpy as np import statsmode

Not able to see all the methods under dt accessor in Jupyter notebook

Maybe a silly question. I have been trying to use dt accessor in pandas to use datetime methods on certain date fields in my Data Frame. Not sure why, but the a

How to name the column when using value_count function in pandas?

I was counting the no of occurrence of angle and dist by the code below: g = new_df.value_counts(subset=['Current_Angle','Current_dist'] ,sort = False) the out

How to select values from pandas dataframe by column value

I am doing an analysis of a dataset with 6 classes, zero based. The dataset is many thousands of items long. I need two dataframes with classes 0 & 1 fo

How to add tags when uploading to S3 from pandas?

Pandas lets you pass an AWS S3 path directly to .to_csv() and .to_parquet(). There's a storage_options argument for passing S3 specific arguments. I would like

Removing [' and '] from CSV

I have several GB of CSV files where values in one of the columns look like this: Which is a consequence of this: urls.append(re.findall(r'http\S+', hashtags_r

Python Pandas Error tokenizing data

I'm trying to use pandas to manipulate a .csv file but I get this error: pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 fields in li

string split with expand=True. Can anyone explain what is the meaning?

all_data['Title']= all_data['Name'].str.split(', ', expand=True)[1].str.split('.', expand=True)[0] Can anyone explain what is the meaning of this line of code?

Converting pandas.DataFrame to bytes

I need convert the data stored in a pandas.DataFrame into a byte string where each column can have a separate data type (integer or floating point). Here is a

how to check if a None is not passed as an argument where a pandas dataframe is expected

I have a function which looks like below. def some_func(df:pd.Dataframe=pd.Dataframe()): if not df or df.empty: //some dataframe operations I want to ens

Pandas read json ValueError: Protocol not known

I ran these codes a while ago and it worked but now there is a ValueError: protocol not known. Could anyone help. Thanks. import json temp = json.dumps([status.

How to create a dictionary of two pandas DataFrame columns

What is the most efficient way to organise the following pandas Dataframe: data = Position Letter 1 a 2 b 3 c 4 d 5