Category "dataframe"

Summing row values after a groupby but based on a dictionary condition?

I am trying to figure out how to add row entries of the numeric columns(supply,demand) . I am at a complete loss. My initial thoughts are to do this with a dic

Sum of list values in a df, new column, values are objects

I have a df made of values from a dictionary. I can get rid of [], ',' and split it all in different cols (one col per number). But can't make the transfer to f

Create several new variables using a vector of names and a vector for computation within dplyr::mutate

I'd like to create several new columns. They should take their names from one vector and they should be computed by taking one column in the data and dividing i

make a mean of several year dataframes, hour by hour

I have several dataframes of some value taken very hour, on several year, like this : df1 Out[6]: time P G(i) H_sun T2m WS10m Int

How to do a new data frame of the latest value reported in each column?

I've got a table like this: country continent date n_case Ex TD TC -----------------------------------------------------

Read Parquet file form S3 in EMR cluster taling a long time

I am trying to read a parquet file (not compressed) into a pandas dataframe on a EMR cluster. I am using EMR 6.4 and parquet version 1.1.5. We are in the proces

Better/Efficient way to filter out Spark Dataframe rows with multiple conditions

I have a dataframe look like this below id pub_date version unique_id c_id p_id type source lni001 20220301 1

How to handle the variable size json file in python to create DataFrame using pandas

I am trying to build a DataFrame using pandas but I am not able to handle the case when I have the variable size of JSON chunks I am getting. eg: 1st chunk: {'a

Generate binary outcome dummy data based on probability of items and its feature

I want to generate a synthetic data from scratch which is a binary outcome sequence data (0/1). My data has following property- For the sake of an example, lets

Pyspark how to join common columns values to a list value

i am trying to join columns values to a list of values df1= name | department| state | id| -----+-----------+-------+---+ James|Sales |NY |101 Maria|F

Pandas to read a excel file from s3 and apply some operation and write the file in same location

i am using pandas to read an excel file from s3 and i will be doing some operation in one of the column and write the new version in same location. Basically ne

pcolormesh plot with data from csv file

I have some time series data in the following format from csv file: Time 1 5 10 18 21 22 29 30 35 2019/11/01 09:00 5

Find last available date if date does not exist in other DataFrame

Suppose that you have two data frames which can be created using code below: df1 = pd.DataFrame(data={'start_date': ['2021-07-02', '2021-07-09',

Simple fread operation with fill=TRUE fails

The following code generates data files where each row has a different number of columns. The option fill=TRUE appears to work only when a certain character lim

Plotting the frequency of occurrences per date

I'm new to pandas and plotly. And I have a large csv file with two columns, a date column and a column that contains a string of text (event). Each event is a n

Can´t copy tupel from one dataframe into another (Length of values does not match length of index)

I want to create columns in a dataframe (df_joined) that contains as values tupels from a second df (df_tupels). The tupels are (10,50) and (20,60). I tried var

Find matching name in another table, return value associated w/ column in pandas

I have 2 tables. I want to take DF1 and adjust the values in the tables given the values in DF2. DF2 is simply a groupby of a column in DF1. In domain terms, I

Smart for loop in python for a portfolio performance

this is my first question here, so go easy on me. I've computed a certain portfolio in python, for which I've gotten a dataframe (or list for that matter) of ar

Image Processing from PDF to excel

I have a pdf file which holds a few tables containing different colors instead of RGB values. I have been tasked to fetch RGB value from each row and transition

Groupby id and change values for all rows for the earliest date to NaN

I have the following id, i would like to groupby id and then replace value X with NaN. My current df. ID Date X other variables.. 1 1/1/18