'Pandas to_gbq() TypeError "Expected bytes, got a 'int' object
I am using pandas_gbq module to try and append a dataframe to a table in Google BigQuery.
I keep getting an ArrowTypeError: Expected bytes, got a 'int' object.
I can confirm the data types of the dataframe match the schema of the BQ table.
I found this post regarding Parquet files not being able to have mixed datatypes: Pandas to parquet file
In the error message I'm receiving I see there is a reference to a parquet file, so I'm assuming the df.to_gbq() call is creating a parquet file and I have a mixed data type column, which is casuing the error. The error message doesn't specify.
I think that my challenge is that I can't see to find which column has the mixed datatype - I've tried casting them all as strings and then specifying the table schema parameter, but that hasn't worked either.
The full error message I'm receiving is below.
Any help would be appreciated!
'''
In [76]: df.to_gbq('Pricecrawler.Daily_Crawl_Data', project_id=project_id, if_exists='append')
ArrowTypeError Traceback (most recent call last)
<ipython-input-76-74cec633c5d0> in <module>
----> 1 df.to_gbq('Pricecrawler.Daily_Crawl_Data', project_id=project_id, if_exists='append')
~\Anaconda3\lib\site-packages\pandas\core\frame.py in to_gbq(self, destination_table,
project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location,
progress_bar, credentials)
1708 from pandas.io import gbq
1709
-> 1710 gbq.to_gbq(
1711 self,
1712 destination_table,
~\Anaconda3\lib\site-packages\pandas\io\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials)
209 ) -> None:
210 pandas_gbq = _try_import()
--> 211 pandas_gbq.to_gbq(
212 dataframe,
213 destination_table,
~\Anaconda3\lib\site-packages\pandas_gbq\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, reauth, if_exists, auth_local_webserver, table_schema, location, progress_bar, credentials, api_method, verbose, private_key)
1191 return
1192
-> 1193 connector.load_data(
1194 dataframe,
1195 destination_table_ref,
~\Anaconda3\lib\site-packages\pandas_gbq\gbq.py in load_data(self, dataframe, destination_table_ref, chunksize, schema, progress_bar, api_method, billing_project)
584
585 try:
--> 586 chunks = load.load_chunks(
587 self.client,
588 dataframe,
~\Anaconda3\lib\site-packages\pandas_gbq\load.py in load_chunks(client, dataframe, destination_table_ref, chunksize, schema, location, api_method, billing_project)
235 ):
236 if api_method == "load_parquet":
--> 237 load_parquet(
238 client,
239 dataframe,
~\Anaconda3\lib\site-packages\pandas_gbq\load.py in load_parquet(client, dataframe, destination_table_ref, location, schema, billing_project)
127
128 try:
--> 129 client.load_table_from_dataframe(
130 dataframe,
131 destination_table_ref,
~\Anaconda3\lib\site-packages\google\cloud\bigquery\client.py in load_table_from_dataframe(self, dataframe, destination, num_retries, job_id, job_id_prefix, location, project, job_config, parquet_compression, timeout)
2669 parquet_compression = parquet_compression.upper()
2670
-> 2671 _pandas_helpers.dataframe_to_parquet(
2672 dataframe,
2673 job_config.schema,
~\Anaconda3\lib\site-packages\google\cloud\bigquery\_pandas_helpers.py in dataframe_to_parquet(dataframe, bq_schema, filepath, parquet_compression, parquet_use_compliant_nested_type)
584
585 bq_schema = schema._to_schema_fields(bq_schema)
--> 586 arrow_table = dataframe_to_arrow(dataframe, bq_schema)
587 pyarrow.parquet.write_table(
588 arrow_table, filepath, compression=parquet_compression, **kwargs,
~\Anaconda3\lib\site-packages\google\cloud\bigquery\_pandas_helpers.py in dataframe_to_arrow(dataframe, bq_schema)
527 arrow_names.append(bq_field.name)
528 arrow_arrays.append(
--> 529 bq_to_arrow_array(get_column_or_index(dataframe, bq_field.name), bq_field)
530 )
531 arrow_fields.append(bq_to_arrow_field(bq_field, arrow_arrays[-1].type))
~\Anaconda3\lib\site-packages\google\cloud\bigquery\_pandas_helpers.py in bq_to_arrow_array(series, bq_field)
288 if field_type_upper in schema._STRUCT_TYPES:
289 return pyarrow.StructArray.from_pandas(series, type=arrow_type)
--> 290 return pyarrow.Array.from_pandas(series, type=arrow_type)
291
292
~\Anaconda3\lib\site-packages\pyarrow\array.pxi in pyarrow.lib.Array.from_pandas()
~\Anaconda3\lib\site-packages\pyarrow\array.pxi in pyarrow.lib.array()
~\Anaconda3\lib\site-packages\pyarrow\array.pxi in pyarrow.lib._ndarray_to_array()
~\Anaconda3\lib\site-packages\pyarrow\error.pxi in pyarrow.lib.check_status()
ArrowTypeError: Expected bytes, got a 'int' object
'''
Solution 1:[1]
Not really an answer but a kludgy workaround. I'm having this exact same problem with dataframes which contain columns of the INT64 type. I've found that doing the following works:
# temporarily store the dataframe as a csv in a string variable
temp_csv_string = df.to_csv(sep=";", index=False)
temp_csv_string_IO = StringIO(temp_csv_string)
# create new dataframe from string variable
new_df = pd.read_csv(temp_csv_string_IO, sep=";")
# this new df can be uploaded to BQ with no issues
new_df.to_gbq(table_id, project_id, if_exists="append")
I have no idea why this works. Both dataframes seem to be identical if you look at df.info() and new_df.info(). I decided to try this after saving the offending dataframe as a csv and uploading it to biquery in that format, which worked.
Note that this specifically happens with INT64 type columns. I'm uploading dataframes generated in the same way that don't contain INT64 values whithout any issues.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Óscar |
