'Row Filter for Table is invalid pyspark
I have a dataframe in pyspark coming from a View in Bigquery that i import after configuring spark session:
config = pyspark.SparkConf().setAll([('spark.executor.memory', '10g'),('spark.driver.memory', '30G'),\
('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0')])
sc = pyspark.SparkContext(conf=config)
spark = SparkSession.builder.master('yarn').appName('base_analitica_entidades').config(conf = conf).getOrCreate()
I read this dataset through:
recomendaveis = spark.read.format("bigquery").option("viewsEnabled", "true").load("resource_group:some_group.someView")
Then I filter a specific column with IsNotNull:
recomendaveis_mid = recomendaveis.filter(recomendaveis["entities_mid"].isNotNull())
This recomendaveis_mid dataset is:
DataFrame[uid: string, revision: bigint, title: string, subtitle: string, access: string, branded_content: boolean, image: string, published_in: date, changed_in: date, entities_extracted_in: string, translation_extracted_in: string, categories_extracted_in: string, bigquery_inserted_in: string, public_url: string, private_url: string, text: string, translation_en: string, authors_name: string, categories_name: string, categories_confidence: double, entities_name: string, entities_type: string, entities_salience: double, entities_mid: string, entities_wikipedia_url: string, named_entities: string, publications: string, body: string, Editoria: string, idmateria: string]
When I try to get minimum date of column published_in with:
recomendaveis_mid.select(F.min("published_in")).collect()
It throws this error:
Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table resource_group:some_group.table is invalid. Filter is '(`entities_mid` IS NOT NULL)'at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
... 14 more
The field published_in has nothing to do with my filter in entities_mid and when i try to run the date filter without running the entities_mid isNotNull my code works fine. Any suggestions? In time:
There is a similar error here but I couldn´t get any other ideas. Thanks in advance
Solution 1:[1]
We faced similar issue in scala spark while reading from view.
Upon Analysis, we observed that when we do
df.printSchema()
df.show(1,false)
it prints all fields even before join operation takes place. But during loading/writing data frame to external storage/table it throws error :
INVALID_ARGUMENT: request failed: Row filter for table
After some experiment we observed that if we persist dataframe
df.persist()
it worked fine.
It looks like after joining we also need to have the column used to filter in select, since we don't want that column in our final dataframe. we persisted it in cluster.
Either you can unpersist
df.unpersist()
once data operation completes OR leave it AS IS if you are using ephemeral cluster as it will be deleted after deletion of cluster.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Khilesh Chauhan |
