'Row Filter for Table is invalid pyspark

I have a dataframe in pyspark coming from a View in Bigquery that i import after configuring spark session:

config = pyspark.SparkConf().setAll([('spark.executor.memory', '10g'),('spark.driver.memory', '30G'),\
                                 ('spark.jars.packages', 'com.google.cloud.spark:spark-bigquery-with-dependencies_2.12:0.18.0')])
sc = pyspark.SparkContext(conf=config)
spark = SparkSession.builder.master('yarn').appName('base_analitica_entidades').config(conf = conf).getOrCreate()

I read this dataset through:

recomendaveis = spark.read.format("bigquery").option("viewsEnabled", "true").load("resource_group:some_group.someView")

Then I filter a specific column with IsNotNull:

recomendaveis_mid = recomendaveis.filter(recomendaveis["entities_mid"].isNotNull())

This recomendaveis_mid dataset is:

DataFrame[uid: string, revision: bigint, title: string, subtitle: string, access: string, branded_content: boolean, image: string, published_in: date, changed_in: date, entities_extracted_in: string, translation_extracted_in: string, categories_extracted_in: string, bigquery_inserted_in: string, public_url: string, private_url: string, text: string, translation_en: string, authors_name: string, categories_name: string, categories_confidence: double, entities_name: string, entities_type: string, entities_salience: double, entities_mid: string, entities_wikipedia_url: string, named_entities: string, publications: string, body: string, Editoria: string, idmateria: string]

When I try to get minimum date of column published_in with:

recomendaveis_mid.select(F.min("published_in")).collect()

It throws this error:

Caused by: com.google.cloud.spark.bigquery.repackaged.io.grpc.StatusRuntimeException: INVALID_ARGUMENT: request failed: Row filter for table resource_group:some_group.table is invalid. Filter is '(`entities_mid` IS NOT NULL)'at com.google.cloud.spark.bigquery.repackaged.io.grpc.Status.asRuntimeException(Status.java:533)
... 14 more

The field published_in has nothing to do with my filter in entities_mid and when i try to run the date filter without running the entities_mid isNotNull my code works fine. Any suggestions? In time: There is a similar error here but I couldn´t get any other ideas. Thanks in advance



Solution 1:[1]

We faced similar issue in scala spark while reading from view.

Upon Analysis, we observed that when we do

df.printSchema()
df.show(1,false)

it prints all fields even before join operation takes place. But during loading/writing data frame to external storage/table it throws error :

INVALID_ARGUMENT: request failed: Row filter for table

After some experiment we observed that if we persist dataframe

df.persist()

it worked fine.

It looks like after joining we also need to have the column used to filter in select, since we don't want that column in our final dataframe. we persisted it in cluster.

Either you can unpersist

df.unpersist()

once data operation completes OR leave it AS IS if you are using ephemeral cluster as it will be deleted after deletion of cluster.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Khilesh Chauhan