Do Spark have cache with TTL option. I need to do lookup on reference data to perform some transformation in my Spark streaming application. Also lookup dataset
I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant
I want to convert each row of my dataframe into to a Python class object called Fruit. I have a dataframe df with the following columns: Identifier, Name, Quant
I have trained a logistic regression algorithm to match job titles and descriptions to a set of 4 digit numeric codes. This it does very well. It will form part
I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an ar
I need to extract objects from an array, where there's more than one object in that array I need to repeat for every id and if the field is null then I want to
HIVE has a metastore and HIVESERVER2 listens for SQL requests; with the help of metastore, the query is executed and the result is passed back. The Thrift frame
I am from Linkedin, we are having compatibility issue with spark-cdm-connector, to give a little context I have a cdm data in ADLS which I’m trying to rea
I am attempting to use a PySpark kernel from inside an EMR Notebook that is hosted on an AWS managed service (EMR) and I am unable to access Artifactory to inst
I am trying to write data on an S3 bucket from my local computer: spark = SparkSession.builder \ .appName('application') \ .config("spark.hadoop.fs.s3a.
I have a dataframe having a column of type MapType<StringType, StringType>. |-- identity: map (nullable = true) | |-- key: string | |-- value: st
I have a spark standalone configured with 3 nodes. I want to read csv data stored in s3-compatible storage (dell ecs) in this pySpark. Here's the method and con
I am using spark version 3.1.2, and I need to load data from a csv with encoding utf-16le. df = spark.read.format("csv") .option("delimiter", ",") .opti
Having a date, I create a column with ISO 8601 week date format: from pyspark.sql import functions as F df = spark.createDataFrame([('2019-03-
I'm not experienced in Java or Hadoop ecosystem. I configured my Spark cluster to connect to Amazon Keyspaces by using spark-cassandra-connector from Datastax.
I am working on a Pyspark using the flatMap function and I am using the split within the function. But I am getting an error which says: AttributeError: 'NoneTy
I was able to create docker based bitnami stand alone spark instance and run spark jobs on it. However I'm not able not able to write data to snowflake from the
I am querying a Hudi table using Hive which is running on Spark engine in EMR cluster 6.3.1 Hudi version is 0.7 I have inserted a few records and then updated t
I have a DataFrame that consists of Column that is ArrayType, and the array may have a different length in each row of the data. I have provide some example cod
I have a Spark scala DataFrame with two columns, text and subtext, where subtext is guaranteed to occur somewhere within text. How would I calculate the positio