Category "apache-spark-sql"

Processing data from a kafka stream using Pyspark

What the console of the kafka consumer looks like: ["2017-12-31 16:06:01", 12472391, 1] ["2017-12-31 16:06:01", 12472097, 1] ["2017-12-31 16:05:59", 12471979,

How to apply a pandas geocode function to Pyspark column

Table is like this id ADDRESS 0 6101 SUMMITVIEW AVE STE 200 YAKIMA 1 527 CEDAR WAY SUITE 105 OAKMONT 2 1700 N ROSE AVE SUITE 460 OXNARD 3 1275 YORK AVE NEW YOR

Databricks local test fail with java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism

I have a unit test to databricks code, and I want to run it locally on windows. Unluckily when I run pytest with PyCharm, it throws the following exception: Exc

Insert Overwrite in data bricks overwriting complete data in table?

I am have two table 1 is with 50K records and other is with 2.5K records and I want to update this 2.5K records into table one. Currently I was doing this by us

Spark/Scala approximate group by

Is there a way of counting approximately after a group by on an sql dataset in Spark? Or more generally, what is the fastest way of group by counting in Spark?

How to use Apache Spark to query Hive table with Kerberos?

I am attempting to use Scala with Apache Spark locally to query Hive table which is secured with Kerberos. I have no issues connecting and querying the data pro

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

I want to use spark SQL or pyspark to reformat a date field from 'dd/mm/yyyy' to 'yyyy/mm/dd'. The field type is string: from pyspark.sql import SparkSession fr

Pyspark Window function on entire data frame

Consider a pyspark data frame. I would like to summarize the entire data frame, per column, and append the result for every row. +-----+----------+-----------+

Spark SQL error from EMR notebook with AWS Glue table partition

I'm testing some pyspark code in an EMR notebook before I deploy it and keep running into this strange error with Spark SQL. I have all my tables and metadata i

Convert date to ISO week date in Spark

Having dates in one column, how to create a column containing ISO week date? ISO week date is composed of year, week number and weekday. year is not the same as

Auto increment id in delta table while inserting

I have a problem regarding merging csv files using pysparkSQL with delta table. I managed to create upsert function that update if matched and insert if not mat

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based

Scala error - Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

I have a requirement where i am reading data from a CSV file and writing data to a Delta table over scala on window OS. My scala code is given below:- import co

Category "apache-spark-sql"

Processing data from a kafka stream using Pyspark

How to apply a pandas geocode function to Pyspark column

Databricks local test fail with java.lang.NoSuchMethodError: org.apache.hadoop.security.HadoopKerberosName.setRuleMechanism

Insert Overwrite in data bricks overwriting complete data in table?

Spark/Scala approximate group by

How to use Apache Spark to query Hive table with Kerberos?

Spark SQL: Parse date string from dd/mm/yyyy to yyyy/mm/dd

Pyspark Window function on entire data frame

Spark SQL error from EMR notebook with AWS Glue table partition

Convert date to ISO week date in Spark

Auto increment id in delta table while inserting

Convert UTC timestamp to local time based on time zone in PySpark

Scala error - Exception in thread "main" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z

TypeError: 'str' object is not callable -Pyspark

Start of the week on Monday in Spark

Reading image dataset into data frame and feature extraction [spark with python]

SPARK SQL create table does not show / read all columns as expected

How to return null in SUM if some values are null?

Exception in thread "main" java.lang.IllegalAccessError: class org.apache.spark.storage.StorageUtils$

bucketing with QuantileDiscretizer using groupBy function in pyspark

Category "apache-spark-sql"

Other Categories