Category "scala"

Random Sampling base on 1 column after Groupby

I have a Spark Table, which contains 400+ millions records/rows. I used spark.table to convert it into a DF. The DF looks like this below id pub_date

I am trying to setup spark in local but getting error

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(

Spark load csv file in jar from resources folder

I am trying to create a Spark application running on Scala that reads a .csv file that is located in src/main/resources directory and saves it on the local hdfs

Azure Storage Account file details in a table in databricks

I am loading data via pipelines in ADLS gen2 container. Now I want to create a table that has details that when the pipeline start running and then completed. l

Word count using map reduce on Seq[String]

I have a Seq which contains randomly generated words. I want to calculate the occurrence count of each word using map reduce. Now, I have been able to map the w

java.lang.NoClassDefFoundError: org/apache/flink/streaming/api/scala/StreamExecutionEnvironment

package com.knoldus import org.apache.flink.api.java.utils.ParameterTool import org.apache.flink.streaming.api.scala._ import org.apache.flink.streaming.api.win

How to use countDistinct using a window function in Spark/Scala?

I need to use window function that is paritioned by 2 columns and do distinct count on the 3rd column and that as the 4th column. I can do count with out any is

get count of partitions in a kafka topic with scala 2.12

With scala 2.11 and spark-streaming-kafka-0-8_2.11 I could do import org.apache.spark.streaming.kafka.KafkaCluster val params = Map[String, Object]( "bootstr

Getting Error while encoding from reading text file

I have a pipe delimited file I need to strip the first two rows off of. So I read it into and RDD, exclude the first two rows, and make it into a data frame. va

Scala Jackson deserialization failing with "non-static inner classes" error version Jackson2.10

I am trying to upgrade from Jackson-2.5 to 2.10 Below deserialization code was working for me, before but post upgrade the solution is failing with following er

sbt-avro is not generating Scala classes, possible settings issue

I'm trying to use sbt-avro in a Scala project to generate Scala classes from an Avro schema. Here is the project structure: multi-tenant/ build.sbt proj

Spark in SBT console: "Could not find spark-version-info.properties"

I'm trying to instantiate a SparkContext inside a SBT console, using the following scala commands: import org.apache.spark.SparkConf import org.apache.spark.Spa

How to generate html test report for junit in sbt?

I have junit tests in my scala sbt project. I know, that I can generate html reports for ScalaTest with: testOptions in Test += Tests.Argument(TestFrameworks.

How to generate html test report for junit in sbt?

I have junit tests in my scala sbt project. I know, that I can generate html reports for ScalaTest with: testOptions in Test += Tests.Argument(TestFrameworks.

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed

Error while running Scala code - Databricks 7.3LTS and above

I am running databricks 7.3LTS and having errors while trying to use scala bulk copy. The error is: object sqldb is not a member of package com.microsoft. I hav

Summation/multiplication of a list of tuples

I'm trying to figure out a simple operation that takes a list of (Int, Int) tuples and multiplies the tuples internally and then sums those results. Example: v

Get only Right values from Either sequence

This code prints List(1, 2, ()) def f1(i:Int): Either[String,Int] = if (i > 0) Right(1) else Left("error 1") def f2(i:Int): Either[String,Int] = if

How to convert timestamp column of Spark Dataframe to string column

I want to convert Spark dataframe all TIMESTAMP columns into String columns. Could anybody say how to do that automatically for each dataframe? The position of

Spark SQL error : org.apache.spark.sql.catalyst.parser.ParseException: extraneous input '$' expecting

I am forming a query in a String Builder like below : println(dataQuery) Execution started at 2019-10-31 02:58:24.006019 PST res245: String = " SELECT transac