'How to extract urls from HYPERLINKS in excel file when reading into scala spark dataframe

I have an Excel file with Column A containing HYPERLINKS like this:

=HYPERLINK("https://google.com","View Link")

I can load the Excel file in scala spark dataframe using com.crealytics.spark.excel library but only with the 'View Link' text which DOES NOT contain the url

import org.apache.spark.sql._
import org.apache.spark.sql.types._

object Tut {
  def main(args: Array[String]): Unit = {
    println("started")

    val spark = SparkSession
      .builder()
      .appName("MySpark")
      .config("spark.master", "local")
      .getOrCreate()

    val customSchema = StructType(Array(
      StructField("A", StringType, nullable = false),
      StructField("B", IntegerType, nullable = false)))


    val df = spark.read.format("com.crealytics.spark.excel")
      .option("useHeader", "true").schema(customSchema)
      .option("dataAddress", "A1")
      .load("/MY_PATH/src/main/resources/SampFile.xlsx")

    df.printSchema()
    df.show()
  }

}

My goal is to load the entire content of the HYPERLINK as a string: =HYPERLINK("https://google.com","View Link") and then extract the url https://google.com.

Do you know if there is a way to do this using com.crealytics.spark.excel library or any other spark library? Thanks in advance!

Solution 1:^[1]

About the other question link you provided in the comments, they're trying to read the column as BinaryType, and cast it out of the box into StringType, well, such thing is not possible (even in scala itself), since you need to know how to read the bytes and represent it as a human readable string, right? for instance the encoding, etc.

Now we know that we need to define a custom approach. I used a sample in-code dataframe, and this approach worked:

scala> import spark.implicits._
import spark.implicits._

scala> val df = Seq(
     |   ("ddd".getBytes, 1)
     | ).toDF("A", "B")
df: org.apache.spark.sql.DataFrame = [A: binary, B: int]

scala> val btos: Array[Byte] => String = bytes => new String(bytes) // short fot bytes to string
btos: Array[Byte] => String = $Lambda$2322/665683021@738f6e44

scala> spark.udf.register("btos", btos)
res0: org.apache.spark.sql.expressions.UserDefinedFunction = SparkUserDefinedFunction($Lambda$2322/665683021@738f6e44,StringType,List(Some(class[value[0]: binary])),Some(btos),true,true)

scala> df.withColumn("C", expr("btos(A)")).show
+----------+---+---+
|         A|  B|  C|
+----------+---+---+
|[64 64 64]|  1|ddd|
+----------+---+---+

Hope this works for you.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	AminMal

'How to extract urls from HYPERLINKS in excel file when reading into scala spark dataframe

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]