'spark.read.excel - Not reading all Excel rows when using custom schema

I am trying to read a Spark DataFrame from an 'excel' file. I used the crealytics dependency.

Without any predefined schema, all rows are correctly read but as only string type columns.

To prevent that, I am using my own schema (where I mentioned certain columns to be Integer type), but in this case, most of the rows are dropped when the file is being read.

The library dependency used in build.sbt:

"com.crealytics" %% "spark-excel" % "0.11.1",

Scala version - 2.11.8 
Spark version - 2.3.2

val inputDF = sparkSession.read.excel(useHeader = true).load(inputLocation(0))

The above reads all the data - around 25000 rows.

But,

val inputWithSchemaDF: DataFrame = sparkSession.read
   .format("com.crealytics.spark.excel")
   .option("useHeader" , "false")
   .option("inferSchema", "false")
   .option("addColorColumns", "true")
   .option("treatEmptyValuesAsNulls" , "false")
   .option("keepUndefinedRows", "true")
   .option("maxRowsInMey", 2000)
   .schema(templateSchema)
   .load(inputLocation)

This gives me only 450 rows. Is there a way to prevent that? Thanks in advance! (edited)

Solution 1:^[1]

As of now, I haven't found a fix to this problem, but I tried solving it in a different way by manually type-casting. To make it a bit better in terms of number of lines of code, I took the help of a for loop. My solutions is as follows:

Step 1: Create my own schema of type 'StructType':

val requiredSchema = new StructType() 
.add("ID", IntegerType, true)
.add("Vendor", StringType, true)
.add("Brand", StringType, true) 
.add("Product Name", StringType, true)
.add("Net Quantity", StringType, true)

Step 2: Type casting the Dataframe AFTER it has been read (WITHOUT the custom schema) from the excel file (instead of using the schema while reading the data):

def convertInputToDesiredSchema(inputDF: DataFrame, requiredSchema: StructType)(implicit sparkSession: SparkSession) : DataFrame = 
{   
    var schemaDf: DataFrame = inputDF 
  
    for(i <- inputDF.columns.indices)   
    {     
        if(inputDF.schema(i).dataType.typeName != requiredSchema(i).dataType.typeName)     
        {         
            schemaDf = schemaDf.withColumn(schemaDf.columns(i), col(schemaDf.columns(i)).cast(requiredSchema.apply(i).dataType))       
        }
    } 

    schemaDf 
}

This might not be an efficient solution, but is better than typing out too many lines of code to typecast multiple columns.

I am still searching for a solution to my original question.

This solution is just in case someone might want to try and are in immediate need of a quick fix.

Solution 2:^[2]

Here's a workaround, using PySpark, using a schema that consists of "fieldname" and "dataType":

# 1st load the dataframe with StringType for all columns
from pyspark.sql.types import *


input_df = spark.read.format("com.crealytics.spark.excel") \
  .option("header", isHeaderOn) \
  .option("treatEmptyValuesAsNulls", "true") \
  .option("dataAddress", xlsxAddress1) \
  .option("setErrorCellsToFallbackValues", "true") \
  .option("ignoreLeadingWhiteSpace", "true") \
  .option("ignoreTrailingWhiteSpace", "true") \
  .load(inputfile)

# 2nd Modify the datatypes within the dataframe using a file containing column names and the expected data type.
dtypes = pd.read_csv("/dbfs/mnt/schema/{}".format(file_schema_location), header=None).to_records(index=False).tolist()
fields = [StructField(dtype[0], globals()[f'{dtype[1]}']()) for dtype in dtypes]
schema = StructType(fields)
for dt in dtypes:
    colname =dt[0]
    coltype = dt[1].replace("Type","")
    input_df = input_df.withColumn(colname, col(colname).cast(coltype))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Hema Priya Velaga
Solution 2	tjheslin1

'spark.read.excel - Not reading all Excel rows when using custom schema

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]