'How to read files with .xlsx and .xls extension in Azure data factory?
I am trying to read and excel file in Azure Blob Storage with .xlsx extension in my azure data factory dataset. it throws following error
Error found when processing 'Csv/Tsv Format Text' source 'Filename.xlsx' with row number 3: found more columns than expected column count: 1.
What are the right Column and row delimiters for excel files to be read in azure Data factory
Solution 1:[1]
Azure Data Factory V2 has recently released an update to support parsing Excel(.xls) files on existing connectors.
Currently, the connections supporting excel files are:
- Amazon S3
- Azure Blob
- Azure Data Lake Storage Gen1
- Azure Data Lake Storage Gen2
- Azure File Storage
- File System
- FTP
- Google Cloud Storage
- HDFS
- HTTP
- SFTP
More details can be found here: https://docs.microsoft.com/en-us/azure/data-factory/format-excel
Solution 2:[2]
Update March 2022: ADF now has better support for Excel via Mapping Data Flows:
https://docs.microsoft.com/en-us/azure/data-factory/format-excel
Excel files have a proprietary format and are not simple delimited files. As indicated here, Azure Data Factory does not have a direct option to import Excel files, eg you cannot create a Linked Service to an Excel file and read it easily. Your options are:
- Export or convert the data as flat files eg before transfer to cloud, as .csv, tab-delimited, pipe-delimited etc are easier to read than Excel files. This is your simplest option although obviously requires a change in process.
- Try shredding the XML - create a custom task to open the Excel file as XML and extract your data as suggested here.
- SSIS packages are now supported in Azure Data Factory (with the Execute SSIS package activity) and have better support for Excel files, eg a Connection Manager. So it may be an option to create an SSIS package to deal with the Excel and host it in ADFv2. Warning! I have not tested this, I am only speculating it is possible. Also there is the overhead of creating an Integration Runtime (IR) for running SSIS in ADFv2.
Try some other custom activity, eg there is a custom U-SQL Extractor for shredding XML on github here.- Try and read the Excel using Databricks, some examples here although spinning up a Spark cluster to read a few Excel files does seem somewhat overkill. This might be a good option if Spark is already in your architecture.
Let us know how you get on.
Solution 3:[3]
Azure Data Factory does not have a direct option to upload the excel, however, it can be uploaded through linked service via some configurations. Follow below steps to do so:
- Create a Linked service to the source using the appropriate protocol(FTP, Fileshare, SFTP etc.)
- Create a Linked Service to your cloud storage.
- Take a Copy Data activity, define a filesystem Source Dataset and Sink using your previously defined linked services
- In Source and Sink, tick the Binary file option.Binary file option
- Publish and execute your pipeline
Solution 4:[4]
I would like to elaborate a bit more on option 3. SSIS packages. I tested the solution and it sorta worked. First of all - SSIS IR is super expensive and You would like to use http post's to start-stop it while it is not needed.
Secondly, SSIS does not support processing blob files out of the box. There are some ways around it. You can use some third-party soft (didn't try that), or explore options
In my case I used trick to deploy package that download file locally, process and then delete it. Since SSIS IR is a VM, it basically did the same on IR after deployment. It was a trick rather than solution and it was not 100% stable. Another issue - numeric fields got some insignificant error during processing - eg. uploading 25 resulted in 25.0000000001 at db site. No idea why. I did not have enough time to investigate issues deeper.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Diego Mendes |
| Solution 2 | |
| Solution 3 | user2029350 |
| Solution 4 | nokufano |
