'How do we access a file in github repo inside our azure databricks notebook
We have a requirement where we need to access a file hosted on our github private repo in our Azure Databricks notebook. Currently we are doing it using curl command using the Personal Access Token of a user.
curl -H 'Authorization: token INSERTACCESSTOKENHERE' -H 'Accept:
application/vnd.github.v3.raw' -O -L
https://api.github.com/repos/*owner*/*repo*/contents/*path*
Is there a way we can avoid the use of PAT and use deploy keys or anything?
Solution 1:[1]
From summer 2021 databricks has introduced integration of git repos functionality. More info here: https://docs.microsoft.com/en-us/azure/databricks/repos
If you add your file (excel, json etc.) in the repo, then you can use a relative path to access it and read it.
e.g. pd.read_excel("./test_data.xlsx")
Be aware that you need a cluster with a databricks version 8.4+ (or 9.1+?)
You can also test what is your current working directory by executing the following command. os.getcwd()
If you have correctly integrated the repo then your result should be something like:
/Workspace/Repos/[email protected]/REPO_FOLDER/analysis
otherwise it will be something like: /databricks/driver
Solution 2:[2]
Integrate Git and azure databricks.
This documentation shows how to integrate Git and azure databricks
Step1: Get raw URL of the File.
Step2: Use wget to access the file:
wget https://github.com/githubtraining/hellogitworld/blob/master/resources/labels.properties
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | AbhishekKhandave-MT |
