'Azure Databricks API, how to attach a cluster to an uploaded notebook via API

I am using python 3.6 to run the following script to upload a local file to Azure databricks notebooks in a specific folder. I have followed the documentation in this link. With the script below I have been able to upload the notebook. However, I couldn't figure out how I can attach a cluster to this notebook after it is uploaded via the API. I couldn't find it in the documentation. Is it possible altogether? If yes, how?

The python script I am using is as follows:

import requests
import os
from os.path import isfile, join
from os import listdir
import base64


dbrks_create_dir_url =  "https://"+os.environ['DBRKS_INSTANCE']+".azuredatabricks.net/api/2.0/workspace/mkdirs"
dbrks_import_rest_url = "https://"+os.environ['DBRKS_INSTANCE']+".azuredatabricks.net/api/2.0/workspace/import"


DBRKS_REQ_HEADERS = {
    'Authorization': 'Bearer ' + os.environ['DBRKS_BEARER_TOKEN'],
    'X-Databricks-Azure-Workspace-Resource-Id': '/subscriptions/'+ os.environ['DBRKS_SUBSCRIPTION_ID'] +'/resourceGroups/'+ os.environ['DBRKS_RESOURCE_GROUP'] +'/providers/Microsoft.Databricks/workspaces/' + os.environ['DBRKS_WORKSPACE_NAME'],
    'X-Databricks-Azure-SP-Management-Token': os.environ['DBRKS_MANAGEMENT_TOKEN']}

## This is the address of the notebook to upload.
notebooks = os.environ['DefaultWorkingDirectory'] + "/notebooks/"

    
path = notebooks
onlyfiles = [f for f in listdir(path) if isfile(join(path, f))]
for file in onlyfiles:
  fileLocation = path +"/"+ file 

  data = open(fileLocation, "rb").read()
  encoded = base64.b64encode(data)
  files = {"content": (fileLocation,encoded)}

  if ".py" in file:
    fileName = file.split(".")[0]
    fileName = fileName.replace(".py", "")
    print(fileName)
    response = requests.post(dbrks_import_rest_url,headers=DBRKS_REQ_HEADERS, files=files, data={'path': '/Users/myuser/' + fileName, 'language':'PYTHON','format':'SOURCE', 'overwrite': 'true', 'content': encoded})


  if response.status_code == 200:
      print(response.json)
  else:
      raise Exception(response.content)  

All the OS environment variables are sent from my Azure DevOps pipeline. However, you don't need to execute the script from a pipeline. You can execute it from your local machine as long as you have a service principal with access to a databricks workspace. To run the python script, you can replace those environment variables with your own credentials.

Explaining the variables in the script:

  • os.environ['DBRKS_INSTANCE']: Name of the databricks instance
  • os.environ['DBRKS_BEARER_TOKEN']: the bearer token. You need this to authenticate your service principal or your user to databricks. Later I have explained how you can get it.
  • os.environ['DBRKS_MANAGEMENT_TOKEN']: If the service principle you are using is not added as databricks workspace users or admins, you need this token. Later I have explained how you can get it.
  • os.environ['DBRKS_SUBSCRIPTION_ID']: The Azure subscription Id where databricks workspace is.
  • os.environ['DBRKS_RESOURCE_GROUP']: Name of the Azure resource group of the databricks workspace.
  • os.environ['DBRKS_WORKSPACE_NAME']: Name of the Azure databricks workspace.
  • os.environ["DBRKS_CLUSTER_ID"]: The cluster Id which will execute the job in databricks.
  • os.environ['DefaultWorkingDirectory']: simply replace it with the address on your local machine for a sample notebook file. A notebook file is just a file with .py extension. You can put a comment in this file or just print("Hello World!").

One final point to make the above script run: To get the above two variables for DBRKS_BEARER_TOKEN and DBRKS_MANAGEMENT_TOKEN, you can run the following script and manually replace os.environ['DBRKS_BEARER_TOKEN'] and os.environ['DBRKS_MANAGEMENT_TOKEN'] with the printed values after script execution:

import requests
import json
import os


TOKEN_BASE_URL = 'https://login.microsoftonline.com/' + os.environ['SVCDirectoryID'] + '/oauth2/token'
TOKEN_REQ_HEADERS = {'Content-Type': 'application/x-www-form-urlencoded'}
TOKEN_REQ_BODY = {
       'grant_type': 'client_credentials',
       'client_id': os.environ['SVCApplicationID'],
       'client_secret': os.environ['SVCSecretKey']}



def dbrks_management_token():
        TOKEN_REQ_BODY['resource'] = 'https://management.core.windows.net/'
        response = requests.get(TOKEN_BASE_URL, headers=TOKEN_REQ_HEADERS, data=TOKEN_REQ_BODY)
        if response.status_code == 200:
            print(response.status_code)
        else:
            raise Exception(response.text)
        return response.json()['access_token']


def dbrks_bearer_token():
        TOKEN_REQ_BODY['resource'] = '2ff814a6-3304-4ab8-85cb-cd0e6f879c1d'
        response = requests.get(TOKEN_BASE_URL, headers=TOKEN_REQ_HEADERS, data=TOKEN_REQ_BODY)
        if response.status_code == 200:
            print(response.status_code)
        else:
            raise Exception(response.text)
        return response.json()['access_token']

DBRKS_BEARER_TOKEN = dbrks_bearer_token()
DBRKS_MANAGEMENT_TOKEN = dbrks_management_token()

os.environ['DBRKS_BEARER_TOKEN'] = DBRKS_BEARER_TOKEN 
os.environ['DBRKS_MANAGEMENT_TOKEN'] = DBRKS_MANAGEMENT_TOKEN 

print("DBRKS_BEARER_TOKEN",os.environ['DBRKS_BEARER_TOKEN'])
print("DBRKS_MANAGEMENT_TOKEN",os.environ['DBRKS_MANAGEMENT_TOKEN'])
  • SVCDirectoryID is Azure Active Directory (AAD) service principal tenant Id
  • SVCApplicationID is the value of AAD service principal client Id.
  • SVCSecretKey is AAD service principal secret key.

Thank you for your valuable input.



Solution 1:[1]

Thank you User Alex Ott - Stack Overflow . Posting your suggestions as answer to help other community members.

You can use Nutter framework for Databricks notebooks, use Repos instead of uploading notebooks.

Your Databricks workspace needs to have Repos functionality enabled. If it's enabled, you should see the "Repos" icon in the navigation panel:

  • Fork repository into your environment - Github, or Azure DevOps (follow Databricks documentation on using it)
  • In the Repos, click "Create Repo" and link it to the Git repository that you've forked

enter image description here

References: GitHub - microsoft/nutter: Testing framework for Databricks notebooks and GitHub - alexott/databricks-nutter-repos-demo: Demo of using the Nutter for testing of Databricks notebooks in the CI/CD pipeline

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 MadhurajVadde-MT