Deploying to Posit Connect

We recommend that for simply accessing data from the Unity Catalog, such as in a Shiny app, use an ODBC connection instead of a Spark session. The advantage of this is that a connection to the Databricks Warehouse does not require a running cluster. For more information about creating dashboards with databases visit Database with R site.

However, there are cases when it is necessary to deploy a solution that requires a Spark session. For example, when there is a long running job that needs to run on a schedule. Those kinds of jobs could be put inside a Quarto document and published to Posit Connect, where they can run on specific date/time intervals. Posit Connect supports Python environments, so it is an ideal platform to deploy these kinds of solutions.

Preparing for deployment

When deploying to Posit Connect, there are specific pieces of information that we need to make sure are sent over:

  • Your cluster’s ID
  • Your workspace URL
  • Your token
  • Your Python environment
  • To use the pysparklyr extension

Based on the recommendations in this article, the cluster’s ID should be in the code of your document, the workspace URL and token should already be inside the DATABRICKS_HOST, and DATABRICKS_TOKEN environment variables. If you are using something other than these environment variables to authenticate, then see the section Alternatives to Environment Variables.

Make sure that your document has a library(pysparklyr) call. This will let Posit Connect deployment process know that it needs to install this package and its dependencies, such as reticulate.

The next section will introduce a function that will ease finding the Python environment.

library(sparklyr)
library(pysparklyr)

sc <- spark_connect(
    cluster_id = "1026-175310-7cpsh3g8",
    method = "databricks_connect"
)

Using deploy_databricks()

The deploy_databricks() function makes it easier to deploy your content to Posit Connect. It does its best to gather all of the pieces of information mentioned above and builds the correct command to publish.

The path to your content is the last piece of information we need. Ideally your content is either located in its own folder inside your project or it is at the root level of your project. There are three ways that you can let deploy_databricks() know the path to use:

  • If you are in RStudio and the document you wish to publish is open, deploy_databricks() will use the RStudio API to get the path of that document and then use its containing folder as the content’s location. This is the preferred method for deployment.

  • Use the appDir argument to pass the path to be used. Something such as here::here("my-cool-document") would work.

  • If no document is opened in RStudio and appDir is left empty, then getwd() will be used

The aim of the new function is to be both flexible and helpful. It will gather your document location, credentials, and URLs. The Python location needs to defined when calling deploy_databricks(). If you are in RStudio, and your document is opened, here are several ways to do this:

  • If you know the cluster’s DBR version:

    pysparklyr::deploy_databricks(version = "14.1")
  • The cluster’s ID can also be used and pysparklyr will automatically determine the required DBR version:

    pysparklyr::deploy_databricks(cluster_id = "1026-175310-7cpsh3g8")
  • If you just ran the code of the content you plan to deploy, then the Python environment will already be loaded in the R session. deploy_databricks() will validate that the path of that Python environment conforms to one that pysparklyr creates and use that. This will happen if no version, cluster_id, or python argument is provided. At that point simply run:

    pysparklyr::deploy_databricks()
  • You can also pass the path to the Python environment to use by setting the python argument:

    pysparklyr::deploy_databricks(python = "/Users/edgar/.virtualenvs/r-sparklyr-databricks-14.1")
  • You can use DATABRICKS_CLUSTER_ID environment variable. If you have it set, simply run:

    pysparklyr::deploy_databricks()

Here is an example of the output returned when using deploy_databricks(). Before submitting your content, you will be prompted to confirm that the information gathered is correct. Also notice that if you have more than one Posit Connect server setup in your RStudio IDE, it will choose the top one as the default but allow you to easily change it if necessary:

> pysparklyr::deploy_databricks(version = "14.1")

── Starting deployment ──────────────────────────────────────────────────────────────────────────────────────
ℹ Source directory: /Users/edgar/r_projects/practice/test-deploy
ℹ Python: /Users/edgar/.virtualenvs/r-sparklyr-databricks-14.1/bin/python
ℹ Posit server: colorado.posit.co
  Account name: edgar
ℹ Host URL: rstudio-partner-posit-default.cloud.databricks.com
  Token: '<REDACTED>'

Does everything look correct?

1: Yes
2: No
3: Change 'Posit server'

Selection: 1

The first time that you publish, the function will check to see if you have a requirements.txt file. If you do not have the file, it will ask you if you wish to create it. The requirements.txt file contains the list of the your current Python environment, as well as its versions. This will help when you re-publish your content, because you will not need to pass the version.

Would you like to create the 'requirements.txt' file?
Why consider? This will allow you to skip using `version` or `cluster_id`

1: Yes
2: No

If you select No, deploy_databricks() will not ask you again when you re-deploy.

Alternatives to Environment Variables

The deploy_databricks() function has a host, and token arguments. These arguments take precedent over the environment variables, if set.

There are a variety of reasons for you to set these arguments when publishing. For example, locally you authenticate with a Databricks configuration file, but when deploying, you will need to let deploy_databricks() what values to use for the PAT and Host URL. Another example could be that your deployed content may need to use a service account, that differs from the credentials you use when developing.

As usual, we recommend that you avoid using open text values with your credentials in your code. An effective way of managing local-vs-remote credentials is with the config package. Here is an example:

config.yml

default:
  host_url: "[Your Host URL]"
  token: "[Your Token]"

rsconnect:
  host_url: "[Your Host URL]"
  token: "[Service Token]"

R script

config <- config::get()

pysparklyr::deploy_databricks(
  version = "14.1", 
  host = config$host_url,
  token = config$token
  )

The integration with Posit Connect and config, allows your deployed content to automatically use the values under the rsconnect section in the YAML file, instead of the values from the default section.