Deploying to Posit Connect
We recommend that for simply accessing data from the Unity Catalog, such as in a Shiny app, use an ODBC connection instead of a Spark session. The advantage of this is that a connection to the Databricks Warehouse does not require a running cluster. For more information about creating dashboards with databases visit Database with R site.
However, there are cases when it is necessary to deploy a solution that requires a Spark session. For example, when there is a long running job that needs to run on a schedule. Those kinds of jobs could be put inside a Quarto document and published to Posit Connect, where they can run on specific date/time intervals. Posit Connect supports Python environments, so it is an ideal platform to deploy these kinds of solutions.
Preparing for deployment
When deploying to Posit Connect, there are specific pieces of information that we need to make sure are sent over:
- Your cluster’s ID
- Your workspace URL
- Your token
- Your Python environment
- To use the
pysparklyr
extension
Based on the recommendations in this article, the cluster’s ID should be in the code of your document, the workspace URL and token should already be inside the DATABRICKS_HOST
, and DATABRICKS_TOKEN
environment variables. If you are using something other than these environment variables to authenticate, then see the section Alternatives to Environment Variables.
Make sure that your document has a library(pysparklyr)
call. This will let Posit Connect deployment process know that it needs to install this package and its dependencies, such as reticulate
.
The next section will introduce a function that will ease finding the Python environment.
library(sparklyr)
library(pysparklyr)
<- spark_connect(
sc cluster_id = "1026-175310-7cpsh3g8",
method = "databricks_connect"
)
Using deploy_databricks()
The deploy_databricks()
function makes it easier to deploy your content to Posit Connect. It does its best to gather all of the pieces of information mentioned above and builds the correct command to publish.
The path to your content is the last piece of information we need. Ideally your content is either located in its own folder inside your project or it is at the root level of your project. There are three ways that you can let deploy_databricks()
know the path to use:
If you are in RStudio and the document you wish to publish is open,
deploy_databricks()
will use the RStudio API to get the path of that document and then use its containing folder as the content’s location. This is the preferred method for deployment.Use the
appDir
argument to pass the path to be used. Something such ashere::here("my-cool-document")
would work.If no document is opened in RStudio and
appDir
is left empty, thengetwd()
will be used
The aim of the new function is to be both flexible and helpful. It will gather your document location, credentials, and URLs. The Python location needs to defined when calling deploy_databricks()
. If you are in RStudio, and your document is opened, here are several ways to do this:
If you know the cluster’s DBR version:
::deploy_databricks(version = "14.1") pysparklyr
The cluster’s ID can also be used and
pysparklyr
will automatically determine the required DBR version:::deploy_databricks(cluster_id = "1026-175310-7cpsh3g8") pysparklyr
If you just ran the code of the content you plan to deploy, then the Python environment will already be loaded in the R session.
deploy_databricks()
will validate that the path of that Python environment conforms to one thatpysparklyr
creates and use that. This will happen if noversion
,cluster_id
, orpython
argument is provided. At that point simply run:::deploy_databricks() pysparklyr
You can also pass the path to the Python environment to use by setting the
python
argument:::deploy_databricks(python = "/Users/edgar/.virtualenvs/r-sparklyr-databricks-14.1") pysparklyr
You can use
DATABRICKS_CLUSTER_ID
environment variable. If you have it set, simply run:::deploy_databricks() pysparklyr
Here is an example of the output returned when using deploy_databricks()
. Before submitting your content, you will be prompted to confirm that the information gathered is correct. Also notice that if you have more than one Posit Connect server setup in your RStudio IDE, it will choose the top one as the default but allow you to easily change it if necessary:
> pysparklyr::deploy_databricks(version = "14.1")
── Starting deployment ──────────────────────────────────────────────────────────────────────────────────────: /Users/edgar/r_projects/practice/test-deploy
ℹ Source directory: /Users/edgar/.virtualenvs/r-sparklyr-databricks-14.1/bin/python
ℹ Python: colorado.posit.co
ℹ Posit server: edgar
Account name: rstudio-partner-posit-default.cloud.databricks.com
ℹ Host URL: '<REDACTED>'
Token
Does everything look correct?
1: Yes
2: No
3: Change 'Posit server'
: 1 Selection
The first time that you publish, the function will check to see if you have a requirements.txt
file. If you do not have the file, it will ask you if you wish to create it. The requirements.txt
file contains the list of the your current Python environment, as well as its versions. This will help when you re-publish your content, because you will not need to pass the version
.
'requirements.txt' file?
Would you like to create the `version` or `cluster_id`
Why consider? This will allow you to skip using
1: Yes
2: No
If you select No, deploy_databricks()
will not ask you again when you re-deploy.
Alternatives to Environment Variables
The deploy_databricks()
function has a host
, and token
arguments. These arguments take precedent over the environment variables, if set.
There are a variety of reasons for you to set these arguments when publishing. For example, locally you authenticate with a Databricks configuration file, but when deploying, you will need to let deploy_databricks()
what values to use for the PAT and Host URL. Another example could be that your deployed content may need to use a service account, that differs from the credentials you use when developing.
As usual, we recommend that you avoid using open text values with your credentials in your code. An effective way of managing local-vs-remote credentials is with the config
package. Here is an example:
config.yml
default:
host_url: "[Your Host URL]"
token: "[Your Token]"
rsconnect:
host_url: "[Your Host URL]"
token: "[Service Token]"
R script
<- config::get()
config
::deploy_databricks(
pysparklyrversion = "14.1",
host = config$host_url,
token = config$token
)
The integration with Posit Connect and config
, allows your deployed content to automatically use the values under the rsconnect
section in the YAML file, instead of the values from the default
section.