# Databricks Connect

Nuvolos now offers a VSCode application with Python 3.9 and R 4.2 and Databricks Connect (`databrics-connect`) pre-installed. From this application, you can submit Spark jobs to Databrics-hosted Spark clusters.

[PySpark](https://spark.apache.org/docs/latest/api/python/) and [sparklyR](https://spark.rstudio.com/) are both installed in the application.

### Prerequisites

{% hint style="info" %}
[Databricks Connect](https://docs.databricks.com/dev-tools/databricks-connect.html) supports Databricks clusters up to 10.4 LTS.
{% endhint %}

To configure the connection, you need:

* the URL of your Databricks cluster, and
* a [personal access token](https://docs.databricks.com/dev-tools/api/latest/authentication.html#token-management).\
  Personal access tokens are not available in Databricks Community Edition.

To connect from Nuvolos:

1. Create a **Databricks 10.4 LTS + Py39 + R 4.2** application.
2. Start the application.
3. Open a terminal and run:

   ```
   databricks-connect configure
   ```
4. Enter the [Databricks cluster URL](https://docs.databricks.com/dev-tools/databricks-connect.html#step-2-configure-connection-properties) and your [personal access token](https://docs.databricks.com/dev-tools/api/latest/authentication.html#token-management) when prompted.
5. Verify the setup with:

   ```
   databricks-connect test
   ```

### Python example

To run the example, please install the `slugify` Python package with the following command:

`conda install -y -c conda-forge python-slugify`

Once you have configured the Databricks connection, you can try the following simple example to create a Databricks table and run a SQL query on the table:

```python
import pandas as pd
from pyspark.sql import SparkSession
from slugify import slugify

spark = SparkSession.builder.getOrCreate()
spark.conf.set("spark.sql.execution.arrow.pyspark.enabled","true")

# Use NYC Squirrel Census data
df = pd.read_csv("https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD")
df.columns = [slugify(c, separator="_") for c in df.columns]
df = df.drop(columns=["vehicle_location", "electric_utility"])

df = spark.createDataFrame(df)
df.write.mode('overwrite').saveAsTable('ev_data')

spark.sql('select make, model, count(*) as registered from ev_data group by make, model order by registered desc').show(10)
```

### R example

The `sparklyr` package is pre-installed in the application which allows you to connect to Databricks Spark clusters, configured with `databricks-connect.`

You can run the following R script example to run a simple job on your Databricks cluster:

```r
library(sparklyr)
library(dplyr)

databricks_connect_spark_home <- system("databricks-connect get-spark-home", intern = TRUE)
sc <- spark_connect(method = "databricks", spark_home = databricks_connect_spark_home)

cars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)

cars_tbl %>% 
  group_by(cyl) %>% 
  summarise(
    mean_mpg = mean(mpg, na.rm = TRUE),
    mean_hp  = mean(hp, na.rm = TRUE)
    )

print(as_tibble(cars_tbl), n = 10)

spark_disconnect(sc)
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.nuvolos.com/user-guides/application-specific-guides/databricks-connect.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
