Skip to content

Commit a9122d0

Browse files
yyassi-heartexrobot-ci-heartexmakseq
authored andcommitted
feat: BROS-353: Databricks storage integration (#8245) (#8542)
Co-authored-by: robot-ci-heartex <87703623+robot-ci-heartex@users.noreply.github.com> Co-authored-by: makseq <makseq@gmail.com> Co-authored-by: robot-ci-heartex <robot-ci-heartex@users.noreply.github.com> Co-authored-by: makseq <makseq@users.noreply.github.com>
1 parent 6c15e61 commit a9122d0

File tree

1 file changed

+43
-103
lines changed

1 file changed

+43
-103
lines changed

docs/source/guide/storage.md

Lines changed: 43 additions & 103 deletions
Original file line numberDiff line numberDiff line change
@@ -13,37 +13,13 @@ section: "Import & Export"
1313

1414
Integrate popular cloud and external storage systems with Label Studio to collect new items uploaded to the buckets, containers, databases, or directories and return the annotation results so that you can use them in your machine learning pipelines.
1515

16-
<div class="opensource-only">
17-
18-
| Storage | Community | Enterprise |
19-
|---|---|---|
20-
| [Amazon S3](#Amazon-S3) |||
21-
| [Amazon S3 with IAM role](https://docs.humansignal.com/guide/storage#Set-up-an-S3-connection-with-IAM-role-access) |||
22-
| [Google Cloud Storage](#Google-Cloud-Storage) |||
23-
| [Google Cloud Storage WIF Auth](https://docs.humansignal.com/guide/storage#Google-Cloud-Storage-with-Workload-Identity-Federation-WIF) |||
24-
| [Microsoft Azure Blob Storage](#Microsoft-Azure-Blob-storage) |||
25-
| [Microsoft Azure Blob Storage with Service Principal](https://docs.humansignal.com/guide/storage#Azure-Blob-Storage-with-Service-Principal-authentication) |||
26-
| [Databricks Files (UC Volumes)](https://docs.humansignal.com/guide/storage#Databricks-Files-UC-Volumes) |||
27-
| [Redis database](#Redis-database)|||
28-
| [Local storage](#Local-storage) |||
29-
30-
</div>
31-
32-
<div class="enterprise-only">
33-
34-
| Storage | Community | Enterprise |
35-
|---|---|---|
36-
| [Amazon S3](#Amazon-S3) |||
37-
| [Amazon S3 with IAM role](#Set-up-an-S3-connection-with-IAM-role-access) |||
38-
| [Google Cloud Storage](#Google-Cloud-Storage) |||
39-
| [Google Cloud Storage WIF Auth](#Google-Cloud-Storage-with-Workload-Identity-Federation-WIF) |||
40-
| [Microsoft Azure Blob Storage](#Microsoft-Azure-Blob-storage) |||
41-
| [Microsoft Azure Blob Storage with Service Principal](#Azure-Blob-Storage-with-Service-Principal-authentication) |||
42-
| [Databricks Files (UC Volumes)](#Databricks-Files-UC-Volumes) |||
43-
| [Redis database](#Redis-database)|||
44-
| [Local storage](#Local-storage) (on-prem only) |||
45-
46-
</div>
16+
Set up the following cloud and other storage systems with Label Studio:
17+
- [Amazon S3](#Amazon-S3)
18+
- [Google Cloud Storage](#Google-Cloud-Storage)
19+
- [Microsoft Azure Blob storage](#Microsoft-Azure-Blob-storage)
20+
- [Redis database](#Redis-database)
21+
- [Local storage](#Local-storage) <div class="enterprise-only">(for On-prem only)</div>
22+
- [Databricks Files (UC Volumes)](#Databricks-Files-UC-Volumes)
4723

4824

4925
## Troubleshooting
@@ -1294,7 +1270,7 @@ Complete the following fields and then click **Test connection**:
12941270
| | |
12951271
| --- | --- |
12961272
| Storage Title | Enter a name for the storage connection to appear in Label Studio. |
1297-
| Storage Name | Enter the name of your Azure storage account. |
1273+
| Storage Name | Enter the name of your Azure storage sccount. |
12981274
| Container Name | Enter the name of a container within the Azure storage account. |
12991275
| Tenant ID | Specify the **Directory (tenant) ID** from your App Registration. |
13001276
| Client ID | Specify the **Application (client) ID** from your App Registration. |
@@ -1516,92 +1492,60 @@ If you're using Label Studio in Docker, you need to mount the local directory th
15161492
15171493
<div class="enterprise-only">
15181494
1519-
Connect Label Studio Enterprise to Databricks Unity Catalog (UC) Volumes to import files as tasks and export annotations as JSON back to your volumes. This connector uses the Databricks Files API and operates only in proxy mode (presigned URLs are not supported by Databricks).
1495+
Connect Label Studio Enterprise to Databricks Unity Catalog (UC) Volumes to import files as tasks and export annotations as JSON back to your volumes. This connector uses the Databricks Files API and operates only in proxy mode (no presigned URLs are supported by Databricks).
15201496
15211497
### Prerequisites
1522-
- A Databricks workspace URL (Workspace Host), for example `https://adb-12345678901234.1.databricks.com` (or Azure domain).
1523-
1524-
See [Create a workspace](https://docs.databricks.com/aws/en/admin/workspace/) and [Get identifiers for workspace objects](https://docs.databricks.com/aws/en/workspace/workspace-details#workspace-url).
1525-
- A Databricks Personal Access Token (PAT) with permission to access the Files API.
1526-
1527-
You can generate tokens from **Settings > Developer**. See [Databricks personal access token authentication](https://docs.databricks.com/en/dev-tools/auth/pat.html).
1528-
- A UC Volume path under `/Volumes/<catalog>/<schema>/<volume>` with files you want to label.
1529-
1530-
See [What are Unity Catalog volumes?](https://docs.databricks.com/aws/en/volumes/).
1531-
1532-
### Create a source storage connection in the Label Studio UI
1533-
1534-
From Label Studio, open your project and select **Settings > Cloud Storage > Add Source Storage**.
1535-
1536-
Select **Databricks Files (UC Volumes)** and click **Next**.
1537-
1538-
#### Configure Connection
1539-
1540-
Complete the following fields and then click **Test connection**:
1541-
1542-
<div class="noheader rowheader">
1543-
1544-
| | |
1545-
| --- | --- |
1546-
| Storage Title | Enter a name for the storage connection to appear in Label Studio. |
1547-
| Workspace Host | Enter your workspace URL, for example `https://<workspace-identifier>.cloud.databricks.com` |
1548-
| Access Token | Enter your personal access token that you generated in Databricks. |
1549-
| Catalog <br> Schema <br> Volume | Specify your volume path (UC coordinates). You can find this from the **Catalog Explorer** in Databricks (see screenshot below). |
1550-
1551-
</div>
1552-
1553-
![Screenshot of Databricks UI and LS UI](/images/storages/databricks-volume.png)
1498+
- A Databricks workspace URL (Workspace Host), for example `https://adb-12345678901234.1.databricks.com` (or Azure domain)
1499+
- A Databricks Personal Access Token (PAT) with permission to access the Files API
1500+
- A UC Volume path under `/Volumes/<catalog>/<schema>/<volume>` with files you want to label
15541501
1555-
#### Import Settings & Preview
1502+
References:
1503+
- Databricks workspace: https://docs.databricks.com/en/getting-started/index.html
1504+
- Personal access tokens: https://docs.databricks.com/en/dev-tools/auth/pat.html
1505+
- Unity Catalog and Volumes: https://docs.databricks.com/en/files/volumes.html
15561506
1557-
Complete the following fields and then click **Load preview** to ensure you are syncing the correct data:
1558-
1559-
<div class="noheader rowheader">
1560-
1561-
| | |
1562-
| --- | --- |
1563-
| Bucket Prefix | Optionally, enter the directory name within the volume that you would like to use. For example, `data-set-1` or `data-set-1/subfolder-2`. |
1564-
| Import Method | Select whether you want create a task for each file in your container or whether you would like to use a JSON/JSONL/Parquet file to define the data for each task. |
1565-
| File Name Filter | Specify a regular expression to filter bucket objects. Use `.*` to collect all objects. |
1566-
| Scan all sub-folders | Enable this option to perform a recursive scan across subfolders within your container. |
1567-
1568-
</div>
1569-
1570-
#### Review & Confirm
1571-
1572-
If everything looks correct, click **Save & Sync** to sync immediately, or click **Save** to save your settings and sync later.
1507+
### Set up connection in the Label Studio UI
1508+
1. Open Label Studio → project → **Settings > Cloud Storage**.
1509+
2. Click **Add Source Storage**. Select **Databricks Files (UC Volumes)**.
1510+
3. Configure the connection:
1511+
- Workspace Host: your Databricks workspace base URL (no trailing slash)
1512+
- Access Token: your PAT
1513+
- Catalog / Schema / Volume: Unity Catalog coordinates
1514+
- Click **Next** to open Import Settings & Preview
1515+
4. Import Settings & Preview:
1516+
- Bucket Prefix (optional): relative subpath under the volume (e.g., `images/train`)
1517+
- File Name Filter (optional): regex to filter files (e.g., `.*\.json$`)
1518+
- Scan all sub-folders: enable for recursive listing; disable to list only current folder
1519+
- Click **Load preview** to verify files
1520+
5. Click **Save** (or **Save & Sync**) to create the connection and sync tasks.
1521+
1522+
### Target storage (export)
1523+
1. Open **Settings > Cloud Storage** → **Add Target Storage** → **Databricks Files (UC Volumes)**.
1524+
2. Use the same Workspace Host/Token and UC coordinates.
1525+
3. Set an Export Prefix (e.g., `exports/${project_id}`).
1526+
4. Click **Save** and then **Sync** to push annotations as JSON files to your volume.
15731527
15741528
!!! note "URI schema"
1575-
To reference Databricks files directly in task JSON (without using source storage), use Label Studio’s Databricks URI scheme:
1529+
To reference Databricks files directly in task JSON (without using an Import Storage), use Label Studio’s Databricks URI scheme:
15761530
15771531
`dbx://Volumes/<catalog>/<schema>/<volume>/<path>`
15781532
15791533
Example:
15801534
1581-
`{ "image": "dbx://Volumes/main/default/dataset/images/1.jpg" }`
1582-
1535+
```
1536+
{ "image": "dbx://Volumes/main/default/dataset/images/1.jpg" }
1537+
```
15831538
15841539
15851540
!!! note "Troubleshooting"
1586-
- If your file preview returns zero files, verify the path under `/Volumes/<catalog>/<schema>/<volume>/<prefix?>` and your PAT permissions.
1541+
- If listing returns zero files, verify the path under `/Volumes/<catalog>/<schema>/<volume>/<prefix?>` and your PAT permissions.
15871542
- Ensure the Workspace Host has no trailing slash and matches your workspace domain.
1588-
- If previews work but media fails to load, confirm proxy mode is allowed for your organization in Label Studio (**Organization > Usage & License > Features**) and network egress allows Label Studio to reach Databricks.
1543+
- If previews work but media fails to load, confirm proxy mode is allowed for your organization in Label Studio and network egress allows Label Studio to reach Databricks.
15891544
15901545
15911546
!!! warning "Proxy and security"
15921547
This connector streams data **through the Label Studio backend** with HTTP Range support. Databricks does not support presigned URLs, so this option is also not available in Label Studio.
15931548
1594-
### Create a target storage connection in the Label Studio UI
1595-
1596-
Repeat the steps from the previous section but using **Add Target Storage**. Use the same workspace host, token, and volume path (UC coordinates).
1597-
1598-
For your **Bucket Prefix**, set an export folder to use (e.g., `exports/${project_id}`) and determine whether you want to allow files to be deleted from target storage.
1599-
1600-
When file deletion is enabled, if you delete an annotation in Label Studio (via UI or API), Label Studio will also delete the corresponding exported JSON file from your target storage for this storage connection.
1601-
1602-
Note that this only affects files that were exported by that target storage, not your source media or tasks. Your PAT permissions must also allow deletion.
1603-
1604-
After adding, click **Sync** to export annotations as JSON files to your target volume.
16051549
16061550
</div>
16071551
@@ -1615,11 +1559,7 @@ Databricks Unity Catalog (UC) Volumes integration is available in Label Studio E
16151559
- Stream media securely via the platform proxy (no presigned URLs)
16161560
- Export annotations back to your Databricks Volume as JSON
16171561
1618-
Learn more and see the full setup guide in the Enterprise documentation:
1619-
1620-
[Databricks Files (UC Volumes)](https://docs.humansignal.com/guide/storage#Databricks-Files-UC-Volumes).
1621-
1622-
If your organization needs governed access to Databricks data with Unity Catalog, consider [Label Studio Enterprise](https://humansignal.com/).
1562+
Learn more and see the full setup guide in the Enterprise documentation: [Databricks Files (UC Volumes)](https://docs.humansignal.com/guide/storage#Databricks-Files-UC-Volumes). If your organization needs governed access to Databricks data with Unity Catalog, consider [Label Studio Enterprise](https://humansignal.com/).
16231563
16241564
</div>
16251565

0 commit comments

Comments
 (0)