> For the complete documentation index, see [llms.txt](https://docs.nuvolos.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.nuvolos.com/concepts/data-integration.md).

# Data storage

Nuvolos provides integrated data storage alongside file storage and compute. This chapter explains how the data warehouse - the Scientific Data Warehouse (SDW) - fits into the rest of Nuvolos, and the design decisions behind dataset Spaces and vintaging.

### How data storage works on Nuvolos

Most cloud research environments treat data the way they treat code: a folder full of files. That works for small projects, but it scales poorly. Joining datasets means writing custom scripts; tracking provenance means keeping a spreadsheet by hand; sharing requires duplication; and the result is rarely reproducible across collaborators.

Nuvolos integrates a SQL-compliant data warehouse - the Scientific Data Warehouse (SDW), built on Snowflake - into the same hierarchy that holds files and Applications. This means data is queried with SQL, joined with other datasets via SQL, and access-controlled through the same role system that controls everything else. Key capabilities:

* **Native Application access** - data can be queried directly from Nuvolos Applications (Python, R, Stata, MATLAB) using pre-installed connectors that handle authentication automatically.
* **External access** - the same data can be accessed from non-Nuvolos applications via standard ODBC connectors and access tokens.
* **Scalable storage** - the data warehouse handles dataset sizes that would be impractical on a regular file system.
* **Access control** - licensed or sensitive data can be controlled at the organisation and Space level, mirroring the file-system permission model.
* **Add-on databases** - applications can be extended with [add-on](/reference/applications/add-ons.md) sidecar services such as PostgreSQL, MongoDB, MariaDB, Redis, Neo4j, OpenSearch, and PostGIS for use cases that require a dedicated database engine.
* **Standalone database servers** - you can run separate database server applications on Nuvolos and connect to them securely via Nuvolos networking. Depending on the application's visibility settings, a database server can be accessible within an instance, across an entire space, or organisation-wide. See [Connecting to apps from other applications](/reference/applications/configuring-applications.md#connecting-to-apps-from-other-applications) for configuration details.

For details on connecting to data, uploading tables, and running queries, see [Database integration](/concepts/data-integration.md).

### Why data integration is useful

Data integration enables workflows in research and education that go beyond simple file sharing.

#### Data is accessible

Access to data can be fine-tuned at both the organisation and project level. Organisation-wide public datasets are viewable by any member; restricted datasets can be made available only to specified users; and the same dataset can be public to faculty and private to outside collaborators on the same Nuvolos instance.

#### Data is annotated

Tables and columns can carry descriptions. This sounds minor but it is the difference between a usable dataset and an unusable one - column names alone rarely tell you what the values mean. Descriptions can be added through the the [Tables view](/reference/data-storage/nuvolos-tables.md#the-tables-view) or programmatically via SQL `COMMENT` statements.

#### Data is vintaged

You can maintain multiple point-in-time versions of a dataset, called vintages. If a data provider updates a financial dataset quarterly, you can store each quarterly release as a separate vintage. This lets you reference the exact data that was available when you ran an analysis - essential for replicability, particularly when the underlying data is revised over time. Vintages are created using the [snapshot feature](/how-to-guides/common-workflows/snapshots.md) and stored in [dataset spaces](/concepts/data-integration.md#dataset-spaces).

#### Data can be distributed

If you need data from a dataset in your own space, you can use [distribution](/concepts/distribution.md) to copy only the tables and files you need, rather than duplicating entire datasets. This is the same distribution mechanism described in the previous chapter - uniform across files, tables, and Applications.

### Dataset Spaces

Datasets are a special kind of space, optimised for hosting curated, well-documented data rather than active work. The design assumption is:

* A dataset space contains only finished, polished data and the documentation that describes it.
* Data preparation - harvesting, cleaning, curating - happens in a separate research space, not in the dataset space itself.
* Once data is distributed to a dataset space and snapshotted as a vintage, it is essentially read-only.

Some practical guidelines that follow from this:

* One dataset should map to one space.
* If a dataset has multiple sub-databases (such as topical sub-databases - Health Indicators, Development Indicators), populate multiple Instances in the same dataset space.
* Create vintages of your data to differentiate point-in-time states of the same data.
* Table names need to be unique within a single vintage, but may be the same across multiple Instances.
* When distributing from multiple Instances, name clashes can occur, so avoid overlapping names where possible.

#### Self-service and managed datasets

Dataset spaces in your organisation can be populated in two ways:

* **Self-service** - you (or your research team) prepare the data in a research space and distribute the finished result to the dataset space. This is the default workflow and is documented in [How-to › For Researchers › Set up a dataset](/how-to-guides/workflows-for-researchers/setting-up-a-dataset-on-nuvolos.md).
* **Managed** - for some datasets, Nuvolos provides the data preparation as a professional service. The intermediate pipeline (harvesting, cleaning, scrubbing) is not visible in your organisation as a research space; only the finished dataset is available in the dataset space with the appropriate access rights configured.

Both kinds of dataset space behave identically once populated - they are read-only sources of distribution, support vintaging via snapshots, and integrate with the same role and visibility model. The difference is only in how the data arrived.

### How the warehouse maps onto the hierarchy

Data storage reuses the same hierarchy as the rest of Nuvolos. In the underlying database layout:

* An organisation and space together form a Database.
* An instance and snapshot together form a Schema.

All Instances in a Space therefore share the same Database, while each vintage lives in a separate Schema. This means a SQL query can join across vintages within the same dataset, and across Instances within the same Space, without any cross-database dance - but data from a different Space sits in a different Database, with access governed by the role system.

### Connection modes

Users automatically receive an account on the Scientific Data Warehouse once they have access to a Space with tables enabled. To follow industry best practices for data security, the warehouse offers two connection modes:

* **Service mode** - the connection uses a username and an RSA key, and is only allowed from within Nuvolos. Nuvolos Applications receive the RSA key automatically; you never handle it directly. This is the default and the safer mode for users who only need data access from within the platform.
* **User mode** - the connection is possible from both inside and outside Nuvolos. From inside, the RSA key is used as in Service mode. From outside, you connect with a username and access token, and connection attempts must be approved on a multi-factor authentication device.

The split exists because the security profile of in-platform access is fundamentally different from external access - Service mode prevents tokens from ever leaving Nuvolos, while User mode is necessary for users who need to query the warehouse from a local IDE or batch job. New users default to Service mode and can switch when needed.

#### Change the connection mode <a href="#change-the-connection-mode" id="change-the-connection-mode"></a>

Every user can change their SDW connection mode anytime in the [Table Access](https://app.nuvolos.cloud/user/settings/tables) menu. By default, new users are created with Service mode, but you can change back and forth between Service and User mode.

#### Where to go next

* For practical instructions on querying, uploading, and accessing data, see [Reference › Data storage](/reference/data-storage.md).
* For dataset creation workflows, see [How-to › For Researchers › Set up a dataset](/how-to-guides/workflows-for-researchers/setting-up-a-dataset-on-nuvolos.md).
* For how data fits into the broader reproducibility model, see [Snapshots, distribution, and states](/concepts/distribution.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.nuvolos.com/concepts/data-integration.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
