> For the complete documentation index, see [llms.txt](https://docs.nuvolos.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.nuvolos.com/how-to-guides/workflows-for-researchers/setting-up-a-dataset-on-nuvolos.md).

# Set up a dataset

<mark style="color:$primary;">**Outcome**</mark>\
You create a dataset space with the right visibility and populate it with curated data via distribution.

<mark style="color:$primary;">**Before you start**</mark>

* You hold the **Organisation Faculty** role (required to create a new space).
* You have decided on visibility - Private, Faculty-only, or Public - and understand that this can't be changed later.
* Your data has been prepared in a regular research space (a pipeline run, manually curated, or imported).

Dataset Spaces are specialised spaces for hosting finished, curated data and its documentation. Unlike research spaces, you can't run applications in a dataset space - they are sources of data and distribution. For the conceptual model and design guidelines (one dataset per space, vintages, name uniqueness within vintages), see [Concepts › Data storage](/concepts/data-integration.md).

### Create the dataset space

{% stepper %}
{% step %}
Navigate to the Dashboard.
{% endstep %}

{% step %}
Click **+ NEW DATASET** in the Datasets section.
{% endstep %}

{% step %}
Enter a name and description.
{% endstep %}

{% step %}
Choose the visibility (see below).
{% endstep %}

{% step %}
Click **+ ADD SPACE**.
{% endstep %}
{% endstepper %}

Visibility options:

* **Private** (default) - visible only to users you explicitly invite.
* **Faculty-only** - visible to all Faculty users in the organisation.
* **Public** - visible to all users in the organisation. Public visibility does not automatically grant access to the contents - users initially receive the Instance Observer role and must request the Viewer role to access data.

{% hint style="info" %}
You can't run applications in dataset spaces. The intended workflow is to prepare data in a regular research space, then distribute the finished data to the dataset space.
{% endhint %}

### Prepare your data in a research space

Dataset Spaces hold static information. The recommended workflow is:

{% stepper %}
{% step %}
Set up a regular research space where you can run applications and develop your data pipeline.
{% endstep %}

{% step %}
Execute the pipeline (ETL steps, transformations, cleaning) until the data is in its final form.
{% endstep %}

{% step %}
Keep the research space as the long-term home for the pipeline - dataset spaces only hold the published artefacts.
{% endstep %}
{% endstepper %}

For the tools and patterns of building data pipelines on Nuvolos, see [Import data](/how-to-guides/workflows-for-researchers/importing-data-on-nuvolos.md) below.

### Distribute your finished data to the dataset space

Once your pipeline is complete:

{% stepper %}
{% step %}
Stage the tables or files to distribute from the source research space.
{% endstep %}

{% step %}
Open the distribution flow.
{% endstep %}

{% step %}
Choose the dataset space (and the appropriate instance) as the target.
{% endstep %}

{% step %}
*Optionally include an application - useful as a documentation aid or a software library blueprint, even though the application cannot run in the dataset space.*
{% endstep %}

{% step %}
Complete the distribution.
{% endstep %}
{% endstepper %}

{% hint style="info" %}
Before distributing an updated dataset, clean up the Current state in the dataset space first. This ensures the next vintage starts from a clean baseline and does not carry over leftover artefacts from the previous one.
{% endhint %}

### Snapshot the dataset as a named vintage

Once distribution completes, create a named snapshot of the dataset space with a full description of the data's provenance and date. Nuvolos calls these snapshots **vintages** because the same dataset evolves over time - for example, a financial dataset updated quarterly has one vintage per quarter.

See [How-to › Common Workflows › Create a snapshot](/how-to-guides/common-workflows/snapshots/create-a-snapshot.md) for the snapshot procedure. For the conceptual model of vintaging, see [Concepts › Data storage](/concepts/data-integration.md).

### How public datasets work

Public datasets are visible to all members of an organisation, but visibility ≠ access:

* Users in a Public dataset space initially receive the Instance Observer role.
* To access the contents, they must request the Instance Viewer role.
* The organisation manager reviews and approves these requests - see [Administration › Organisation administration](/administration/organisation-management.md).


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.nuvolos.com/how-to-guides/workflows-for-researchers/setting-up-a-dataset-on-nuvolos.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
