Set up a dataset

Outcome You create a dataset space with the right visibility and populate it with curated data via distribution.

Before you start

  • You hold the Organisation Faculty role (required to create a new space).

  • You have decided on visibility - Private, Faculty-only, or Public - and understand that this can't be changed later.

  • Your data has been prepared in a regular research space (a pipeline run, manually curated, or imported).

Dataset Spaces are specialised spaces for hosting finished, curated data and its documentation. Unlike research spaces, you can't run applications in a dataset space - they are sources of data and distribution. For the conceptual model and design guidelines (one dataset per space, vintages, name uniqueness within vintages), see Concepts › Data storage.

Create the dataset space

1

Navigate to the Dashboard.

2

Click + NEW DATASET in the Datasets section.

3

Enter a name and description.

4

Choose the visibility (see below).

5

Click + ADD SPACE.

Visibility options:

  • Private (default) - visible only to users you explicitly invite.

  • Faculty-only - visible to all Faculty users in the organisation.

  • Public - visible to all users in the organisation. Public visibility does not automatically grant access to the contents - users initially receive the Instance Observer role and must request the Viewer role to access data.

You can't run applications in dataset spaces. The intended workflow is to prepare data in a regular research space, then distribute the finished data to the dataset space.

Prepare your data in a research space

Dataset Spaces hold static information. The recommended workflow is:

1

Set up a regular research space where you can run applications and develop your data pipeline.

2

Execute the pipeline (ETL steps, transformations, cleaning) until the data is in its final form.

3

Keep the research space as the long-term home for the pipeline - dataset spaces only hold the published artefacts.

For the tools and patterns of building data pipelines on Nuvolos, see Import data below.

Distribute your finished data to the dataset space

Once your pipeline is complete:

1

Stage the tables or files to distribute from the source research space.

2

Open the distribution flow.

3

Choose the dataset space (and the appropriate instance) as the target.

4

Optionally include an application - useful as a documentation aid or a software library blueprint, even though the application cannot run in the dataset space.

5

Complete the distribution.

Before distributing an updated dataset, clean up the Current state in the dataset space first. This ensures the next vintage starts from a clean baseline and does not carry over leftover artefacts from the previous one.

Snapshot the dataset as a named vintage

Once distribution completes, create a named snapshot of the dataset space with a full description of the data's provenance and date. Nuvolos calls these snapshots vintages because the same dataset evolves over time - for example, a financial dataset updated quarterly has one vintage per quarter.

See How-to › Common Workflows › Create a snapshot for the snapshot procedure. For the conceptual model of vintaging, see Concepts › Data storage.

How public datasets work

Public datasets are visible to all members of an organisation, but visibility ≠ access:

  • Users in a Public dataset space initially receive the Instance Observer role.

  • To access the contents, they must request the Instance Viewer role.

  • The organisation manager reviews and approves these requests - see Administration › Organisation administration.

Last updated

Was this helpful?