Data storage
Nuvolos provides integrated data storage alongside file storage and compute. This chapter explains how the data warehouse - the Scientific Data Warehouse (SDW) - fits into the rest of Nuvolos, and the design decisions behind dataset Spaces and vintaging.
How data storage works on Nuvolos
Most cloud research environments treat data the way they treat code: a folder full of files. That works for small projects, but it scales poorly. Joining datasets means writing custom scripts; tracking provenance means keeping a spreadsheet by hand; sharing requires duplication; and the result is rarely reproducible across collaborators.
Nuvolos integrates a SQL-compliant data warehouse - the Scientific Data Warehouse (SDW), built on Snowflake - into the same hierarchy that holds files and Applications. This means data is queried with SQL, joined with other datasets via SQL, and access-controlled through the same role system that controls everything else. Key capabilities:
Native Application access - data can be queried directly from Nuvolos Applications (Python, R, Stata, MATLAB) using pre-installed connectors that handle authentication automatically.
External access - the same data can be accessed from non-Nuvolos applications via standard ODBC connectors and access tokens.
Scalable storage - the data warehouse handles dataset sizes that would be impractical on a regular file system.
Access control - licensed or sensitive data can be controlled at the organisation and Space level, mirroring the file-system permission model.
Add-on databases - applications can be extended with add-on sidecar services such as PostgreSQL, MongoDB, MariaDB, Redis, Neo4j, OpenSearch, and PostGIS for use cases that require a dedicated database engine.
Standalone database servers - you can run separate database server applications on Nuvolos and connect to them securely via Nuvolos networking. Depending on the application's visibility settings, a database server can be accessible within an instance, across an entire space, or organisation-wide. See Connecting to apps from other applications for configuration details.
For details on connecting to data, uploading tables, and running queries, see Database integration.
Why data integration is useful
Data integration enables workflows in research and education that go beyond simple file sharing.
Data is accessible
Access to data can be fine-tuned at both the organisation and project level. Organisation-wide public datasets are viewable by any member; restricted datasets can be made available only to specified users; and the same dataset can be public to faculty and private to outside collaborators on the same Nuvolos instance.
Data is annotated
Tables and columns can carry descriptions. This sounds minor but it is the difference between a usable dataset and an unusable one - column names alone rarely tell you what the values mean. Descriptions can be added through the the Tables view or programmatically via SQL COMMENT statements.
Data is vintaged
You can maintain multiple point-in-time versions of a dataset, called vintages. If a data provider updates a financial dataset quarterly, you can store each quarterly release as a separate vintage. This lets you reference the exact data that was available when you ran an analysis - essential for replicability, particularly when the underlying data is revised over time. Vintages are created using the snapshot feature and stored in dataset spaces.
Data can be distributed
If you need data from a dataset in your own space, you can use distribution to copy only the tables and files you need, rather than duplicating entire datasets. This is the same distribution mechanism described in the previous chapter - uniform across files, tables, and Applications.
Dataset Spaces
Datasets are a special kind of space, optimised for hosting curated, well-documented data rather than active work. The design assumption is:
A dataset space contains only finished, polished data and the documentation that describes it.
Data preparation - harvesting, cleaning, curating - happens in a separate research space, not in the dataset space itself.
Once data is distributed to a dataset space and snapshotted as a vintage, it is essentially read-only.
Some practical guidelines that follow from this:
One dataset should map to one space.
If a dataset has multiple sub-databases (such as topical sub-databases - Health Indicators, Development Indicators), populate multiple Instances in the same dataset space.
Create vintages of your data to differentiate point-in-time states of the same data.
Table names need to be unique within a single vintage, but may be the same across multiple Instances.
When distributing from multiple Instances, name clashes can occur, so avoid overlapping names where possible.
Self-service and managed datasets
Dataset spaces in your organisation can be populated in two ways:
Self-service - you (or your research team) prepare the data in a research space and distribute the finished result to the dataset space. This is the default workflow and is documented in How-to › For Researchers › Set up a dataset.
Managed - for some datasets, Nuvolos provides the data preparation as a professional service. The intermediate pipeline (harvesting, cleaning, scrubbing) is not visible in your organisation as a research space; only the finished dataset is available in the dataset space with the appropriate access rights configured.
Both kinds of dataset space behave identically once populated - they are read-only sources of distribution, support vintaging via snapshots, and integrate with the same role and visibility model. The difference is only in how the data arrived.
How the warehouse maps onto the hierarchy
Data storage reuses the same hierarchy as the rest of Nuvolos. In the underlying database layout:
An organisation and space together form a Database.
An instance and snapshot together form a Schema.
All Instances in a Space therefore share the same Database, while each vintage lives in a separate Schema. This means a SQL query can join across vintages within the same dataset, and across Instances within the same Space, without any cross-database dance - but data from a different Space sits in a different Database, with access governed by the role system.
Connection modes
Users automatically receive an account on the Scientific Data Warehouse once they have access to a Space with tables enabled. To follow industry best practices for data security, the warehouse offers two connection modes:
Service mode - the connection uses a username and an RSA key, and is only allowed from within Nuvolos. Nuvolos Applications receive the RSA key automatically; you never handle it directly. This is the default and the safer mode for users who only need data access from within the platform.
User mode - the connection is possible from both inside and outside Nuvolos. From inside, the RSA key is used as in Service mode. From outside, you connect with a username and access token, and connection attempts must be approved on a multi-factor authentication device.
The split exists because the security profile of in-platform access is fundamentally different from external access - Service mode prevents tokens from ever leaving Nuvolos, while User mode is necessary for users who need to query the warehouse from a local IDE or batch job. New users default to Service mode and can switch when needed.
Change the connection mode
Every user can change their SDW connection mode anytime in the Table Access menu. By default, new users are created with Service mode, but you can change back and forth between Service and User mode.
Where to go next
For practical instructions on querying, uploading, and accessing data, see Reference › Data storage.
For dataset creation workflows, see How-to › For Researchers › Work with data.
For how data fits into the broader reproducibility model, see Snapshots, distribution, and states.
Last updated
Was this helpful?