# Access data from applications

You can access your data tables stored in the Scientific Data Warehouse (SDW) from either in-Nuvolos applications or external, non-Nuvolos applications.

## Accessing data from applications running in Nuvolos

Currently, data access from Nuvolos is supported in the following applications: Python (Spyder, JupyterLab and VS Code), RStudio, Matlab and Stata.

### Accessing data tables from Python in Nuvolos

If you want to use Nuvolos-hosted Python (via JupyterLab or Spyder), the data access will be simple:

1. Make sure you have the data available.
2. **Run** your application.
3. Inside your app, you will need to use the `nuvolos-odbc` [Python library](https://github.com/nuvolos-cloud/python-connector) developed by Alphacruncher, which is pre-installed in the Jupyter application.

Usage example:

```python
from nuvolos import get_connection
import pandas as pd

con = get_connection()
df = pd.read_sql("SELECT * FROM table", con=con)
```

#### Stopping queries from Python

Please refer to the [cancelling queries](#canceling-queries) section for the available SQL commands. You can execute them as

```sql
df = pd.read_sql("<SQL_COMMAND>", con=con)
```

#### Upgade instructions for nuvolos-odbc

You can use pip to upgrade nuvolos-odbc in existing applications:

```
pip install --upgrade nuvolos-odbc
```

### Accessing data tables from R in Nuvolos

If you want to use Nuvolos-hosted R (via RStudio), the data access will be simple:

1. Make sure you have the data available.
2. **Run** your application.
3. Inside RStudio, you will need to use the [r-connector](https://github.com/nuvolos-cloud/r-connector) developed by Alphacruncher.

Usage example:

```r
con <- nuvolos::get_connection()
result_data <- dbGetQuery(con,"SELECT * FROM table LIMIT 10")
```

#### Stopping queries from R

Please refer to the [cancelling queries](#canceling-queries) section for the available SQL commands. You can execute them as

```sql
result_data <- dbGetQuery(con,"<SQL_COMMAND>")
```

### Accessing data tables from Stata in Nuvolos

If you want to use Nuvolos-hosted Stata, the data access is greatly simplified. Nuvolos has its own `sysprofile.do` that automatically sets you up with access parameters. Stata communicates with the database using `odbc` , so you will need to issue the following command to load the query:

```
odbc load, exec(`"SELECT * FROM "table" LIMIT 10"') connectionstring($conn_str)
```

#### Stopping queries from Stata

Please refer to the [cancelling queries](#canceling-queries) section for the available SQL commands. You can execute them as

```sql
odbc load, exec(`"<SQL_COMMAND>"') connectionstring($conn_str)
```

### Accessing data tables from Matlab in Nuvolos

First test your query in the Tables view of your space (or create a new query by using our query builder form). Then select the option **Run from application > Matlab** when looking at the query results preview to see the instructions on executing the given query inside Matlab on Nuvolos.

We suggest using the `select` statement of Matlab as it provides results in the [table](https://www.mathworks.com/help/matlab/tables.html) data type. For example:

```
conn = get_connection();
result_data = select(conn,'SELECT * FROM TABLE_NAME LIMIT 10');
```

In this example, `result_data` will be of the table type, and thus column names will be also available for the programmer as part of the return data structure.

## Accessing data tables from external, non-Nuvolos applications

### Connecting with R

First, please install the Nuvolos `r-connector` package developed for Nuvolos:

```r
options(repos = "https://cran.rstudio.com")
install.packages("remotes")
remotes::install_github("nuvolos-cloud/r-connector")
```

Next, [obtain access tokens](https://docs.nuvolos.com/features/database-integration/access-data-from-applications/obtain-tokens-for-data-access) and database/schema names from the Connection Guide on the Nuvolos *Tables* interface of the instance you wish to access.

Finally, pass the SQL statement, the database name and schema name to the`read_sql()` function:

```r
result_data <- nuvolos::read_sql("SELECT * FROM \"TABLE\" LIMIT 10", dbname = "my_database", schemaname= "my_schema")
```

**Attention:** you need to follow the quotation approach as the example code. If the table name is case insensitive, it can be referred as *table* or *\\"TABLE\\"*. If the table name is case sensitive (containing upper-and lowercase letters or special characters), quotation is needed. For example: *\\"Table\\".*

**Credentials**: When you connect to the Nuvolos database for the first time, it will ask for your credentials. Check "Remember with keyring" box to avoid your future input. You can find your credentials following the [connection guide](https://app.gitbook.com/@alphacruncher-1/s/nuvolos/~/drafts/-MMuNtFnFrIaP3B5ov-M/data/access-data-from-applications/obtain-tokens-for-your-data/). You don't need to write your credentials explicitly in your scripts, and the connector can safely access your token during the connection process.

In case you need to re-input your credentials, please execute the command below in R's console.

```sql
nuvolos::input_nuvolos_credential()
```

#### Stopping queries from R

Please refer to the [cancelling queries](#canceling-queries) section for the available SQL commands. You can execute them as

```sql
result_data <- novolos::execute("<SQL_COMMAND>", dbname = "my_database", schemaname= "my_schema")
```

### Connecting with Python

First, install the `nuvolos` package developed for Nuvolos:

```bash
pip install --upgrade nuvolos
```

{% hint style="info" %}
Since version 0.4.0 of the `nuvolos` package, installing the ODBC driver is no longer required.
{% endhint %}

Next, obtain access tokens and database/schema names from the Connection Guide on the Nuvolos *Tables* interface of the instance you wish to access.

Finally, pass the database and schema names specified in the [Connection Guide](https://app.gitbook.com/@alphacruncher-1/s/nuvolos/~/drafts/-MMuNtFnFrIaP3B5ov-M/data/access-data-from-applications/obtain-tokens-for-your-data/) to the `get_connection()` function:

```python
from nuvolos import get_connection
import pandas as pd

con = get_connection(dbname = "dbname", schemaname="schemaname")
df = pd.read_sql("SELECT * FROM table", con=con)
```

**Credentials**: The first time you connect from Python, the connector asks for your Nuvolos credentials. Retrieve them from the [Connection Guide](https://app.gitbook.com/@alphacruncher-1/s/nuvolos/~/drafts/-MMuNtFnFrIaP3B5ov-M/data/access-data-from-applications/obtain-tokens-for-your-data/) in the Nuvolos Tables interface. The connector uses them securely, so you do not need to hardcode them in your script. Your operating system may also prompt for your local computer password to allow access to the stored credential. If you need to re-enter the credentials later, use the Python console command provided below.

```sql
from nuvolos import input_nuvolos_credential
input_nuvolos_credential()
```

#### Stopping queries from Python

Please refer to the [cancelling queries](#canceling-queries) section for the available SQL commands. You can execute them as

```sql
df = pd.read_sql("<SQL_COMMAND>", con=con)
```

#### Special Notes for Win10

If you meet the error below:

```sql
RuntimeError: The current Numpy installation (...) fails to pass a sanity check due to a bug in the windows runtime...
```

Please install an older version of NumPy from a terminal to solve this. It is a temporary solution specifically for windows 10.

```sql
pip install numpy==1.19.3
```

### Connecting with Stata

Accessing data from out-of-Nuvolos Stata applications consists of the following steps:

1. Install[ the Snowflake ODBC driver](https://docs.nuvolos.com/features/database-integration/access-data-from-applications/set-up-odbc-drivers).
2. [Obtain access tokens](https://docs.nuvolos.com/features/database-integration/access-data-from-applications/obtain-tokens-for-data-access) and database/schema names from the Connection Guide on the Nuvolos tables interface.
3. Establish a connection.

{% hint style="info" %}
To simplify work, we suggest that you save your connection parameters to global macros and finally create a connection string as a global macro. On Nuvolos, this is part of the sysprofile.do file of the application.

You should only add these macros to your profile.do or sysprofile.do if you are going to work only in one single instance and state/snapshot.
{% endhint %}

To set up your access parameters, issue the following commands. These have to be issued only once. The values for `username` and `snowflake_access_token` can be obtained [following these instructions](https://docs.nuvolos.com/features/database-integration/access-data-from-applications/obtain-tokens-for-data-access), and for `database_name` and `schema_name`, follow instructions [here](https://docs.nuvolos.com/features/database-integration/access-data-from-applications/find-database-and-schema-path).

```
set odbcmgr unixodbc
global user "<username>"
global dbpwd "<snowflake_access_token>"
global dbpath_db `"<database_name>"'
global dbpath_schema `"<schema_name>"'
global conn_str `"DRIVER=SnowflakeDSIIDriver;SERVER=alphacruncher.eu-central-1.snowflakecomputing.com;DATABASE=$dbpath_db;SCHEMA=$dbpath_schema;UID=$user;PWD=$dbpwd"'
```

You can then access data similar to if you were using Nuvolos:

```
odbc load, exec(`"SELECT * FROM "table" LIMIT 10"') connectionstring($conn_str)
```

#### Stopping queries from Stata

Please refer to the [cancelling queries](#canceling-queries) section for the available SQL commands. You can execute them as

```sql
odbc load, exec(`"<SQL_COMMAND>"') connectionstring($conn_str)
```

### Connecting with Matlab

First, please download and install the [`nuvolos`](https://ch.mathworks.com/matlabcentral/fileexchange/82903-nuvolos?s_tid=srchtitle) toolbox developed for Nuvolos. You can also click "Get Add-ons", search "nuvolos" in the Matlab Add-on Explorer, and then “Add" in your toolbox.

Next, [obtain access tokens](https://docs.nuvolos.com/features/database-integration/access-data-from-applications/obtain-tokens-for-data-access) and database/schema names from the Connection Guide on the Nuvolos *Tables* interface of the instance you wish to access.

Finally, pass the database and schema names specified in the [Connection Guide](https://app.gitbook.com/@alphacruncher-1/s/nuvolos/~/drafts/-MMuNtFnFrIaP3B5ov-M/data/access-data-from-applications/obtain-tokens-for-your-data/) to the `get_connection()` function:

```python
conn = get_connection("dbname", "schemaname");
dataset = select(conn, "SELECT * FROM my_table");
```

For your credential's safety consideration, the Matlab connector will display the login dialog for the first time connecting to the Nuvolos database. Please find your credentials following the [Connection Guide](https://app.gitbook.com/@alphacruncher-1/s/nuvolos/~/drafts/-MMuNtFnFrIaP3B5ov-M/data/access-data-from-applications/obtain-tokens-for-your-data/).

If you need to correct or change your credential, you can use the command below to input your credentials again.

```sql
create_credential(true)
```

### Connecting with Excel

For both Windows and Mac OS, please first install the [Snowflake ODBC database driver](https://docs.snowflake.com/en/user-guide/odbc.html) for your platform, which is required to access the Nuvolos database service. You only need to satisfy the prerequisites and finish the ODBC driver installation (first step). You don't need to further configure and test the driver.

#### Windows

1. Install the Snowflake ODBC driver, then download and unzip the [Excelerator](https://github.com/Snowflake-Labs/Excelerator) package locally.
2. In File Explorer, right-click `SnowflakeExcelAddin.xlam`, select Properties, and enable Unblock.
3. Open Excel and install the add-in:
   * Go to File > Options > Add-ins > Go...
   * Select Browse
   * Choose `SnowflakeExcelAddin.xlam` from the unzipped Excelerator folder

{% hint style="info" %}
For more detailed Excel add-in installation and management, please refer to this [guide](https://exceloffthegrid.com/install-uninstall-excel-add/).
{% endhint %}

4. In Excel, open the Home tab and select Connect from the Excelerator add-in. In the connection window, use User & Password authentication and enter:
   * Hostname: your Nuvolos server URL
   * User ID
   * Password.

{% hint style="info" %}
Get these values from the Connection Guide on the Tables page in your Nuvolos space.
{% endhint %}

5. Select Query in the add-in. Then:
   * choose the target Database and Schema
   * select a table, or enter a custom SQL statement
   * optionally use Choose to limit the returned columns
   * click Execute to insert the results into the active worksheet as a table.
6. When writing SQL manually, quote identifiers as `"database-name"."schema-name"."table-name"`.

#### Uploading Data

You can use the Excelerator add-in to upload worksheet data to Nuvolos, either into an existing table or as a new table.

To upload data to Nuvolos:

1. Click the cell in the first row and first column of the data range you want to upload.
2. On the Home ribbon, select Define Data Types. Excelerator inserts a row above your data where you must choose the data type for each column. The add-in does not infer data types automatically.
3. After selecting the column data types, click Upload.
4. In the upload dialog, choose one of the following actions for the target table:
   * Update an existing table
   * Append to an existing table
   * Replace an existing table
   * Create new table then upload, then enter the new table name.
5. Click Upload to start the upload.

#### Mac OS

1. After installing the Snowflake ODBC driver, open **Terminal** and copy the Snowflake ODBC driver files (typically from `/opt/snowflake/`) into `/Library/ODBC/ODBCDataSources`.

   ```
   ~$ sudo cp -r /opt/snowflake/ /Library/ODBC/ODBCDataSources
   ```
2. Edit the file `/Library/ODBC/ODBCDataSources/snowflakeodbc/lib/universal/simba.snowflake.ini` and update the two file path entries so they point to the new locations under `/Library/ODBC/ODBCDataSources`. Save the file after making the changes.

```
ErrorMessagesPath = /Library/ODBC/ODBCDataSources/snowflakeodbc/ErrorMessages
CABundleFile = /Library/ODBC/ODBCDataSources/snowflakeodbc/lib/universal/cacert.pem
```

3. Open the ODBC administration application with administrator rights. This is required to avoid a General installer error later in the setup.
4. In the ODBC Drivers tab, check whether the Snowflake driver is already present. If it is, confirm that the driver path is `/Library/ODBC/ODBCDataSources/snowflakeodbc/lib/universal/Snowflake.dylib`. If it is not present, add it manually:
   * Enter a driver description such as `SnowflakeODBC`
   * Browse to the driver file in `/Library/ODBC/ODBCDataSources/snowflakeodbc/lib/universal/`
   * Save the driver entry.
5. In the System DSN tab, add a new DSN using the Snowflake driver. Enter a unique DSN name, such as `SnowflakeExcel`, then add keyword/value pairs for:
   * `server`
   * `port`
   * `database`
   * `schema`

{% hint style="info" %}
Get these values from the **Connection Guide** on the **Tables** page in your Nuvolos space.
{% endhint %}

6. Open Microsoft Excel and go to Data > New Database Query > From Database. In the iODBC Data Source Chooser, open the System DSN tab, select the DSN you created, and click OK.
7. When prompted, enter your Snowflake username and password from the Connection Guide on the Tables page.
8. In the Microsoft Query window, enter your SQL statement and click Run. Then click Return Data to import the results into the spreadsheet. Quote table references using `"database-name"."schema-name"."table-name"`. Database, schema, and table names are available in the Connection Guide.

## Canceling queries

Running queries can be listed and (selectively) canceled using SQL statements.

### List running queries

You can check how many running queries you have with the command below. You'll need to substitute your Nuvolos username into the \<USERNAME> placeholder. You can find out your Nuvolos username in the Profile page on the Nuvolos Web interface.

```sql
SELECT USER_NAME, QUERY_ID, SESSION_ID, QUERY_TEXT, START_TIME FROM TABLE(INFORMATION_SCHEMA.QUERY_HISTORY_BY_USER(USER_NAME=>'<USERNAME>')) WHERE EXECUTION_STATUS = 'RUNNING'
```

### Cancel a particular query

Once you know the query id, you can cancel a specific query with

```sql
SELECT SYSTEM$CANCEL_QUERY('<QUERY_ID>')
```

### Cancel all running queries

Alternatively, you can cancel all your currently running queries with the command

```sql
ALTER USER <USERNAME> ABORT ALL QUERIES
```

## Exporting big data

If your work requires frequently reading a large amount of tabular data from Nuvolos tables, in cluster batch jobs, for example, or you wish to export the result of a query that exceeds 1 million rows, it is recommended to retrieve the data locally as [Parquet](https://databricks.com/glossary/what-is-parquet) compressed data files.

Parquet files offer high data compression and significantly faster read times compared to CSV files and are supported by most scientific applications (Matlab, Pandas, R).

### Export steps

The general steps that need to be performed are the following:

1. Create a staging area in the database where the database engine will export the query results as a Parquet file,
2. Run the query and export the data,
3. Retrieve the exported data from the staging area in the database,
4. Drop the remote staging area.

This translates to the following SQL, assuming the compressed data will be less, than 5GB:

```sql
CREATE OR REPLACE FILE FORMAT PARQUET_FORMAT TYPE = PARQUET COMPRESSION = SNAPPY;
CREATE OR REPLACE STAGE PARQUET_STAGE FILE_FORMAT=PARQUET_FORMAT;

COPY INTO @PARQUET_STAGE/orders.parquet
FROM (SELECT 
    O_ORDERKEY, 
    O_CUSTKEY, 
    O_TOTALPRICE, 
    O_ORDERDATE, 
    O_ORDERPRIORITY, 
    O_CLERK, 
    O_SHIPPRIORITY, 
    O_COMMENT 
FROM ORDERS)
HEADER = TRUE
OVERWRITE = TRUE
SINGLE = TRUE
MAX_FILE_SIZE = 5000000000;


GET @PARQUET_STAGE file:///files/query_results/ PATTERN='orders.parquet' PARALLEL=4;

DROP STAGE PARQUET_STAGE;
DROP FILE FORMAT PARQUET_FORMAT;
```

If you anticipate that the result set of your query would be larger than 5GB compressed (\~50M rows), you will need to split the exported data into multiple Parquet files. This way the export will be faster, however, as the database engine can parallelize the data export:

```sql
CREATE OR REPLACE FILE FORMAT PARQUET_FORMAT TYPE = PARQUET COMPRESSION = SNAPPY;
CREATE OR REPLACE STAGE PARQUET_STAGE FILE_FORMAT=PARQUET_FORMAT;

COPY INTO @PARQUET_STAGE/orders_
FROM (SELECT 
    O_ORDERKEY, 
    O_CUSTKEY, 
    O_TOTALPRICE, 
    O_ORDERDATE, 
    O_ORDERPRIORITY, 
    O_CLERK, 
    O_SHIPPRIORITY, 
    O_COMMENT 
FROM ORDERS)
HEADER = TRUE
OVERWRITE = TRUE
SINGLE = FALSE
MAX_FILE_SIZE = 5000000000;


GET @PARQUET_STAGE file:///files/query_results/ PATTERN='orders_*' PARALLEL=4;

DROP STAGE PARQUET_STAGE;
DROP FILE FORMAT PARQUET_FORMAT;
```

### Matlab example

You can download data as Parquet files and read them with Matlab as the following example demonstrates, which retrieves and loads the ORDERS table containing 15 million records.

Matlab requires integer columns to be explicitly cast to an integer field in order to have the logical data type in the Parquet file be INT. Each field in a Parquet file has a physical type (INT64, DOUBLE, BYTE, etc) and a logical type telling the processing application how to interpret the data saved in the field. Please refer to the [Matlab documentation on the supported Parquet logical types](https://www.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html).

```
conn = get_connection();
execute(conn, 'CREATE OR REPLACE FILE FORMAT PARQUET_FORMAT TYPE = PARQUET COMPRESSION = SNAPPY;');
execute(conn, 'CREATE OR REPLACE STAGE PARQUET_STAGE FILE_FORMAT=PARQUET_FORMAT;');
execute(conn, sprintf([ ... 
'COPY INTO @PARQUET_STAGE/orders.parquet '...
'FROM FROM (SELECT  '...
'    O_ORDERKEY::INT,  '...
'    O_CUSTKEY::INT,  '...
'    O_TOTALPRICE::DOUBLE,  '... 
'    O_ORDERDATE,  '...
'    O_ORDERPRIORITY,  '...
'    O_CLERK,  '...
'    O_SHIPPRIORITY::INT,  '...
'    O_COMMENT  '...
FROM ORDERS) '...
'HEADER = TRUE '...
'OVERWRITE = TRUE '...
'SINGLE = TRUE '...
'MAX_FILE_SIZE = 5000000000; '...
]));


try
    % This line will fail with an error currently, however, the file will be retrieved.
    execute(conn, 'GET @PARQUET_STAGE/orders.parquet file:///files/query_results/ PARALLEL=4 ;');
catch
    warning('Data has been retrieved');
end

% Please run these clean-up statements after the error.
execute(conn, 'DROP STAGE PARQUET_STAGE;');
execute(conn, 'DROP FILE FORMAT PARQUET_FORMAT;');

info = parquetinfo('/files/query_results/orders.parquet')
T = parquetread('/files/query_results/orders.parquet')
```
