Apache Airflow
For researchers who require scheduled workflows, Nuvolos supports Airflow as a self-service application. Airflow runs inside a JupyterLab application, making it easy to edit Airflow DAG files, install packages and use the Nuvolos filesystem for data processing.
The JupyterLab application is collaborative, so DAGs can be worked on simultaneously by multiple users in a "Google Docs"-like fashion.
Configuration
DAGs should be created as Python files in the /files/airflow/dags folder, refer to Airflow documentation for an example.
Setting up your first DAG
Create a new Python file named
/files/airflow/dags/tutorial.pyand copy the contents of the tutorial DAG from the Airflow tutorial.Click on the Airflow tab and click on the All DAGs filter selector on the UI, the DAG should show up on the list like on the screenshot below. It can take up to a minute for the DAG to show up on the list, as Airflow is periodically scanning Python files the
/files/airflow/dagsfolder for new DAG definitions.Click on the slider toggle next to the
tutorialDAG name to enable the DAG and start the first execution.You should quickly see that the DAG has executed successfully by seeing a 1 in a green circle in the Runs column.
Airlfow Connections and Variables can be configured on the Airflow UI.
Airflow on Nuvolos uses a CeleryExecutor back-end to be able to execute tasks in parallel.
Installing packages
To install packages used in DAGs, open a JupyterLab terminal and pip / conda / mamba install the required package. See the Install a software package chapter for detailed instructions.
Logs
Task execution, scheduler and DAG bag update logs are in /files/airflow/logs.
Saving data to Nuvolos
The following example DAG downloads CSV-style time series data from an API, saves it as a compressed Parquet file, and uploads it as a Nuvolos table. Airflow uses the database credentials of the user who started the application.
Prerequisites
Create and start a new Airflow application in your working instance.
Open a new terminal tab and install the required packages:
mamba install -y --freeze-installed -c conda-forge pandas-datareadermamba install -y --freeze-installed -c conda-forge pyarrow
Save the example script as
/files/airflow/dags/csv_to_nuvolos. After saving the file, wait a few seconds for the new DAG to appear in the Airflow tab.
Enable the
csv_to_nuvolosDAG with the toggle switch next to its name.Select the play button to trigger a DAG run manually.
Select the DAG name to open its detail view and monitor task progress. Completed tasks appear in dark green.
After the DAG succeeds, open the Tables view to verify that the output table was created.
Airflow with VSCode
Airflow is also available bundled with VSCode, which can make DAG development easier. To use it:
In the applications gallery, choose the latest Airflow + Code-server + Python application.
After the application starts, open the Command Palette with
Ctrl + Shift + Pon Windows/Linux orCommand + Shift + Pon macOS.Search for Airflow and select Airflow: Show Airflow.
Airflow opens in a new VSCode tab. Use that tab to view and manage DAGs.
To install additional Python dependencies, open a Terminal in VSCode and run:
Last updated
Was this helpful?