cascadeflow

This guide describes how to configure and deploy cascadeflow within a Nuvolos application.

cascadeflow is an intelligent AI model cascading library designed to optimize both cost and latency for Large Language Model (LLM) interactions. It functions by dynamically selecting the most appropriate model for each query: simple tasks are handled by smaller, faster, cheaper models, while complex requests are automatically escalated to powerful flagship models only when necessary. By employing techniques like speculative execution and quality validation, cascadeflow can significantly reduce API costs—often by 40-85%—while delivering high-quality results.

Why use cascadeflow on Nuvolos?

Deploying cascadeflow on the Nuvolos platform unlocks a powerful, hybrid AI development environment tailored for professional data science and research.

Privacy-First Coding

Securely process code within your Nuvolos container. By routing autocomplete and basic logic tasks to local models (like qwen3 or ministral) running directly in your workspace, sensitive code snippets never leave your secure environment.

Hybrid Intelligence & Cost Optimization

Development involves thousands of queries, from simple syntax checks to complex architectural reasoning.

  • Zero Cost for Routine Tasks: Use free, local models for autocomplete and simple edits.

  • Component-Level Routing: Smartly escalate only the hardest problems to paid APIs (like Anthropic's Claude), significantly reducing API costs.

High-Availability Performance

The configuration separates the autocomplete engine from the chat engine using parallel ports. This ensures that heavy reasoning tasks (chatting with the agent) never block the instantaneous feedback required for code completion while typing.

Flexible Research Platform

Nuvolos provides the compute resources to run open-weights models effectively. This setup serves as a customizable template for testing "Mixture of Experts" strategies, allowing you to swap out models without changing your client-side configuration.


Technical Architecture & Setup

This reference architecture implements a proxy layer that manages two parallel local Ollama instances and one external API provider.

Prerequisites

This case study uses the cascadeflow-openai-proxy repository available at https://github.com/nuvolos/cascadeflow-openai-proxyarrow-up-right.

1. Service Orchestration

To support this hybrid architecture, the workspace uses custom shell scripts to manage the lifecycle of the multiple required processes.

  • cascadeflow-openai-proxy/start_services.sh:

    • Launches Ollama Instance 1 (Port 11434) for low-latency autocomplete.

    • Launches Ollama Instance 2 (Port 11435) for parallel processing of chat requests.

    • Initializes the cascadeflow Proxy (Port 8000).

    • Manages Python virtual environments and model pulling.

  • cascadeflow-openai-proxy/stop_services.sh:

    • Performs a clean shutdown of all related processes (Proxy and Ollama instances) using tracked PIDs to prevent resource leaks.

2. Component Configuration

Continue Extension

File: /home/datahub/.continue/config.yaml

The client side is configured to treat the Proxy as a standard OpenAI provider. Note how tasks are split:

  • Autocomplete connects directly to localhost:11434 for maximum speed.

  • Chat/Edit connects to the Proxy at localhost:8000.

cascadeflow Proxy backend

File: cascadeflow-openai-proxy/config.yaml

The proxy defines the logic for available models and their costs. It aggregates the disparate providers (Local vs. Anthropic) into a unified list.

Model Name
Provider
Endpoint
Role

qwen3:1.7b

Ollama

localhost:11434

Fast generic tasks

ministral-3:8b

Ollama

localhost:11435

Intermediate reasoning

claude-sonnet

Anthropic

External API

Complex coding tasks

Getting Started

  1. Initialize Services: Open a terminal in your Nuvolos workspace and run:

    Wait for the logs to confirm that Ollama instances and the Proxy are active.

  2. Verify Connections: Open the Continue extension in VS Code. The models should now be available for selection.

  3. Shutdown: When finished, ensure resources are released:

Screenshots

Using cascadeflow within the Continue extension in Antigravity on Nuvolos, you can observe the model routing in action:

Last updated

Was this helpful?