# cascadeflow

This guide describes how to configure and deploy **cascadeflow** within a Nuvolos application.

**cascadeflow** is an intelligent AI model cascading library designed to optimize both cost and latency for Large Language Model (LLM) interactions. It functions by dynamically selecting the most appropriate model for each query: simple tasks are handled by smaller, faster, cheaper models, while complex requests are automatically escalated to powerful flagship models only when necessary. By employing techniques like speculative execution and quality validation, cascadeflow can significantly reduce API costs—often by 40-85%—while delivering high-quality results.

## Why use cascadeflow on Nuvolos?

Deploying cascadeflow on the Nuvolos platform unlocks a powerful, hybrid AI development environment tailored for professional data science and research.

### Privacy-First Coding

Securely process code within your Nuvolos container. By routing autocomplete and basic logic tasks to local models (like `qwen3` or `ministral`) running directly in your workspace, sensitive code snippets never leave your secure environment.

### Hybrid Intelligence & Cost Optimization

Development involves thousands of queries, from simple syntax checks to complex architectural reasoning.

* **Zero Cost for Routine Tasks**: Use free, local models for autocomplete and simple edits.
* **Component-Level Routing**: Smartly escalate only the hardest problems to paid APIs (like Anthropic's Claude), significantly reducing API costs.

### High-Availability Performance

The configuration separates the autocomplete engine from the chat engine using parallel ports. This ensures that heavy reasoning tasks (chatting with the agent) never block the instantaneous feedback required for code completion while typing.

### Flexible Research Platform

Nuvolos provides the compute resources to run open-weights models effectively. This setup serves as a customizable template for testing "Mixture of Experts" strategies, allowing you to swap out models without changing your client-side configuration.

***

## Technical Architecture & Setup

This reference architecture implements a proxy layer that manages two parallel local Ollama instances and one external API provider.

<figure><img src="/files/MJD77czvbf45BANb5Ltf" alt=""><figcaption></figcaption></figure>

### Prerequisites

This case study uses the `cascadeflow-openai-proxy` repository available at <https://github.com/nuvolos-cloud/cascadeflow-openai-proxy/>.

### 1. Service Orchestration

To support this hybrid architecture, the workspace uses custom shell scripts to manage the lifecycle of the multiple required processes.

* **`cascadeflow-openai-proxy/start_services.sh`**:
  * Launches **Ollama Instance 1** (Port 11434) for low-latency autocomplete.
  * Launches **Ollama Instance 2** (Port 11435) for parallel processing of chat requests.
  * Initializes the **cascadeflow Proxy** (Port 8000).
  * Manages Python virtual environments and model pulling.
* **`cascadeflow-openai-proxy/stop_services.sh`**:
  * Performs a clean shutdown of all related processes (Proxy and Ollama instances) using tracked PIDs to prevent resource leaks.

### 2. Component Configuration

#### Continue Extension

*File: `/home/datahub/.continue/config.yaml`*

The client side is configured to treat the Proxy as a standard OpenAI provider. Note how tasks are split:

* **Autocomplete** connects directly to `localhost:11434` for maximum speed.
* **Chat/Edit** connects to the Proxy at `localhost:8000`.

```yaml
models:
  - name: cascadeflow
    provider: openai
    model: local-model
    apiBase: http://localhost:8000/v1
    roles: [chat, edit, apply]
  - name: Qwen3 1.7B
    provider: ollama
    model: qwen3:1.7b
    roles: [autocomplete, embed]
```

#### cascadeflow Proxy backend

*File: `cascadeflow-openai-proxy/config.yaml`*

The proxy defines the logic for available models and their costs. It aggregates the disparate providers (Local vs. Anthropic) into a unified list.

| Model Name         | Provider  | Endpoint          | Role                   |
| ------------------ | --------- | ----------------- | ---------------------- |
| **qwen3:1.7b**     | Ollama    | `localhost:11434` | Fast generic tasks     |
| **ministral-3:8b** | Ollama    | `localhost:11435` | Intermediate reasoning |
| **claude-sonnet**  | Anthropic | External API      | Complex coding tasks   |

## Getting Started

1. **Initialize Services**: Open a terminal in your Nuvolos workspace and run:

   ```bash
   ./cascadeflow-openai-proxy/start_services.sh
   ```

   *Wait for the logs to confirm that Ollama instances and the Proxy are active.*
2. **Verify Connections**: Open the **Continue** extension in VS Code. The models should now be available for selection.
3. **Shutdown**: When finished, ensure resources are released:

   ```bash
   ./cascadeflow-openai-proxy/stop_services.sh
   ```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.nuvolos.com/user-guides/application-specific-guides/cascadeflow_on_nuvolos.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
