Beyond the Hype: Building a Local-First, AI-Powered Dev Workflow (Sans the Cloud)

Let’s be honest: we’ve reached peak AI fatigue. Every day, there’s a new SaaS tool promising to write your code, debug your legacy systems, and tuck you in at night—for the low, low price of a premium monthly subscription and your company's intellectual property. If you’ve been hanging out on Hacker News lately, you’ve probably noticed a growing counter-movement: "Sans AI" (or more accurately, "Sans Cloud AI").

Developers are starting to push back against the standard practice of sending every keystroke to third-party APIs. Whether it's due to strict corporate compliance, flight cabin offline mode, or just sheer impatience with API latency, the desire for local-first developer tools is skyrocketing. But can you actually build a productive, modern development workflow using 100% local, open-source AI models without sacrificing quality?

The short answer is yes. In this post, we’re going to step away from the proprietary APIs and build a fully local, privacy-first AI development assistant. We’ll look at how to set up local inference, hook it into your IDE, and write code without sending a single byte over the wire.

Why Go Local? The Developer's Case for "Sans Cloud"

Before we dive into the YAML and config files, let's address the elephant in the room. Why should you care about running models locally when OpenAI, Anthropic, and Google spend billions optimizing their cloud infrastructure?

Zero Latency: Waiting for a cloud API to round-trip while you're in the middle of a coding flow state is a productivity killer. Local models run directly on your hardware, eliminating network latency.
Privacy and IP Security: If you work in fintech, healthcare, or any enterprise with strict NDAs, pasting proprietary code into a web UI or a cloud-connected IDE extension is a fireable offense. Local models keep your code on your machine.
Cost Predictability: No token limits, no surprise bills, and no tier upgrades. Once you have the hardware, running the models is virtually free.
Offline Capability: Whether you're on an airplane, commuting on a train, or experiencing a Wi-Fi outage, your development environment remains fully functional.

The Modern Local AI Stack

To get a seamless local development experience, we need three core components:

An Inference Engine: This is the backend that loads the weights of our LLM (Large Language Model) and exposes a local API. We'll be using Ollama, which has quickly become the Docker of local AI.
The Model: We need a model optimized for code generation and autocomplete. We will use Qwen2.5-Coder (specifically the 1.5B or 7B parameter variants), which currently rivals GPT-3.5 and even GPT-4 on many coding benchmarks.
The IDE Integration: We need an editor extension that can speak to our local API. We'll use Continue.dev, an incredible open-source autopilot alternative for VS Code and JetBrains IDEs.

Step 1: Setting Up Ollama and Fetching the Model

First, let's get Ollama up and running. It supports macOS, Linux, and Windows natively. Head over to your terminal and run the installation script (for Mac/Linux):

curl -fsSL https://ollama.com/install.sh | sh

Once installed, the Ollama daemon will run in the background. Now, we want to pull a highly efficient model optimized for code autocomplete (Fill-in-the-Middle, or FIM) and general software engineering chat.

For mid-range machines (like an Apple M1/M2/M3 with 16GB RAM), the Qwen2.5-Coder 7B model is the sweet spot. If you are on an older machine or want lightning-fast inline completions, the Qwen2.5-Coder 1.5B model is incredibly lightweight and capable.

Let's pull both: the 7B model for complex chat/refactoring, and the 1.5B model for fast autocomplete.

# For general coding assistant chat
ollama pull qwen2.5-coder:7b

# For fast, inline tab-completions
ollama pull qwen2.5-coder:1.5b

To verify the installation works, you can run a quick prompt directly in your terminal:

ollama run qwen2.5-coder:7b "Write a Python function to check if a string is a palindrome."

Step 2: Configuring Your IDE with Continue.dev

Now that our local inference engine is serving models on http://localhost:11434, we need to connect our IDE to it. Install the Continue extension from the VS Code Marketplace or JetBrains Plugins Marketplace.

Once installed, click the Continue icon in your sidebar. It will prompt you to configure your setup. This is managed via a JSON configuration file located at ~/.continue/config.json. Let’s edit this file to wire up our local Ollama instances.

Open your config.json and replace its contents with the following configuration:

{
  "models": [
    {
      "title": "Qwen 2.5 Coder 7B (Local)",
      "provider": "ollama",
      "model": "qwen2.5-coder:7b"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Qwen 2.5 Coder 1.5B (Local)",
    "provider": "ollama",
    "model": "qwen2.5-coder:1.5b"
  },
  "customCommands": [
    {
      "name": "test",
      "prompt": "Write a comprehensive unit test suite for this code using pytest, ensuring edge cases are covered.",
      "description": "Generate Pytest Unit Tests"
    }
  ],
  "contextProviders": [
    {
      "name": "code",
      "params": {}
    },
    {
      "name": "docs",
      "params": {}
    }
  ]
}

Understanding the Configuration

models: This array defines the models available in your sidebar chat. We use the 7B model here because chat interactions are less sensitive to a few milliseconds of latency, allowing us to leverage a smarter model.
tabAutocompleteModel: This is the secret sauce. By using the smaller 1.5B model, we ensure that as you type, code suggestions pop up instantly (under 100ms), giving you a GitHub Copilot-like experience without sending your code to the cloud.
customCommands: These are shortcuts. You can highlight code, type /test, and Continue will automatically generate unit tests using our local model.

Under the Hood: How Local Autocomplete Works

You might wonder how a local model can predict what you're going to write next. Modern code models use a technique called Fill-in-the-Middle (FIM).

Traditional language models only look at the text *before* the cursor to predict the next word. However, when coding, you are often editing in the middle of a file. The IDE needs to send the text *before* the cursor (the prefix) and the text *after* the cursor (the suffix) to the model. The model then fills in the blank.

Here is a conceptual architecture of how your IDE communicates with Ollama locally:

┌────────────────────────────────────────────────────────┐
│                      Your IDE                          │
│                                                        │
│  def process_user_data(data):                          │
│      [Cursor here - typing...]                         │
│  # Ensure database connection is closed                │
└───────────────────────────┬────────────────────────────┘
                            │
                            │ (Prefix + Suffix via local HTTP)
                            ▼
┌────────────────────────────────────────────────────────┐
│               Ollama (localhost:11434)                 │
│                                                        │
│  [Runs Qwen2.5-Coder:1.5b using system RAM/VRAM]       │
└───────────────────────────┬────────────────────────────┘
                            │
                            │ (Suggested: "db.close()")
                            ▼
┌────────────────────────────────────────────────────────┐
│                      Your IDE                          │
│                                                        │
│  def process_user_data(data):                          │
│      db.close()  <-- (Instantly Completed)              │
│  # Ensure database connection is closed                │
└────────────────────────────────────────────────────────┘

Putting It to the Test: Real-World Scenarios

Let's look at how this setup handles common developer tasks. Assume we have a simple Express.js controller where we want to implement user registration with password hashing.

Scenario 1: Inline Autocomplete (FIM)

As you start typing your controller, the 1.5B model analyzes the imports and context. You type:

const bcrypt = require('bcrypt');
const User = require('../models/User');

const registerUser = async (req, res) => {
    //

Within milliseconds, Continue and your local Ollama instance will suggest the rest of the function block:

    try {
        const { email, password } = req.body;
        const hashedPassword = await bcrypt.hash(password, 10);
        const newUser = await User.create({ email, password: hashedPassword });
        return res.status(201).json({ user: newUser });
    } catch (error) {
        return res.status(500).json({ error: error.message });
    }
}

Scenario 2: Context-Aware Chat

If you highlight the code above and press Cmd+L (or Ctrl+L on Windows/Linux) to open the Continue chat sidebar, you can ask the 7B model to review the code.

By typing: "Are there any security issues with this registration controller?", the local model will analyze the snippet and point out:

You should check if the user already exists before hashing the password to avoid unnecessary computation.
Returning the raw newUser object (including the hashed password) in the JSON response is a security risk.

All of this architectural reasoning happens completely offline, powered by your machine's GPU or CPU.

Performance Tuning: Getting the Best Out of Your Hardware

If your local model feels sluggish, here are a few developer-to-developer tips to optimize your local inference engine:

VRAM is King: LLMs run incredibly fast on GPUs. If you are on an Apple Silicon Mac (M1/M2/M3), your system RAM is unified, meaning Ollama can use it as VRAM. If you are on Windows/Linux, make sure Ollama is utilizing your NVIDIA/AMD GPU instead of fallback CPU mode.
Quantization Matters: When you pull a model from Ollama, it defaults to a 4-bit quantization (Q4_K_M). This is the optimal balance of speed and intelligence. Avoid pulling unquantized (FP16) models unless you are running a workstation with dual RTX 4090s.
Keep Your Context Window Small: For autocomplete, set your context length to 2048 or 4096 tokens in your Continue configuration. This limits how much of your surrounding code is analyzed, keeping completion speeds snappy.

Conclusion

The "Sans AI" movement isn't necessarily about rejecting the power of large language models; it's about rejecting the centralization, latency, and privacy compromises that come with cloud-hosted AI. By setting up a local stack with Ollama, Qwen2.5-Coder, and Continue, you get the best of both worlds: a highly intelligent coding companion that respects your privacy, costs nothing to run, and works flawlessly when you're completely disconnected from the grid.

Have you tried moving your development workflow offline? What local models are you running on your machine? Let me know in the comments below, or hit me up on the sysseder forums!

Happy coding, offline and sans cloud!