Model Backends

Caro supports multiple inference backends for running language models. Choose the best option for your hardware and use case.

Available Backends

MLX (Apple Silicon)

Optimized for Apple M1/M2/M3 chips using Metal Performance Shaders. This is the fastest option for Mac users with Apple Silicon.

  • Platform: macOS (Apple Silicon only)
  • Performance: Fastest on supported hardware
  • Memory: Uses unified memory efficiently

Ollama

Connect to a local Ollama instance for inference. Great for users who already have Ollama set up.

  • Platform: Cross-platform
  • Requires: Ollama running locally
  • Models: Any Ollama-compatible model

vLLM

Connect to a vLLM server for high-throughput inference. Best for users with powerful GPU servers.

  • Platform: Cross-platform (client)
  • Requires: vLLM server running
  • Performance: Excellent for batch processing

CPU Backend (Candle)

Pure CPU inference using the Candle library. Works everywhere but slower than GPU-accelerated options.

  • Platform: Cross-platform
  • Performance: Slower but universal
  • No GPU required

Recommended Models

Caro works best with coding-focused models:

Qwen 2.5 Coder (Recommended)

  • qwen2.5-coder:3b - Fast, good for simple commands
  • qwen2.5-coder:7b - Balanced performance and quality
  • qwen2.5-coder:14b - Best quality, needs more RAM

Other Compatible Models

  • codellama:7b - Meta's coding model
  • deepseek-coder:6.7b - DeepSeek's coding model
  • starcoder2:7b - BigCode's model

Setting Up Ollama

  1. Install Ollama:
    curl -fsSL https://ollama.com/install.sh | sh
  2. Pull a model:
    ollama pull qwen2.5-coder:7b
  3. Configure Caro:
    # In ~/.config/cmdai/config.toml
    default_model = "qwen2.5-coder:7b"

Model Caching

Caro caches downloaded models to speed up subsequent runs. The cache is stored in:

  • macOS/Linux: ~/.cache/cmdai/models/
  • Windows: %LOCALAPPDATA%\cmdai\models\

Cache Management

# View cache size
du -sh ~/.cache/cmdai/

# Configure max cache size (in config.toml)
cache_max_size_gb = 20

When the cache exceeds the limit, Caro automatically removes least-recently-used models.

Performance Tips

  • Apple Silicon: Use MLX backend for best performance
  • NVIDIA GPU: Use Ollama with GPU acceleration
  • Limited RAM: Use smaller models (3B parameters)
  • Slow internet: Pre-download models with Ollama