Model Backends

Caro supports multiple inference backends for running language models. Choose the best option for your hardware and use case.

Available Backends

Optimized for Apple M1/M2/M3 chips using Metal Performance Shaders. This is the fastest option for Mac users with Apple Silicon.

Connect to a local Ollama instance for inference. Great for users who already have Ollama set up.

Connect to a vLLM server for high-throughput inference. Best for users with powerful GPU servers.

Pure CPU inference using the Candle library. Works everywhere but slower than GPU-accelerated options.

Caro works best with coding-focused models:

Install Ollama:

curl -fsSL https://ollama.com/install.sh | sh

Configure Caro:

# In ~/.config/cmdai/config.toml
default_model = "qwen2.5-coder:7b"

Caro caches downloaded models to speed up subsequent runs. The cache is stored in:

# View cache size
du -sh ~/.cache/cmdai/

# Configure max cache size (in config.toml)
cache_max_size_gb = 20

When the cache exceeds the limit, Caro automatically removes least-recently-used models.