Model Backends
Caro supports multiple inference backends for running language models. Choose the best option for your hardware and use case.
Available Backends
MLX (Apple Silicon)
Optimized for Apple M1/M2/M3 chips using Metal Performance Shaders. This is the fastest option for Mac users with Apple Silicon.
- Platform: macOS (Apple Silicon only)
- Performance: Fastest on supported hardware
- Memory: Uses unified memory efficiently
Ollama
Connect to a local Ollama instance for inference. Great for users who already have Ollama set up.
- Platform: Cross-platform
- Requires: Ollama running locally
- Models: Any Ollama-compatible model
vLLM
Connect to a vLLM server for high-throughput inference. Best for users with powerful GPU servers.
- Platform: Cross-platform (client)
- Requires: vLLM server running
- Performance: Excellent for batch processing
CPU Backend (Candle)
Pure CPU inference using the Candle library. Works everywhere but slower than GPU-accelerated options.
- Platform: Cross-platform
- Performance: Slower but universal
- No GPU required
Recommended Models
Caro works best with coding-focused models:
Qwen 2.5 Coder (Recommended)
qwen2.5-coder:3b- Fast, good for simple commandsqwen2.5-coder:7b- Balanced performance and qualityqwen2.5-coder:14b- Best quality, needs more RAM
Other Compatible Models
codellama:7b- Meta's coding modeldeepseek-coder:6.7b- DeepSeek's coding modelstarcoder2:7b- BigCode's model
Setting Up Ollama
- Install Ollama:
curl -fsSL https://ollama.com/install.sh | sh - Pull a model:
ollama pull qwen2.5-coder:7b - Configure Caro:
# In ~/.config/cmdai/config.toml default_model = "qwen2.5-coder:7b"
Model Caching
Caro caches downloaded models to speed up subsequent runs. The cache is stored in:
- macOS/Linux:
~/.cache/cmdai/models/ - Windows:
%LOCALAPPDATA%\cmdai\models\
Cache Management
# View cache size
du -sh ~/.cache/cmdai/
# Configure max cache size (in config.toml)
cache_max_size_gb = 20 When the cache exceeds the limit, Caro automatically removes least-recently-used models.
Performance Tips
- Apple Silicon: Use MLX backend for best performance
- NVIDIA GPU: Use Ollama with GPU acceleration
- Limited RAM: Use smaller models (3B parameters)
- Slow internet: Pre-download models with Ollama