Docs | tamebi

Overview

tamebi is a general-purpose Python package for working with open-source AI models. It is designed to grow with you, from understanding your hardware, to running models, to building agents and deploying in production.

The current release focuses on hardware detection and model compatibility. Give it one command and it automatically scans your CPU, RAM, GPU, and disk, then tells you exactly which models you can run locally, at which precision, and what performance to expect. Every result includes a memory breakdown, throughput estimates, and ready-to-copy Ollama commands.

More features are on the way, including model inference, agent primitives, provider abstraction, and deployment helpers. This is the foundation.

NVIDIA, AMD, Intel, and Apple Silicon are all detected automatically. No extra flags or environment variables needed. The model catalog updates weekly and covers the latest releases from major labs.

Hardware-aware

Detects GPU, CPU, RAM, and disk. Knows exactly what you can run. No guessing.

One command

tamebi check gives you a full compatibility report in seconds.

Ollama-ready

Every compatible model ships with a ready-to-copy ollama run command.

Installation

Install with pip:

pipbash

pip install tamebi

Or with uv (recommended, significantly faster):

uvbash

uv pip install tamebi

Tip

uv resolves and installs tamebi roughly 10–100× faster than pip. If you don't have it: curl -LsSf https://astral.sh/uv/install.sh | sh

No extra flags, extras, or system dependencies are needed. Platform detection (NVIDIA via NVML, AMD via ROCm, Intel via OpenCL, Apple via system_profiler) is handled automatically at runtime.

Quick Start

Run a hardware scan with a single command:

terminalbash

tamebi check

Output is divided into three sections:

bash - tamebi check

~/Workspace›tamebi check

Hardware

CPUApple M4 Pro

Architecturearm64

Cores / Threads12 cores / 12 threads @ 4.5 GHz

RAM24.0 GB total / 8.2 GB available

GPUApple M4 Pro, 24.0 GB VRAM (20.1 GB free)

Disk512.0 GB free / 1024.0 GB total

OSDarwin 25.3.0

Available for inference20.1 GB (VRAM)

Top Recommendations

#	Model	Precision	Memory	Run with Ollama
1	Qwen3.5 9B	INT4	6.3 GB	ollama run qwen3.5:9b
2	gemma 4 E4B	INT8	5.5 GB	ollama run gemma4:4b
3	DeepSeek R1 8B	INT4	5.6 GB	ollama run deepseek-r1:8b

Runnable Models

… all compatible models with memory breakdown, speed estimates, and TTFT

CLI Reference

›tamebi check

Detect hardware and show what's runnable. Output has three sections: Hardware, Top Recommendations, and Runnable Models.

tamebi check [OPTIONS]

Flag	Short	Default	Description
`--json`	`-j`	false	Output as JSON instead of rich tables.
`--context-length`	`-c`	4096	Context length in tokens. KV cache scales linearly with this. 4K vs 128K changes memory dramatically.
`--batch-size`	`-b`	1	Concurrent requests. Each gets its own KV cache. Set >1 if planning to serve multiple users.
`--verbose`	-	false	Show detailed detection info: driver versions, compute capability, etc.

›tamebi models

Show the full model compatibility matrix. Every model in the catalog across all precisions (INT4, INT8, FP16), with fit status and memory at each level.

tamebi models [OPTIONS]

Flag	Short	Default	Description
`--context-length`	`-c`	4096	Context length for KV cache estimation.
`--batch-size`	`-b`	1	Batch size for KV cache estimation.

›tamebi update

Pull the latest model catalog from the remote. The catalog updates automatically in the background, but you can force a refresh at any time.

tamebi update

Info

The catalog is fetched directly from HuggingFace Hub and covers models from Meta, Mistral, Google, Qwen, DeepSeek, GLM, MiniMax, Kimi, Liquid, and AllenAI. It updates weekly automatically. Run tamebi update to force a refresh.

Examples

Basic hardware check

Scan your machine and see all compatible models with Ollama commands.

tamebi check

JSON output for scripting

Machine-readable output. Pipe into jq, scripts, or CI pipelines.

tamebi check --json

Serving multiple users

Each concurrent user gets their own KV cache. This estimates memory for 4 simultaneous requests with 8K context each.

tamebi check --batch-size 4 --context-length 8192

Use native max context window

Pass --context-length 0 to use each model's own maximum context window instead of the 4K default.

tamebi check --context-length 0

Browse all models

See every model in the catalog and their compatibility across INT4, INT8, and FP16 precisions.

tamebi models

Supported Hardware

tamebi detects your hardware automatically using platform-native APIs. No extra configuration is needed.

Platform	Detection Method	What's Reported
NVIDIA	`nvidia-ml-py (NVML)`	Model, VRAM, CUDA version, compute capability
AMD	`rocm-smi (subprocess)`	Model, VRAM (requires ROCm)
Intel	`OpenCL / WMI`	Model, VRAM (Arc discrete + integrated graphics)
Apple Silicon	`system_profiler`	Chip model (M1/M2/M3/M4), unified memory
CPU-only	`psutil + py-cpuinfo`	Cores, threads, frequency, architecture

Apple Silicon note

On Apple Silicon, the GPU and CPU share unified memory. tamebi uses the total unified memory as the available VRAM, which matches how frameworks like Ollama and llama.cpp actually allocate memory on these devices.

Model Catalog

The model catalog is maintained automatically, fetched directly from HuggingFace Hub, no manual curation needed. It updates in the background weekly and covers the latest releases from all major labs.

Meta (Llama)

Mistral AI

Google (Gemma)

Alibaba (Qwen)

DeepSeek

Kimi / Moonshot

Liquid AI

AllenAI

GLM / Zhipu

MiniMax

Models are catalogued across multiple precisions (INT4, INT8, FP16) with accurate parameter counts, context windows, and layer counts for GQA-aware KV cache estimation. Run tamebi update at any time to pull the latest catalog without reinstalling.

How Estimation Works

Memory is estimated per model and precision using the following formula:

Memory estimation formula

Total VRAM = Model Weights + KV Cache + Overhead

Model Weights = params (billions) × bytes_per_param

FP16 → 2 bytes / param

INT8 → 1 byte / param

INT4 → 0.5 bytes / param

KV Cache = 2 × layers × num_kv_heads × head_dim

× context_len × bytes × batch_size

GQA-aware: uses KV heads, not Q heads

Overhead = 15% of weights (activations + fragmentation)

+ 0.5 GB (NVIDIA only, CUDA runtime)

Accuracy

Performance estimates (tokens/sec and time to first token) are based on hardware-class lookup tables. They show ranges, not exact numbers. Actual performance depends on your drivers, software stack (Ollama, llama.cpp, vLLM), and workload pattern.

The KV cache formula is GQA-aware: models that use grouped query attention (like Llama 3, Qwen 2.5, Gemma) have far fewer KV heads than query heads, so their KV cache is proportionally smaller. tamebi uses the actual architecture metadata from HuggingFace, not a rough approximation.

tamebilab

PyPI ↗Home

tamebi docs