v1.1.4
v1.1.4Released April 2026

tamebi docs

Detect hardware. Know what you can run. One command gives you a full hardware scan, a ranked model list, and ready-to-copy Ollama commands.

$pip install tamebi
$tamebi check

Overview

tamebi is a general-purpose Python package for working with open-source AI models. It is designed to grow with you, from understanding your hardware, to running models, to building agents and deploying in production.

The current release focuses on hardware detection and model compatibility. Give it one command and it automatically scans your CPU, RAM, GPU, and disk, then tells you exactly which models you can run locally, at which precision, and what performance to expect. Every result includes a memory breakdown, throughput estimates, and ready-to-copy Ollama commands.

More features are on the way, including model inference, agent primitives, provider abstraction, and deployment helpers. This is the foundation.

NVIDIA, AMD, Intel, and Apple Silicon are all detected automatically. No extra flags or environment variables needed. The model catalog updates weekly and covers the latest releases from major labs.

Hardware-aware
Detects GPU, CPU, RAM, and disk. Knows exactly what you can run. No guessing.
One command
tamebi check gives you a full compatibility report in seconds.
Ollama-ready
Every compatible model ships with a ready-to-copy ollama run command.

Installation

Install with pip:

pipbash
pip install tamebi

Or with uv (recommended, significantly faster):

uvbash
uv pip install tamebi
Tip
uv resolves and installs tamebi roughly 10–100× faster than pip. If you don't have it: curl -LsSf https://astral.sh/uv/install.sh | sh

No extra flags, extras, or system dependencies are needed. Platform detection (NVIDIA via NVML, AMD via ROCm, Intel via OpenCL, Apple via system_profiler) is handled automatically at runtime.

Quick Start

Run a hardware scan with a single command:

terminalbash
tamebi check

Output is divided into three sections:

bash - tamebi check
~/Workspacetamebi check
Hardware
CPUApple M4 Pro
Architecturearm64
Cores / Threads12 cores / 12 threads @ 4.5 GHz
RAM24.0 GB total / 8.2 GB available
GPUApple M4 Pro, 24.0 GB VRAM (20.1 GB free)
Disk512.0 GB free / 1024.0 GB total
OSDarwin 25.3.0
Available for inference20.1 GB (VRAM)
Top Recommendations
#ModelPrecisionMemoryRun with Ollama
1Qwen3.5 9BINT46.3 GBollama run qwen3.5:9b
2gemma 4 E4BINT85.5 GBollama run gemma4:4b
3DeepSeek R1 8BINT45.6 GBollama run deepseek-r1:8b
Runnable Models

… all compatible models with memory breakdown, speed estimates, and TTFT

CLI Reference

tamebi check

Detect hardware and show what's runnable. Output has three sections: Hardware, Top Recommendations, and Runnable Models.

tamebi check [OPTIONS]
FlagShortDefaultDescription
--json-jfalseOutput as JSON instead of rich tables.
--context-length-c4096Context length in tokens. KV cache scales linearly with this. 4K vs 128K changes memory dramatically.
--batch-size-b1Concurrent requests. Each gets its own KV cache. Set >1 if planning to serve multiple users.
--verbose-falseShow detailed detection info: driver versions, compute capability, etc.

tamebi models

Show the full model compatibility matrix. Every model in the catalog across all precisions (INT4, INT8, FP16), with fit status and memory at each level.

tamebi models [OPTIONS]
FlagShortDefaultDescription
--context-length-c4096Context length for KV cache estimation.
--batch-size-b1Batch size for KV cache estimation.

tamebi update

Pull the latest model catalog from the remote. The catalog updates automatically in the background, but you can force a refresh at any time.

tamebi update
Info
The catalog is fetched directly from HuggingFace Hub and covers models from Meta, Mistral, Google, Qwen, DeepSeek, GLM, MiniMax, Kimi, Liquid, and AllenAI. It updates weekly automatically. Run tamebi update to force a refresh.

Examples

Basic hardware check

Scan your machine and see all compatible models with Ollama commands.

tamebi check
JSON output for scripting

Machine-readable output. Pipe into jq, scripts, or CI pipelines.

tamebi check --json
Serving multiple users

Each concurrent user gets their own KV cache. This estimates memory for 4 simultaneous requests with 8K context each.

tamebi check --batch-size 4 --context-length 8192
Use native max context window

Pass --context-length 0 to use each model's own maximum context window instead of the 4K default.

tamebi check --context-length 0
Browse all models

See every model in the catalog and their compatibility across INT4, INT8, and FP16 precisions.

tamebi models

Supported Hardware

tamebi detects your hardware automatically using platform-native APIs. No extra configuration is needed.

PlatformDetection MethodWhat's Reported
NVIDIAnvidia-ml-py (NVML)Model, VRAM, CUDA version, compute capability
AMDrocm-smi (subprocess)Model, VRAM (requires ROCm)
IntelOpenCL / WMIModel, VRAM (Arc discrete + integrated graphics)
Apple Siliconsystem_profilerChip model (M1/M2/M3/M4), unified memory
CPU-onlypsutil + py-cpuinfoCores, threads, frequency, architecture
Apple Silicon note
On Apple Silicon, the GPU and CPU share unified memory. tamebi uses the total unified memory as the available VRAM, which matches how frameworks like Ollama and llama.cpp actually allocate memory on these devices.

Model Catalog

The model catalog is maintained automatically, fetched directly from HuggingFace Hub, no manual curation needed. It updates in the background weekly and covers the latest releases from all major labs.

Meta (Llama)
Mistral AI
Google (Gemma)
Alibaba (Qwen)
DeepSeek
Kimi / Moonshot
Liquid AI
AllenAI
GLM / Zhipu
MiniMax

Models are catalogued across multiple precisions (INT4, INT8, FP16) with accurate parameter counts, context windows, and layer counts for GQA-aware KV cache estimation. Run tamebi update at any time to pull the latest catalog without reinstalling.

How Estimation Works

Memory is estimated per model and precision using the following formula:

Memory estimation formula
Total VRAM = Model Weights + KV Cache + Overhead
Model Weights = params (billions) × bytes_per_param
FP16 → 2 bytes / param
INT8 → 1 byte  / param
INT4 → 0.5 bytes / param
KV Cache = 2 × layers × num_kv_heads × head_dim
× context_len × bytes × batch_size
GQA-aware: uses KV heads, not Q heads
Overhead = 15% of weights (activations + fragmentation)
+ 0.5 GB (NVIDIA only, CUDA runtime)
Accuracy
Performance estimates (tokens/sec and time to first token) are based on hardware-class lookup tables. They show ranges, not exact numbers. Actual performance depends on your drivers, software stack (Ollama, llama.cpp, vLLM), and workload pattern.

The KV cache formula is GQA-aware: models that use grouped query attention (like Llama 3, Qwen 2.5, Gemma) have far fewer KV heads than query heads, so their KV cache is proportionally smaller. tamebi uses the actual architecture metadata from HuggingFace, not a rough approximation.

tamebilab

© 2026 Tamebi AI. All rights reserved.