Milestone #83
openrunning LLM
0%
Updated by Daniele Cruciani about 1 month ago
· Edited
[2026-02-10 15:27 CET]
Below is a practical, rough-order comparison table of some of the lighter LLMs you can run locally, especially on a CPU-only Linux machine like yours with Intel Iris Xe graphics and typical laptop RAM. I’ll focus on models in the ~0.5 B to ~8 B range — the sweet spot for desktop CPU inference (with quantization) without a GPU. (Alibaba)
⚠️ These are estimates — actual RAM/CPU usage varies by:
- quantization format (e.g., 4-bit vs 8-bit),
- inference engine (
llama.cpp,gptq,vLLM,onnxruntime, etc.), - threading settings (
n_threads), - context window size,
- OS overhead.
Numbers here are typical effective requirements for CPU inference and assume:
- quantized models (4-bit / 8-bit) where supported,
- inference with moderate context (~1–4 K tokens).
Lightweight LLMs — Resource Comparison (CPU-only focus)¶
| Model | Size (Params) | Typical RAM (CPU) | Typical Storage | CPU Workload Notes |
|---|---|---|---|---|
| Qwen2.5-0.5B | ~0.5 B | ~1–2 GB | ~1 GB | Fastest; tiny footprint; good basic text tasks (cs.virginia.edu) |
| TinyLlama-1.1B | ~1.1 B | ~2–3 GB | ~0.6–1 GB | Very low usage; basic chat/summarization (Alibaba) |
| SmolLM2-1.7B | ~1.7 B | ~2–4 GB | ~2 GB | Good quality for size; still very light (Medium) |
| Gemma-2-2B | ~2 B | ~3–4 GB | ~3 GB | Balanced speed/quality (Inero Software - Software Consulting) |
| Llama-3.2-3B | ~3 B | ~4–6 GB | ~6 GB | Strong local summarization; long-context support (Inero Software - Software Consulting) |
| StableLM-Zephyr-3B | ~3 B | ~4–6 GB | ~? | Mid-range alternative for generic tasks (Analytics Vidhya) |
| Phi-3 Mini / 3.8B | ~3.8 B | ~5–7 GB | ~2.5 GB | Good general tasks; lighter footprint (localaimaster.com) |
| Mistral-7B | ~7 B | ~6–8 GB | ~4–5 GB | Very efficient 7B model; one of the better CPU choices (Alibaba) |
| Llama-2-7B / 8B | ~7–8 B | ~6–10 GB | ~5–7 GB | General purpose; slower on CPU but OK with quantization (LinkedIn) |
How to interpret these¶
🧠 “Typical RAM (CPU)”¶
- This is working memory needed when running the model in 4-bit or 8-bit quantization.
- Smaller models can fit comfortably in 4–8 GB RAM.
- Larger (~7–8 B) models often benefit from ≥12 GB for decent performance and context windows. (Alibaba)
⚡ CPU workload¶
Since you have Intel Iris Xe, there is no dedicated AI accelerator exposed by default; so these models will run slow unless:
- quantized weights (4-bit) are used,
- inference libraries optimized for CPU (AVX2/AVX512 where supported) are used. (Alibaba)
On purely CPU inference:
- ultra-small models like Qwen2.5-0.5B or TinyLlama ~1 B may hit tens of tokens/sec (Reddit)
- mid-range (3–4 B) might be ~10–25 tokens/sec depending on threads/settings (Reddit)
- larger (7 B) might be a few tokens/sec — enough for simple local use, but not responsive like hosted APIs. (Alibaba)
🧰 Storage¶
Model files vary widely:
- ~0.6–2 GB for small ones (0.5–2 B),
- ~4–7 GB for 7–8 B.
Quantized files are smaller (4-bit ≈ ~1/4 full float size). (Inero Software - Software Consulting)
Practical advice for your machine¶
You can comfortably experiment with:
- TinyLlama-1.1B
- Qwen2.5-0.5B
- SmolLM2-1.7B
- Gemma-2-2B
- Llama-3.2-3B
- Mistral-7B
Without a GPU, quantizing to 4-bit (Q4) is essential for usable performance. Tools like:
-
llama.cpp, -
gptq, -
onnxruntime, - inference UIs like Ollama or GPT4All
…will help run these efficiently on CPU. (LinkedIn)
Summary (very blunt)¶
- 0.5–2 B models — fast, low RAM, good for simple tasks.
- 3–4 B models — decent balance of quality and performance.
- 7–8 B models — highest quality you can still run on CPU, but slow.
No matter what, full-precision 16/32-bit is not realistic on a laptop CPU; always aim for 4-bit quantization on CPU. (Alibaba)
If you want, I can give you specific commands to run these models on your Linux box using tools like llama.cpp or Ollama, tuned to your CPU and RAM.
Updated by Daniele Cruciani about 1 month ago
[2026‑02‑10 16:12 CET]
Here’s a practical, honest answer about running a model like TinyLlama 1.1B (often called similar to inyLlama‑1.1B) on your home‑lab Kubernetes, what tools exist, how they expose APIs, and whether you need Python/FastAPI.
1) Is there a Docker image you can use (no custom Python needed)?¶
Yes — there are containerized inference systems that can take a GGUF (LLM) model like TinyLlama 1.1B and expose an API.
Common inference containers¶
a) Docker Model Runner
- General‑purpose container that can run LLMs with different backends (llama.cpp and vLLM) and expose an API automatically.
- Includes inference binary in the image; makes it easy to mount your model directory. (GitHub)
b) vLLM OpenAI‑compatible Docker image
- Official image that runs an OpenAI‑compatible server on port 8000.
- You specify the model on start (
--model TinyLlama/...). (docs.vllm.ai)
c) AIKit / RamaLama / Ollama containers
- Community images that run LLMs and expose an OpenAI‑style REST API.
- Some come with built‑in API server so you don’t write any network code. (Reddit)
d) LMDeploy
- Another container‑based inference server that serves a REST API compatible with OpenAI endpoints, can run on CPU or GPU. (lmdeploy.readthedocs.io)
👉 Bottom line: You do not need to build your own Python API unless you want custom logic.* Most containers already expose REST endpoints.
2) Does it expose an API, and what kind?¶
Most of the above runtimes provide:
-
OpenAI‑compatible REST API
e.g., POST/v1/chat/completions - Sometimes direct llama.cpp HTTP API (if using llama.cpp backend)
That means:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"TinyLlama-1.1B","messages":[{"role":"user","content":"Hello"}]}'
will work (exact path/port depends on image). (docs.vllm.ai)
You do not need FastAPI or Python just to expose a network API — the container usually comes with its own server.
3) If you do want custom Python logic¶
If you want:
- custom prompt preprocessing,
- vector store retrieval + LLM,
- business logic before/after calls,
then a common pattern is:
-
Containerized inference server (models + API)
-
A Python microservice (FastAPI/Flask) that:
- calls the container’s API
- applies your business logic
- returns results
That separation keeps:
- heavy inference inside a managed container
- business rules in your app
But it’s not required just to serve the model.
4) How to run TinyLlama 1.1B in Kubernetes¶
a) Using an inference container¶
Typical steps:
1) Create a Kubernetes Deployment YAML
Example (simple):
apiVersion: apps/v1
kind: Deployment
metadata:
name: tinyllama
spec:
replicas: 1
selector:
matchLabels:
app: tinyllama
template:
metadata:
labels:
app: tinyllama
spec:
containers:
- name: tinyllama
image: vllm/vllm-openai:latest
args: ["--model","TinyLlama/TinyLlama-1.1B-Chat-v1.0"]
ports:
- containerPort: 8000
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
emptyDir: {}
2) Expose it as a Service
apiVersion: v1
kind: Service
metadata:
name: tinyllama-svc
spec:
selector:
app: tinyllama
ports:
- port: 80
targetPort: 8000
3) Then call the API
curl http://<service-ip>/v1/chat/completions ...
This runs it entirely in Kubernetes without Python. (docs.vllm.ai)
5) Deployment frameworks that simplify this¶
Instead of manually writing YAML, you can use tools that are Kubernetes‑native and simplify LLM serving:
⭐ LLMKube¶
- Kubernetes operator to deploy LLMs with scaling, metrics, GPU support, and OpenAI‑compatible APIs. (LLMKube)
Others¶
- AIKit / RamaLama generate container and manifests
- Model Runner can be run as a container with your model
These tools reduce manual YAML and add:
- model caching
- health checks
- integration with vector stores and RAG
6) Python control vs container¶
When you don’t need Python¶
- You just want a simple inference API
- You use OpenAI‑compatible clients
- You standardize on one container for serving
➡️ The container is enough.
When you do want Python¶
Use Python when:
- you have business logic
- you want to integrate with vector DB (e.g., Qdrant)
- you want to orchestrate prompt/RAG + inference
In practice:
Python Backend (FastAPI) -> Inference Container -> Kubernetes Service
The FastAPI service becomes your custom API that calls the inference container.
You can put both behind an ingress so external clients only see your Python app.
Final takeaways¶
✔ You can run TinyLlama 1.1B in Kubernetes without writing Python
✔ Many existing Docker inference containers already expose REST APIs
✔ Python FastAPI is optional — only if you need custom logic
✔ Tools like Model Runner, vLLM images, AIKit/RamaLama and operators like LLMKube make serving easier
If you want, I can produce example Kubernetes manifests that:
- deploy TinyLlama 1.1B
- expose it via an ingress
- integrate with a simple Python / FastAPI service that does request routing and RAG preprocessing.