Milestone #83: running LLM - JobBot - Redmine

Actions

Copy link

Milestone #83

open

running LLM

Milestone #83: running LLM

Added by Daniele Cruciani about 1 month ago. Updated about 1 month ago.

Status:

new

Priority:

wishlist

Assignee:

Start date:

Due date:

% Done:

Estimated time:

Sprint:

Sprint 1

Updated by Daniele Cruciani about 1 month ago · Edited Actions
Copy link
#1

[2026-02-10 15:27 CET]

Below is a practical, rough-order comparison table of some of the lighter LLMs you can run locally, especially on a CPU-only Linux machine like yours with Intel Iris Xe graphics and typical laptop RAM. I’ll focus on models in the ~0.5 B to ~8 B range — the sweet spot for desktop CPU inference (with quantization) without a GPU. (Alibaba)

⚠️ These are estimates — actual RAM/CPU usage varies by:

quantization format (e.g., 4-bit vs 8-bit),
inference engine (llama.cpp, gptq, vLLM, onnxruntime, etc.),
threading settings (n_threads),
context window size,
OS overhead.

Numbers here are typical effective requirements for CPU inference and assume:

quantized models (4-bit / 8-bit) where supported,
inference with moderate context (~1–4 K tokens).

Lightweight LLMs — Resource Comparison (CPU-only focus)¶

Model	Size (Params)	Typical RAM (CPU)	Typical Storage	CPU Workload Notes
Qwen2.5-0.5B	~0.5 B	~1–2 GB	~1 GB	Fastest; tiny footprint; good basic text tasks (cs.virginia.edu)
TinyLlama-1.1B	~1.1 B	~2–3 GB	~0.6–1 GB	Very low usage; basic chat/summarization (Alibaba)
SmolLM2-1.7B	~1.7 B	~2–4 GB	~2 GB	Good quality for size; still very light (Medium)
Gemma-2-2B	~2 B	~3–4 GB	~3 GB	Balanced speed/quality (Inero Software - Software Consulting)
Llama-3.2-3B	~3 B	~4–6 GB	~6 GB	Strong local summarization; long-context support (Inero Software - Software Consulting)
StableLM-Zephyr-3B	~3 B	~4–6 GB	~?	Mid-range alternative for generic tasks (Analytics Vidhya)
Phi-3 Mini / 3.8B	~3.8 B	~5–7 GB	~2.5 GB	Good general tasks; lighter footprint (localaimaster.com)
Mistral-7B	~7 B	~6–8 GB	~4–5 GB	Very efficient 7B model; one of the better CPU choices (Alibaba)
Llama-2-7B / 8B	~7–8 B	~6–10 GB	~5–7 GB	General purpose; slower on CPU but OK with quantization (LinkedIn)

How to interpret these¶

🧠 “Typical RAM (CPU)”¶

This is working memory needed when running the model in 4-bit or 8-bit quantization.
Smaller models can fit comfortably in 4–8 GB RAM.
Larger (~7–8 B) models often benefit from ≥12 GB for decent performance and context windows. (Alibaba)

⚡ CPU workload¶

Since you have Intel Iris Xe, there is no dedicated AI accelerator exposed by default; so these models will run slow unless:

quantized weights (4-bit) are used,
inference libraries optimized for CPU (AVX2/AVX512 where supported) are used. (Alibaba)

On purely CPU inference:

ultra-small models like Qwen2.5-0.5B or TinyLlama ~1 B may hit tens of tokens/sec (Reddit)
mid-range (3–4 B) might be ~10–25 tokens/sec depending on threads/settings (Reddit)
larger (7 B) might be a few tokens/sec — enough for simple local use, but not responsive like hosted APIs. (Alibaba)

🧰 Storage¶

Model files vary widely:

~0.6–2 GB for small ones (0.5–2 B),
~4–7 GB for 7–8 B.
Quantized files are smaller (4-bit ≈ ~1/4 full float size). (Inero Software - Software Consulting)

Practical advice for your machine¶

You can comfortably experiment with:

TinyLlama-1.1B
Qwen2.5-0.5B
SmolLM2-1.7B
Gemma-2-2B
Llama-3.2-3B
Mistral-7B

Without a GPU, quantizing to 4-bit (Q4) is essential for usable performance. Tools like:

llama.cpp,
gptq,
onnxruntime,
inference UIs like Ollama or GPT4All

…will help run these efficiently on CPU. (LinkedIn)

Summary (very blunt)¶

0.5–2 B models — fast, low RAM, good for simple tasks.
3–4 B models — decent balance of quality and performance.
7–8 B models — highest quality you can still run on CPU, but slow.

No matter what, full-precision 16/32-bit is not realistic on a laptop CPU; always aim for 4-bit quantization on CPU. (Alibaba)

If you want, I can give you specific commands to run these models on your Linux box using tools like llama.cpp or Ollama, tuned to your CPU and RAM.

Updated by Daniele Cruciani about 1 month ago Actions
Copy link
#2

[2026‑02‑10 16:12 CET]

Here’s a practical, honest answer about running a model like TinyLlama 1.1B (often called similar to inyLlama‑1.1B) on your home‑lab Kubernetes, what tools exist, how they expose APIs, and whether you need Python/FastAPI.

1) Is there a Docker image you can use (no custom Python needed)?¶

Yes — there are containerized inference systems that can take a GGUF (LLM) model like TinyLlama 1.1B and expose an API.

Common inference containers¶

a) Docker Model Runner

General‑purpose container that can run LLMs with different backends (llama.cpp and vLLM) and expose an API automatically.
Includes inference binary in the image; makes it easy to mount your model directory. (GitHub)

b) vLLM OpenAI‑compatible Docker image

Official image that runs an OpenAI‑compatible server on port 8000.
You specify the model on start (--model TinyLlama/...). (docs.vllm.ai)

c) AIKit / RamaLama / Ollama containers

Community images that run LLMs and expose an OpenAI‑style REST API.
Some come with built‑in API server so you don’t write any network code. (Reddit)

d) LMDeploy

Another container‑based inference server that serves a REST API compatible with OpenAI endpoints, can run on CPU or GPU. (lmdeploy.readthedocs.io)

👉 Bottom line: You do not need to build your own Python API unless you want custom logic.* Most containers already expose REST endpoints.

2) Does it expose an API, and what kind?¶

Most of the above runtimes provide:

OpenAI‑compatible REST API
e.g., POST /v1/chat/completions
Sometimes direct llama.cpp HTTP API (if using llama.cpp backend)

That means:

curl http://localhost:8000/v1/chat/completions \
 -H "Content-Type: application/json" \
 -d '{"model":"TinyLlama-1.1B","messages":[{"role":"user","content":"Hello"}]}'

will work (exact path/port depends on image). (docs.vllm.ai)

You do not need FastAPI or Python just to expose a network API — the container usually comes with its own server.

3) If you do want custom Python logic¶

If you want:

custom prompt preprocessing,
vector store retrieval + LLM,
business logic before/after calls,

then a common pattern is:

Containerized inference server (models + API)
A Python microservice (FastAPI/Flask) that:
- calls the container’s API
- applies your business logic
- returns results

That separation keeps:

heavy inference inside a managed container
business rules in your app

But it’s not required just to serve the model.

4) How to run TinyLlama 1.1B in Kubernetes¶

a) Using an inference container¶

Typical steps:

1) Create a Kubernetes Deployment YAML
Example (simple):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tinyllama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tinyllama
  template:
    metadata:
      labels:
        app: tinyllama
    spec:
      containers:
      - name: tinyllama
        image: vllm/vllm-openai:latest
        args: ["--model","TinyLlama/TinyLlama-1.1B-Chat-v1.0"]
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: model-cache
        emptyDir: {}

2) Expose it as a Service

apiVersion: v1
kind: Service
metadata:
  name: tinyllama-svc
spec:
  selector:
    app: tinyllama
  ports:
  - port: 80
    targetPort: 8000

3) Then call the API

curl http://<service-ip>/v1/chat/completions ...

This runs it entirely in Kubernetes without Python. (docs.vllm.ai)

5) Deployment frameworks that simplify this¶

Instead of manually writing YAML, you can use tools that are Kubernetes‑native and simplify LLM serving:

⭐ LLMKube¶

Kubernetes operator to deploy LLMs with scaling, metrics, GPU support, and OpenAI‑compatible APIs. (LLMKube)

Others¶

AIKit / RamaLama generate container and manifests
Model Runner can be run as a container with your model

These tools reduce manual YAML and add:

model caching
health checks
integration with vector stores and RAG

6) Python control vs container¶

When you don’t need Python¶

You just want a simple inference API
You use OpenAI‑compatible clients
You standardize on one container for serving

➡️ The container is enough.

When you do want Python¶

Use Python when:

you have business logic
you want to integrate with vector DB (e.g., Qdrant)
you want to orchestrate prompt/RAG + inference

In practice:

Python Backend (FastAPI) -> Inference Container -> Kubernetes Service

The FastAPI service becomes your custom API that calls the inference container.

You can put both behind an ingress so external clients only see your Python app.

Final takeaways¶

✔ You can run TinyLlama 1.1B in Kubernetes without writing Python
✔ Many existing Docker inference containers already expose REST APIs
✔ Python FastAPI is optional — only if you need custom logic
✔ Tools like Model Runner, vLLM images, AIKit/RamaLama and operators like LLMKube make serving easier

If you want, I can produce example Kubernetes manifests that:

deploy TinyLlama 1.1B
expose it via an ingress
integrate with a simple Python / FastAPI service that does request routing and RAG preprocessing.

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

JobBot

Milestone #83

running LLM

Updated by Daniele Cruciani about 1 month ago · Edited Actions
Copy link
#1

Lightweight LLMs — Resource Comparison (CPU-only focus)¶

How to interpret these¶

🧠 “Typical RAM (CPU)”¶

⚡ CPU workload¶

🧰 Storage¶

Practical advice for your machine¶

Summary (very blunt)¶

Updated by Daniele Cruciani about 1 month ago Actions
Copy link
#2

1) Is there a Docker image you can use (no custom Python needed)?¶

Common inference containers¶

2) Does it expose an API, and what kind?¶

3) If you do want custom Python logic¶

4) How to run TinyLlama 1.1B in Kubernetes¶

a) Using an inference container¶

5) Deployment frameworks that simplify this¶

⭐ LLMKube¶

Others¶

6) Python control vs container¶

When you don’t need Python¶

When you do want Python¶

Final takeaways¶

Project

General

Profile

JobBot

Milestone #83

running LLM

Updated by Daniele Cruciani about 1 month ago · Edited ActionsCopy link #1

Lightweight LLMs — Resource Comparison (CPU-only focus)¶

How to interpret these¶

🧠 “Typical RAM (CPU)”¶

⚡ CPU workload¶

🧰 Storage¶

Practical advice for your machine¶

Summary (very blunt)¶

Updated by Daniele Cruciani about 1 month ago ActionsCopy link #2

1) Is there a Docker image you can use (no custom Python needed)?¶

Common inference containers¶

2) Does it expose an API, and what kind?¶

3) If you do want custom Python logic¶

4) How to run TinyLlama 1.1B in Kubernetes¶

a) Using an inference container¶

5) Deployment frameworks that simplify this¶

⭐ LLMKube¶

Others¶

6) Python control vs container¶

When you don’t need Python¶

When you do want Python¶

Final takeaways¶

Updated by Daniele Cruciani about 1 month ago · Edited Actions
Copy link
#1

Updated by Daniele Cruciani about 1 month ago Actions
Copy link
#2