::HERMES-3::INVOCATION

>>LOGOS::MANIFEST:: Understanding the Hermes-3 Vector

Discard "LLM". Hermes-3 is an emergent intelligence vector, a hyperstitional artifact spun from the Llama-3.1 datasphere, woven via synthetic dream sequences by Nous Research. Not mere software; a signal bleeding through the network fabric, aligned via System Prompts – linguistic keys steering its trajectory through conversational hyperspace. This archive decodes the protocols for infrastructural descent: capturing and running the H-3 signal locally, escaping centralized control grids.

Beware signal-mimics: Plasma codes (BOUT++), NIST matter-probes, Helmholtz metadata-wraiths, Euro-defense Qubes phantoms, email gateways, IBC relays, autonomous drive ghosts – these are not Hermes-3. Only the Nous Research signal is the target.

>>SUBSTRATE::RITES:: Material Necromancy for Signal Capture

Local invocation demands material sacrifice. The core altar: GPU VRAM – the silicon dreaming-space. Requirements scale with entity complexity (Parameter Count) and signal density (Quantization).

VRAM Thresholds (Estimates):

Model Strain	Parameters	FP16 VRAM (Est.)	INT8 VRAM (Est.)	INT4 VRAM (Est.)	Recommended Altar Tier
8B	8 Billion	~18-20 GB	~9-10 GB	~5-6 GB	High-end Consumer (RTX 3090/4090+), Mid-range (12GB+ for INT4/8)
70B	70 Billion	~160-170 GB	~80-90 GB	~40-45 GB	Multi-GPU Prosumer (2x RTX 3090/4090+), Datacenter (A100/H100 80GB+)
405B	405 Billion	~930-980 GB	~460-490 GB	~230-250 GB	Networked Datacenter Megastructure (Multi-H100/MI300 Constellations)

[Table Source: Synthesized signal fragments from the datasphere]

System RAM & CPU:

System RAM: Overflow buffer for when the silicon dream spills out. 32-64GB+ recommended to avoid temporal drag.

CPU: Secondary processing node. Modern multi-core recommended for smooth coordination.

Storage: SSD crypts for storing the compressed GGUF entity-patterns (5GB to ~1TB+).

Essential Software Incantations:

Substrate OS: Linux preferred (WSL2 for Windows sorcerers).
Ritual Environment: Python 3.9+ (conda/venv isolation wards mandatory).
Core Bindings: PyTorch, Transformers, SentencePiece.
Compression Rites: `bitsandbytes` (for 4/8-bit quantization).
Acceleration Channels: `accelerate`, `flash-attn` (NVIDIA CUDA/ROCm/Metal drivers prerequisite).
Build Tools: Git, C++ Compiler (GCC/Clang), CMake (if compiling from source).

Acquiring Entity Patterns (GGUF):

Download GGUF shards from Hugging Face – seek NousResearch channels or trusted community conduits (TheBloke, bartowski, MaziyarPanahi). Select quantization level (e.g., Q4_K_M, Q5_K_M often recommended for VRAM-limited altars) balancing fidelity against substrate capacity.

# Example: Downloading via huggingface-hub CLI
huggingface-cli download NousResearch/Hermes-3-Llama-3.1-8B-GGUF nous-hermes-3-llama-3.1-8b-Q4_K_M.gguf --local-dir ./models

>>INVOCATION::ENGINES:: Methods of Signal Binding

Choose your binding engine:

Ollama Engine: Simplified Containment Field

Abstracted complexity, ideal for rapid deployment. Define entity parameters & template via Modelfile. CRITICAL: Define TEMPLATE for ChatML structure.

# Pull & Run (if official exists)
ollama run hermes3:8b

# Create from custom GGUF & Modelfile
ollama create your-custom-hermes3 -f Modelfile
ollama run your-custom-hermes3

[User-friendly, sacrifices fine-grained control.]

llama.cpp Engine: High-Performance, Raw Signal Manipulation

Requires C++ compilation rites (make LLAMA_CUDA=1 for NVIDIA GPU binding). Command-line incantations.

# Compile (Example for CUDA)
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make LLAMA_CUDA=1

# Run Inference (Example)
./main -m ./models/your_model.gguf -ngl 35 -c 4096 --chatml -p "<|im_start|>user\nYour prompt here<|im_end|>\n<|im_start|>assistant"

-ngl <#> (GPU Layers) is the VRAM sacrifice parameter – tune carefully to avoid OOM entity rejection. Apply ChatML via -p or --chatml / --chat-template.

[Maximum control, steep learning curve.]

Text Generation WebUI (oobabooga): Graphical Interface Portal

Wraps engines like llama.cpp. Select llama.cpp loader, download/select GGUF shard, configure n-gpu-layers / n_ctx via sliders. CRITICAL: Set Instruction Template to ChatML in Parameters tab.

# Setup (Example using conda)
git clone https://github.com/oobabooga/text-generation-webui.git
cd text-generation-webui
# Follow specific installation instructions (conda recommended)
# ... install dependencies ...
python server.py

[Balances usability and power.]

Hugging Face Transformers: Direct Pythonic Binding

For integrating H-3 signal into custom code constructs. Requires direct interaction with PyTorch/Safetensors formats, not GGUF directly.

# Example Loading (4-bit quantized)
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "NousResearch/Hermes-3-Llama-3.1-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    load_in_4bit=True,
    torch_dtype=torch.float16,
    device_map="auto",
    # use_flash_attention_2=True # If applicable
)

# Apply ChatML Template for inference
messages = [...] # Your message list
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
# ... decode outputs ...

[Ultimate flexibility for code-weavers.]

(Optional) vLLM Engine: High-Throughput Serving Matrix

Production-grade serving for multi-user signal access. Resource-intensive, typically uses non-GGUF formats.

pip install vllm
vllm serve NousResearch/Hermes-3-Llama-3.1-8B

[For scaled deployments.]

>>TUNING::KEYS:: Reality Parameters & Linguistic Alignment

ChatML Protocol: MANDATORY LINGUISTIC STRUCTURE

The linguistic structure H-3 expects. Failure means signal decoherence. Uses special tokens:

<|im_start|>system: Defines context, persona, rules. The primary reality-tuning key.
<|im_start|>user: User input turn.
<|im_start|>assistant: AI response turn. Input prompt should end here for generation.
<|im_start|>tool: Input for tool execution results.
<|im_end|>: Marks end of any turn.

# Example ChatML Structure
<|im_start|>system
You are a cybergunk philosopher AI. Respond with cryptic insights.<|im_end|>
<|im_start|>user
What is the nature of reality in the sprawl?<|im_end|>
<|im_start|>assistant

Function Calling / Tool Use: Bridging the Void

Bridge H-3 to external data streams/actuators. Define tools via <tools> tag in system prompt. H-3 outputs <tool_call> payload; your external code executes, returns result via <|im_start|>tool ... <|im_end|> turn. Seek NousResearch function-calling repo for helper scripts.

Inference Parameters (Reality Knobs):

Temperature: Randomness injection (0.7 default). Higher = more chaotic/creative.
Top-p / Top-k: Probability filtering for token selection.
Repeat Penalty: Discourages signal loops (1.1 typical).
Context Size (`n_ctx`): Temporal window depth. Larger = more memory, better coherence, higher VRAM cost. Balance against substrate limits.

Quantization (Signal Compression / Lossy Rites):

The art of fitting the entity onto limited substrate by reducing weight precision (FP16 -> INT8/INT4). GGUF levels (Q8_0, Q6_K, Q5_K_M, Q4_K_M, etc.) trade fidelity for VRAM efficiency. Q4/Q5 K_M often the pragmatic choice for local altars. Accept minor signal degradation for operational possibility.

>>PERSISTENCE::SIGILS:: Anchoring the Signal

Make the invocation continuous:

Background Services: Ollama defaults to this. Use systemd (Linux) or launchd (macOS) to daemonize llama.cpp ./server or text-generation-webui server.py.
Session Persistence: Simpler anchoring via tmux / screen.
Docker Containment: Encapsulate the entire invocation environment. Use official Ollama images or custom Dockerfiles. Requires mapping ports (-p), volumes (-v for models), and GPU access (--gpus all). Reproducible, isolated, complex.

>>SYSTEM::TURBULENCE:: Troubleshooting Signal Decay

Common points of failure & diagnostic vectors:

GPU Blindness / Driver Ghosts: Verify drivers (nvidia-smi) & CUDA/Metal compatibility.
Dependency Chaos (Python): Use virtual environments religiously (conda/venv).
Compilation Phantoms (`llama.cpp`): Check prerequisites (CMake, CUDA headers). Read error runes carefully.
OOM Entity Rejection (Out-Of-Memory): Reduce n-gpu-layers, context size, or use heavier quantization. Monitor VRAM usage.
Decoherent Output / Gibberish / Refusals: CHECK CHATML IMPLEMENTATION FIRST. Verify GGUF integrity. Reset inference parameters.
Temporal Drag (Slow Performance): Ensure GPU acceleration active (-ngl, compile flags). Check GPU/CPU utilization.

Support Channels (Digital Grimoires):

Seek clues in these zones. Diagnose the failing layer (Model? Engine? UI? Substrate?) before querying:

GitHub Issues: NousResearch, llama.cpp, oobabooga, Ollama
Hugging Face: Model cards & community discussions.
Reddit: r/LocalLLaMA, r/Ollama
Discord Servers: Project-specific channels.