Running GPT-OSS-120B on Framework Desktop with AMD Strix Halo

The Framework Desktop with AMD Strix Halo AI chip offers a compelling platform for running large language models locally. With 128GB of unified memory shared between CPU and GPU, it eliminates the traditional VRAM bottleneck that typically limits consumer hardware to smaller quantised models.

This guide demonstrates how to run gpt-oss-120b, a 120-billion parameter open-source language model, using a containerised approach that sidesteps the complexity of installing AMD's GPU compute toolchain directly on the host system.

Why Containerisation Matters

Installing the AMD GPU compute stack and llama.cpp with Vulkan support on your host system is complex. The container build process requires:

Build toolchain: gcc, clang, cmake, ninja-build, and associated development libraries
Vulkan stack: Vulkan SDK, Mesa drivers (RADV), and GLSL compiler
llama.cpp compilation: Building from source with Vulkan backend enabled and optional performance patches
Runtime dependencies: System libraries, GPU monitoring tools, and custom library paths

The final container image contains over 1.45GB of compiled binaries plus 413MB of runtime dependencies. Replicating this manually means managing version compatibility across dozens of packages, building from source, and troubleshooting library paths.

The container approach encapsulates all this complexity into a single podman run command.

The Complete Command

podman run -d \
  --restart=always \
  --name=gpt-oss-120b \
  --device /dev/dri \
  --group-add video \
  --security-opt seccomp=unconfined \
  -v ~/.cache/llama.cpp:/root/.cache/llama.cpp \
  -p 8080:8080 \
  docker.io/kyuz0/amd-strix-halo-toolboxes:vulkan-radv \
  llama-server \
    -hf ggml-org/gpt-oss-120b-GGUF \
    --host 0.0.0.0 \
    --ctx-size 120000 \
    --temp 0.8 \
    --no-mmap \
    -ngl 999 \
    -fa on \
    --jinja \
    -a gpt-oss-120b

Command Breakdown

Container Options

-d: Run the container in detached mode (background)
--restart=always: Automatically restart the container if it stops or after system reboots
--name=gpt-oss-120b: Assign a friendly name for easier management
--device /dev/dri: Grant access to Direct Rendering Infrastructure devices (GPU)
--group-add video: Add the container process to the video group for GPU access
--security-opt seccomp=unconfined: Disable seccomp filtering to allow GPU operations
-v ~/.cache/llama.cpp:/root/.cache/llama.cpp: Mount host cache directory to avoid re-downloading model weights (120B models are large!)
-p 8080:8080: Expose the llama-server API on port 8080

Model and Server Options

-hf ggml-org/gpt-oss-120b-GGUF: Download the model from Hugging Face repository. This model uses mxfp4 quantisation, a native quantisation format optimised for large models
--host 0.0.0.0: Bind to all network interfaces (default is 127.0.0.1, which wouldn't be accessible from outside the container)
-a gpt-oss-120b: Set an alias for the model name in API responses

Context and Inference Settings

--ctx-size 120000: Set the prompt context size to 120,000 tokens. The large unified memory allows for massive context windows that would be impossible on traditional GPU setups
--temp 0.8: Set the sampling temperature to 0.8 for balanced creativity and coherence

Performance and Memory Options

--no-mmap: Disable memory-mapped file I/O. This loads the entire model into RAM, which is slower on startup but can improve inference performance and reduce page faults with unified memory
-ngl 999: Offload 999 layers to GPU (effectively "all layers" since the model has fewer layers). This maximises GPU acceleration
-fa on: Enable Flash Attention, an optimised attention mechanism that reduces memory usage and improves performance

Chat Template

--jinja: Use Jinja templating for chat format. This enables the server to automatically apply the model's chat template (defined in its metadata) for properly formatted conversations

Unified Memory Advantage

The AMD Strix Halo's unified memory architecture is the key enabler here. Traditional discrete GPU setups face two constraints:

VRAM capacity: A 24GB GPU can only load models that fit in 24GB
Transfer overhead: Moving data between system RAM and VRAM introduces latency

With 128GB of unified memory, both constraints disappear. The GPU can directly access the full 128GB pool, allowing:

Larger models: GPT-OSS-120B with mxfp4 quantisation uses approximately 68GB of RAM when loaded, fitting comfortably in the available memory
Massive context windows: 120,000 tokens of context without running out of memory
No transfer penalties: The GPU accesses the same memory as the CPU

Accessing the API

Once the container is running, the llama-server API is accessible at http://localhost:8080. The server provides an OpenAI-compatible API endpoint and includes a web UI (accessible at the same URL).

Example API call:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-oss-120b",
    "messages": [
      {"role": "user", "content": "Explain unified memory architecture"}
    ]
  }'

Container Management

Check container status:

podman ps -a | grep gpt-oss-120b

View logs:

podman logs -f gpt-oss-120b

Stop the server:

podman stop gpt-oss-120b

Start the server:

podman start gpt-oss-120b

Performance Considerations

GPT-OSS-120B achieves approximately 50 toks/s on the Strix Halo thanks to its Mixture of Experts (MoE) architecture, which activates only a subset of the model's parameters for each token. While the unified memory architecture enables running large models, inference speed depends on several factors:

Model quantisation: GPT-OSS-120B uses mxfp4 quantisation, which balances model size and quality for large parameter models
Context size: Larger context windows increase memory usage and computation time
Layer offloading: The -ngl 999 ensures all layers run on GPU, but memory bandwidth still matters

The Strix Halo's integrated GPU shares memory bandwidth with the CPU, so performance may not match dedicated high-end GPUs, but the ability to run 120B models at approximately 50 toks/s on consumer hardware is a significant achievement.