Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -204,6 +204,12 @@ if(SD_WEBM)
endif()
endif()

if (SD_RPC)
message("-- Use RPC as backend stable-diffusion")
set(GGML_RPC ON)
add_definitions(-DSD_USE_RPC)
endif ()

set(SD_LIB stable-diffusion)

file(GLOB SD_LIB_SOURCES CONFIGURE_DEPENDS
Expand Down
220 changes: 220 additions & 0 deletions docs/rpc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# Building and Using the RPC Server with `stable-diffusion.cpp`

This guide covers how to build a version of [the RPC server from `llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/tools/rpc/README.md) that is compatible with your version of `stable-diffusion.cpp` to manage multi-backends setups. RPC allows you to offload specific model components to a remote server.

> **Note on Model Location:** The model files (e.g., `.safetensors` or `.gguf`) remain on the **Client** machine. The client parses the file and transmits the necessary tensor data and computational graphs to the server. The server does not need to store the model files locally.

## 1. Building `stable-diffusion.cpp` with RPC client

First, you should build the client application from source. It requires `SD_RPC=ON` to include the RPC backend to your client.

```bash
mkdir build
cd build
cmake .. \
-DSD_RPC=ON \
# Add other build flags here (e.g., -DSD_VULKAN=ON)
cmake --build . --config Release -j $(nproc)
```

> **Note:** Ensure you add the other flags you would normally use (e.g., `-DSD_VULKAN=ON`, `-DSD_CUDA=ON`, `-DSD_HIPBLAS=ON`, or `-DGGML_METAL=ON`), for more information about building `stable-diffusion.cpp` from source, please refer to the [build.md](build.md) documentation.

## 2. Ensure `llama.cpp` is at the correct commit

`stable-diffusion.cpp`'s RPC client is designed to work with a specific version of `llama.cpp` (compatible with the `ggml` submodule) to ensure API compatibility. The commit hash for `llama.cpp` is stored in `ggml/scripts/sync-llama.last`.

> **Start from Root:** Perform these steps from the root of your `stable-diffusion.cpp` directory.

1. Read the target commit hash from the submodule tracker:

```bash
# Linux / WSL / MacOS
HASH=$(cat ggml/scripts/sync-llama.last)

# Windows (PowerShell)
$HASH = Get-Content -Path "ggml\scripts\sync-llama.last"
```

2. Clone `llama.cpp` at the target commit .
```bash
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
git checkout $HASH
```
To save on download time and storage, you can use a shallow clone to download only the target commit:
```bash
mkdir -p llama.cpp
cd llama.cpp
git init
git remote add origin https://github.com/ggml-org/llama.cpp.git
git fetch --depth 1 origin $HASH
git checkout FETCH_HEAD
```

## 3. Build `llama.cpp` (RPC Server)

The RPC server acts as the worker. You must explicitly enable the **backend** (the hardware interface, such as CUDA for Nvidia, Metal for Apple Silicon, or Vulkan) when building, otherwise the server will default to using only the CPU.

To find the correct flags for your system, refer to the official documentation for the [`llama.cpp`](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md) repository.

> **Crucial:** You must include the compiler flags required to satisfy the API compatibility with `stable-diffusion.cpp` (`-DGGML_MAX_NAME=128`). Without this flag, `GGML_MAX_NAME` will default to `64` for the server, and data transfers between the client and server will fail. Of course, `-DGGML_RPC` must also be enabled.
>
> I recommend disabling the `LLAMA_CURL` flag to avoid unnecessary dependencies, and disabling shared library builds to avoid potential conflicts.

> **Build Target:** We are specifically building the `rpc-server` target. This prevents the build system from compiling the entire `llama.cpp` suite (like `llama-server`), making the build significantly faster.

### Linux / WSL (Vulkan)

```bash
mkdir build
cd build
cmake .. -DGGML_RPC=ON \
-DGGML_VULKAN=ON \ # Ensure backend is enabled
-DGGML_BUILD_SHARED_LIBS=OFF \
-DLLAMA_CURL=OFF \
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
cmake --build . --config Release --target rpc-server -j $(nproc)
```

### macOS (Metal)

```bash
mkdir build
cd build
cmake .. -DGGML_RPC=ON \
-DGGML_METAL=ON \
-DGGML_BUILD_SHARED_LIBS=OFF \
-DLLAMA_CURL=OFF \
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 \
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
cmake --build . --config Release --target rpc-server
```

### Windows (Visual Studio 2022, Vulkan)

```powershell
mkdir build
cd build
cmake .. -G "Visual Studio 17 2022" -A x64 `
-DGGML_RPC=ON `
-DGGML_VULKAN=ON `
-DGGML_BUILD_SHARED_LIBS=OFF `
-DLLAMA_CURL=OFF `
-DCMAKE_C_FLAGS=-DGGML_MAX_NAME=128 `
-DCMAKE_CXX_FLAGS=-DGGML_MAX_NAME=128
cmake --build . --config Release --target rpc-server
```

## 4. Usage

Once both applications are built, you can run the server and the client to manage your GPU allocation.

### Step A: Run the RPC Server

Start the server. It listens for connections on the default address (usually `localhost:50052`). If your server is on a different machine, ensure the server binds to the correct interface and your firewall allows the connection.

**On the Server :**
If running on the same machine, you can use the default address:

```bash
./rpc-server
```

If you want to allow connections from other machines on the network:

```bash
./rpc-server --host 0.0.0.0
```

> **Security Warning:** The RPC server does not currently support authentication or encryption. **Only run the server on trusted local networks**. Never expose the RPC server directly to the open internet.

> **Drivers & Hardware:** Ensure the Server machine has the necessary drivers installed and functional (e.g., Nvidia Drivers for CUDA, Vulkan SDK, or Metal). If no devices are found, the server will simply fallback to CPU usage.

<!-- ### Step B: Check if the client is able to connect to the server and see the available devices

We're assuming the server is running on your local machine, and listening on the default port `50052`. If it's running on a different machine, you can replace `localhost` with the IP address of the server.

**On the Client:**

```bash
./sd-cli --rpc-servers localhost:50052 --list-devices
```

If the server is running and the client is able to connect, you should see `RPC0 localhost:50052` in the list of devices.

Example output:
(Client built without GPU acceleration, two GPUs available on the server)

```
List of available GGML devices:
Name Description
-------------------
CPU AMD Ryzen 9 5900X 12-Core Processor
RPC0 localhost:50052
RPC1 localhost:50052
``` -->

### Step B: Run with RPC device

If everything is working correctly, you can now run the client while offloading some or all of the work to the RPC server.

Example: Setting the main backend to the RPC0 device for doing all the work on the server.

```bash
./sd-cli -m models/sd1.5.safetensors -p "A cat" --rpc-servers localhost:50052 --backend RPC0
```

---

## 5. Scaling: Multiple RPC Servers

You can connect the client to multiple RPC servers simultaneously to scale out your hardware usage.

Example: A main machine (192.168.1.10) with 3 GPUs, with one GPU running CUDA and the other two running Vulkan, and a second machine (192.168.1.11) only one GPU.

**On the first machine (Running two server instances):**

**Terminal 1 (CUDA):**

```bash
# Linux / WSL
export CUDA_VISIBLE_DEVICES=0
cd ./build_cuda/bin/Release
./rpc-server --host 0.0.0.0

# Windows PowerShell
$env:CUDA_VISIBLE_DEVICES="0"
cd .\build_cuda\bin\Release
./rpc-server --host 0.0.0.0
```

**Terminal 2 (Vulkan):**

```bash
cd ./build_vulkan/bin/Release
# ignore the first GPU (used by CUDA server)
./rpc-server --host 0.0.0.0 --port 50053 -d Vulkan1,Vulkan2
```

**On the second machine:**

```bash
cd ./build/bin/Release
./rpc-server --host 0.0.0.0
```

**On the Client:**
Pass multiple server addresses separated by commas.

```bash
./sd-cli --rpc-servers 192.168.1.10:50052,192.168.1.10:50053,192.168.1.11:50052 [...]
```

The client will map these servers to sequential device IDs (e.g., RPC0 from the first server, RPC2, RPC3 from the second, and RPC4 from the third). With this setup, you could for example use RPC0 for the main backend, RPC1 and RPC2 for the text encoders, and RPC3 for the VAE.

---

## 6. Performance Considerations

RPC performance is heavily dependent on network bandwidth, as large weights and activations must be transferred back and forth over the network, especially for large models, or when using high resolutions. For best results, ensure your network connection is stable and has sufficient bandwidth (>1Gbps recommended). This shoumd not be a concern if you are running the server and client on the same machine, as the data transfer will happen over the loopback interface.
24 changes: 18 additions & 6 deletions examples/cli/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,10 @@ Context Options:
--high-noise-diffusion-model <string> path to the standalone high noise diffusion model
--uncond-diffusion-model <string> path to the standalone unconditional diffusion model, currently used by
Ideogram4 CFG
--embeddings-connectors <string> path to LTXAV embeddings connectors
--vae <string> path to standalone vae model
--vae-format <string> VAE latent format override: auto, flux, sd3, or flux2 (default: auto)
--audio-vae <string> path to standalone LTX audio vae model
--taesd <string> path to taesd. Using Tiny AutoEncoder for fast decoding (low quality)
--tae <string> alias of --taesd
--control-net <string> path to control net model
Expand All @@ -53,12 +56,18 @@ Context Options:
--tensor-type-rules <string> weight type per tensor pattern (example: "^vae\.=f16,model\.=q8_0")
--photo-maker <string> path to PHOTOMAKER model
--upscale-model <string> path to esrgan model.
--backend <string> runtime backend assignment, e.g. cpu or clip=cpu,vae=cuda0,diffusion=vulkan0
--params-backend <string> parameter backend assignment, e.g. cpu or diffusion=cpu,clip=cpu
--rpc-servers <string> comma-separated list of RPC servers to connect to for offloading, in the
format host:port, e.g. localhost:50052,192.168.1.3:50052
-t, --threads <int> number of threads to use during computation (default: -1). If threads <= 0,
then threads will be set to the number of CPU physical cores
--chroma-t5-mask-pad <int> t5 mask pad size of chroma
--max-vram <float> maximum VRAM budget in GiB for graph-cut segmented execution. 0 disables
graph splitting; a negative value auto-detects free VRAM, sparing the
specified value (e.g. -0.5 will keep at least 0.5 GiB free)
--stream-layers enable residency+prefetch streaming on top of --max-vram (no effect without
--max-vram; defaults to false)
--force-sdxl-vae-conv-scale force use of conv scale on sdxl vae
--offload-to-cpu place the weights in RAM to save VRAM, and automatically load them into VRAM
when needed
Expand Down Expand Up @@ -109,7 +118,8 @@ Generation Options:
--extra-sample-args <string> extra sampler/scheduler/guidance args, key=value list. APG supports apg_eta,
apg_momentum, apg_norm_threshold, apg_norm_threshold_smoothing; SLG supports
slg_uncond; lcm supports noise_clip_std, noise_scale_start, noise_scale_end;
ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports gamma
ltx2 supports max_shift, base_shift, stretch, terminal; euler_ge supports
gamma
--extra-tiling-args <string> extra VAE tiling args, key=value list. LTX video VAE supports
temporal_tile_frames (default: 4), temporal_tile_overlap (default: 1)
-H, --height <int> image height, in pixel space (default: 512)
Expand Down Expand Up @@ -153,7 +163,7 @@ Generation Options:
--high-noise-eta <float> (high noise) noise multiplier (default: 0 for ddim_trailing, tcd,
res_multistep and res_2s; 1 for euler_a, er_sde and dpm++2s_a)
--strength <float> strength for noising/unnoising (default: 0.75)
--pm-style-strength <float>
--pm-style-strength <float>
--control-strength <float> strength to apply Control Net (default: 0.9). 1.0 corresponds to full
destruction of information in init image
--moe-boundary <float> timestep boundary for Wan2.2 MoE model. (default: 0.875). Only enabled if
Expand All @@ -172,13 +182,15 @@ Generation Options:
-s, --seed RNG seed (default: 42, use random seed for < 0)
--sampling-method sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a, dpm++2m,
dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep, res_2s,
er_sde, euler_cfg_pp, euler_a_cfg_pp] (default: euler for Flux/SD3/Wan, euler_a otherwise)
er_sde, euler_cfg_pp, euler_a_cfg_pp](default: euler for Flux/SD3/Wan,
euler_a otherwise)
--high-noise-sampling-method (high noise) sampling method, one of [euler, euler_a, heun, dpm2, dpm++2s_a,
dpm++2m, dpm++2mv2, ipndm, ipndm_v, lcm, ddim_trailing, tcd, res_multistep,
res_2s, er_sde, euler_cfg_pp, euler_a_cfg_pp] default: euler for Flux/SD3/Wan, euler_a otherwise
res_2s, er_sde, euler_cfg_pp, euler_a_cfg_pp] default: euler for
Flux/SD3/Wan, euler_a otherwise
--scheduler denoiser sigma scheduler, one of [discrete, karras, exponential, ays, gits,
smoothstep, sgm_uniform, simple, kl_optimal, lcm, bong_tangent, ltx2], default:
model-specific
smoothstep, sgm_uniform, simple, kl_optimal, lcm, bong_tangent, ltx2],
default: model-specific
--sigmas custom sigma values for the sampler, comma-separated (e.g.,
"14.61,7.8,3.5,0.0").
--hires-sigmas custom sigma values for the highres fix second pass, comma-separated (e.g.,
Expand Down
5 changes: 5 additions & 0 deletions examples/common/common.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,10 @@ ArgOptions SDContextParams::get_options() {
"--params-backend",
"parameter backend assignment, e.g. cpu or diffusion=cpu,clip=cpu",
&params_backend},
{"",
"--rpc-servers",
"comma-separated list of RPC servers to connect to for offloading, in the format host:port, e.g. localhost:50052,192.168.1.3:50052",
&rpc_servers},
};

options.int_options = {
Expand Down Expand Up @@ -817,6 +821,7 @@ sd_ctx_params_t SDContextParams::to_sd_ctx_params_t(bool vae_decode_only, bool f
stream_layers,
backend.c_str(),
params_backend.c_str(),
rpc_servers.c_str(),
};
return sd_ctx_params;
}
Expand Down
1 change: 1 addition & 0 deletions examples/common/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,7 @@ struct SDContextParams {
bool stream_layers = false;
std::string backend;
std::string params_backend;
std::string rpc_servers;
bool enable_mmap = false;
bool control_net_cpu = false;
bool clip_on_cpu = false;
Expand Down
Loading
Loading