Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,11 @@
# ComfyUI-GGUF
GGUF Quantization support for native ComfyUI models

> [!NOTE]
> This is a fork of the original nodes, updated to support loading Ideogram 4 GGUFs and Krea 2 GGUFs.
> To use it, clone `https://github.com/city96/ComfyUI-GGUF`and not the original repo.

This is currently very much WIP. These custom nodes provide support for model files stored in the GGUF format popularized by [llama.cpp](https://github.com/ggerganov/llama.cpp).

While quantization wasn't feasible for regular UNET models (conv2d), transformer/DiT models such as flux seem less affected by quantization. This allows running it in much lower bits per weight variable bitrate quants on low-end GPUs. For further VRAM savings, a node to load a quantized version of the T5 text encoder is also included.
Expand Down Expand Up @@ -35,15 +40,22 @@ Simply use the GGUF Unet loader found under the `bootleg` category. Place the .g

LoRA loading is experimental but it should work with just the built-in LoRA loader node(s).

Pre-quantized models:
Pre-quantized models (🍴 icon on ones added by this fork):

- [flux1-dev GGUF](https://huggingface.co/city96/FLUX.1-dev-gguf)
- [flux1-schnell GGUF](https://huggingface.co/city96/FLUX.1-schnell-gguf)
- [stable-diffusion-3.5-large GGUF](https://huggingface.co/city96/stable-diffusion-3.5-large-gguf)
- [stable-diffusion-3.5-large-turbo GGUF](https://huggingface.co/city96/stable-diffusion-3.5-large-turbo-gguf)
- [Krea 2 (Both Turbo and Raw)](https://huggingface.co/molbal/krea2-gguf) 🍴
- [Ideogram 4](https://huggingface.co/molbal/ideogram-4-gguf) 🍴


> [!IMPORTANT]
> Please note, that this fork does not support _K quants on diffusion models, only on text encoders. They may or may not load, but inference speed may be very slow. There may be other forks, or other custom nodes with better support for these quantization types.
Initial support for quantizing T5 has also been added recently, these can be used using the various `*CLIPLoader (gguf)` nodes which can be used inplace of the regular ones. For the CLIP model, use whatever model you were using before for CLIP. The loader can handle both types of files - `gguf` and regular `safetensors`/`bin`.

- [t5_v1.1-xxl GGUF](https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf)
- [Qwen3-VL-4B-Instruct-GGUF](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct-GGUF)🍴

See the instructions in the [tools](https://github.com/city96/ComfyUI-GGUF/tree/main/tools) folder for how to create your own quants.
3 changes: 3 additions & 0 deletions dequant.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ def dequantize_tensor(tensor, dtype=None, dequant_dtype=None):

if qtype in TORCH_COMPATIBLE_QTYPES:
return tensor.to(dtype)
elif qtype == gguf.GGMLQuantizationType.BF16:
tensor = torch.Tensor(tensor.data.view(torch.bfloat16).reshape(oshape))
return tensor if dtype is None or dtype == torch.bfloat16 else tensor.to(dtype)
elif qtype in dequantize_functions:
dequant_dtype = dtype if dequant_dtype == "target" else dequant_dtype
return dequantize(tensor.data, qtype, oshape, dtype=dequant_dtype).to(dtype)
Expand Down
60 changes: 60 additions & 0 deletions editor.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Prompt Canvas Editor

A single-file browser editor for building Ideogram-style structured JSON prompts with canvas-based bounding boxes. The app lives entirely in `ui.html`; there is no build step, package manager, or local server requirement.

## Features

- Set canvas width and height with sliders or by double-clicking the displayed values.
- Draw, move, resize, delete, and edit bounding boxes directly on the canvas.
- Cycle through selected boxes with the `<-` and `->` controls when boxes overlap.
- Edit global prompt fields, including high-level description, aesthetics, lighting, medium, style/photo mode, background, and color palette.
- Edit per-box mode, description, optional text content, and per-box color palette.
- Generate formatted JSON from the current canvas and form state.
- Paste existing prompt JSON into the JSON box and load it back into the editable canvas.

## Usage

Open `ui.html` directly in a modern browser.

The Tailwind design system is loaded from the Tailwind CDN, so the page needs internet access for styling. The editor logic itself is plain HTML, CSS, and JavaScript.

## Basic Workflow

1. Set the canvas size.
2. Draw boxes on the canvas by clicking and dragging.
3. Select a box and edit its properties in the right panel.
4. Fill in the global prompt settings.
5. Click `Generate JSON` to write the prompt JSON into the textarea.
6. Copy or save the generated JSON wherever your workflow needs it.

To edit an existing prompt, paste the JSON into the textarea and click `Load JSON`. The editor will rebuild the canvas boxes and form fields from the prompt.

## JSON Shape

The editor expects prompt JSON in this general form:

```json
{
"high_level_description": "",
"style_description": {
"aesthetics": "",
"lighting": "",
"medium": "",
"art_style": "",
"color_palette": []
},
"compositional_deconstruction": {
"background": "",
"elements": [
{
"type": "obj",
"bbox": [0, 0, 1000, 1000],
"desc": "",
"color_palette": []
}
]
}
}
```

Bounding boxes use normalized coordinates from `0` to `1000` in `[y1, x1, y2, x2]` order. The editor converts those coordinates to the current canvas size when loading JSON, then converts them back to normalized coordinates when generating JSON.
19 changes: 19 additions & 0 deletions fp8/transformer/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"_class_name": "Ideogram4Transformer2DModel",
"_diffusers_version": "0.39.0.dev0",
"_name_or_path": "/home/jinli/.cache/huggingface/hub/models--ideogram-ai--debug-ideogram-v4/snapshots/41af6183c9fd9b6254864b0720319ef984535bfc/transformer",
"adaln_dim": 512,
"attention_head_dim": 256,
"in_channels": 128,
"intermediate_size": 12288,
"llm_features_dim": 53248,
"mrope_section": [
24,
20,
20
],
"norm_eps": 1e-05,
"num_attention_heads": 18,
"num_layers": 34,
"rope_theta": 5000000
}

Large diffs are not rendered by default.

19 changes: 19 additions & 0 deletions fp8/unconditional_transformer/config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
{
"_class_name": "Ideogram4Transformer2DModel",
"_diffusers_version": "0.39.0.dev0",
"_name_or_path": "/home/jinli/.cache/huggingface/hub/models--ideogram-ai--debug-ideogram-v4/snapshots/41af6183c9fd9b6254864b0720319ef984535bfc/unconditional_transformer",
"adaln_dim": 512,
"attention_head_dim": 256,
"in_channels": 128,
"intermediate_size": 12288,
"llm_features_dim": 53248,
"mrope_section": [
24,
20,
20
],
"norm_eps": 1e-05,
"num_attention_heads": 18,
"num_layers": 34,
"rope_theta": 5000000
}

Large diffs are not rendered by default.

54 changes: 50 additions & 4 deletions loader.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,23 @@
from .ops import GGMLTensor
from .dequant import is_quantized, dequantize_tensor

IMG_ARCH_LIST = {"flux", "sd1", "sdxl", "sd3", "aura", "hidream", "cosmos", "ltxv", "hyvid", "wan", "lumina2", "qwen_image"}
IMG_ARCH_LIST = {"flux", "sd1", "sdxl", "sd3", "aura", "hidream", "cosmos", "ltxv", "hyvid", "wan", "lumina2", "qwen_image", "ideogram", "krea2"}
TXT_ARCH_LIST = {"t5", "t5encoder", "llama", "qwen2vl", "qwen3", "qwen3vl", "gemma3"}
VIS_TYPE_LIST = {"clip-vision", "mmproj"}

def device_supports_bf16():
"""
Return True if the active torch device can run bf16 natively. On devices
without native bf16 support, computation silently falls back to fp32 which
is very slow, so callers should load tensors as fp16 instead.
"""
try:
import comfy.model_management
return comfy.model_management.should_use_bf16(comfy.model_management.get_torch_device())
except Exception:
# If support can't be determined, keep the previous bf16 behavior.
return True

def get_orig_shape(reader, tensor_name):
field_key = f"comfy.gguf.orig_shape.{tensor_name}"
field = reader.get_field(field_key)
Expand Down Expand Up @@ -113,6 +126,9 @@ def gguf_sd_loader(path, handle_prefix="model.diffusion_model.", is_text_model=F
logging.warning(f"Warning: This gguf model file is loaded in compatibility mode '{compat}' [arch:{arch_str}]")

# main loading loop
# Devices without native bf16 fall back to slow fp32 compute, so load the
# full-precision BF16 storage tensors as fp16 there instead.
bf16_storage_dtype = torch.bfloat16 if device_supports_bf16() else torch.float16
state_dict = {}
qtype_dict = {}
for sd_key, tensor in tensors:
Expand All @@ -138,9 +154,10 @@ def gguf_sd_loader(path, handle_prefix="model.diffusion_model.", is_text_model=F
torch_tensor = torch_tensor.view(*shape)
state_dict[sd_key] = GGMLTensor(torch_tensor, tensor_type=tensor.tensor_type, tensor_shape=shape)

# 1D tensors shouldn't be quantized, this is a fix for BF16
if len(shape) <= 1 and tensor.tensor_type == gguf.GGMLQuantizationType.BF16:
state_dict[sd_key] = dequantize_tensor(state_dict[sd_key], dtype=torch.float32)
# BF16 GGUF tensors are full-precision storage, not compressed quants.
if tensor.tensor_type == gguf.GGMLQuantizationType.BF16:
dtype = torch.float32 if len(shape) <= 1 else bf16_storage_dtype
state_dict[sd_key] = dequantize_tensor(state_dict[sd_key], dtype=dtype)

# keep track of loaded tensor types
tensor_type_str = getattr(tensor.tensor_type, "name", repr(tensor.tensor_type))
Expand Down Expand Up @@ -501,6 +518,35 @@ def gguf_clip_loader(path):
if arch == "qwen2vl":
vsd = gguf_mmproj_loader(path)
sd.update(vsd)
if arch == "qwen3vl" and "model.visual.deepstack_merger_list.0.norm.weight" not in sd:
# Standard llama.cpp Qwen3-VL GGUFs omit the visual tower. Without it,
# detect_te_model() mis-classifies the state dict as QWEN3_4B/8B (Qwen3 LM)
# instead of QWEN3VL_4B/8B, so clip_type=KREA2 never selects the 12-layer
# tap encoder and conditioning has shape (B, seq, 2560) instead of (B, seq, 30720).
# Inject zero sentinel tensors with shapes that exactly match the model
# parameters so that load_state_dict(strict=False) doesn't raise a size
# mismatch error while still satisfying detect_te_model()'s key checks.
# deepstack_merger_list.0.norm -> LayerNorm(vis_hidden * 4) shape [merge_dim]
# merger.linear_fc2 -> Linear(merge_dim, lm_hidden) shape [lm_hidden, merge_dim]
ln_key = "model.layers.0.input_layernorm.weight"
lm_hidden = int(sd[ln_key].shape[0]) if ln_key in sd else 2560
vis_hidden = 1024 if lm_hidden == 2560 else 1152 # Qwen3-VL-4B vs 8B
merge_dim = vis_hidden * 4 # spatial_merge_size=2
sd["model.visual.deepstack_merger_list.0.norm.weight"] = torch.zeros(merge_dim)
sd["model.visual.merger.linear_fc2.weight"] = torch.zeros(lm_hidden, merge_dim)
logging.info(f"qwen3vl GGUF: injected visual marker tensors (lm_hidden={lm_hidden}, merge_dim={merge_dim}) for model type detection")
elif arch == "ideogram":
# Dequantize Ideogram model for inference
logging.info("Dequantizing Ideogram model for inference...")
# Use BF16 to save VRAM while maintaining quality, but fall back to FP16
# on devices that don't support bf16 (avoids slow fp32 compute fallback).
target_dtype = torch.bfloat16 if device_supports_bf16() else torch.float16
dequantized_count = 0
for key in list(sd.keys()):
if is_quantized(sd[key]):
sd[key] = dequantize_tensor(sd[key], dtype=target_dtype)
dequantized_count += 1
logging.info(f"Dequantized {dequantized_count} tensors for Ideogram model ({target_dtype})")
else:
pass
return sd
40 changes: 37 additions & 3 deletions ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,18 @@
import comfy.model_management
from .dequant import dequantize_tensor, is_quantized

def _valid_compute_dtype(dtype):
return dtype in {torch.float16, torch.bfloat16, torch.float32, torch.float64}

def _infer_compute_dtype(tensor_type, fallback=None):
if _valid_compute_dtype(fallback):
return fallback
if tensor_type == gguf.GGMLQuantizationType.BF16:
return torch.bfloat16
if tensor_type == gguf.GGMLQuantizationType.F32:
return torch.float32
return torch.float16

def chained_hasattr(obj, chained_attr):
probe = obj
for attr in chained_attr.split('.'):
Expand Down Expand Up @@ -45,20 +57,22 @@ class GGMLTensor(torch.Tensor):
"""
Main tensor-like class for storing quantized weights
"""
def __init__(self, *args, tensor_type, tensor_shape, patches=[], **kwargs):
def __init__(self, *args, tensor_type, tensor_shape, patches=[], compute_dtype=None, **kwargs):
super().__init__()
self.tensor_type = tensor_type
self.tensor_shape = tensor_shape
self.patches = patches
self.compute_dtype = compute_dtype

def __new__(cls, *args, tensor_type, tensor_shape, patches=[], **kwargs):
def __new__(cls, *args, tensor_type, tensor_shape, patches=[], compute_dtype=None, **kwargs):
return super().__new__(cls, *args, **kwargs)

def to(self, *args, **kwargs):
new = super().to(*args, **kwargs)
new.tensor_type = getattr(self, "tensor_type", None)
new.tensor_shape = getattr(self, "tensor_shape", new.data.shape)
new.patches = getattr(self, "patches", []).copy()
new.compute_dtype = getattr(self, "compute_dtype", None)
return new

def clone(self, *args, **kwargs):
Expand All @@ -81,9 +95,17 @@ def new_empty(self, size, *args, **kwargs):
new_tensor,
tensor_type = getattr(self, "tensor_type", None),
tensor_shape = size,
patches = getattr(self, "patches", []).copy()
patches = getattr(self, "patches", []).copy(),
compute_dtype = getattr(self, "compute_dtype", None),
)

@property
def dtype(self):
qtype = getattr(self, "tensor_type", None)
if qtype in GGMLLayer.torch_compatible_tensor_types:
return torch.Tensor(self).dtype
return _infer_compute_dtype(qtype, getattr(self, "compute_dtype", None))

@property
def shape(self):
if not hasattr(self, "tensor_shape"):
Expand Down Expand Up @@ -121,8 +143,12 @@ def ggml_load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
prefix_len = len(prefix)
for k,v in state_dict.items():
if k[prefix_len:] == "weight":
if isinstance(v, GGMLTensor):
v.compute_dtype = self._ggml_compute_dtype(v, "weight")
self.weight = torch.nn.Parameter(v, requires_grad=False)
elif k[prefix_len:] == "bias" and v is not None:
if isinstance(v, GGMLTensor):
v.compute_dtype = self._ggml_compute_dtype(v, "bias")
self.bias = torch.nn.Parameter(v, requires_grad=False)
else:
unexpected_keys.append(k)
Expand All @@ -137,6 +163,12 @@ def ggml_load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
if getattr(self.weight, "is_largest_weight", False):
self.largest_layer = True

def _ggml_compute_dtype(self, tensor, param_name):
if self.dequant_dtype is not None and self.dequant_dtype != "target":
return self.dequant_dtype
model_dtype = getattr(self, f"{param_name}_comfy_model_dtype", None)
return _infer_compute_dtype(getattr(tensor, "tensor_type", None), model_dtype)

def _save_to_state_dict(self, *args, **kwargs):
if self.is_ggml_quantized():
return self.ggml_save_to_state_dict(*args, **kwargs)
Expand Down Expand Up @@ -238,6 +270,8 @@ def __init__(self, in_features, out_features, bias=True, device=None, dtype=None
self.out_features = out_features
self.weight = None
self.bias = None
self.weight_comfy_model_dtype = dtype
self.bias_comfy_model_dtype = dtype

def forward_ggml_cast_weights(self, input):
weight, bias = self.cast_bias_weight(input)
Expand Down
42 changes: 42 additions & 0 deletions pr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# Summary

This adds support for Ideogram GGUF models.

# What Changed

- Added `ideogram` to the supported image GGUF architectures.
- Added Ideogram model detection to the converter.
- Added GGUF dtype handling needed by Ideogram inference.
- Fixed the Ideogram inference failure where a packed GGUF weight dtype caused a byte tensor to reach CUDA linear.
- Adjusted BF16 GGUF loading so Ideogram can start inference faster.

# Error Fixed

Before this change, Ideogram GGUF models could load but failed during sampling with:

```text
"addmm_cuda" not implemented for 'Byte'
```

# Tests

Ran Python compile checks:

```text
python -m py_compile dequant.py ops.py loader.py nodes.py tools\convert.py
```

Checked a synthetic GGUF linear forward:

- packed storage stayed `torch.uint8`
- reported dtype was `torch.bfloat16`
- input dtype was `torch.bfloat16`
- output dtype was `torch.bfloat16`

Checked BF16 loading behavior with finite values.

# Notes

Full ComfyUI inference was not run from this environment.

Other GGUF quant types still use the existing dequant paths.
Loading