city96 · molbal · Jun 12, 2026 · Jun 24, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/README.md b/README.md
@@ -1,6 +1,11 @@
 # ComfyUI-GGUF
 GGUF Quantization support for native ComfyUI models
 
+> [!NOTE]  
+> This is a fork of the original nodes, updated to support loading Ideogram 4 GGUFs and Krea 2 GGUFs. 
+> To use it, clone `https://github.com/city96/ComfyUI-GGUF`and not the original repo.
+
+
 This is currently very much WIP. These custom nodes provide support for model files stored in the GGUF format popularized by [llama.cpp](https://github.com/ggerganov/llama.cpp).
 
 While quantization wasn't feasible for regular UNET models (conv2d), transformer/DiT models such as flux seem less affected by quantization. This allows running it in much lower bits per weight variable bitrate quants on low-end GPUs. For further VRAM savings, a node to load a quantized version of the T5 text encoder is also included.
@@ -35,15 +40,22 @@ Simply use the GGUF Unet loader found under the `bootleg` category. Place the .g
 
 LoRA loading is experimental but it should work with just the built-in LoRA loader node(s).
 
-Pre-quantized models:
+Pre-quantized models (🍴 icon on ones added by this fork):
 
 - [flux1-dev GGUF](https://huggingface.co/city96/FLUX.1-dev-gguf)
 - [flux1-schnell GGUF](https://huggingface.co/city96/FLUX.1-schnell-gguf)
 - [stable-diffusion-3.5-large GGUF](https://huggingface.co/city96/stable-diffusion-3.5-large-gguf)
 - [stable-diffusion-3.5-large-turbo GGUF](https://huggingface.co/city96/stable-diffusion-3.5-large-turbo-gguf)
+- [Krea 2 (Both Turbo and Raw)](https://huggingface.co/molbal/krea2-gguf) 🍴
+- [Ideogram 4](https://huggingface.co/molbal/ideogram-4-gguf) 🍴
+
+
+> [!IMPORTANT]  
+> Please note, that this fork does not support _K quants on diffusion models, only on text encoders. They may or may not load, but inference speed may be very slow. There may be other forks, or other custom nodes with better support for these quantization types.
 
 Initial support for quantizing T5 has also been added recently, these can be used using the various `*CLIPLoader (gguf)` nodes which can be used inplace of the regular ones. For the CLIP model, use whatever model you were using before for CLIP. The loader can handle both types of files - `gguf` and regular `safetensors`/`bin`.
 
 - [t5_v1.1-xxl GGUF](https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf)
+- [Qwen3-VL-4B-Instruct-GGUF](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct-GGUF)🍴
 
 See the instructions in the [tools](https://github.com/city96/ComfyUI-GGUF/tree/main/tools) folder for how to create your own quants.
diff --git a/dequant.py b/dequant.py
@@ -18,6 +18,9 @@ def dequantize_tensor(tensor, dtype=None, dequant_dtype=None):
 
     if qtype in TORCH_COMPATIBLE_QTYPES:
         return tensor.to(dtype)
+    elif qtype == gguf.GGMLQuantizationType.BF16:
+        tensor = torch.Tensor(tensor.data.view(torch.bfloat16).reshape(oshape))
+        return tensor if dtype is None or dtype == torch.bfloat16 else tensor.to(dtype)
     elif qtype in dequantize_functions:
         dequant_dtype = dtype if dequant_dtype == "target" else dequant_dtype
         return dequantize(tensor.data, qtype, oshape, dtype=dequant_dtype).to(dtype)

diff --git a/editor.md b/editor.md
@@ -0,0 +1,60 @@
+# Prompt Canvas Editor
+
+A single-file browser editor for building Ideogram-style structured JSON prompts with canvas-based bounding boxes. The app lives entirely in `ui.html`; there is no build step, package manager, or local server requirement.
+
+## Features
+
+- Set canvas width and height with sliders or by double-clicking the displayed values.
+- Draw, move, resize, delete, and edit bounding boxes directly on the canvas.
+- Cycle through selected boxes with the `<-` and `->` controls when boxes overlap.
+- Edit global prompt fields, including high-level description, aesthetics, lighting, medium, style/photo mode, background, and color palette.
+- Edit per-box mode, description, optional text content, and per-box color palette.
+- Generate formatted JSON from the current canvas and form state.
+- Paste existing prompt JSON into the JSON box and load it back into the editable canvas.
+
+## Usage
+
+Open `ui.html` directly in a modern browser.
+
+The Tailwind design system is loaded from the Tailwind CDN, so the page needs internet access for styling. The editor logic itself is plain HTML, CSS, and JavaScript.
+
+## Basic Workflow
+
+1. Set the canvas size.
+2. Draw boxes on the canvas by clicking and dragging.
+3. Select a box and edit its properties in the right panel.
+4. Fill in the global prompt settings.
+5. Click `Generate JSON` to write the prompt JSON into the textarea.
+6. Copy or save the generated JSON wherever your workflow needs it.
+
+To edit an existing prompt, paste the JSON into the textarea and click `Load JSON`. The editor will rebuild the canvas boxes and form fields from the prompt.
+
+## JSON Shape
+
+The editor expects prompt JSON in this general form:
+
+```json
+{
+  "high_level_description": "",
+  "style_description": {
+    "aesthetics": "",
+    "lighting": "",
+    "medium": "",
+    "art_style": "",
+    "color_palette": []
+  },
+  "compositional_deconstruction": {
+    "background": "",
+    "elements": [
+      {
+        "type": "obj",
+        "bbox": [0, 0, 1000, 1000],
+        "desc": "",
+        "color_palette": []
+      }
+    ]
+  }
+}
+```
+
+Bounding boxes use normalized coordinates from `0` to `1000` in `[y1, x1, y2, x2]` order. The editor converts those coordinates to the current canvas size when loading JSON, then converts them back to normalized coordinates when generating JSON.
diff --git a/fp8/transformer/config.json b/fp8/transformer/config.json
@@ -0,0 +1,19 @@
+{
+  "_class_name": "Ideogram4Transformer2DModel",
+  "_diffusers_version": "0.39.0.dev0",
+  "_name_or_path": "/home/jinli/.cache/huggingface/hub/models--ideogram-ai--debug-ideogram-v4/snapshots/41af6183c9fd9b6254864b0720319ef984535bfc/transformer",
+  "adaln_dim": 512,
+  "attention_head_dim": 256,
+  "in_channels": 128,
+  "intermediate_size": 12288,
+  "llm_features_dim": 53248,
+  "mrope_section": [
+    24,
+    20,
+    20
+  ],
+  "norm_eps": 1e-05,
+  "num_attention_heads": 18,
+  "num_layers": 34,
+  "rope_theta": 5000000
+}
diff --git a/fp8/transformer/diffusion_pytorch_model.safetensors.index.json b/fp8/transformer/diffusion_pytorch_model.safetensors.index.json
diff --git a/fp8/unconditional_transformer/config.json b/fp8/unconditional_transformer/config.json
@@ -0,0 +1,19 @@
+{
+  "_class_name": "Ideogram4Transformer2DModel",
+  "_diffusers_version": "0.39.0.dev0",
+  "_name_or_path": "/home/jinli/.cache/huggingface/hub/models--ideogram-ai--debug-ideogram-v4/snapshots/41af6183c9fd9b6254864b0720319ef984535bfc/unconditional_transformer",
+  "adaln_dim": 512,
+  "attention_head_dim": 256,
+  "in_channels": 128,
+  "intermediate_size": 12288,
+  "llm_features_dim": 53248,
+  "mrope_section": [
+    24,
+    20,
+    20
+  ],
+  "norm_eps": 1e-05,
+  "num_attention_heads": 18,
+  "num_layers": 34,
+  "rope_theta": 5000000
+}
diff --git a/fp8/unconditional_transformer/diffusion_pytorch_model.safetensors.index.json b/fp8/unconditional_transformer/diffusion_pytorch_model.safetensors.index.json
diff --git a/loader.py b/loader.py
@@ -9,10 +9,23 @@
 from .ops import GGMLTensor
 from .dequant import is_quantized, dequantize_tensor
 
-IMG_ARCH_LIST = {"flux", "sd1", "sdxl", "sd3", "aura", "hidream", "cosmos", "ltxv", "hyvid", "wan", "lumina2", "qwen_image"}
+IMG_ARCH_LIST = {"flux", "sd1", "sdxl", "sd3", "aura", "hidream", "cosmos", "ltxv", "hyvid", "wan", "lumina2", "qwen_image", "ideogram", "krea2"}
 TXT_ARCH_LIST = {"t5", "t5encoder", "llama", "qwen2vl", "qwen3", "qwen3vl", "gemma3"}
 VIS_TYPE_LIST = {"clip-vision", "mmproj"}
 
+def device_supports_bf16():
+    """
+    Return True if the active torch device can run bf16 natively. On devices
+    without native bf16 support, computation silently falls back to fp32 which
+    is very slow, so callers should load tensors as fp16 instead.
+    """
+    try:
+        import comfy.model_management
+        return comfy.model_management.should_use_bf16(comfy.model_management.get_torch_device())
+    except Exception:
+        # If support can't be determined, keep the previous bf16 behavior.
+        return True
+
 def get_orig_shape(reader, tensor_name):
     field_key = f"comfy.gguf.orig_shape.{tensor_name}"
     field = reader.get_field(field_key)
@@ -113,6 +126,9 @@ def gguf_sd_loader(path, handle_prefix="model.diffusion_model.", is_text_model=F
         logging.warning(f"Warning: This gguf model file is loaded in compatibility mode '{compat}' [arch:{arch_str}]")
 
     # main loading loop
+    # Devices without native bf16 fall back to slow fp32 compute, so load the
+    # full-precision BF16 storage tensors as fp16 there instead.
+    bf16_storage_dtype = torch.bfloat16 if device_supports_bf16() else torch.float16
     state_dict = {}
     qtype_dict = {}
     for sd_key, tensor in tensors:
@@ -138,9 +154,10 @@ def gguf_sd_loader(path, handle_prefix="model.diffusion_model.", is_text_model=F
             torch_tensor = torch_tensor.view(*shape)
         state_dict[sd_key] = GGMLTensor(torch_tensor, tensor_type=tensor.tensor_type, tensor_shape=shape)
 
-        # 1D tensors shouldn't be quantized, this is a fix for BF16
-        if len(shape) <= 1 and tensor.tensor_type == gguf.GGMLQuantizationType.BF16:
-            state_dict[sd_key] = dequantize_tensor(state_dict[sd_key], dtype=torch.float32)
+        # BF16 GGUF tensors are full-precision storage, not compressed quants.
+        if tensor.tensor_type == gguf.GGMLQuantizationType.BF16:
+            dtype = torch.float32 if len(shape) <= 1 else bf16_storage_dtype
+            state_dict[sd_key] = dequantize_tensor(state_dict[sd_key], dtype=dtype)
 
         # keep track of loaded tensor types
         tensor_type_str = getattr(tensor.tensor_type, "name", repr(tensor.tensor_type))
@@ -501,6 +518,35 @@ def gguf_clip_loader(path):
         if arch == "qwen2vl":
             vsd = gguf_mmproj_loader(path)
             sd.update(vsd)
+        if arch == "qwen3vl" and "model.visual.deepstack_merger_list.0.norm.weight" not in sd:
+            # Standard llama.cpp Qwen3-VL GGUFs omit the visual tower. Without it,
+            # detect_te_model() mis-classifies the state dict as QWEN3_4B/8B (Qwen3 LM)
+            # instead of QWEN3VL_4B/8B, so clip_type=KREA2 never selects the 12-layer
+            # tap encoder and conditioning has shape (B, seq, 2560) instead of (B, seq, 30720).
+            # Inject zero sentinel tensors with shapes that exactly match the model
+            # parameters so that load_state_dict(strict=False) doesn't raise a size
+            # mismatch error while still satisfying detect_te_model()'s key checks.
+            #   deepstack_merger_list.0.norm  -> LayerNorm(vis_hidden * 4)  shape [merge_dim]
+            #   merger.linear_fc2             -> Linear(merge_dim, lm_hidden) shape [lm_hidden, merge_dim]
+            ln_key = "model.layers.0.input_layernorm.weight"
+            lm_hidden = int(sd[ln_key].shape[0]) if ln_key in sd else 2560
+            vis_hidden = 1024 if lm_hidden == 2560 else 1152  # Qwen3-VL-4B vs 8B
+            merge_dim = vis_hidden * 4  # spatial_merge_size=2
+            sd["model.visual.deepstack_merger_list.0.norm.weight"] = torch.zeros(merge_dim)
+            sd["model.visual.merger.linear_fc2.weight"] = torch.zeros(lm_hidden, merge_dim)
+            logging.info(f"qwen3vl GGUF: injected visual marker tensors (lm_hidden={lm_hidden}, merge_dim={merge_dim}) for model type detection")
+    elif arch == "ideogram":
+        # Dequantize Ideogram model for inference
+        logging.info("Dequantizing Ideogram model for inference...")
+        # Use BF16 to save VRAM while maintaining quality, but fall back to FP16
+        # on devices that don't support bf16 (avoids slow fp32 compute fallback).
+        target_dtype = torch.bfloat16 if device_supports_bf16() else torch.float16
+        dequantized_count = 0
+        for key in list(sd.keys()):
+            if is_quantized(sd[key]):
+                sd[key] = dequantize_tensor(sd[key], dtype=target_dtype)
+                dequantized_count += 1
+        logging.info(f"Dequantized {dequantized_count} tensors for Ideogram model ({target_dtype})")
     else:
         pass
     return sd
diff --git a/ops.py b/ops.py
@@ -8,6 +8,18 @@
 import comfy.model_management
 from .dequant import dequantize_tensor, is_quantized
 
+def _valid_compute_dtype(dtype):
+    return dtype in {torch.float16, torch.bfloat16, torch.float32, torch.float64}
+
+def _infer_compute_dtype(tensor_type, fallback=None):
+    if _valid_compute_dtype(fallback):
+        return fallback
+    if tensor_type == gguf.GGMLQuantizationType.BF16:
+        return torch.bfloat16
+    if tensor_type == gguf.GGMLQuantizationType.F32:
+        return torch.float32
+    return torch.float16
+
 def chained_hasattr(obj, chained_attr):
     probe = obj
     for attr in chained_attr.split('.'):
@@ -45,20 +57,22 @@ class GGMLTensor(torch.Tensor):
     """
     Main tensor-like class for storing quantized weights
     """
-    def __init__(self, *args, tensor_type, tensor_shape, patches=[], **kwargs):
+    def __init__(self, *args, tensor_type, tensor_shape, patches=[], compute_dtype=None, **kwargs):
         super().__init__()
         self.tensor_type = tensor_type
         self.tensor_shape = tensor_shape
         self.patches = patches
+        self.compute_dtype = compute_dtype
 
-    def __new__(cls, *args, tensor_type, tensor_shape, patches=[], **kwargs):
+    def __new__(cls, *args, tensor_type, tensor_shape, patches=[], compute_dtype=None, **kwargs):
         return super().__new__(cls, *args, **kwargs)
 
     def to(self, *args, **kwargs):
         new = super().to(*args, **kwargs)
         new.tensor_type = getattr(self, "tensor_type", None)
         new.tensor_shape = getattr(self, "tensor_shape", new.data.shape)
         new.patches = getattr(self, "patches", []).copy()
+        new.compute_dtype = getattr(self, "compute_dtype", None)
         return new
 
     def clone(self, *args, **kwargs):
@@ -81,9 +95,17 @@ def new_empty(self, size, *args, **kwargs):
                 new_tensor,
                 tensor_type = getattr(self, "tensor_type", None),
                 tensor_shape = size,
-                patches = getattr(self, "patches", []).copy()
+                patches = getattr(self, "patches", []).copy(),
+                compute_dtype = getattr(self, "compute_dtype", None),
         )
 
+    @property
+    def dtype(self):
+        qtype = getattr(self, "tensor_type", None)
+        if qtype in GGMLLayer.torch_compatible_tensor_types:
+            return torch.Tensor(self).dtype
+        return _infer_compute_dtype(qtype, getattr(self, "compute_dtype", None))
+
     @property
     def shape(self):
         if not hasattr(self, "tensor_shape"):
@@ -121,8 +143,12 @@ def ggml_load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
         prefix_len = len(prefix)
         for k,v in state_dict.items():
             if k[prefix_len:] == "weight":
+                if isinstance(v, GGMLTensor):
+                    v.compute_dtype = self._ggml_compute_dtype(v, "weight")
                 self.weight = torch.nn.Parameter(v, requires_grad=False)
             elif k[prefix_len:] == "bias" and v is not None:
+                if isinstance(v, GGMLTensor):
+                    v.compute_dtype = self._ggml_compute_dtype(v, "bias")
                 self.bias = torch.nn.Parameter(v, requires_grad=False)
             else:
                 unexpected_keys.append(k)
@@ -137,6 +163,12 @@ def ggml_load_from_state_dict(self, state_dict, prefix, local_metadata, strict,
         if getattr(self.weight, "is_largest_weight", False):
             self.largest_layer = True
 
+    def _ggml_compute_dtype(self, tensor, param_name):
+        if self.dequant_dtype is not None and self.dequant_dtype != "target":
+            return self.dequant_dtype
+        model_dtype = getattr(self, f"{param_name}_comfy_model_dtype", None)
+        return _infer_compute_dtype(getattr(tensor, "tensor_type", None), model_dtype)
+
     def _save_to_state_dict(self, *args, **kwargs):
         if self.is_ggml_quantized():
             return self.ggml_save_to_state_dict(*args, **kwargs)
@@ -238,6 +270,8 @@ def __init__(self, in_features, out_features, bias=True, device=None, dtype=None
             self.out_features = out_features
             self.weight = None
             self.bias = None
+            self.weight_comfy_model_dtype = dtype
+            self.bias_comfy_model_dtype = dtype
 
         def forward_ggml_cast_weights(self, input):
             weight, bias = self.cast_bias_weight(input)

diff --git a/pr.md b/pr.md
@@ -0,0 +1,42 @@
+# Summary
+
+This adds support for Ideogram GGUF models.
+
+# What Changed
+
+- Added `ideogram` to the supported image GGUF architectures.
+- Added Ideogram model detection to the converter.
+- Added GGUF dtype handling needed by Ideogram inference.
+- Fixed the Ideogram inference failure where a packed GGUF weight dtype caused a byte tensor to reach CUDA linear.
+- Adjusted BF16 GGUF loading so Ideogram can start inference faster.
+
+# Error Fixed
+
+Before this change, Ideogram GGUF models could load but failed during sampling with:
+
+```text
+"addmm_cuda" not implemented for 'Byte'
+```
+
+# Tests
+
+Ran Python compile checks:
+
+```text
+python -m py_compile dequant.py ops.py loader.py nodes.py tools\convert.py
+```
+
+Checked a synthetic GGUF linear forward:
+
+- packed storage stayed `torch.uint8`
+- reported dtype was `torch.bfloat16`
+- input dtype was `torch.bfloat16`
+- output dtype was `torch.bfloat16`
+
+Checked BF16 loading behavior with finite values.
+
+# Notes
+
+Full ComfyUI inference was not run from this environment.
+
+Other GGUF quant types still use the existing dequant paths.