Qwen-Image

Qwen-Image は中国 Alibaba Cloud の画像生成AIである。

ここではM4 Max Mac Studio 128GBでPythonで動かす方法をメモしておく。一般には Qwen-Image-gguf を ComfyUI で動かすほうが楽のようだ。

まずPyTorchと最新のdiffusersをインストールしておく（pip install torch diffusers）。あとは、サンプルコードを適宜書き直して使う。

from diffusers import DiffusionPipeline
import torch

device = "mps"
pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image", torch_dtype=torch.bfloat16).to(device)

positive_magic = {
    "en": ", Ultra HD, 4K, cinematic composition.", # for english prompt
    "zh": ", 超清，4K，电影级构图." # for chinese prompt
}

prompt = "a Japanese anime girl"  # プロンプト

image = pipe(
    prompt=prompt + positive_magic["en"],
    negative_prompt=" ",
    width=640,
    height=640,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device=device).manual_seed(42),
).images[0]

image.save("example.webp", lossless=True, method=6)

保存はどんな形式でもよいが、上ではロスレスWebPとした。

640×640で3分少々かかる。1664×928なら15分くらいかかる。画像生成しているとM4 Max Mac Studioの消費電力が60Wくらい増えて、ほんのり暖かくなる。

拡散モデルの働きを理解するために、途中段階を書き出すコードをGPT-5に書いてもらった。追加部分のみ記す。

height, width = 640, 640

# --- live preview callback ---
def on_step_end(pipe, i, t, kw):
    tokens = kw["latents"]  # shape: [B, num_patches, 64]
    with torch.no_grad():
        # 1) unpack packed tokens -> VAE latent grid [B, z_dim(=16), T(=1), H, W]
        lat = pipe._unpack_latents(tokens, height, width, pipe.vae_scale_factor)  # private helper used by the pipeline

        # 2) match VAE dtype/device
        lat = lat.to(pipe.vae.dtype, non_blocking=True)

        # 3) un-normalize (pipeline does this right before decoding)
        mean = torch.tensor(pipe.vae.config.latents_mean, device=lat.device, dtype=lat.dtype).view(1, pipe.vae.config.z_dim, 1, 1, 1)
        stdinv = 1.0 / torch.tensor(pipe.vae.config.latents_std, device=lat.device, dtype=lat.dtype).view(1, pipe.vae.config.z_dim, 1, 1, 1)
        lat = lat / stdinv + mean

        # 4) decode and postprocess to a PIL image
        frame = pipe.vae.decode(lat, return_dict=False)[0][:, :, 0]  # take temporal dim 0
        pil = pipe.image_processor.postprocess(frame, output_type="pil")[0]

        pil.save(f"step_{i:03d}.webp")  # or display in-notebook
        print(f"[preview] step {i:02d} saved")

    # IMPORTANT: return the (possibly updated) tensors for the sampler to continue
    return {"latents": tokens}

image = pipe(
    prompt=prompt + positive_magic["en"],
    negative_prompt=" ",
    width=width,
    height=height,
    num_inference_steps=50,
    true_cfg_scale=4.0,
    generator=torch.Generator(device=device).manual_seed(42),
    callback_on_step_end=on_step_end,
    callback_on_step_end_tensor_inputs=["latents"],
).images[0]

できた50枚のWebPを結合して動画にしてみた（ffmpeg -framerate 12 -i step_%03d.webp -c:v libx264 -pix_fmt yuv420p qwen_steps.mp4）。

次は num_inference_steps を1から25まで変えた結果である。こちらはanimated WebP形式にしてみた。

最後のanimated WebPの作り方：

ffmpeg -framerate 5 -i example%03d.webp \
  -vf "drawtext=: \
       text='%{eif\:n+1\:d\:02}': \
       x=10:y=10:fontsize=24:fontcolor=white:box=1:boxcolor=black@0.5" \
  -c:v libwebp -lossless 0 -q:v 80 -loop 0 output.webp