Image ComfyUI
Stable Cascade Stability AI · 5.9B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
3-stage cascade (Würstchen architecture). Strong prompt adherence, lighter than its 5.9B suggests.
1024×1024 13 GB disk 16 GB RAM ✓ Stability AI Non-Commercial
Image ComfyUI
Z-Image Turbo Tongyi Lab · 6B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Distilled few-step model. FP16 fits comfortably in 16GB. Apache-licensed.
1024×1024 fast 14 GB disk 16 GB RAM ✓ Apache 2.0
Image ComfyUI
Z-Image Base Tongyi Lab · 6B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Full (non-distilled) Z-Image. More steps than Z-Image Turbo but higher fidelity. Apache-licensed.
1024×1024 16 GB disk 16 GB RAM ✓ Apache 2.0
Image ComfyUI★ Most popular community model
Stable Diffusion XL Stability AI · 3.5B · 2023
Too heavy Q4 GGUF · ~3.5 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
The workhorse. Sharp 1024px output, huge fine-tune ecosystem (Pony, Illustrious, Juggernaut).
1024×1024 7 GB disk 16 GB RAM ✓ CreativeML Open RAIL++-M
Image ComfyUI
SDXL Turbo Stability AI · 3.5B · 2023
VRAM headroom Infinity% used · ~0.0 steps/s proxy
1-step distilled SDXL. Generates in under a second on midrange GPUs. Lower fidelity than full SDXL.
512×512 real-time 7 GB disk 16 GB RAM ✓ Stability AI Non-Commercial
Edit ComfyUI
OmniGen 2 BAAI · 3.8B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Unified text-to-image and image-editing model. One model handles generate, edit, compose.
1024×1024 11 GB disk 16 GB RAM ✓ MIT
Video ComfyUI
CogVideoX-5B THUDM (Tsinghua) · 5B · 2024
Too heavy Q4 GGUF · ~5.5 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
5B T2V/I2V from Tsinghua. Mid-range hardware target. 6-second clips at 720p.
720p · 6s 16 GB disk 16 GB RAM ✓ CogVideoX (open)
Image ComfyUI
FLUX.2 [klein] 4B Black Forest Labs · 4B · 2025
Too heavy Q4 GGUF · ~3.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
The smaller Apache-licensed FLUX.2. 4B params at 1024px — fits comfortably on midrange GPUs and high-end phones.
1024×1024 fast 12 GB disk 16 GB RAM ✓ Apache 2.0
Image 🤗 Hugging Face★ Distilled FLUX.2 for consumers
FLUX.2 Klein Black Forest Labs · 4B · 2026
VRAM headroom Infinity% used · ~0.0 steps/s proxy
4 B distilled from FLUX.2 32 B. Real-time generation on consumer GPUs at FLUX-grade fidelity.
1024×1024 fast 11 GB disk 16 GB RAM ✓ FLUX.2 Klein (Apache-style)
Music 🤗 Hugging Face★ Top open music model
ACE-Step 1.5 ACE-Step · 3.5B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Best open full-song generator with vocals. Generates 4-minute tracks in ~20 s on a 4090.
Up to 4 min · stereo fast 10 GB disk 16 GB RAM ✓ Apache 2.0
Music 🤗 Hugging Face
MusicGen Large Meta · 3.3B · 2023
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Meta's flagship MusicGen. Strong instrumental tracks. Non-commercial only.
32 kHz · 30s 8.5 GB disk 16 GB RAM ✓ CC BY-NC 4.0 (non-commercial)
Music 🤗 Hugging Face
MusicGen Stereo Large Meta · 3.3B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Stereo variant of MusicGen Large. True L/R channels for richer mix.
32 kHz · 30s 8.5 GB disk 16 GB RAM ✓ CC BY-NC 4.0 (non-commercial)
Image ComfyUI
Stable Diffusion 3 Medium Stability AI · 2B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
MMDiT architecture with triple text encoders. Better text rendering than SDXL.
1024×1024 11 GB disk 16 GB RAM ✓ Stability AI Community
Image ComfyUI
Stable Diffusion 3.5 Medium Stability AI · 2.5B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Improved 3.5 generation with better composition and aesthetics than SD3 Medium.
1024×1024 12.5 GB disk 16 GB RAM ✓ Stability AI Community
Image ComfyUI
Lumina Image 2.0 Alpha-VLLM · 2.6B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Compact next-gen DiT with strong photographic realism. Apache-licensed for commercial use.
1024×1024 12 GB disk 16 GB RAM ✓ Apache 2.0
Video ComfyUI
LTX-Video Lightricks · 2B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Realtime-capable video model. Generates faster than playback on a 4090. The speed champion.
768×512 real-time 11 GB disk 16 GB RAM ✓ Lightricks Open License
Image 🤗 Hugging Face
Kolors Kuaishou · 2.6B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Strong photographic realism, multilingual prompts (Chinese + English). Mid-weight footprint.
1024×1024 12 GB disk 16 GB RAM ✓ Kolors (open)
TTS 🤗 Hugging Face
Parler-TTS Large HuggingFace + Parler · 2.3B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Larger Parler with broader voice and prosody range. Slower, higher fidelity.
44.1 kHz · 30s 7 GB disk 16 GB RAM ✓ Apache 2.0
TTS 🤗 Hugging Face
Orpheus 3B CanopyAI · 3B · 2025
Too heavy Q4 GGUF · ~4.5 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
LLM-style TTS with rich emotional control. Streaming output via vLLM.
24 kHz · streaming 8 GB disk 16 GB RAM ✓ Apache 2.0
Image ComfyUI
Hunyuan-DiT Tencent · 1.5B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Bilingual (Chinese + English) DiT. Strong at Chinese text rendering, light to run.
1024×1024 11 GB disk 16 GB RAM ✓ Tencent Hunyuan Community
Video ComfyUI
Stable Video Diffusion Stability AI · 1.5B · 2023
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Image-to-video, ~14–25 frames at 576×1024. The original consumer-friendly local video model.
576×1024 · 25 frames 10 GB disk 16 GB RAM ✓ Stability AI Non-Commercial
Video ComfyUI
Wan 2.1 T2V 1.3B Alibaba · 1.3B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Tiny T2V. The most compatible video model — fits on 8GB GPUs. Apache-licensed.
480p 9 GB disk 16 GB RAM ✓ Apache 2.0
Video ComfyUI
SkyReels V2 1.3B Skywork AI · 1.3B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Lightweight successor with infinite-length generation. Fits on 8 GB GPUs.
540p 10 GB disk 16 GB RAM ✓ Apache 2.0
Image 🤗 Hugging Face
SANA 1.5 NVIDIA Labs · 1.6B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Linear DiT that generates up to 4K images on as little as 8 GB. Tiny weights, massive output.
Up to 4K fast 9 GB disk 16 GB RAM ✓ NVIDIA Source Code (Non-Commercial)
TTS 🤗 Hugging Face
Bark Suno · 0.9B · 2023
Too heavy Low-VRAM · ~2.5 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Generates speech, laughter, sighs, music — all from a text prompt. Quirky but expressive.
24 kHz · 14s 5 GB disk 16 GB RAM ✓ MIT
TTS 🤗 Hugging Face
Parler-TTS Mini HuggingFace + Parler · 0.88B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Prompt-controlled TTS — describe the voice in natural language ('young woman, warm, slow').
44.1 kHz · 30s 3 GB disk 8 GB RAM ✓ Apache 2.0
TTS 🤗 Hugging Face
Sesame CSM 1B Sesame · 1B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Conversational speech model — context-aware prosody for dialogue agents.
24 kHz · streaming 5 GB disk 16 GB RAM ✓ Apache 2.0
TTS 🤗 Hugging Face
Fish Speech 1.5 Fish Audio · 1.4B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Multilingual zero-shot voice cloning. Strong on Chinese, Japanese, Korean.
44.1 kHz · 30s 6.5 GB disk 16 GB RAM ✓ CC BY-NC-SA 4.0
Music 🤗 Hugging Face
MusicGen Medium Meta · 1.5B · 2023
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Mid-tier MusicGen. Faster than Large, slightly less rich.
32 kHz · 30s 4.5 GB disk 8 GB RAM ✓ CC BY-NC 4.0 (non-commercial)
Music 🤗 Hugging Face
Stable Audio Open 1.0 Stability AI · 1.21B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Sound effects, ambient textures, short instrumental loops up to 47 s.
44.1 kHz · 47s 4 GB disk 16 GB RAM ✓ Stability AI Community
Music 🤗 Hugging Face
Stable Audio Open 1.5 Stability AI · 1.5B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Refined sound design model. Cleaner ambient textures, tighter SFX.
44.1 kHz · 47s 5 GB disk 16 GB RAM ✓ Stability AI Community
Music 🤗 Hugging Face
DiffRhythm ASLP-lab · 1.5B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Latent-diffusion full-song generator. End-to-end. 8 GB minimum with chunked inference.
44.1 kHz · 4 min 6 GB disk 16 GB RAM ✓ Apache 2.0
Image ComfyUI
Stable Diffusion 1.5 Stability AI / RunwayML · 0.86B · 2022
VRAM headroom Infinity% used · ~0.0 steps/s proxy
The classic. Tiny, fast, runs on almost anything. Massive ecosystem of LoRAs and fine-tunes.
512×512 very fast 4.5 GB disk 8 GB RAM ✓ CreativeML Open RAIL-M
Image ComfyUI
Stable Diffusion 2.1 Stability AI · 0.86B · 2022
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Successor to SD 1.5 with native 768px training. Smaller community than 1.5 but still light on hardware.
768×768 very fast 5.5 GB disk 8 GB RAM ✓ CreativeML Open RAIL++-M
Image ComfyUI
PixArt-Σ PixArt-α team · 0.6B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Featherweight 0.6B DiT with T5-XXL encoder. Beautiful 4K output for the param count.
Up to 4K 11.5 GB disk 16 GB RAM ✓ OpenRAIL++
TTS 🤗 Hugging Face
Chatterbox Turbo Resemble AI · 0.35B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
350 M model with a 1-step distilled decoder. Sub-200 ms latency for production agents.
44.1 kHz · streaming real-time 2 GB disk 8 GB RAM ✓ MIT
TTS 🤗 Hugging Face
F5-TTS SWivid · 0.33B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Flow-matching TTS with strong voice cloning from a 6-second reference clip.
24 kHz · 30s 2.5 GB disk 16 GB RAM ✓ CC BY-NC 4.0
TTS 🤗 Hugging Face
XTTS v2 Coqui · 0.47B · 2023
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Multi-language voice cloning from 6 seconds of reference audio. 17 languages.
24 kHz · streaming 2.5 GB disk 8 GB RAM ✓ Coqui Public Model License
TTS 🤗 Hugging Face
Spark-TTS 0.5B SparkAudio · 0.5B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Tiny LLM-driven zero-shot voice cloning. 4 GB VRAM, runs comfortably on most GPUs.
24 kHz · streaming 3 GB disk 8 GB RAM ✓ Apache 2.0
TTS 🤗 Hugging Face
Kani-TTS 2 Kani · 0.4B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
400 M streaming TTS on a Liquid LFM2 backbone with NVIDIA NanoCodec. 3 GB VRAM end-to-end.
24 kHz · streaming real-time 2 GB disk 8 GB RAM ✓ Apache 2.0
Music 🤗 Hugging Face
MusicGen Small Meta · 0.3B · 2023
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Tiny MusicGen for fast prototyping. Runs on a 4 GB GPU.
32 kHz · 30s fast 1 GB disk 8 GB RAM ✓ CC BY-NC 4.0 (non-commercial)
Music 🤗 Hugging Face
Magenta RealTime Google Magenta · 0.8B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Live, prompt-steerable instrumental generation. Optimized for streaming.
44.1 kHz · streaming real-time 4 GB disk 8 GB RAM ✓ Apache 2.0
Image ComfyUI
Stable Diffusion 3.5 Large Stability AI · 8B · 2024
Too heavy Q4 GGUF · ~8.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Stability's flagship 8B MMDiT. Sharp text, strong composition. Triple text encoder (CLIP-L, CLIP-G, T5-XXL).
1024×1024 20 GB disk RAM tight · −15 Stability AI Community
Image ComfyUI
AuraFlow v0.3 Fal.ai · 6.8B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Truly open-source flow-matching model. The largest fully Apache-licensed image model.
1024×1024 16 GB disk RAM tight · −15 Apache 2.0
Image ComfyUI
FLUX.1 schnell Black Forest Labs · 12B · 2024
Too heavy Q4 GGUF · ~7.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
4-step distilled FLUX. The fastest way to get FLUX-tier quality. Apache-licensed.
1024×1024 fast 33 GB disk RAM tight · −15 Apache 2.0
Image ComfyUI★ Flagship FLUX
FLUX.1 dev Black Forest Labs · 12B · 2024
Too heavy Q4 GGUF · ~7.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
The model to beat. State-of-the-art prompt adherence and photorealism. Hungry but worth it.
1024×1024 33 GB disk RAM tight · −15 FLUX.1 Non-Commercial
Edit ComfyUI
FLUX.1 Kontext dev Black Forest Labs · 12B · 2025
Too heavy Q4 GGUF · ~7.5 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Image editing variant of FLUX. Reference image + prompt → edited result.
1024×1024 33 GB disk RAM tight · −15 FLUX.1 Non-Commercial
Image ComfyUI
HiDream-I1 HiDream-ai · 17B (8.5B active) · 2025
Too heavy Q4 GGUF · ~11.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Hybrid DiT + MoE. Beats Flux on several benchmarks. MIT-licensed for any use.
1024×1024 38 GB disk RAM tight · −15 MIT
Image ComfyUI
Qwen-Image Alibaba · 20B · 2025
Too heavy Q4 GGUF · ~14.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
20B MMDiT from Alibaba. Best-in-class text rendering — handles paragraphs of text in images.
1328×1328 45 GB disk RAM low · −25 Apache 2.0
Edit ComfyUI
Qwen-Image-Edit Alibaba · 20B · 2025
Too heavy Q4 GGUF · ~14.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Edit variant of Qwen-Image. Replace, add, restyle objects via natural prompts.
1328×1328 45 GB disk RAM low · −25 Apache 2.0
Image ComfyUI
Hunyuan Image 2.1 Tencent · 17B · 2025
Too heavy Q4 GGUF · ~11.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Tencent's 17B image flagship. Strong at composition and Chinese text.
1024×1024 36 GB disk RAM tight · −15 Tencent Hunyuan Community
Image ComfyUI★ 84B MoE giant
Hunyuan Image 3 Tencent · 84B (13B active) · 2025
Too heavy Q4 GGUF · ~32.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Massive 84B MoE image model with 13B active. Quality rivals closed-source flagships. Needs serious hardware.
1024×1024 180 GB disk RAM low · −25 Tencent Hunyuan Community
Image ComfyUI★ Next-gen FLUX
FLUX.2 dev Black Forest Labs · 32B · 2025
Too heavy Q4 GGUF · ~22.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Successor to FLUX.1 dev. Higher fidelity, longer context. Heavy hardware required.
1024×1024 72 GB disk RAM low · −25 FLUX.2 Non-Commercial
Image ComfyUI
ERNIE-Image Baidu · 10B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Baidu's 10B image model. Strong at Chinese prompts and stylized output.
1024×1024 22 GB disk RAM tight · −15 Baidu Community
Edit ComfyUI
HiDream-E1.1 HiDream-ai · 17B (8.5B active) · 2025
Too heavy Q4 GGUF · ~11.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Editing variant of HiDream-I1. Same MoE backbone tuned for image editing.
1024×1024 36 GB disk RAM tight · −15 MIT
Video ComfyUI
Wan 2.1 T2V 14B Alibaba · 14B · 2025
Too heavy Q4 GGUF · ~12.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Full-size Wan T2V. Strong motion, 720p output. Quantization makes it viable down to 12GB.
720p 30 GB disk RAM tight · −15 Apache 2.0
Video ComfyUI
Wan 2.1 I2V 14B Alibaba · 14B · 2025
Too heavy Q4 GGUF · ~12.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Image-to-video flagship. Best open I2V motion in early 2025.
720p 30 GB disk RAM tight · −15 Apache 2.0
Video ComfyUI★ Best open video < 8B
Wan 2.2 TI2V 5B Alibaba · 5B · 2025
Too heavy Q4 GGUF · ~7.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Unified text + image to video. Best-in-class T2V/I2V at 5B. Fits on a 24GB GPU at FP16.
720p 13 GB disk RAM tight · −15 Apache 2.0
Video ComfyUI
Wan 2.2 T2V A14B (MoE) Alibaba · 27B (14B active) · 2025
Too heavy Q4 GGUF · ~18.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
MoE flagship — 27B total / 14B active. Top open-video quality.
720p 56 GB disk RAM low · −25 Apache 2.0
Video ComfyUI
LTX-2 Lightricks · 8B · 2026
Too heavy Distilled FP8 · ~12.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Successor to LTX-Video. 4K-capable on high-end hardware. Distilled variants run on 12GB.
1080p–4K 28 GB disk RAM tight · −15 Lightricks Open License
Video ComfyUI
HunyuanVideo Tencent · 13B · 2024
Too heavy Q4 GGUF · ~12.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
13B T2V — closest open rival to Sora at release. Slow but cinematic.
720p 50 GB disk RAM tight · −15 Tencent Hunyuan Community
Video ComfyUI★ Consumer-friendly Hunyuan
HunyuanVideo 1.5 Tencent · 8.3B · 2025
Too heavy Q4 GGUF · ~9.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Lighter, sharper successor. 8.3B params, 14GB minimum, runs on consumer GPUs.
720p 22 GB disk RAM tight · −15 Tencent Hunyuan Community
Video ComfyUI
Mochi 1 Genmo · 10B · 2024
Too heavy Q4 GGUF · ~9.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
10B asymmetric DiT. Ambitious motion. Apache-licensed for any use.
480p–720p 22 GB disk RAM tight · −15 Apache 2.0
Video ComfyUI
Pyramid Flow Pyramid Flow team · 2B · 2024
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Pyramidal flow-matching for efficient long videos. Strong T2V quality at 2B.
768p · 10s 11 GB disk RAM tight · −15 MIT
Image ComfyUI
Chroma 1 HD lodestones (community) · 8.9B · 2025
Too heavy Q4 GGUF · ~5.5 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Community 8.9B distillation of FLUX.1-schnell, fully Apache 2.0. Strong photographic quality with no commercial restrictions.
1024×1024 24 GB disk RAM tight · −15 Apache 2.0
Image ComfyUI
FLUX.1 [klein] Black Forest Labs · 9B · 2025
Too heavy Q4 GGUF · ~5.5 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Apache-licensed 9B distillation of FLUX.1. Smaller and lighter than schnell while keeping FLUX-tier prompt adherence.
1024×1024 fast 24 GB disk RAM tight · −15 Apache 2.0
Image ComfyUI
FLUX.2 [klein] 9B Black Forest Labs · 9B · 2025
Too heavy Q4 GGUF · ~5.5 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
The larger Apache-licensed FLUX.2. 9B params, sharper detail than the 4B variant.
1024×1024 24 GB disk RAM tight · −15 Apache 2.0
Video ComfyUI
SkyReels V1 Skywork AI · 13.8B · 2025
Too heavy Q4 GGUF · ~12.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Human-centric video foundation model based on HunyuanVideo. Up to 12 s clips at 24 fps, 544×960. Strong at faces and motion.
544×960 · 12 s 50 GB disk RAM tight · −15 Apache 2.0
Video ComfyUI
SkyReels V2 14B Skywork AI · 14B · 2025
Too heavy Q4 GGUF · ~12.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Flagship infinite-length video model derived from Wan 2.1 14B. Top-tier open-source motion at 720p.
720p 30 GB disk RAM tight · −15 Apache 2.0
Image 🤗 Hugging Face
Cosmos-Predict2 2B Text2Image NVIDIA · 2B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
Physics-aware text-to-image — coherent geometry and lighting, tuned for sim/robotics.
1024×1024 22 GB disk RAM tight · −15 NVIDIA Open Model License
Video 🤗 Hugging Face
Cosmos-Predict2 2B Video2World NVIDIA · 2B · 2025
VRAM headroom Infinity% used · ~0.0 steps/s proxy
World-model T2V/I2V. Generates physics-consistent motion. Heavy VRAM appetite.
720p · 16fps 28 GB disk RAM tight · −15 NVIDIA Open Model License
TTS 🤗 Hugging Face★ Best ultra-light TTS
Kokoro 82M hexgrad · 0.082B · 2025
Too heavy FP32 CPU · ~0.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Featherweight TTS that punches way above its weight. Sub-300 ms inference, runs on CPU.
24 kHz · streaming real-time 1 GB disk 8 GB RAM ✓ Apache 2.0
Music 🤗 Hugging Face
YuE 7B Multimodal Art Projection · 7B · 2025
Too heavy Q4 GGUF (GP) · ~6.0 GB VRAM headroom Infinity% used · ~0.0 steps/s proxy
Suno-style full-song generation with synchronized lyrics + vocals. Up to 5-minute tracks.
44.1 kHz · up to 5 min 18 GB disk RAM tight · −15 Apache 2.0