Qwen

Alibaba/happyhorse-1.0

From $0.115800/ call

Alibaba ATH Innovation Unit's 15B video model — the first with native joint audio-video generation including dialogue, ambient sound, and lip-sync — #1 on the Artificial Analysis Video Arena, built for short-form, ads, and dialogue-driven video.

Text to VideoImage to Video

More from Alibaba

README

Alibaba/happyhorse-1.0

Key Capabilities

  • Native Joint A/V Generation: Video frames and audio tokens are denoised together in one forward pass, producing dialogue, ambience, and Foley aligned to visuals without any post-production dub or lip-sync step.
  • Multilingual Lip-Sync: Native support for English, Mandarin, Cantonese, Japanese, Korean, German, and French with ultra-low Word Error Rate for natural, accurate mouth motion.
  • Unified T2V / I2V / Ref-to-Video Pipeline: A single model handles text-to-video, image-to-video, reference-to-video, and edit modes — no model switching required.
  • 1080p Cinematic Quality: 720p–1080p output with strong facial detail, texture fidelity, and natural motion — well-suited to short-form drama and ads.
  • Blazing Inference Speed: DMD-2 distillation (8 steps) plus MagiCompiler delivers ~2s for a 5-second 256p clip and ~38s for 1080p on H100 — among the fastest in its quality tier.
  • 3–15 Second Short Clips: Aligned with short-form platform durations; particularly strong on single-character dialogue scenes.
  • SOTA Blind-Preference Rankings: Artificial Analysis Video Arena Elo of 1333 (T2V) and 1392 (I2V) — #1 on both, based on blind votes from real users.

Technical Strengths

FeatureBenefit
Unified 40-Layer Self-Attention TransformerOne sequence, one forward pass for all modalities — minimal architecture, strong extensibility, efficient inference
Sandwich Modality LayoutModality-specific projections at the ends with 32 shared middle layers — efficient and well-aligned cross-modal reasoning
No Cross-AttentionA/V alignment is learned inside denoising rather than fixed post-hoc, eliminating the sync drift inherent to dub-then-lipsync pipelines
DMD-2 Distillation + No CFGJust 8 denoising steps for high-quality output — 3-4× faster than typical 30+ step diffusion models
MagiCompiler AccelerationCompiler-level inference optimization pushes 1080p generation down to ~38 seconds on a single H100
Independent Blind Leaderboard WinLeads the Artificial Analysis Video Arena by 107 Elo points — not self-reported, validated by real-user blind votes

Use Cases

  • Short-Form Content Creation: 3–15s native-audio clips plug directly into TikTok, Reels, Douyin, and Shorts pipelines without external audio tooling.
  • Dialogue-Driven Short Drama: Single-character dialogue with multilingual lip-sync outperforms "silent video + post-dub" pipelines for vlogs, short dramas, and TVC.
  • Advertising & Marketing Video: 1080p output in ~38s dramatically shortens A/B testing and creative iteration cycles for ad teams.
  • Product Walkthroughs & Character Voiceover: Upload a product image or character art plus a script to produce voiced, lip-synced explainer videos in one pass.
  • Multilingual Localization: Native lip-sync across seven languages lets a single script ship in multiple locales without re-modeling.
  • Creative Iteration & Storyboarding: Near-real-time 1080p generation enables directors and producers to validate shot ideas and pacing early in pre-production.
  • AI Film & Game NPC Assets: Native A/V output serves as a content production pipeline for AI films, interactive drama, and virtual character dialogue.

Pricing

ResolutionLinkAI PriceOfficial Price
1080P0.1984000.264500
720P0.1158000.154400