Alibaba/happyhorse-1.0
From $0.115800/ callAlibaba ATH Innovation Unit's 15B video model — the first with native joint audio-video generation including dialogue, ambient sound, and lip-sync — #1 on the Artificial Analysis Video Arena, built for short-form, ads, and dialogue-driven video.
Text to VideoImage to Video
More from Alibaba
README
Alibaba/happyhorse-1.0
Key Capabilities
- Native Joint A/V Generation: Video frames and audio tokens are denoised together in one forward pass, producing dialogue, ambience, and Foley aligned to visuals without any post-production dub or lip-sync step.
- Multilingual Lip-Sync: Native support for English, Mandarin, Cantonese, Japanese, Korean, German, and French with ultra-low Word Error Rate for natural, accurate mouth motion.
- Unified T2V / I2V / Ref-to-Video Pipeline: A single model handles text-to-video, image-to-video, reference-to-video, and edit modes — no model switching required.
- 1080p Cinematic Quality: 720p–1080p output with strong facial detail, texture fidelity, and natural motion — well-suited to short-form drama and ads.
- Blazing Inference Speed: DMD-2 distillation (8 steps) plus MagiCompiler delivers ~2s for a 5-second 256p clip and ~38s for 1080p on H100 — among the fastest in its quality tier.
- 3–15 Second Short Clips: Aligned with short-form platform durations; particularly strong on single-character dialogue scenes.
- SOTA Blind-Preference Rankings: Artificial Analysis Video Arena Elo of 1333 (T2V) and 1392 (I2V) — #1 on both, based on blind votes from real users.
Technical Strengths
| Feature | Benefit |
|---|---|
| Unified 40-Layer Self-Attention Transformer | One sequence, one forward pass for all modalities — minimal architecture, strong extensibility, efficient inference |
| Sandwich Modality Layout | Modality-specific projections at the ends with 32 shared middle layers — efficient and well-aligned cross-modal reasoning |
| No Cross-Attention | A/V alignment is learned inside denoising rather than fixed post-hoc, eliminating the sync drift inherent to dub-then-lipsync pipelines |
| DMD-2 Distillation + No CFG | Just 8 denoising steps for high-quality output — 3-4× faster than typical 30+ step diffusion models |
| MagiCompiler Acceleration | Compiler-level inference optimization pushes 1080p generation down to ~38 seconds on a single H100 |
| Independent Blind Leaderboard Win | Leads the Artificial Analysis Video Arena by 107 Elo points — not self-reported, validated by real-user blind votes |
Use Cases
- Short-Form Content Creation: 3–15s native-audio clips plug directly into TikTok, Reels, Douyin, and Shorts pipelines without external audio tooling.
- Dialogue-Driven Short Drama: Single-character dialogue with multilingual lip-sync outperforms "silent video + post-dub" pipelines for vlogs, short dramas, and TVC.
- Advertising & Marketing Video: 1080p output in ~38s dramatically shortens A/B testing and creative iteration cycles for ad teams.
- Product Walkthroughs & Character Voiceover: Upload a product image or character art plus a script to produce voiced, lip-synced explainer videos in one pass.
- Multilingual Localization: Native lip-sync across seven languages lets a single script ship in multiple locales without re-modeling.
- Creative Iteration & Storyboarding: Near-real-time 1080p generation enables directors and producers to validate shot ideas and pacing early in pre-production.
- AI Film & Game NPC Assets: Native A/V output serves as a content production pipeline for AI films, interactive drama, and virtual character dialogue.
Pricing
| Resolution | LinkAI Price | Official Price |
|---|---|---|
| 1080P | 0.198400 | 0.264500 |
| 720P | 0.115800 | 0.154400 |