happyhorse-1.0 - LinkModel

Native Joint A/V Generation: Video frames and audio tokens are denoised together in one forward pass, producing dialogue, ambience, and Foley aligned to visuals without any post-production dub or lip-sync step.
Multilingual Lip-Sync: Native support for English, Mandarin, Cantonese, Japanese, Korean, German, and French with ultra-low Word Error Rate for natural, accurate mouth motion.
Unified T2V / I2V / Ref-to-Video Pipeline: A single model handles text-to-video, image-to-video, reference-to-video, and edit modes — no model switching required.
1080p Cinematic Quality: 720p–1080p output with strong facial detail, texture fidelity, and natural motion — well-suited to short-form drama and ads.
Blazing Inference Speed: DMD-2 distillation (8 steps) plus MagiCompiler delivers ~2s for a 5-second 256p clip and ~38s for 1080p on H100 — among the fastest in its quality tier.
3–15 Second Short Clips: Aligned with short-form platform durations; particularly strong on single-character dialogue scenes.
SOTA Blind-Preference Rankings: Artificial Analysis Video Arena Elo of 1333 (T2V) and 1392 (I2V) — #1 on both, based on blind votes from real users.

Feature	Benefit
Unified 40-Layer Self-Attention Transformer	One sequence, one forward pass for all modalities — minimal architecture, strong extensibility, efficient inference
Sandwich Modality Layout	Modality-specific projections at the ends with 32 shared middle layers — efficient and well-aligned cross-modal reasoning
No Cross-Attention	A/V alignment is learned inside denoising rather than fixed post-hoc, eliminating the sync drift inherent to dub-then-lipsync pipelines
DMD-2 Distillation + No CFG	Just 8 denoising steps for high-quality output — 3-4× faster than typical 30+ step diffusion models
MagiCompiler Acceleration	Compiler-level inference optimization pushes 1080p generation down to ~38 seconds on a single H100
Independent Blind Leaderboard Win	Leads the Artificial Analysis Video Arena by 107 Elo points — not self-reported, validated by real-user blind votes

Short-Form Content Creation: 3–15s native-audio clips plug directly into TikTok, Reels, Douyin, and Shorts pipelines without external audio tooling.
Dialogue-Driven Short Drama: Single-character dialogue with multilingual lip-sync outperforms "silent video + post-dub" pipelines for vlogs, short dramas, and TVC.
Advertising & Marketing Video: 1080p output in ~38s dramatically shortens A/B testing and creative iteration cycles for ad teams.
Product Walkthroughs & Character Voiceover: Upload a product image or character art plus a script to produce voiced, lip-synced explainer videos in one pass.
Multilingual Localization: Native lip-sync across seven languages lets a single script ship in multiple locales without re-modeling.
Creative Iteration & Storyboarding: Near-real-time 1080p generation enables directors and producers to validate shot ideas and pacing early in pre-production.
AI Film & Game NPC Assets: Native A/V output serves as a content production pipeline for AI films, interactive drama, and virtual character dialogue.

Resolution	LinkAI Price	Official Price
1080P	0.198400	0.264500
720P	0.115800	0.154400

Alibaba/happyhorse-1.0