Both models seem to be proprietary (AliBaba has the paper but actual trained weights & code are not published) and I couldn't find much information on HeyGen (like architecture etc.) other than its website, so it's hard to exactly say what are the differences since you cannot freely test them out. From what I can deduce, it seems like the main edge of EMO is the handling of fast-paced audio like songs since I did not find many examples of HeyGen animating something other than speech. Any other direct comparisons are hard to make without having both models side-to-side on similar inputs.
EDIT: I actually researched and read more about HeyGen, the main product requires an actual video input to produce an avatar while EMO only requires a still image + audio on inference time (it additionally needs frames with facial movements during train time), so there is a massive difference. However, they do have a "Photo Avatar" tool which seems to be similar to what EMO is doing but from whatever snippets I could find, it seems that the animation quality is far worse than EMO's - the movements and facial expressions are pretty unnatural and rudimentary, it is incapable of animating small face muscle movements and I'm not sure it is capable of emulating the body movements (like e.g. in EMO's Rap God example). There is another similar model from Pika Labs but its quality is also garbage compared to what AliBaba has shown.
Last edited: