Hacker News new | past | comments | ask | show | jobs | submit login
Let's talk about animation quality (theorangeduck.com)
237 points by ibobev 3 months ago | hide | past | favorite | 78 comments



Something I really enjoyed about this article is that really helps explain a counterintuitive result in hand drawn 2D animation. It's a well known phenomenon in hand drawn 2D animation that naively tracing over live action footage usually results in unconvincing and poor quality animation. The article demonstrates how sampling and even small amounts of noise can make a movement seem unconvincing or jittery- and seeing that, it suddenly helps make sense how something like simple tracing at 12 fps would produce bad results, without substantial error correction (which is where traditional wisdom like arcs, simplification etc comes in).


2D animation traced over live action is called rotoscoping. Many of Disney's animated movies from the Walt Disney era used rotoscoping, so I don't think it's fair to say it results in poor quality.

https://en.wikipedia.org/wiki/List_of_rotoscoped_works#Anima...


The comment was about naive tracing. When Disney used rotoscoping they had animators draw conforming to a character model on top of the live action pose.

The experienced animator and inbetweeners knew how to produce smooth line motion, and the live action was used for lifelike pose, movement, etc. It wasn’t really tracing.

There’s examples of this in the Disney animation books, the finished animation looks very different from the live actors, but with the same movement.


On the other side of the same coin, when animating VFX for live action, animation which looks "too clean" is also a failure mode. You want to make your poses a little less good for camera, introduce a little bit of grime and imperfection, etc.

Animation is a great art and it takes a lot of skill to make things look the way they ought to for whatever it is you are trying to achieve.

Most animators don't like the "digital makeup" comparison (because it's often used in a way which feels marginalizing to their work on mocap-heavy shows), but if you interpret it in the sense that makeup makes people look the way they are "supposed to" I think it's a good model for understanding why rotoscope and motion capture don't yet succeed without them.


Rotoscoping has its place. It can save a lot of time/money for scenes with complex motion and can produce good results, but overreliance on it does tend to produce worse animation since it can end up being constrained to just what was captured on film. Without it, animators are more free to exaggerate certain motions, or manipulate the framerate, or animate things that could never be captured on camera in the first place. That kind of freedom is part of what makes animation such a cool medium. Animation would definitely be much worse off if rotoscoping was all we had.


"Animation would definitely be much worse off if rotoscoping was all we had." Yeah, then it wouldn't be animation anymore.


I mean, rotoscoping is still animation, but it's just one technique/tool of the trade. I thought it was used well in Undone, and I enjoyed The Case of Hana & Alice


Rotoscoping was utilized for some difficult shots. Mostly live action was used for reference, not directly traced, Fleischer style. I've never seen rotoscoping that looked so masterful as Snow White and similar golden age films.

https://www.youtube.com/watch?v=smqEmTujHP8


A Scanner Darkly is rotoscoped

https://youtu.be/l1-xKcf9Q4s


A Scanner Darkly is more like a manual post-processing effect than animation.


The creators reference that as rotoscopy very often, there are some cool details (not much related to this discussion) here https://blogs.iu.edu/establishingshot/2022/03/28/rotoscoping...


I'm not saying it isn't really roto, I'm saying it's not really used here as part of an animation pipeline so much as in a compositing pipeline.


I spend a lot of my time researching live-action anime[0][1], and there's an important thing to learn from Japanese animators: sometimes an animation style may seem technically lacking, but visually stunning.

When animator Ken Arto was on the Trash Taste podcast he mentioned how Disney had the resources to perfect the animation, while in Japan they had to achieve more with less.

This basically shifts the "what is good animation" discussion in ways that are not as clear from looking at the stats.

[0] https://blog.nestful.app/p/ways-to-use-nestful-outlining-ani...

[1] https://www.youtube.com/watch?v=WiyqBHNNSlo


These kinds of perspectives are often found and parroted in perceived 'elite' circles. It's no wonder the author works in Epic Games, a place in which one would need high technical chops to work there.

It's also no wonder why such people get disconnected from some realities on the ground. Sure on paper people do want higher quality things but they don't even know what those are. Most people have low-brow tastes; they'd take a cheaper and well-marketed thing over a 1% improvement.

Japan didn't need to compete on the same ladder for success, it needed to mix various elements of what they're good at to achieve it's own success.


Exactly right. Sometimes those "higher quality" things may lead to reduced quality, most commonly by reaching the uncanny valley.

Interestingly that does not happen in the opposite direction. When "reducing" certain stats on real footage (which is what live-action anime should do[0]) the uncanny valley is skipped. Maybe it's harder to fall into when going backwards? More research is needed.

BTW, I love your books

[0] https://www.youtube.com/shorts/3ZiBu5Il2eY


We're really good at filling in the blanks with less data. That's why even the best animated scenes may not hit as hard as a page of manga with 6-7 panels dedicated to the scene. we imagine that scene ourselves, guided by the panels. Or why a recreation of a scene in a book can fall short to your imagination.

To contrast the above comment, video games don't let you "skip" steps these days. It's unsurprising to hear the author works at Epic Games because you get a lot less room to be experimental in that 3d real-time realm compared to any other medium like movies. When interactivity is involved, fluidity and responsiveness is key to keeping a player immersed, compared to a movie that could suddenly lower its framerate to create an ironically more engaging fight scene.


Those dumb artists focusing on quality instead of revenue!


I own an animation studio.

Animation and motion are two different things—related, but definitely not the same. They don't rely on the same principles and they don't capture the same data.

Most people use the terms interchangeably, probably because the tools to process key frames are USUALLY the same.

Animation frames aren't regular the way mo-cap is. Instead, they are designed to furnish the eye (persistence of vision) with images that, in sequence, produce a sense of crisp motion to the viewer.

It's a subtle distinction, but the result is wildly different. In animation, the ACTUAL POSES matter a great deal. In mo-cap, they don't matter at all, it's all about regular sampling and then you just render (in 3D) what you want.

Video game cut scenes are what more-or-less raw "mo-cap" looks like if you're curious.


'Obviously a huge part of this is the error propagation that we get down the joint chain... but"

This shouldn't be glossed over and a proper consideration of the error metric here is key to storing quality animation with fewer bits, lower bandwidth and higher performance.


It does feel like joint rotations are probably not a great representation for processing animation data. And it's not generally how the motion is captured, nor does it seem to be how the human eye analyses it. But I don't know what an alternative would look like, and I'm pretty sure smarter people than I have spent a lot of time thinking about it.


The joint rotation / skinning concept is about compression and ability to playback a stored animation rapidly. It has also been done by storing the entire posed mesh per frame and interpolating between meshes.

It follows from digitizing the articulated armatures of stop motion animation which preceeded CGI. Some of the first CGI was input using metal jointed armatures with potentiometers in every joint which is a form of 'rigid' transforms.

I'm not sure what you are thinking about when you say it's not a great representation for processing animation data but I speculate that you mean it's not a great representation for how a flesh and blood creature moves and you'd be right. Advanced CGI can sometimes tackle a more thorough simulation of the musculature, complex joints, tendons, ligaments, and the physical dynamics of these all being controlled by a brain. A lot of soft body physical interactions as well. The sheer amount of processing and data needed for such a real simulation is why the cheating approximations are used for interactive simulations and games.

See https://www.wetafx.co.nz/research-and-tech/technology/tissue


Fitting joints onto a text-prompted Sora-generated video: could "transformers" not make all this stuff obsolete too? You might need the motion capture data for ground truth to fit joints, but maybe not to generate animation itself.



Seems like the media files still load from the original domain


Image generation has its own problems with non-cancelling noise.

For example, images are often generated with jpeg artifacts in regions but not globally.

Watermarks are also reproduced.

Some generated images have artifacts from CCD cameras

https://www.eso.org/~ohainaut/ccd/CCD_artifacts.html

Images generated from Google Street View data would likely contain features specific to the cars/cameras used in each country

https://www.geometas.com/metas/categories/google_car/


It seems like such an obvious and surmountable problem though. Indeed since 2020 there are robust approaches to eliminating JPEG artifacts, for example - browse around here - https://openmodeldb.info/.


You're right. In order to be a big problem, the error needs to be non-cancelling and inseparable.

That makes me wonder: if you label good data, and generate data with the good label, how much benefit do you get from also training on okay data?


The author did some very cool work with Raylib interpolating between animations to make transitions more natural. I remember being blown away at how realistic it looked from the videos he posted in the Discord. Glad to see he's still pushing the boundaries on what's possible with quality animation. And of course Cello rocks!


The points about the effects of noise are super interesting. Kind of mind blowing to think about the sensitivity of our perception being so different across visual channels (color, shape, movement, etc).


> one of the highest quality publicly available datasets of motion capture in the graphics community

> This data is sampled at 120 Hz, with finger and toe motions

But when I watch the videos they look like the dancer had palsy affecting their hands or were wearing astronaut gloves, because the fingers barely move for the most part.


The dataset Github page https://github.com/simonalexanderson/MotoricaDanceDataset does mention some missing finger data:

> Session 2: Casual dancing ... No finger motion.

> Session 3: Vintage jazz dancing ... Fingers captured with Manus gloves, which unfortunately suffered in quality due to sensor drift during rapid motion.

> Session 4: Street dancing ... Simplified finger motion (markers on thumb, index finger, and pinky according to the OptiTrack layout).

The first video in the article does have a bit of finger motion, so I'm guessing it's from session 4. Toes also look a bit iffy and clip into the ground instead of curling at times.


If one looks at the YODA puppet in The Empire Strikes back, of course, moves like a puppet, but the motion is real. Jerky, emotional, human-like.

One move to The Clone Wars and the CGI moves are mechanic. Maybe the way to go about animation is not on the eye of the beholder but on careful comparison of analog vs digital renderings: Film a human running on analog and pair it pixel by pixel with the digital cgi counterpart.


That's called rotoscoping.


The author discusses the perceptual allowances for different kinds of inputs (the noise in images, etc), and it's a really interesting point that helps sketch some boundaries around where the LLM/Diffusion model paradigms are useful.

Human color perception is almost entirely comparative - we see something as Blue because within the context of the other objects in a scene and the perceived lighting, the color an object would be that looked the way the object in the scene does is Blue (this is the blue dress phenomenon) - and so noise in images is easy for us to ignore. Similarly, audio and especially speech perception is also very strongly contextually dependent (as attested by the McGurk effect), so we can also deal with a lot of noise or imprecision - in other words, generative guesswork.

Motion, on the other hand, and especially human motion, is something we're exquisitely attentive to - think of how many horror movies convey a character's 'off-ness' by subtle variations in how they move. In this case, the diffusion model's tendency towards guesswork is much, much less easily ignored - our brains are paying tight attention to subtle variations, and anything weird alarms us.

A constant part of the conversation around LLMs, etc. is exactly this level of detail-mindedness (or, the "hallucinations" conversation), and I think that's basically where you're going to land with things like this - where you need actual genuine precision, where there's some proof point on whether or not something is accurate, the generative models are going to be a harder fit, whereas areas where you can get by with "pretty good", they'll be transformative.

(I've said it elsewhere here, but my rule of thumb for the LLMs and generative models is that if a mediocre answer fast moves the needle - basically, if there's more value in speed than precision - the LLMs are a good fit. If not, they're not.)


the shoulder rotation plotted at various frequencies sparked for me: is there an "MP3" of character animation data? The way that we have compression optimized for auditory perception… it feels like we might be missing an open standard for compressing this kind of animation data?

edit: Claude is thinking MP3 could work directly: pack 180Hz animation channels into a higher frequency audio signal with some scheme like Frequency Division / Time Division Multiplexing, or Amplitude Modulation. Boom, high compression with commonplace hardware support.


That same graph had me jump towards the sampling theorem - playing back an animation with linear interpolation creates hard edges, e.g. frequency spikes. I‘m not sure if the movement space is comparable to audio here, but I can‘t see why not.

so; if the sampling theorem applies; having 2x the maximum movement „frequency“ should be enough to perfectly recreate them, as long as you „filter out“ any higher frequencies when playing back the animation by using something like fft upscaling (re-sampling) instead of linear or bezier interpolation.

(having written this, I realize that‘s probably what everyone is doing.)


I would love to be corrected on this - but my understanding of frequency compression is that you have to decode the entire file before being able to play back the audio. Therefore, in real time applications with limited RAM (video games) you don't want to wait for the entire animation to be decoded before streaming the first frames.

Can anyone think of a system with better time-to-first-frame that achieves good compression?


most audio and video schemes support streaming, in the case of MP3 we are talking about frame-based compression

I guess to restate my curiosity: are things like Animation Pose Compression in Unity or equivalents in other engines remotely as good as audio techniques with hardware support? The main work on this seems to be here and I didn't see any references to audio codecs in the issue history fwiw. https://github.com/nfrechette/acl


Oh Ubisoft provided some datasets publically. That's very nice of them.

>All of these datasets are NOT licensed for commercial use. They are for research use only.

I mean, I knew this already when I looked at the license (a thing any commercially-oriented dev should do on any repo) and saw CC-4 (non-derivative, non-commercial). But it's still sad that this somewhat repeats the very mantra said a few sections up:

>Almost all games and VFX companies don't share their data with the academic world (with the exception of Ubisoft who have released some very high quality animation datasets), so how can we penalize academics for not reaching a bar they don't even know exists?

But alas, This is one barrier that separates an Indie and even some AA games from a AAA game. At least the article gave tips on what to look out for if trying to prepare your own animation dataset.


I love the statement in the conclusion.

Curation is something we intrinsically favor over engagement algorithms. Noisy is easy to quantify, but greatness is not. Greatness might have a lag in engagement metrics while folks read or watch the material. It might provoke consideration, instead of reaction.

Often we need seasons of production in order to calibrate our selection criteria, and hopefully this season of booming generation leads to a very rich new opportunity to curate great things to elevate from the noise.


Why is curation relevant to ‘greatness’?

By definition 99% of the content produced has to be in the bottom 99 percentiles, in any given year.

Even if the entire world decided everything must be curated, that would just mean the vast vast majority of curators have not-great taste.

Whereas in a future world where 99% of it is driven by algorithms, that would mean the vast majority of curators have ‘great’ taste.

But this seems entirely orthogonal.


Something I keep seeing is that modern ML makes for some really cool and impressive tech demos in the creative field, but is not productionizable due to a lack of creative control.

Namely, anything generating music / video / images - tweaking the output is not workable.

Some notable exceptions are when you need stock art for a blog post (no need for creative control), Adobe's recolorization tool (lots of control built in), and a couple more things here and there.

I don't know how it is for 3D assets or rigged model animation (as per the article), never worked with them. I'd be curious to hear about successful applications, maybe there's a pattern.


Something I realized about AI is that an AI that generates "art" be it text, image, animation, video, photography, etc., is cool. The product it generates, however, is not.

It's very cool that we have a technology that can generate video, but what's cool is the tech, not the video. It doesn't matter if it's a man eating spaghetti or a woman walking in front of dozens of reflections. The tech is cool, the video is not. It could be ANY video and just the fact AI can generate is cool. But nobody likes a video that is generated by AI.

A very cool technology to produce products that nobody wants.


That's an over simplification I think. If you're only generating a video because 'I can oooh AI' - then of course no one wants it. If you treat the tools as what they are, Tools - then people may want it.

No one really cares about a tech demo, but if generative tools help you make a cool music video to an awesome song? People will want it.

Well, as long as they aren't put off by a regressive stigma against new tool at least.


Are there any valid reasons people might not like this or is it only "regressive stigma?"


Humans find lots of value in human effort towards culturally important things.

See: a grandmother’s food vs. the industrial equivalent


well we had an entire article explaining parts of that. You can skimp out on some areas and fool a human, but human proportions and environmental weight is hard. Then you get to motion and it's extremely hard to fool the human eye.

One end of art is spending millions of man hours to polish this effect to fool the eye. the other side simplifies the environment and focuses more on making this new environment cohesive, which relaxes our expectations. Take your favorite 90's/early 00's 3d game and compare it to Mass Effect: Andromeda to get a feel of this.

AI is promising to do the former with the costs of the latter. And so far it's maybe halfway to Andromeda in its infancy of videos.


If you used AI to make something awesome, even if I liked it, I'd feel scammed if it wasn't clearly labelled as AI, and if it was clearly labelled as AI I wouldn't even look at it.


> if it was clearly labelled as AI I wouldn't even look at it.

If you dislike it without even seeing it, that would indicate the problem isn't with the video...


Yes, the problem is with AI. I'm tired of trying to find X and finding "AI X" instead. I google "pixel art" I get "AI pixel art." I google clipart I get "AI clipart." I go to /r/logodesign to see some cool logo designs, it's 50% people who used ChatGPT asking if it looks good enough.

The only good AI is AI out of my sight.


> I'd feel scammed if it wasn't clearly labelled as AI

TBF - have you looked at a digital photo made in the last decade? Likely had significant 'AI' processing applied to it. That's why I call it a regressive pattern to dislike anything with a new label attached - it minimizes at best and often flat out ignores the very real work very real artists put in to leverage the new tools.


Face it. People are okay with super resolution efforts, including most deep learning-based methods. But not "AI". You can run video through i2i as a cleanup tool and upload it on the Internet, some tried and quit. YouTubers and TikTokers aren't doing it and they're all for attention.

Output of current image generators are trash. It's unsalvageable. That's the problem, not "regressive pattern".


You still have to take the photo. That's a billion times more effort than typing a prompt in ChatGPT.


honestly, that's the same argument people made against photographs when the technology became available. Same argument made against the printing press.

New tools aren't inherently inferior, they open up new opportunities.


I've never seen a photograph pretending to be an illustration, or vice-versa. It's only AI that pretends to be a genre it isn't.


If anything photography mostly just replaced painted portraits, hence why high art went super abstract in response.


> A very cool technology to produce products that nobody wants.

creative power without control is like a rocket with no navigation—sure, you'll launch, but who knows where you'll crash!


Yes, it turns out there's more to creating good art than simulating the mechanics and technique of good artists. The human factor actually matters, and that factor can't be extrapolated from the data in the model itself. In essence it's a lossy compression problem.

It is technically interesting, and a lot of what it creates does have its own aesthetic appeal just because of how uncanny it can get, particularly in a photorealistic format. It's like looking at the product of an alien mind, or an alternate reality. But as an expression of actual human creative potential and directed intent I think it will always fall short of the tools we already have. They require skilled human beings who require paychecks and sustenance and sleep and toilets, and sometimes form unions, and unfortunately that's the problem AI is being deployed to solve in the hope that "extruded AI art product" is good enough to make a profit from.


The problem in your example is that you wouldn’t think a picture of a man eating spaghetti taken by a real person would be cool.

You may feel different if it’s, say, art assets in your new favorite video game, frames of a show, or supplementary art assets in some sort of media.


> or a woman walking in front of dozens of reflections

A lot of people will not notice the missing reflections and because of this our gatekeepers to quality will disappear.


While I am in the same camp as you, there is one exception: Music. Especially music with lyrics (like suno.com) - Although I know that it's not created by humans, the music created by Suno is still very listenable and it evokes feelings just like any other piece of music does. Especially if I am on a playlist and doing something else and the songs just progress into the unknown. Even when I am in a more conscious state - i.e. creating my own songs in Suno, the end result is so good that I can listen to it over and over again. Especially those ones that I create for special events (like mocking a friend's passing phase of communism and reverting back to capitalism).


In my opinion, Suno is good for making really funny songs, but not for making really moving songs. Examples of songs that make me chuckle that I've had it do:

A Bluegrass song about how much fun it is to punch holes in drywall like a karate master.

A post-punk/hardcore song about the taste of the mud and rocks at the bottom of a mountain stream in the newly formed mountains of Oklahoma.

A hair band power ballad about white dad sneakers.

But for "serious" songs, the end result sounds like generic muzak you might hear in the background at Wal-Mart.


appreciate your position but mine is that everything out of suno sounds like copycat dog water.


Makes sense that GP appreciates the taste of dog water when they’re mocking their friends for having had values (friends whom likely gave up their values to stop being mocked)


My generation do not give up on their values because they are being mocked - they mock back even harder until somebody ends up dying from laughter.


Reminds me of Cohen as covered by the Doug Anthony All Stars

    I got my shit together meeting Christ and reading Marx
    It failed my little fire but it spread a dying spark
https://youtu.be/elr0JmB7Ac8?t=42


Probably accurate for videos and music. Videos because there’s going to be just too many things to correct to make it time efficient. Music because music just needs to be excellent or it’s trash. That is for high quality art of course. You can ship filler garbage for lots of things.

2D art has a lot of strong tooling though. If you’re actually trying to use AI art tooling, you won’t be just dropping a prompt and hoping for the best. You will be using a workflow graph and carefully iterating on the same image with controlled seeds and then specific areas for inpainting.

We are at an awkward inflection point where we have great tooling for the last generation of models like SDXL, but haven’t really made them ready for the current gen of models (Flux) which are substantially better. But it’s basically an inevitability on the order of months.


Even with the relatively strong tooling for 2D art it's still very difficult to push the generated image in novel directions though, hence the heavy reliance on LoRAs trained on prior examples. There doesn't seem to be an answer to "how would you create [artists] style with AI" that doesn't require [artist] to already exist so you can throw their life's work into a blender and make a model that copies it.

I've found this to be observable in practice - I follow hundreds of artists who I could reliably name by seeing a new example of their work, even if they're only amateurs, but I find that AI art just blurs together into a samey mush with nothing to distinguish the person at the wheel from anyone else using the same models. The tool speaks much louder than the person supposedly directing it, which isn't the case with say Photoshop, Clip Studio or Blender.


Shrug. That’s a very different goal. Yes, if you want to leverage a different style your best bet is to train a Lora off a dozen images in that style.

Art made by unskilled randos is always going to blur together. But the question I feel we’re discussing here is whether a dedicated artist can use them for production grade content. And the answer is yes.



> but is not productionizable due to a lack of creative control.

It's just a matter of time until some big IP holder makes "productionizable" generative art, no? "Tweaking the output" is just an opinion, and people already ship tons of AAA art with flaws that lacked budget to tweak. How is this going to be any different?


No, it's not "just a matter of time." It's an open question whether it's even possible with anything resembling current techniques.


I don't think it is a question at all. It is not just possible, it's implemented in reality. Compositing is a thing in imagen space, and source adjustments in this scheme are trivial. I'm talking about controlnets, style transfer adapters, straight up neural rendering of simplified 3D scenes, training on custom references, and a ton of other methods to establish control. Temporal stability is also a solved issue.

What it really lacks is domain knowledge. Current imagen is done by ML nerds, not artists, and they are simply unaware of what needs to be done to make it useful in the industry, and what to optimize for. I expected big animation studios to pick up the tech like they did with 3D CGI in the 90s, but they seem to be pretty stagnant nowadays, even besides the animosity and the weird culture war surrounding this space.

In other words, it's not productized because nobody productized it, not because it's impossible.


The generated artwork will initially displace clipart/stock footage and then illustrators and graphic designers.

The last 2 can have tremendous talent but the society at large isn’t that sensitive to the higher quality output.


Seems like this site is getting hugged to death right now


I haven't checked, but I think some of the videos on the page might be served directly from the server.

Edit: Wow! they are loaded directly from the server where I assume no cdn is involved. And what's even worse they're not lazy loaded. No wonder why it cannot handle a little bit of traffic.


> The people who are actually trying to build quality content are being forced to sink or swim - optimize for engagement or else be forgotten... There are many people involved in deep learning who are trying very hard to sell you the idea that in this new world of big-data...

It's always easy to talk about "actually trying to build quality content" in the abstract. Your thing, blog post or whatever, doesn't pitch us a game. Where is your quality content?

That said, having opinions is a pitch. A16Z will maybe give you like, $10m for your "Human Generated Authentic badge" anti-AI company or whatever. Go for it dude, what are you waiting for? Sure it's a lot less than $220m for "Spatial Intelligence." But it's $10m! Just take it!

You can slap your badge onto Fortnite and try to become a household name by shipping someone else's IP. That makes sense to me. Whether you can get there without considering "engagement," I don't know.


fwiw, the author:

>My name is Daniel Holden. I'm a programmer and occasional writer currently working as a Principal Animation Programmer at Epic Games and doing research mainly on Machine Learning and Character Animation.

Their quality content is assuredly building tools for other professionals. So, B2B. A very different kind of "content creator" than those working B2C.

And their pitch itself is already done, likely being paid 200k+/yr to directly or indirectly help make Fornite look better or iterate faster. So... mission accomplished? How Epic is sourcing their deep learning tech will be interesting to see in the coming years as society figures out boundaries on AI tech.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: