I wrote a tool to do automated QA on internet video (HLS/DASH, tech used for Netflix, YouTube, Twitch, etc.).
It evaluates streams against a database of 100 or so "quirks" that identify either general issues or issues that will only manifest on certain player libraries. For instance, specific "in spec" encodings which are actually non-standard in practice get flagged.
Built on TypeScript/node/Docker over the course of maybe 18 months. Used it fairly often when I was working in the space, not at all these days. Originally the plan was to license it as an enterprise vid tool.
(I've been considering open-sourcing it - would YOU use it if so?)
I am definitely curious about a tool like this. I work with a lot of video streams and this collective knowledge of quirks might be useful as a QA tool
The code signing process has sucked and always been deeply cryptic, even on macOS - getting something usable and improvable that's cross platform is a fantastic win.
I'd label Warden an "AI maximalist" approach. Also has a lot of merit and is very interesting, but much harder to have fine control or tweakability, and much much harder to do fully hands off. "Hands off"-ability is an important metric because it's useful to not babysit every second of footage.
Writing comedy is really hard! Great concept for a show, tho. TBH a good example of something that could be interesting and viable with a tool like On Screen but never make it as a studio or even indie production.
The scope and target complexity of the series I'm making with On Screen is _dramatically_ cut down from what I started with, and it's still a bit of a stretch for the models at times. I started at DS9/Babylon 5 and ended up at Flash Gordon...
I am planning on doing some more articles/director commentary as it goes along.
I have a number of episodes in the queue and each one is better than the last. My plan is to release an entire season of 12 or so.
The "I'm a GPT that wants everyone to be friends and how" is increasingly better in those episodes.
Even incremental improvements in stuff like background music make a big big difference.
I really want to do a v2 that is more of a "copilot" than an "AI first" experience. But I need partners to help with funding; I've taken it about as far as I can on a solo basis. The next step is a team of 4-5 people levelling it up. Every piece could be 10x better, and it would be a different beast entirely if that happened. I think there are some super exciting directions this could go.
The vision of a distributed creator system is very interesting, as is letting people do more hands-on writing/rewriting.
You should link not to the first episode but a playlist that you update in reverse order, so the best episodes come first. It wasn't clear to me that the quality would improve with each episode, and honestly getting through the first was a bit of a struggle.
How much funding do you think you need for an MVP that's more Copilot-like? I might be interested in taking part in a seed round. Having AI do everything is a fun challenge, but I think the sort of people who would actually pay for a product would want to have some creative control and let the AI handle the parts they don't want to or can't do.
The Minecraft-esque graphics probably aren't an issue, but scaling up to provide all the needed assets probably is. There are AIs that can generate 3D models but having a consistent art style is required for it to work visually and you provided that here. Finding a way to quickly and cheaply scale the "kitbashing" seems key to any kind of productization.
Good call on the playlist. I'll do that soon. I agree, ep1 is rough.
I shot you an e-mail on a v2. (An MVP would be less; I realized I sent you the pitch for a full v2.)
There are a LOT of art packs out there for a ton of different looks and genres. Building sets is quick and easy even with kitbashing. I think you could synthesize 3d content in a lot of ways (vid2vid, Gaussian diffusion generative models, prop placement by LLM, clever use of stable diffusion/firefly for mattes, etc.) or have a small stable of fiverr types to make art for people on demand in a specific style...
> I am planning on doing some more articles/director commentary as it goes along.
Speaking for myself, I expect that the behind-the-scenes commentary would be the most interesting part of the project!
> The "I'm a GPT that wants everyone to be friends and how" is increasingly better in those episodes.
How long does the pipeline take to run? (apologies if this was part of the blog series and I missed it). Depending on how close to a self-running CI pipeline the whole process is at, I think it might be interesting to run benchmarks against various versions of the pipeline and evaluate its performance at each stage. I feel like I could evaluate the improvement of the "let's make everyone be friends!" writing if I'm comparing Episode 1 (compiled w/ v0.3) against Episode 1 (compiled w/ v0.8), instead of Episode 1 vs. Episode 12.
Crazy idea: If one could somehow quantify the quality of consistency, dialogue, camera work, etc -- then you may be able to watch numbers-go-up in an actual graph sort of way (I'm imagining a multi-agent system where various agents are responsible for monitoring various aspects of script and production quality -- almost like an actor/critic setup).
But at the very least, being able to A/B comparison between v0.3 and v0.6 could be very interesting for people interested in the internals.
> I've taken it about as far as I can on a solo basis. The next step is a team of 4-5 people levelling it up. Every piece could be 10x better, and it would be a different beast entirely if that happened. I think there are some super exciting directions this could go.
I think that's the really cool thing about what you've built here -- it's a complete pipeline, and every piece is present -- even if the pieces aren't in their final form, the fact that you've pieced together an entire pipeline is extremely compelling.
> (PS - Hi Han!)
Hi!! It was a very cool surprise to see your name pop up on my HN feed this morning. :D
But I had 8 kids 5-15 watch all of Ep1 _AND_ choose to watch Ep2 afterwards last night. They actually sat and watched, too, instead of having it on in the background... AND they were bummed they couldn't watch the super secret pilot episode (which has MAJOR audio issues - I couldn't bring myself to inflict it on them).
So I think something is there.
I agree, there are some great opportunities to track things somewhat more quantitatively. It takes ~15 minutes and $10 bucks to generate a script depending on how fast OpenAI is feeling. So in a real scale v2 it would be very reasonable to explore this.
Yes, I think so! That's super encouraging about holding the attention of a room of kids!
> It takes ~15 minutes and $10 bucks to generate a script depending on how fast OpenAI is feeling. So in a real scale v2 it would be very reasonable to explore this.
Yeah -- still a bit large to truly put into a CI pipeline that is running against every commit tho. :-/
Do you mind sharing your context window size? I always want to use local LLMs for rapid iteration -- I think 32k window isn't too difficult (Mixtral supports this out of the box, I think?), but I've heard of people pushing 100k tokens locally. Even so, that's peanuts compared to what hosted LLMs are doing, and if quality of writing is your bottleneck, then you wouldn't want to stray too far away from GPT-4 / Claude.
> Man, I sure hope I get to build this further!
Yeah!! It really feels like you've latched onto a nugget of something here, and I'm excited to see what's next!
Yeah - I think you don't necessarily (or only the best) go to a Netflix or MGM, but you could see success like a lot of smaller podcast content creators do.
10,000 screaming fans can take you a long long way.
It's interesting to consider "AI as its own genre" rather than "AI replacing mainstream content" - like how cheap animation enabled the anime genre or cheap filmmaking enabled the indie genre.
I wrote an autonomous AI space opera tv show generator. It takes a short topic phrase on one end and spits out a 10-15 minute 3D animated and AI voiced video suitable for upload to YouTube on the other end.
Super interesting learning exercise since it intersects with many enterprise topics, but the output is of course more fun.
In some ways it is more challenging - a summary is still useful if it misses a point or is a little scrambled, whereas when a story drops a thread it’s much more immediately problematic.
I’m working on a blog post as well as getting a dozen episodes uploaded for “season 1”.
Disclaimer: I spent five years working at GarageGames doing core Torque development (the Tribes 2 engine derivative we sold).
The core code was pretty clean but there was a LOT of cruft on top of this part which tended to obscure the really good bits. IMHO the library that handled the animation was genius - it was incredibly light, it supported a broad featureset, and it could load any old asset from v1 up to v30. It even did a bunch of crazy data layout stuff to allow extremely fast endian conversion for PPC vs Intel (back when that mattered).
Good efficient networking for that era meant you had to be miserly with your resources. Tribes 2/Torque was very much aligned with these requirements and your example is actually a good example of those strengths.
The engine had three update cycles, all in service of the networking.
First, it would process fixed timestep logic - ticks guaranteed at 32 per second (this also aligned with packet send rate). Client and server both ran this. This is physics, user input, health management, etc.
Second, it would run "time" based logic. This would be things like particle systems which don't care a lot about whether you advance them 100ms at a time or 1ms at a time, and don't need to match precisely for gameplay anyway. Only client ran this.
Third, it would interpolate tick state. This would smoothly interpolate between the last and current game state based on how far you were between the two states. It introduced a small amount of lag but since it did not predict it never caused visual glitches. This is gave a smooth appearance for any of the stuff that happened in the first step. Only client ran this.
The result of all this machinery is that you paid exactly what you needed to for each type of thing in your simulation and no more.
Later versions of the engine added lag compensation. This meant that client would snapshot game state and re-run the fixed timestep logic for compensated objects based on latency. You could configure it to only consider objects that might have interacted with the player (and thus were mispredicted) to save on CPU.
What happened in the case of authoritative skeletal animation?
1. Gameplay relevant parts of the skeleton would be simulated in the fixed ticks. For instance, the current orientation of a player's weapon which might need to take their animation pose into account. So you might see the spine and one arm updated here, while legs wouldn't be touched.
2. In the time based logic, you would run the full skeletal animation update so that the player could see smooth animation.
3. In the interpolation phase, you would interpolate the position of the player between the two states to give a smooth appearance (in conjunction with the animation work in phase 2).
I would submit the above is actually a pretty elegant solution to the above. Unfortunately the surface level code was all cut down versions of Tribes code which was written with shipping, not long term reuse, in mind.
Most of the community developers never really got their hands wrapped around this architectural stuff. A big fault of our core engine product was that it was oriented towards AAA projects and we never really made it both user friendly AND powerful. So we went with powerful but that didn't serve our indie customer base well.
It took Unity around a decade and $500M to add "really powerful" to "easy to use" so I don't feel too bad about this.
It evaluates streams against a database of 100 or so "quirks" that identify either general issues or issues that will only manifest on certain player libraries. For instance, specific "in spec" encodings which are actually non-standard in practice get flagged.
Built on TypeScript/node/Docker over the course of maybe 18 months. Used it fairly often when I was working in the space, not at all these days. Originally the plan was to license it as an enterprise vid tool.
(I've been considering open-sourcing it - would YOU use it if so?)