Hacker News new | past | comments | ask | show | jobs | submit login
Serverless Video Transcription inspired by Cyberpunk 2077 (github.com/elanmart)
774 points by pierremenard on Dec 22, 2022 | hide | past | favorite | 80 comments



People complain that the magic of programming has been lost because all we do is stitch together APIs. They're oversimplying the work done, those people are cynical, this is amazing.


I think part of the magic is still lost, as we used to be very open source and free centric. These days it's all based on paid API calls.


The magic is still there! In this case, the model for OpenAI's Whisper, which is arguably doing the bulk of the work here, is Open Source (under the MIT licence), and freely available for download at https://github.com/openai/whisper. You can run it wherever you want, though something with a GPU will let you do 5x realtime (or better!) transcription.


Someone is out there building and maintaining the APIs


One hopes. If they decide to stop providing/maintaining the API, you're out of luck.

Just using APIs also misses out on the learning, community, and collaboration of open source software.


agree with parent re: stitching together -- but also, someone's gotta program the API!


Compare it to cars. New cars are more complex but collectors pay the most for the old, simple cars with a bonus for beauty and speed.


On your old car, you can just take it apart and see how it works. Don’t have the part? Improvise. It’s heavily skill-based, and a little knowledge will take you a long way. As long as you know your fundamental toolchains, you can make any part, do any job.

On your new car, it’s a series of black boxes, cables, and tubes - you don’t know and don’t care what happens inside the boxes, just what goes in, and what comes out. It’s heavily knowledge based - you need to know an awful lot about the boxes and how to work with them - but nobody expects you to machine your own parts. If one isn’t made that suits your purposes, you have to go to an old car guy.

The skillsets are similar but different. To be clear, I really am talking about cars and mechanics here - you won’t find many who can fix up both a 1960 and a 2021 motor. You’ll find plenty who can do one or the other, however.

And yeah - the same applies to developers.


Being an ex-mechanic I second this. You HAVE to specialise otherwise you’ll be left with the simple competitive jobs like tyre replacement.

I specialised in new vehicles and then you have specialise in brand to actually know whats going on and how to fix said black boxes and systems.

My reason for training to become a software engineer was mainly because I felt like the knowledge and on-going training you need to retain/constantly learn was not worth the money earned.

Instead now I earn my bread from what was my hobby, essentially the same work ethic as far as knowledge is concerned just less back breaking to put into practice.


I like to read comments from non IT people on HN, It's awesome that you were able change careers across different domains and still transfer skills.


On new cars one that recently got me is this: they don't even have spare tires anymore.

I blew a tire on my car and due to its unconventional size and delays I had to wait 3 weeks for new tires to arrive.

Then, during those three weeks, wife's car blew a tire too. Gigantic hole in the side of the tire: probably a defect back from when we bought the car that eventually made the tire blow (I kinda noticed it when we bought the car but the official dealership told me the car had been inspected and the tires were fine).

Thankfully it's a BMW with "run flat" tires: even if you blow a tire you can still drive up to 50 mph for up to 80 miles (officially, you can always try your luck too) on that blow / totally deflated tire.

But yup, even something which was as normal as having a spare tire in your car and being able to swap it yourself (something I've done several times in my life: for me and to help others) is a long gone skill.


Some of that's down to the internal combustion engine being this insane, complex mechanical device, with 160 years of evolution on top of it. In comparison, electric motors, are much simpler. I'm sure that all of the technology in an EV 60 years from now will be amazingly advanced, but at least for now, an EV has no PCV valve to replace, no timing belt to synchronize a crankshaft and the camshaft; no O2 sensors.


Both batteries and electric motors are far older than cars and aren’t getting more complicated any time soon.

Infotainment etc impact both equally so it’s clear EV’s will just be simpler than ICE’s coming out in the same year.


those APIs would be worthless if nobody wanted to stitch them together


>Matching faces to voices relies on simple co-occurence heuristic, and will not work in certain scenarios (e.g. if the whole conversation between two people is recorded from a single angle)

This seems like the really hard part. maybe if there was a way to find the time lips moves for a face. or guess gender and age of both face and voice.. or If the audio is a stereo mix, using relative position


There's this that can differentiate speakers

https://github.com/CorentinJ/Real-Time-Voice-Cloning


I think you are referring to GE2E [1], which is the speaker identification model used in that project.

[1] https://arxiv.org/abs/1710.10467


That's from an eternity ago. Speaker diarization has come a long way.


What's the best pretrained model available? The best I've tried is pyannote.


But matching a speaker to a face is a different problem differentiating between speech in a recording


Very cool. Zack Friedman made something similar to wear on a hoody. It wasn't perfect, but it worked. https://youtu.be/mTK8dIBJIqg


So I've taken to saving many, many posts from HN with https://web.archive.org/save/ and this is the amusing result I got when trying to save this one: "This host has been already captured 100,091.0 times today. Please try again tomorrow. Please email us at "info@archive.org" if you would like to discuss this more." Haha.


This would be a killer app for a google glass type device. A near realtime closed captioning device with translation.


It made me so sad to see how Glass was neglected/spurned by the public. Especially because it had a camera.

I don't think people understand yet that this stuff is inevitable; whether it takes 20 years or 50 years eventually we (or the next generation) will _all_ be wearing these. So why not do it now?

For all the bad it could do, it could do so much more good/be useful. & if people are worried about the cameras on them; they should be worried about the camera that they _can't_ see, like the OmniVision OV6948.


Pretty darn cool!

I almost want to try and implement this in OpenCVjs and see if it is possible to build a browser extension to do this in the browser itself


How easy is it to translate word by word like that? I was under the impression that generally it is hard because different languages have different word orders. Is it not necessary to have the whole sentence before starting? (Or maybe Polish is just conveniently similar to English in word order?)


Where does it appear to be going word by word to you?

It's sweeping each sentence at a fixed number of milliseconds per character, set so that both languages finish simultaneously. Look at when "Donald Tusk" appears during the first line. Much sooner in English than in Polish, and both are happening before he actually says the name.

The example from the game is doing entire sentences too.


I got caught up in the sci-fi of it all and assumed for some reason that the intended application was a sort of “live” translation. That is, not looking ahead at the whole video. Which of course doesn’t really make sense.


Yes it's not the best way to translate, and for some languages it won't make much sense since the word order is different, as in English to Japanese translation.

To say nothing of all the myriad nuances of language like context changing word meaning and colloquial meanings of words.

In the game Cyberpunk, when doing the inline translations between english and japanese it is actually swapping fully translated sentences out, so not word for word, the effect just looks similar.


I really enjoy that aspect of the visuals because it's like a tiny reflection of the kernel of cyberpunk (the genre, not the game).

Here you have a technology that would work in a way totally different to the way people assume... But someone put the effort into making it look like it does something else because the skeuomorphic expectations of the human makes that inaccurate representation feel more correct.

That gap between the way tech works and the way our perceptions and psychology drive us to assume it works is very real, and it underpins piles of tech we use everyday, from the icons for making calls and hanging up to whether browsers blank the screen when navigating to another site.


Absolutely, and I am perhaps morbidly excited to see what this next generation of technologists come up with, as they have never lived in a world without tech dominating day to day life.

Has our idea of the human/technology interface shifted far enough that the idea of information technology implants will seem a sensible optimization of day to day life? Will skuemorphisms change over time? Maybe the symbol for a neural implant phone call will look like a smartphone (I think it does in Cyberpunk 2077 but I can't recall)


Even really user-friendly and reliable stuff like iPhones still craps out or needs replacement for anyone to be comfortable implanting it surgically I think.

I do wonder if they’ll be more amenable to the 2035 version of Google Glass though.


Just render reality at an offset to compensate for the latency. Hehe.


I'm really hoping somebody comes along and puts this in a format that I can attach to my head inconspicuously.


Google Glass has already been rejected by the public; I imagine anything more inconspicuous would arouse even deeper suspicion.


What's funny is that the opposite is true. People will cry about the big obvious camera on the glass, but won't cry about a pair of thick framed but normal looking glasses because they have no idea that it has a camera on it.

As mentioned in another post chuck a OmniVision OV6948 or similar in the glasses and nobody will bat an eye because they simply won't know.


It was ahead of its times (and wasn't that useful), in 10 years it will be perfectly acceptable.


It was rejected because it didn't do anything useful. Certainly there are many really useful things we could do now that were not possible when it was first released, no?


It didn't do anything useful because it was a prototype, nobody really got the time to build great apps for it, and it didn't get rejected for anything except: it was bulky (can be solved) and it has a camera on it so privacy implications (ie naive hoomans/public who don't realise this is inevitable/how much they are already being recorded in public).


It’s inevitable though. Is it not? Contact lenses, eye, and so on.


There's this[0], using NREAL Air glasses.

[0] https://www.youtube.com/watch?v=LauvOTnZMZg


This is utterly incredible. I have so many ideas after reading way too much scifi and watching Ridley Scott’s scifi series, Raised By Wolves (what if we could create a benevolent, kind and caring ai to help humans grow and navigate into this world, like Father in the series)?

I want to jump in, badly. How would one go about picking up the skills needed to create stuff like this? As pragmatically, concretely and efficiently as possible without getting sidetracked in overly theoretical distractions?

I’m a fullstack engineer and have an MS in CS and pretty good math chops, but I sadly only took 1 machine learning course in all of my formal education.

How do I get into this (gpt-3, chatgpt are also on my mind)? Please, any books, moocs, etc


> I have so many ideas

> I want to jump in, badly. How would one go about picking up the skills needed to create stuff like this?

Have you tried to make any of your ideas? If not, pick one and try to make it.


This is a great approach.

I want to build a benevolent, kind, caring and nurturing AI that one can go to for learning, understanding and support in navigating this world and being human.

I want to dump key knowledge into it that I have been fortunate enough to discern during my lifetime and knowledge from others that I regard highly. Kind of like what characterai is doing, but highly fined tuned and customized.

What skills do I need to do this? What tech do I need to learn? What books, moocs, etc. or even just words to look into? Just looking to be pointed in the right direction, I’ll take care of the rest.

Definitely not trying to reinvent gpt3 or anything like that, but looking to leverage these tools and customize and fine tune them highly


Isn't what you're describing simply "being a friend" with an encyclopedia? Why would you want to create a machine to do that if we're made like that?

In a sense instead of training a machine you could steer yourself to become that person. Call it being a life coach or a pastor. That's a lot less hassle than actual, true artificial intelligence (which needs to be invented as of yet).

I'm genuinely curious, not trying to piss on your parade or anything.


Not quite.

Here is my thinking, can’t be too elaborate because I’m typing on a phone keyboard, but here it is:

We live in the age of information. The amount of information out there is limitless.

This being the case, the importance lies in how a human being interfaces with this vast amount of information and which information out of the vast ocean of information is presented. Therefore curation is important. User experience is important. Discernment of what is actually valuable for a human being is also important. And how the user interacts with it is also important.

Chatgpt is one form of crafting this user interface/experience. There are other ways to go about it. The feel of the interface itself, how it presents itself to the human, and what exactly it presents are all key.

This is very different than say, an encyclopedia in which there is no discernment/bias of what is actually important for a human to know + very bad ui/ux.

And also, most importantly, a human will die. An ai model will not. A human can only speak to one person at a time. An ai model could speak to countless.


A very basic trait of our capitalistic system is that you'll always want more regardless of your actual, personal needs. You're always left wanting, always. And that's by design.

"Information" in the current sense (stats, news, (inter)national "politics") is a commodity that's being generated by the boatloads. Just because there's lots of it doesn't mean it's valuable in the sense that it actually pertains to you and your life.

So from my perspective you have the right ideas but wanting to satiate others with "information" is a bottomless pit (again, by design).

The information that is actually actionable is still right in front of us (at least in the west): it's local media (newspapers, radio, not TV). The slow, boring, local, tedious stuff. Apart from that it happens in associations, sport clubs, regional politics, NGO's.

For that, you don't need an AI.

Just fyi: I'm a doomscroller and news junkie so I'm not in a position to judge you. :)


I agree with many of your points, but I think you misunderstand my intent a bit. It’s not about providing more info. It’s about providing a nurturing friend to help one in life. In addition to this, this nurturing friend would try to convey some key and crucial points about developing as a human that I think are important to understand.

Humans need to get information/transmission of a few key points that leads to their own wisdom and discernment flourishing.

This bit of information would be built into my ideal ai, and it would convey it to human beings in the most beneficial way possible. It would even be built into its character.

And humans need a supportive and nurturing friend for the rocky road that is life.

Those are my intentions

By the way, the benevolent ai would advise you to stop taking in so much new info every day (doomscrolling, too much news), as it leads to disorientation, chaos, confusion, and discombobulation in one’s mind :)


Your enthuasism is great! Good luck with your plan!


OpenAI GPT-3 fine-tuned models as per their documentation


Is it possible to have this tool run locally and use it for myself? I don't see any instructions on the Readme, all I see are ways to set it up for development and ways to deploy it to some 3rd party cloud solutions.


Videos encode so much information, and it looks pretty cool when such projects extract higher level information to play with. And recent models like whisper and clip are amazing to help make sense of that information even for personal projects.

We are also trying to do something similar[0], but still a lot of work remaining. Idea is to allow real time processing of any video using just a CPU. [0] https://www.youtube.com/watch?v=E7UPj9blnWc


Nice demo, any chance this tech gets ported to an android library soon? Real-time face detection via CPU would be way better than constantly uploading your video feed to aws kinesis for ML processing.


This is impressive and so is the writeup. How did you create the diagrams in the readme? I really like that visual format for diagrams.


For the hand-drawn style ones, I think it's https://excalidraw.com/


Now this is both cool piece of tech and practically useful. Very nice!


I developed an online course on serverless machine learning, where you can learn some of the principles of refactoring ML systems into separate feature/training/inference pipelines: https://github.com/featurestoreorg/serverless-ml-course

Some of the students have built similar systems, for example chaining Whisper and ChatGPT or translation or sentiment analysis of transcribed text, such as here (transcribe Swedish and tell me the sentiment of the text): https://huggingface.co/spaces/Chrysoula/voice_to_text_swedis...


Why is this type of cloud based computing called serverless?

To me "putting all the work on remote machines you don't manage" seems pretty cloudy, and I naively expected "serverless" to mean something like runs locally, not "uses a shit town of someone else's servers".


A good analogy I heard is that serverless uses servers as much as wireless uses wires. It does, but it's not part of your responsibility.


Nice. If anybody has installed and operated an MLFlow server (open source, great product), but then switched to use a serverless platform like weights and biases, you would instinctively get the term serverless. DynamoDB is called a serverless database. Obviously there are servers, it's just somebody else's job to maintain them.


Good question. For me it's in the same category as "hoverboards" that don't actually hover.


Great question. I think of it as follows - if I can just write some operational Python code and "deploy" it, without needing to install/upgrade/maintain systems software (k8s/DB/feature-store/etc), then I have built a serverless system. What's cool now is lots of these services have great free tier and you can run serverless ML systems without paying anything.


Thanks for sharing! will check it out later, but this looks great.


Nice! Also would be cool to see the text positioned above the face and translated just like in the cyberpunk gif.


It's actually incredible doing all this work and resist implementing the last few lines of code that would do that.


I would like to point out: As someone who is a newbie coder, I had a ton of fun learning how your code works with the help of ChatGPT. Even learning about unfamiliar topics like serverless apps, nlp transformers, or yaml vs json lol


This is awesome, impressed you threw this together over a weekend!

What did you use to make that entity diagram?

edit: answered below


Neat use of Gradio & Modal


FFMPEG is in there too, of course! Such a powerful tool.

https://youtu.be/9kaIXkImCAM humor


that was the first video I have watched start to finish without skipping in years. brilliant, thanks


Clever hack/solution, and the thorough documentation is appreciated!


This is a great example of serverless machine learning with modal.


Does anyone have a link to something similar for real-time captions (audio only - (.g. Google Meet, Twitter Spaces)? Or is that typically done on the device itself


If I understood you correctly, there's a Chrome accessibility feature that you should check out: https://support.google.com/chrome/answer/10538231

Think of it as YouTube's auto-generated captions, but on a browser level.


I was mainly interested in studying implementations of real time transcription


It seems there is a niche market in this...


More than just a niche. Imagine a pair of glasses a tourist would wear that would translate in real time what that waiter just said.


This is cool. I guess it's possible to this all in single multi-task NN


Mindblowing that this was so easy to build.

Coming soon to a phone near you, and in realtime.


This is incredible




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: